GRC + Mojo/Service Manager lifetime semantic problems. |
|||||||||
Issue descriptionDocumenting this here to keep some public records of my investigations. I'm trying to land a CL that uses GRC: https://chromium-review.googlesource.com/c/578482/ It updates RFH to lazily create FrameCUs. This happens for pretty much every FrameCU. This in turn has caused at least 3, distinct non-deterministic failures in content_browsertests and browser_tests. I suspect most/all of these issues are GRC object-lifetime management issues. It's possible that there are Error #1: See Android failures on PS#23 https://chromium-review.googlesource.com/c/578482/23 I was able to reproduce this locally by patching PS#23 onto fd9ff4c76da4e1317b9b8fe96c4e347734ffcaff. Local symbolization shows that the error is: 0) ~LogMessage [DCHECK] 1) base::WeakPtrFactory<service_manager::ServiceContextRefFactory>::GetWeakPtr() 2) service_manager::ServiceContextRefFactory::CreateRef() 3) std::__ndk1::default_delete<service_manager::ServiceContextRef>::operator() 4) std::__ndk1::unique_ptr<service_manager::ServiceContextRef, std::__ndk1::default_delete<service_manager::ServiceContextRef> >::reset(service_manager::ServiceContextRef*) 5) ~unique_ptr 6) ~CoordinationUnitProviderImpl ... This occurs because the CoordinationUnitProviderImpl is destroyed after the ServiceContextRefFactory. The lifetime of the latter is managed by ResourceCoordinatorService. The lifetime of the former is implicitly tied to MessagePipeHandle/InterfaceRequest (?). It's not really clear to me what that is, but it's outliving the ServiceContextRefFactory. I fixed this in PS#24 by making the ResourceCoordinatorService own CoordinationUnitProviderImpl, so ensure that the CoordinationUnitProviderImpl is destroyed before the ServiceContextRefFactory: https://chromium-review.googlesource.com/c/578482/23..24 [Two more errors to follow].
,
Jul 27 2017
The next problem can be seen in PS#24 and PS#26. https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.linux%2Flinux_chromium_chromeos_rel_ng%2F477670%2F%2B%2Frecipes%2Fsteps%2Fbrowser_tests__with_patch_%2F0%2Fstdout """ [4720:4925:0726/193712.373950:WARNING:embedded_test_server.cc(219)] Request not handled. Returning 404: /favicon.ico BrowserTestBase received signal: Segmentation fault. Backtrace: #0 0x00000387f97c base::debug::StackTrace::StackTrace() #1 0x000003f0ea62 content::(anonymous namespace)::DumpStackTraceSignalHandler() #2 0x7fbdfbefacb0 <unknown> #3 0x000002985aa7 resource_coordinator::CoordinationUnitImpl::SetProperty() #4 0x0000021f8d1d resource_coordinator::mojom::CoordinationUnitStubDispatch::Accept() #5 0x000004dfb0d2 mojo::InterfaceEndpointClient::HandleValidatedMessage() #6 0x000004dfac86 mojo::FilterChain::Accept() #7 0x000004dfc425 mojo::InterfaceEndpointClient::HandleIncomingMessage() #8 0x000004e042c6 mojo::internal::MultiplexRouter::ProcessIncomingMessage() #9 0x000004e03a0f mojo::internal::MultiplexRouter::Accept() #10 0x000004dfac86 mojo::FilterChain::Accept() #11 0x000004df9f6b mojo::Connector::ReadSingleMessage() #12 0x000004dfa871 mojo::Connector::ReadAllAvailableMessages() #13 0x000004dfa729 mojo::Connector::OnHandleReadyInternal() #14 0x000001a264c5 chromeos::file_system_provider::Int64ToIntCompletionCallback() #15 0x000004e13216 mojo::SimpleWatcher::OnHandleReady() #16 0x00000196cb3e _ZN4base8internal12InvokeHelperILb1EvE8MakeItSoIRKMN3arc40ArcActiveDirectoryEnrollmentTokenFetcherEFvN6policy22DeviceManagementStatusEiRKN21enterprise_management24DeviceManagementResponseEERKNS_7WeakPtrIS5_EEJS7_iSB_EEEvOT_OT0_DpOT1_ #17 0x0000038800f0 base::debug::TaskAnnotator::RunTask() #18 0x00000389f715 base::MessageLoop::RunTask() #19 0x00000389f9cb base::MessageLoop::DeferOrRunPendingTask() #20 0x00000389fdd4 base::MessageLoop::DoWork() #21 0x0000038a2189 base::MessagePumpLibevent::Run() #22 0x00000389f34b base::MessageLoop::Run() #23 0x0000038c99ba base::RunLoop::Run() #24 0x0000038f4d77 base::Thread::Run() #25 0x00000242d2d8 content::BrowserThreadImpl::IOThreadRun() #26 0x00000242d466 content::BrowserThreadImpl::Run() #27 0x0000038f527d base::Thread::ThreadMain() #28 0x0000038ee35f base::(anonymous namespace)::ThreadFunc() #29 0x7fbdff5c5184 start_thread #30 0x7fbdfbfc1bed clone """ Based on the nature of the error, I immediately guessed that the observers have been destroyed. There's only ever 1 observer created, and it's lifetime is tied to the ResourceCoordinatorService. This again suggests that we're getting a message after the ResourceCoordinatorService. I worked around this problem by in PS#25 and PS#27 by deregistering the observer on destruction. [PS#26 and PS#27 were just tests to verify that my fix was indeed doing the right thing]. https://chromium-review.googlesource.com/c/578482/24..25 I think I'm starting to see a similar pattern. The lifetime of the CUImpl is tied to the lifetime of the interface request, which is tied to other objects [e.g. RFH in the case of the FrameCU]. However, the CUImpl depends on objects in the ResourceCoordinatorService, whose lifetime is independent. In some browser_tests or content_browsertests, the ResourceCoordinatorService is destroyed before the RFH, which is causing this class of problems.
,
Jul 27 2017
A DCHECK in ~WeakPtr typically means incorrect usage, i.e. WeakPtrs are not being invalidated and dereferenced from the same single thread.
,
Jul 27 2017
I also have a small fix on the creation of CU in one of my pending patches.
,
Jul 27 2017
Hacked up a CL to trigger: https://chromium-review.googlesource.com/c/590700/
,
Jul 27 2017
The third problem I reported in Issue 749781 . Unfortunately, binaries from CQ runs are not recorded, since I can't get further information about the access to a destroyed object. rockot@ indicates some type of threading issues. Observations: 1) In tests, RFH can outlive ResourceCoordinatorService. This causes problems #1 and #2. 2) The ResourceCoordinator service does not touch threading at all. It uses the implicit TaskRunners present in the tasks that create it. https://cs.chromium.org/search/?q=taskrunner+file:resource_coordinator&type=cs Note that resource_coordinator/tracing and resource_coordinator/memory_instrumentation, despite sharing a directory structure, are unrelated. I tried to compile the object file locally, with the same gn args, with the premise that offset <2270> in the function resource_coordinator::mojom::CoordinationUnitStubDispatch::Accept should be relatively stable. Based on the results, it seems quite likely that <2270> corresponds to: impl->SetProperty() https://cs.chromium.org/chromium/src/out/Debug/gen/services/resource_coordinator/public/interfaces/coordination_unit.mojom.cc?q=coordination_unit.mojom.cc&sq=package:chromium&dr&l=721 This is: 1) Not too surprising, since my CL only adds a call to SetProperty(), not to the other potential messages. 2) Slightly surprising in that this is the same error I saw in problem #2, only 1 frame higher! This suggests that the |impl| object has been destroyed at the time this code is called. Hm. I'm not sufficiently familiar with Mojo internals to know the lifetime semantics of |impl|, but I'd be quite curious how this related to both ResourceCoordinatorService and resource_coordinator::ResourceCoordinatorInterface.
,
Jul 27 2017
@oysteine: It's "working" https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.mac%2Fmac_chromium_rel_ng%2F511463%2F%2B%2Frecipes%2Fsteps%2Fbrowser_tests__with_patch_%2F0%2Flogs%2FPushMessagingBrowserTest.PushEventOnShutdown%2F0 [29429:38915:0727/143433.080739:FATAL:coordination_unit_impl.cc(41)] Check failed: g_cu_map().empty(). 0 browser_tests 0x000000010542bdfc base::debug::StackTrace::StackTrace(unsigned long) + 28 1 browser_tests 0x000000010544fb90 logging::LogMessage::~LogMessage() + 224 2 browser_tests 0x00000001041a20b1 resource_coordinator::CoordinationUnitImpl::AssertNoActiveCoordinationUnits() + 161 3 browser_tests 0x00000001041b01d9 resource_coordinator::ResourceCoordinatorService::~ResourceCoordinatorService() + 25 4 browser_tests 0x00000001041b026e resource_coordinator::ResourceCoordinatorService::~ResourceCoordinatorService() + 14 5 browser_tests 0x000000010613e6fd service_manager::ServiceContext::~ServiceContext() + 173 6 browser_tests 0x000000010613e73e service_manager::ServiceContext::~ServiceContext() + 14
,
Jul 27 2017
rockot@: What's the recommended fix here; tearing down the bindings when the service is shut down, or make the bound impl objects resistant to the service having been shut down already?
,
Jul 27 2017
Either one would be reasonable reasonable, but this is a typical kind of problem when StrongBinding which is why I try to discourage it. I see no great reason why the service shouldn't just own all the CoordinatorUnitImpls. It's a little extra work to manually watch for connection errors and clean them up, but the benefits of lifecycle clarity are probably worth it.
,
Jul 27 2017
Makes sense to me; I'll make the change. Thanks!
,
Jul 28 2017
,
Jul 28 2017
,
Jul 28 2017
,
Jul 28 2017
,
Jul 29 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647 commit 1f637d9b2977128cc48202a0cb5c9c5f8ed7a647 Author: Oystein Eftevaag <oysteine@chromium.org> Date: Sat Jul 29 14:16:46 2017 GRC: Make CoordinationUnitImpls have their lifespan explicitly managed by the service. A few consequences of this: * To be able to easily tear down all the active CoordinationUnitImpls, the static CU map now owns the CoordinationUnitImpls as unique_ptrs, rather than just holding a raw pointer to them. * Because of the above, client code that creates CoordinationUnitImpls must now explicitly call Destruct() on these. This mainly impacts test code, so I've made a TestCoordinationUnitWrapper which RAII handles this. Bug: 749804 Change-Id: I251c17606c2b055a6a66409681c954cdd76b58d6 Reviewed-on: https://chromium-review.googlesource.com/590700 Commit-Queue: Erik Chen <erikchen@chromium.org> Reviewed-by: Zhen Wang <zhenw@chromium.org> Cr-Commit-Position: refs/heads/master@{#490654} [modify] https://crrev.com/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647/services/resource_coordinator/BUILD.gn [delete] https://crrev.com/bf6f663689975e8edc7f0cccbafd7b0a6158599c/services/resource_coordinator/coordination_unit/coordination_unit_factory.cc [delete] https://crrev.com/bf6f663689975e8edc7f0cccbafd7b0a6158599c/services/resource_coordinator/coordination_unit/coordination_unit_factory.h [modify] https://crrev.com/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647/services/resource_coordinator/coordination_unit/coordination_unit_graph_observer_unittest.cc [modify] https://crrev.com/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647/services/resource_coordinator/coordination_unit/coordination_unit_impl.cc [modify] https://crrev.com/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647/services/resource_coordinator/coordination_unit/coordination_unit_impl.h [modify] https://crrev.com/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647/services/resource_coordinator/coordination_unit/coordination_unit_impl_unittest.cc [modify] https://crrev.com/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647/services/resource_coordinator/coordination_unit/coordination_unit_impl_unittest_util.cc [modify] https://crrev.com/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647/services/resource_coordinator/coordination_unit/coordination_unit_impl_unittest_util.h [modify] https://crrev.com/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647/services/resource_coordinator/coordination_unit/coordination_unit_manager.cc [modify] https://crrev.com/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647/services/resource_coordinator/coordination_unit/coordination_unit_manager.h [modify] https://crrev.com/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647/services/resource_coordinator/coordination_unit/coordination_unit_provider_impl.cc [modify] https://crrev.com/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647/services/resource_coordinator/coordination_unit/coordination_unit_provider_impl.h [modify] https://crrev.com/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647/services/resource_coordinator/coordination_unit/mock_coordination_unit_graphs.cc [modify] https://crrev.com/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647/services/resource_coordinator/coordination_unit/mock_coordination_unit_graphs.h [modify] https://crrev.com/1f637d9b2977128cc48202a0cb5c9c5f8ed7a647/services/resource_coordinator/resource_coordinator_service.cc
,
Jul 29 2017
,
Jul 31 2017
Does this need a merge to M61?
,
Jul 31 2017
That'd probably be a good idea, I'm sure this would cause some rare crashes otherwise.
,
Jul 31 2017
Pls apply appropriate OSs. Thank you.
,
Jul 31 2017
,
Jul 31 2017
Bumping pri since this could lead to occasional race conditions on browser shutdown.
,
Aug 1 2017
Your change meets the bar and is auto-approved for M61. Please go ahead and merge the CL to branch 3163 manually. Please contact milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), ketakid @(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Aug 3 2017
Pls merge you change to M61 branch 3163 by 5:00 PM PT, Friday (08/04) so we can take it in for next week M61 Beta release. Thank you.
,
Aug 7 2017
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Aug 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/164211ffe050ec02e64b771858662ad168daadf8 commit 164211ffe050ec02e64b771858662ad168daadf8 Author: Oystein Eftevaag <oysteine@chromium.org> Date: Mon Aug 07 18:29:27 2017 GRC: Make CoordinationUnitImpls have their lifespan explicitly managed by the service. A few consequences of this: * To be able to easily tear down all the active CoordinationUnitImpls, the static CU map now owns the CoordinationUnitImpls as unique_ptrs, rather than just holding a raw pointer to them. * Because of the above, client code that creates CoordinationUnitImpls must now explicitly call Destruct() on these. This mainly impacts test code, so I've made a TestCoordinationUnitWrapper which RAII handles this. TBR=oysteine@chromium.org (cherry picked from commit 1f637d9b2977128cc48202a0cb5c9c5f8ed7a647) Bug: 749804 Change-Id: I251c17606c2b055a6a66409681c954cdd76b58d6 Reviewed-on: https://chromium-review.googlesource.com/590700 Commit-Queue: Erik Chen <erikchen@chromium.org> Reviewed-by: Zhen Wang <zhenw@chromium.org> Cr-Original-Commit-Position: refs/heads/master@{#490654} Reviewed-on: https://chromium-review.googlesource.com/604210 Reviewed-by: Oystein Eftevaag <oysteine@chromium.org> Cr-Commit-Position: refs/branch-heads/3163@{#361} Cr-Branched-From: ff259bab28b35d242e10186cd63af7ed404fae0d-refs/heads/master@{#488528} [modify] https://crrev.com/164211ffe050ec02e64b771858662ad168daadf8/services/resource_coordinator/BUILD.gn [delete] https://crrev.com/e4d101223d670c88b0a6423c198efb15e2d67064/services/resource_coordinator/coordination_unit/coordination_unit_factory.cc [delete] https://crrev.com/e4d101223d670c88b0a6423c198efb15e2d67064/services/resource_coordinator/coordination_unit/coordination_unit_factory.h [modify] https://crrev.com/164211ffe050ec02e64b771858662ad168daadf8/services/resource_coordinator/coordination_unit/coordination_unit_graph_observer_unittest.cc [modify] https://crrev.com/164211ffe050ec02e64b771858662ad168daadf8/services/resource_coordinator/coordination_unit/coordination_unit_impl.cc [modify] https://crrev.com/164211ffe050ec02e64b771858662ad168daadf8/services/resource_coordinator/coordination_unit/coordination_unit_impl.h [modify] https://crrev.com/164211ffe050ec02e64b771858662ad168daadf8/services/resource_coordinator/coordination_unit/coordination_unit_impl_unittest_util.cc [modify] https://crrev.com/164211ffe050ec02e64b771858662ad168daadf8/services/resource_coordinator/coordination_unit/coordination_unit_impl_unittest_util.h [modify] https://crrev.com/164211ffe050ec02e64b771858662ad168daadf8/services/resource_coordinator/coordination_unit/coordination_unit_manager.cc [modify] https://crrev.com/164211ffe050ec02e64b771858662ad168daadf8/services/resource_coordinator/coordination_unit/coordination_unit_manager.h [modify] https://crrev.com/164211ffe050ec02e64b771858662ad168daadf8/services/resource_coordinator/coordination_unit/coordination_unit_provider_impl.cc [modify] https://crrev.com/164211ffe050ec02e64b771858662ad168daadf8/services/resource_coordinator/coordination_unit/coordination_unit_provider_impl.h [modify] https://crrev.com/164211ffe050ec02e64b771858662ad168daadf8/services/resource_coordinator/coordination_unit/mock_coordination_unit_graphs.cc [modify] https://crrev.com/164211ffe050ec02e64b771858662ad168daadf8/services/resource_coordinator/coordination_unit/mock_coordination_unit_graphs.h [modify] https://crrev.com/164211ffe050ec02e64b771858662ad168daadf8/services/resource_coordinator/resource_coordinator_service.cc
,
Aug 10 2017
Issue 754234 has been merged into this issue. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by oysteine@chromium.org
, Jul 27 2017