Issue metadata
Sign in to add a comment
|
The start task of ServiceWorker can't detect the renderer crash if it crashes before SendStartWorker() |
||||||||||||||||||||||
Issue description
If a renderer process crashes, EmbeddedWorkerRegistry::RemoveProcess() is
called on the IO thread. And if the embedded worker is in |worker_process_map_|,
EmbeddedWorkerInstance::OnDetached() is called.
The embedded worker was set in |worker_process_map_| by BindWorkerToProcess()
which was called from EmbeddedWorkerInstance::SendStartWorker().
The thread hopping of starting service worker is like this:
IO thread : EmbeddedWorkerInstance::Start()
-> StartTask::Start()
-> ServiceWorkerProcessManager::AllocateWorkerProcess()
UI thread : ServiceWorkerProcessManager::AllocateWorkerProcess()
IO thread : StartTask::OnProcessAllocated()
UI thread : SetupOnUI()
IO thread : StartTask::OnSetupOnUICompleted()
-> EmbeddedWorkerInstance::SendStartWorker()
If EmbeddedWorkerRegistry::RemoveProcess() is called between
StartTask::OnProcessAllocated() and StartTask::OnSetupOnUICompleted(),
EmbeddedWorkerInstance::OnDetached() can't be called.
In this case, I think CallDetach() which is a connection_error_handler of
EmbeddedWorkerInstanceClientPtr must be called. But I don't know why, it is not
called. So the the start worker task will not finish.
I noticed this issue while writing tests for StartServiceWorkerForNavigationHint().
https://chromium-review.googlesource.com/c/530731/1/chrome/browser/chrome_service_worker_browsertest.cc#714
What steps will reproduce the problem?
(1) Patch https://chromium-review.googlesource.com/c/530731/1
(2) Build browser_tests
(3) run browser_tests with "--gtest_filter=ChromeServiceWorkerNavigationHintTest.FailedDueToRendererCrash" several times.
Expected: Test passes.
Actual: Sometimes the test time out. (not 100%)
,
Jun 13 2017
I think moving "registry_->BindWorkerToProcess();" from EmbeddedWorkerInstance::SendStartWorker() to OnProcessAllocated() can fix this problem. shimazu@ What do you think?
,
Jun 15 2017
Is this a regression with Mojo?
,
Jun 22 2017
I think this bug exists before mojofying StartWorker. https://codereview.chromium.org/2374353002 The embedded_worker_id was registered to the |worker_process_map_| when SendStartWorker() is called. https://codereview.chromium.org/2374353002/diff/1/content/browser/service_worker/embedded_worker_registry.cc#oldcode272 This is same as the current logic.
,
Jun 22 2017
Ah, now the test hits the DCHECK(dispatcher_host) which was recently added by shimazu@. https://chromium.googlesource.com/chromium/src/+blame/f50e2dcfad8c3e03eb9445243e4b5e8fccf15b87/content/browser/service_worker/service_worker_provider_host.cc#691 I created a CL to make the test hit this crash by using BrowserThread::PostDelayedTask() to call SetupOnUI() from OnProcessAllocated(). https://chromium-review.googlesource.com/c/544478/ shimazu@ Please try this.
,
Jun 26 2017
,
Jun 26 2017
IIUC, CallDetach should be called when the renderer has crashed. rockot, can I ask you the behavior when the crash is notified? This is where an error callback is set. https://cs.chromium.org/chromium/src/content/browser/service_worker/embedded_worker_instance.cc?type=cs&q=EmbeddedWorkerInstance&l=474 EmbeddedWorkerInstance's |client_| is an InterfacePtr, and the request is bound with a RenderProcessHost by using BindInterface(). https://cs.chromium.org/chromium/src/content/browser/service_worker/embedded_worker_instance.cc?type=cs&l=100 If the renderer crashed before the request is bound with the RenderProcessHost, is the error handler called in the BindInterface()? Or is it happened asynchronously?
,
Jun 26 2017
It will always happen asynchronously.
,
Jul 4 2017
,
Jul 4 2017
Investigating now.
,
Jul 4 2017
This bug is a bit deep. It turns out CallDetached() is called when the renderer is crashed. The problem is we immediately try to restart the worker and ServiceWorkerProcessManager picks the process that already crashed. Then for the second start, neither EmbeddedWorkerRegistry::RemoveProcess() nor Mojo is able to detect that the process has "already crashed" so the second start stalls. In some more detail, horo@'s test is doing this: 1. [IO thread] Start the worker for registration. Finish. 2. [IO thread] Start stopping the worker via StopAllWorkers() 3. [UI thread] renderer crash. 4. [IO thread] Mojo triggers EmbeddedWorkerInstance::CallDetach() which calls EmbeddedWorkerInstance::Detach() which posts task to ServiceWorkerProcessManager::ReleaseWorkerProcess(). 5. [UI thread] ServiceWorkerProcessManager::ReleaseWorkerProcess() decrements RPHI ref count, RPHI::Cleanup() is called. 6. [IO thread] ServiceWorkerVersion::OnStopped() wants to restart the worker. It calls EmbeddedWorkerInstance::Start() again. 7. [UI thread] ServiceWorkerProcessManager::AllocateProcess() is called, increments the ref count of the RPHI that was already Cleanup()’d, returns it. 8. [IO thread] EmbeddedWorkerInstance::StartTask::OnProcessAllocated gets the process id but doesn't yet register with the EmbeddedWorkerRegistry. 9. [IO thread] Due to crashed process from step 3 (via mojo::Connector::HandleError), the ProviderHost for the document is destructed and posts task for ServiceWorkerProcessManager::RemoveProcessReference. 10. [UI thread] RemoveProcessReference removes the process from the list. It’s too late because step 7 already grabbed it. 11. [IO thread] Due the crashed process from step 3 (via ServiceWorkerDispatcherHost::OnFilterRemoved), ServiceWorkerDispatcherHost is destructed and called EmbeddedWorkerRegistry::RemoveProcess(). But the process id is not mapped to the worker yet, so OnDetached() isn’t called. 12. [IO thread] SendStartWorker is called. This apparently goes into the abyss and Mojo doesn’t detect the renderer is crashed because it already crashed and was detected in step 4. I'm not sure why Mojo is happily using the crashed renderer process. What may be happening is we're trying to use the render process for chrome://crash and it looks like a normal process to the browser process but a EmbeddedWorkerInstanceClient endpoint is never created for it, so client_->StartWorker() looks fine to the browser process. When I register to EmbeddedWorkerRegistry immediately in OnProcessAllocated, then in step 11 RemoveProcess() can call OnDetached(). But it still looks possible for a race condition to occur where step 11 happens before step 8. It seems to me that it's safest if step 7 can detect that the render process host impl is already crashed. I think we can do this by: - Iterating through AllHostsIterator() to see if the host is still there - Make SWProcessManager a RenderProcessHostObserver - Remove the RPH tracking code from SWProcessManager and just use SiteInstance I like the final option best but I'm not sure if we still rely on the RPH code. clamy@ do you know if we can delete the RPH code in SWProcessManager yet?
,
Jul 4 2017
@falken: I think we might have to wait until PlzNavigate ships. In particular, without PlzNavigate we don't track which sites are pending in the navigation process.
,
Jul 4 2017
Thanks, clamy@. I investigated further. It turns out in step 5, RPHI::Cleanup() is called, but it won't do anything since there are still listeners_ in RPHI. Then in step 7 we keep the RPHI alive even more by incrementing worker ref count. So the RPHI is never Cleanup()'d or destructed, apparently. That means my ideas in comment 11 won't help: the RPHI is alive and well from its point of view. RPHI::ProcessDied() does get called between step 8 and 9, which I suspect is what triggers steps 9-11. It's looking like we need to do: - Register to EWRegistry immediately in OnProcessAllocated, so that when step 11 happens OnDetached() is called. - Additionally in SendStartWorker, check whether the dispatcher host exists for the process id (this guards against step 11 happening before step 8, though I'm not totally sure that's possible). BTW horo@ to get your test to pass today, I think you just need to wait for StopAllServiceWorkersForOrigin to actually stop the worker. Then after the crash when it stops, SWVersion won't try to restart it since the original status will be STARTING instead of STOPPING. I'm not sure this is all the same bug as issue 736203, though.
,
Jul 5 2017
Also, the timeout timer should eventually time out starting the worker, and the failure callback would be called then. So I don't think this causes indefinite hangs in production (just waiting for a long time). It is probably not good that we try to reuse the crashed process though.
,
Jul 5 2017
Are step 4 and 9 both triggered by step 3, but are executed in separate tasks and have time lags?
,
Jul 5 2017
kinuko@ I believe so. step 4 is triggered by: #1 0x7eff316cc1ae content::(anonymous namespace)::CallDetach() #2 0x7eff329da2a5 _ZNO4base8CallbackIFvvELNS_8internal8CopyModeE0ELNS2_10RepeatModeE0EE3RunEv #3 0x7eff329df589 mojo::InterfaceEndpointClient::NotifyError() #4 0x7eff329e86bf mojo::internal::MultiplexRouter::ProcessNotifyErrorTask() #5 0x7eff329e5e5f mojo::internal::MultiplexRouter::ProcessTasks() #6 0x7eff329e467e mojo::internal::MultiplexRouter::OnPipeConnectionError() #7 0x7eff329da2a5 _ZNO4base8CallbackIFvvELNS_8internal8CopyModeE0ELNS2_10RepeatModeE0EE3RunEv #8 0x7eff329d9216 mojo::Connector::HandleError() #9 0x7eff329d9f12 mojo::Connector::OnHandleReadyInternal() #10 0x7eff35410c9a mojo::SimpleWatcher::OnHandleReady() #11 0x7eff354110f2 mojo::SimpleWatcher::Context::Notify() #12 0x7eff3540fafe mojo::SimpleWatcher::Context::CallNotify() #13 0x7eff2ff0850e mojo::edk::Watch::InvokeCallback() #14 0x7eff2ff043f5 mojo::edk::RequestContext::~RequestContext() #15 0x7eff2fef72c3 mojo::edk::NodeChannel::OnChannelError() step 9 is triggered by: #3 0x7eff31732f29 content::ServiceWorkerProviderHost::~ServiceWorkerProviderHost() #4 0x7eff316e1b0c IDMap<>::Remove() #5 0x7eff316e18a3 content::ServiceWorkerContextCore::RemoveProviderHost() #6 0x7eff317325bc content::(anonymous namespace)::RemoveProviderHost() #7 0x7eff31724a94 _ZN4base8internal7InvokerINS0_9BindStateIPFvNS_7WeakPtrIN7content24ServiceWorkerInternalsUIEEEiNS4_23ServiceWorkerStatusCodeEEJS6_iS7_EEEFvvEE3RunEPNS0_13BindStateBaseE #8 0x7eff329da2a5 _ZNO4base8CallbackIFvvELNS_8internal8CopyModeE0ELNS2_10RepeatModeE0EE3RunEv #9 0x7eff329df589 mojo::InterfaceEndpointClient::NotifyError() #10 0x7eff329a403a IPC::(anonymous namespace)::ChannelAssociatedGroupController::NotifyEndpointOfError() #11 0x7eff329a5d45 IPC::(anonymous namespace)::ChannelAssociatedGroupController::OnPipeError() #12 0x7eff329da2a5 _ZNO4base8CallbackIFvvELNS_8internal8CopyModeE0ELNS2_10RepeatModeE0EE3RunEv #13 0x7eff329d9216 mojo::Connector::HandleError() #14 0x7eff329d9f12 mojo::Connector::OnHandleReadyInternal() #15 0x7eff35410c9a mojo::SimpleWatcher::OnHandleReady() step 11 is triggered by: #1 0x7eff316d34ba content::EmbeddedWorkerRegistry::RemoveProcess() #2 0x7eff316e167e content::ServiceWorkerContextCore::RemoveDispatcherHost() #3 0x7eff3170c07c content::ServiceWorkerDispatcherHost::OnFilterRemoved() #4 0x7eff30f3c6c9 content::BrowserMessageFilter::Internal::OnFilterRemoved() #5 0x7eff32995079 IPC::ChannelProxy::Context::OnChannelClosed() step 4 is triggered specifically by EmbeddedWorkerInstanceClient (created for start worker) dying step 9 is triggered specifically by ServiceWorkerNetworkProvider (for the document) dying step 11 is triggered specifically by the IPC channel dying
,
Jul 5 2017
Thanks, I see because they're on different pipes. And it'd probably mean any destruction sequence triggered by connection error could run in any order. Tough.
,
Jul 5 2017
I think I see why Mojo doesn't re-detect the crashed process. RenderProcessHostImpl::ProcessDied() calls RPHI::EnableSendQueue() which sets up the channels again in anticipation of this RPHI being reused for another process. So Mojo considers this RPHI to be a clean one ready to host a new process. I don't see an easy way around this. We could add an observer for RenderProcessExited but more observers of lifetime doesn't seem like the answer. If we could access the |instance_id_| we could detect if the RPHI was recycled unexpectedly. But that likely needs to be done on the UI thread and we've set everything up for message sending on the IO thread. In the short term the simplest solution seems to just be to check if SWDispatcherHost exists in SendStartWorker. My reasoning is as follows: 1) If the ServiceWorkerDispatcherHost is removed before SendStartWorker, we’ll discover it in SendStartWorker and bail. 2) If the ServiceWorkerDispatcherHost is removed after SendStartWorker, then EmbeddedWorkerRegistry will call Detach() since SendStartWorker called BindWorkerToProcess. 3) We expect that ServiceWorkerDispatcherHost be created and registered to ServiceWorkerContextCore by the time SendStartWorker() is called. This is because we expect that RenderProcessHostImpl::Init() to have been called already, which constructs the DispatcherHost and posts a task to the IO thread to initialize the DispatcherHost. So by the time OnProcessAllocated() is called on the IO thread, which eventually leads to SendStartWorker() on the IO thread, the DispatcherHost should have been created.
,
Jul 5 2017
Re c#16, c#17:
step 9 and 11 might be coming from the same task because SWNetworkProvider uses channel-associated interface, but the main problem seems happening due to difference between step 3 and step {9, 11}.
,
Jul 5 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/28232bff5f48813f417f3d808a68a0858f1765ad commit 28232bff5f48813f417f3d808a68a0858f1765ad Author: Matt Falkenhagen <falken@chromium.org> Date: Wed Jul 05 10:57:32 2017 service worker: Handle renderer crash by checking for SWDispacherHost Before this patch, if the renderer crash was detected by EmbeddedWorkerRegistry in between OnProcessAllocated and SendStartWorker, the embedded worker would never know because EmbeddedWorkerRegistry::RemoveProcess would not know to call EmbeddedWorkerInstance::OnDetach() because the worker was not yet registered for the process via EmbeddedWorkerRegistry::BindWorkerToProcess. See bug for more details. The solution is to check for the existence of ServiceWorkerDispatcherHost in SendStartWorker. My reasoning is as follows: 1) If the ServiceWorkerDispatcherHost is removed before SendStartWorker, we’ll discover it in SendStartWorker and bail. 2) If the ServiceWorkerDispatcherHost is removed after SendStartWorker, then EmbeddedWorkerRegistry will call Detach() since SendStartWorker called BindWorkerToProcess. 3) We expect that ServiceWorkerDispatcherHost be created and registered to ServiceWorkerContextCore by the time SendStartWorker() is called. This is because we expect that RenderProcessHostImpl::Init() to have been called already, which constructs the DispatcherHost and posts a task to the IO thread to initialize the DispatcherHost. So by the time OnProcessAllocated() is called on the IO thread, which eventually leads to SendStartWorker() on the IO thread, the DispatcherHost should have been created. Bug: 732729 , 738310 Change-Id: I4e46a2265c059724849a90bfbfc68b3bfdbf5be8 Reviewed-on: https://chromium-review.googlesource.com/558599 Commit-Queue: Matt Falkenhagen <falken@chromium.org> Reviewed-by: Tsuyoshi Horo <horo@chromium.org> Reviewed-by: Kinuko Yasuda <kinuko@chromium.org> Reviewed-by: Makoto Shimazu <shimazu@chromium.org> Cr-Commit-Position: refs/heads/master@{#484238} [modify] https://crrev.com/28232bff5f48813f417f3d808a68a0858f1765ad/content/browser/service_worker/embedded_worker_instance.cc [modify] https://crrev.com/28232bff5f48813f417f3d808a68a0858f1765ad/content/browser/service_worker/embedded_worker_instance.h [modify] https://crrev.com/28232bff5f48813f417f3d808a68a0858f1765ad/content/browser/service_worker/embedded_worker_instance_unittest.cc
,
Jul 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/daf95c19c43619e2b49d807e2b31e771512f2a4b commit daf95c19c43619e2b49d807e2b31e771512f2a4b Author: Matt Falkenhagen <falken@chromium.org> Date: Fri Jul 07 04:23:01 2017 service worker: Remove some unnecessary thread hops during startup. Before this patch embedded worker startup did "allocate process" (UI), "handle result" (IO), "register to devtools" (UI), "complete" (IO). Now it does: "allocate process and register to devtools" (UI), "complete" (IO). This fixes a failing CHECK(rph) when the allocated RenderProcessHost has been destroyed by the time we return to the UI thread to check it. Bug: 738310 , 732729 , 561209 Change-Id: I70b9b2512a01997e56db64121a5e9f3543d3f37a Reviewed-on: https://chromium-review.googlesource.com/561027 Commit-Queue: Matt Falkenhagen <falken@chromium.org> Reviewed-by: Makoto Shimazu <shimazu@chromium.org> Cr-Commit-Position: refs/heads/master@{#484831} [modify] https://crrev.com/daf95c19c43619e2b49d807e2b31e771512f2a4b/content/browser/service_worker/embedded_worker_instance.cc [modify] https://crrev.com/daf95c19c43619e2b49d807e2b31e771512f2a4b/content/browser/service_worker/embedded_worker_instance.h [modify] https://crrev.com/daf95c19c43619e2b49d807e2b31e771512f2a4b/content/browser/service_worker/embedded_worker_instance_unittest.cc [modify] https://crrev.com/daf95c19c43619e2b49d807e2b31e771512f2a4b/content/browser/service_worker/service_worker_process_manager.cc [modify] https://crrev.com/daf95c19c43619e2b49d807e2b31e771512f2a4b/content/browser/service_worker/service_worker_process_manager.h [modify] https://crrev.com/daf95c19c43619e2b49d807e2b31e771512f2a4b/content/browser/service_worker/service_worker_process_manager_unittest.cc [modify] https://crrev.com/daf95c19c43619e2b49d807e2b31e771512f2a4b/tools/metrics/histograms/enums.xml
,
Jul 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/16a04784ff29ac7f8caf4c6fc52031807f6683ae commit 16a04784ff29ac7f8caf4c6fc52031807f6683ae Author: Matt Falkenhagen <falken@chromium.org> Date: Fri Jul 07 08:13:54 2017 service worker: Cleanups in EmbeddedWorkerInstance::StartTask Refactoring following https://chromium-review.googlesource.com/561027 Bug: 738310 , 732729 Change-Id: Ia3f7a7bf561cad7c405bf2b9e899f8d9e42aded7 Reviewed-on: https://chromium-review.googlesource.com/562938 Commit-Queue: Matt Falkenhagen <falken@chromium.org> Reviewed-by: Kinuko Yasuda <kinuko@chromium.org> Reviewed-by: Makoto Shimazu <shimazu@chromium.org> Reviewed-by: Hiroki Nakagawa <nhiroki@chromium.org> Cr-Commit-Position: refs/heads/master@{#484857} [modify] https://crrev.com/16a04784ff29ac7f8caf4c6fc52031807f6683ae/content/browser/service_worker/embedded_worker_instance.cc [modify] https://crrev.com/16a04784ff29ac7f8caf4c6fc52031807f6683ae/content/browser/service_worker/embedded_worker_instance.h [modify] https://crrev.com/16a04784ff29ac7f8caf4c6fc52031807f6683ae/content/browser/service_worker/service_worker_process_manager.cc [modify] https://crrev.com/16a04784ff29ac7f8caf4c6fc52031807f6683ae/content/browser/service_worker/service_worker_process_manager.h
,
Jul 14 2017
I'm fairly confident that comment 21 was the real fix for this issue. Now we bind to the RPH right when it's returned, instead of bouncing through IO and back to UI, which meant the process could have been destroyed in the meantime. A related issue: I think SWProcessManager should check RPH::HasConnection() before returning an "existing process" because the existing process might have been destroyed and the RPH is just being reused. However, the RPH tracking code should go away once PlzNavigate ships. Also we probably should not be trying to restart a worker that was Detached, it looks possible to get into an infinite loop of choosing a bad process.
,
Sep 29 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/079829bbf720bbbe1fe0a95c47ce6bb18ad2c0f7 commit 079829bbf720bbbe1fe0a95c47ce6bb18ad2c0f7 Author: Matt Falkenhagen <falken@chromium.org> Date: Fri Sep 29 02:33:23 2017 service worker: Rely completely on SiteInstance when PlzNavigate is enabled This is a step toward removing the process tracking code, which we shouldn't need to do in the PlzNavigate case. In PlzNavigate mode, we can rely on SiteInstance to track renderer process. I also somewhat suspect the IPC_FAILED errors are coming from using a dead process, which can happen when we don't rely on SiteInstance. See: https://bugs.chromium.org/p/chromium/issues/detail?id=732729#c23 Bug: 732729 , 765526 Change-Id: I56d54f3c0b7200bf3e38693436619dd56c2811c2 Reviewed-on: https://chromium-review.googlesource.com/668354 Reviewed-by: Tsuyoshi Horo <horo@chromium.org> Commit-Queue: Matt Falkenhagen <falken@chromium.org> Cr-Commit-Position: refs/heads/master@{#505261} [modify] https://crrev.com/079829bbf720bbbe1fe0a95c47ce6bb18ad2c0f7/content/browser/service_worker/service_worker_process_manager.cc [modify] https://crrev.com/079829bbf720bbbe1fe0a95c47ce6bb18ad2c0f7/content/browser/service_worker/service_worker_process_manager.h [modify] https://crrev.com/079829bbf720bbbe1fe0a95c47ce6bb18ad2c0f7/content/browser/service_worker/service_worker_process_manager_unittest.cc |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by horo@chromium.org
, Jun 13 2017