New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 732729 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug-Regression

Blocking:
issue 736203
issue 736649



Sign in to add a comment

The start task of ServiceWorker can't detect the renderer crash if it crashes before SendStartWorker()

Project Member Reported by horo@chromium.org, Jun 13 2017

Issue description

If a renderer process crashes, EmbeddedWorkerRegistry::RemoveProcess() is
called on the IO thread. And if the embedded worker is in |worker_process_map_|,
EmbeddedWorkerInstance::OnDetached() is called.
The embedded worker was set in |worker_process_map_| by BindWorkerToProcess()
which was called from EmbeddedWorkerInstance::SendStartWorker().

The thread hopping of starting service worker is like this:
  IO thread : EmbeddedWorkerInstance::Start()
              -> StartTask::Start()
                -> ServiceWorkerProcessManager::AllocateWorkerProcess()
  UI thread : ServiceWorkerProcessManager::AllocateWorkerProcess()
  IO thread : StartTask::OnProcessAllocated()
  UI thread : SetupOnUI()
  IO thread : StartTask::OnSetupOnUICompleted()
              -> EmbeddedWorkerInstance::SendStartWorker()

If EmbeddedWorkerRegistry::RemoveProcess() is called between 
StartTask::OnProcessAllocated() and StartTask::OnSetupOnUICompleted(),
EmbeddedWorkerInstance::OnDetached() can't be called.

In this case, I think CallDetach() which is a connection_error_handler of
EmbeddedWorkerInstanceClientPtr must be called. But I don't know why, it is not
called. So the the start worker task will not finish.

I noticed this issue while writing tests for StartServiceWorkerForNavigationHint().
https://chromium-review.googlesource.com/c/530731/1/chrome/browser/chrome_service_worker_browsertest.cc#714


What steps will reproduce the problem?
(1) Patch https://chromium-review.googlesource.com/c/530731/1
(2) Build browser_tests
(3) run browser_tests with "--gtest_filter=ChromeServiceWorkerNavigationHintTest.FailedDueToRendererCrash" several times.

Expected: Test passes.
Actual: Sometimes the test time out. (not 100%)
 

Comment 1 by horo@chromium.org, Jun 13 2017

Description: Show this description

Comment 2 by horo@chromium.org, Jun 13 2017

I think moving "registry_->BindWorkerToProcess();" from EmbeddedWorkerInstance::SendStartWorker() to OnProcessAllocated() can fix this problem.

shimazu@
What do you think?

Comment 3 by falken@chromium.org, Jun 15 2017

Labels: -Type-Bug -Pri-3 M-61 Pri-2 Type-Bug-Regression
Status: Available (was: Untriaged)
Is this a regression with Mojo?

Comment 4 by horo@chromium.org, Jun 22 2017

I think this bug exists before mojofying StartWorker.
https://codereview.chromium.org/2374353002

The embedded_worker_id was registered to the |worker_process_map_| when SendStartWorker() is called.
https://codereview.chromium.org/2374353002/diff/1/content/browser/service_worker/embedded_worker_registry.cc#oldcode272
This is same as the current logic.

Comment 5 by horo@chromium.org, Jun 22 2017

Owner: shimazu@chromium.org
Status: Assigned (was: Available)
Ah, now the test hits the DCHECK(dispatcher_host) which was recently added by shimazu@.

https://chromium.googlesource.com/chromium/src/+blame/f50e2dcfad8c3e03eb9445243e4b5e8fccf15b87/content/browser/service_worker/service_worker_provider_host.cc#691

I created a CL to make the test hit this crash by using BrowserThread::PostDelayedTask() to call SetupOnUI() from OnProcessAllocated().
https://chromium-review.googlesource.com/c/544478/

shimazu@
Please try this.
Blocking: 736649
Cc: roc...@chromium.org
Labels: -Pri-2 Pri-1
Status: Started (was: Assigned)
IIUC, CallDetach should be called when the renderer has crashed.

rockot, can I ask you the behavior when the crash is notified?

This is where an error callback is set.
https://cs.chromium.org/chromium/src/content/browser/service_worker/embedded_worker_instance.cc?type=cs&q=EmbeddedWorkerInstance&l=474

EmbeddedWorkerInstance's |client_| is an InterfacePtr, and the request is bound with a RenderProcessHost by using BindInterface().
https://cs.chromium.org/chromium/src/content/browser/service_worker/embedded_worker_instance.cc?type=cs&l=100

If the renderer crashed before the request is bound with the RenderProcessHost, is the error handler called in the BindInterface()? Or is it happened asynchronously?


Comment 8 by roc...@chromium.org, Jun 26 2017

It will always happen asynchronously.
Blocking: 736203
Cc: shimazu@chromium.org
Owner: falken@chromium.org
Investigating now.
Cc: clamy@chromium.org
This bug is a bit deep.

It turns out CallDetached() is called when the renderer is crashed. The problem is we immediately try to restart the worker and ServiceWorkerProcessManager picks the process that already crashed. Then for the second start, neither EmbeddedWorkerRegistry::RemoveProcess() nor Mojo is able to detect that the process has "already crashed" so the second start stalls. In some more detail, horo@'s test is doing this:

1. [IO thread] Start the worker for registration. Finish.
2. [IO thread] Start stopping the worker via StopAllWorkers()
3. [UI thread] renderer crash.
4. [IO thread] Mojo triggers EmbeddedWorkerInstance::CallDetach() which calls EmbeddedWorkerInstance::Detach() which posts task to ServiceWorkerProcessManager::ReleaseWorkerProcess().
5. [UI thread] ServiceWorkerProcessManager::ReleaseWorkerProcess() decrements RPHI ref count, RPHI::Cleanup() is called.
6. [IO thread] ServiceWorkerVersion::OnStopped() wants to restart the worker. It calls EmbeddedWorkerInstance::Start() again.
7. [UI thread] ServiceWorkerProcessManager::AllocateProcess() is called, increments the ref count of the RPHI that was already Cleanup()’d, returns it.
8. [IO thread] EmbeddedWorkerInstance::StartTask::OnProcessAllocated gets the process id but doesn't yet register with the EmbeddedWorkerRegistry.
9. [IO thread] Due to crashed process from step 3 (via mojo::Connector::HandleError), the ProviderHost for the document is destructed and posts task for ServiceWorkerProcessManager::RemoveProcessReference.
10. [UI thread] RemoveProcessReference removes the process from the list. It’s too late because step 7 already grabbed it.
11. [IO thread] Due the crashed process from step 3 (via ServiceWorkerDispatcherHost::OnFilterRemoved), ServiceWorkerDispatcherHost is destructed and called EmbeddedWorkerRegistry::RemoveProcess(). But the process id is not mapped to the worker yet, so OnDetached() isn’t called.
12. [IO thread] SendStartWorker is called. This apparently goes into the abyss and Mojo doesn’t detect the renderer is crashed because it already crashed and was detected in step 4.

I'm not sure why Mojo is happily using the crashed renderer process. What may be happening is we're trying to use the render process for chrome://crash and it looks like a normal process to the browser process but a EmbeddedWorkerInstanceClient endpoint is never created for it, so client_->StartWorker() looks fine to the browser process.

When I register to EmbeddedWorkerRegistry immediately in OnProcessAllocated, then in step 11 RemoveProcess() can call OnDetached(). But it still looks possible for a race condition to occur where step 11 happens before step 8.

It seems to me that it's safest if step 7 can detect that the render process host impl is already crashed. I think we can do this by:
- Iterating through AllHostsIterator() to see if the host is still there
- Make SWProcessManager a RenderProcessHostObserver
- Remove the RPH tracking code from SWProcessManager and just use SiteInstance

I like the final option best but I'm not sure if we still rely on the RPH code.

clamy@ do you know if we can delete the RPH code in SWProcessManager yet?

@falken: I think we might have to wait until PlzNavigate ships. In particular, without PlzNavigate we don't track which sites are pending in the navigation process.
Thanks, clamy@.

I investigated further. It turns out in step 5, RPHI::Cleanup() is called, but it won't do anything since there are still listeners_ in RPHI. Then in step 7 we keep the RPHI alive even more by incrementing worker ref count. So the RPHI is never Cleanup()'d or destructed, apparently. That means my ideas in comment 11 won't help: the RPHI is alive and well from its point of view.

RPHI::ProcessDied() does get called between step 8 and 9, which I suspect is what triggers steps 9-11.

It's looking like we need to do:
- Register to EWRegistry immediately in OnProcessAllocated, so that when step 11 happens OnDetached() is called.
- Additionally in SendStartWorker, check whether the dispatcher host exists for the process id (this guards against step 11 happening before step 8, though I'm not totally sure that's possible).

BTW horo@ to get your test to pass today, I think you just need to wait for StopAllServiceWorkersForOrigin to actually stop the worker. Then after the crash when it stops, SWVersion won't try to restart it since the original status will be STARTING instead of STOPPING.

I'm not sure this is all the same bug as issue 736203, though.
Also, the timeout timer should eventually time out starting the worker, and the failure callback would be called then. So I don't think this causes indefinite hangs in production (just waiting for a long time).

It is probably not good that we try to reuse the crashed process though.

Are step 4 and 9 both triggered by step 3, but are executed in separate tasks and have time lags?
kinuko@ I believe so.

step 4 is triggered by:
#1 0x7eff316cc1ae content::(anonymous namespace)::CallDetach()
#2 0x7eff329da2a5 _ZNO4base8CallbackIFvvELNS_8internal8CopyModeE0ELNS2_10RepeatModeE0EE3RunEv
#3 0x7eff329df589 mojo::InterfaceEndpointClient::NotifyError()
#4 0x7eff329e86bf mojo::internal::MultiplexRouter::ProcessNotifyErrorTask()
#5 0x7eff329e5e5f mojo::internal::MultiplexRouter::ProcessTasks()
#6 0x7eff329e467e mojo::internal::MultiplexRouter::OnPipeConnectionError()
#7 0x7eff329da2a5 _ZNO4base8CallbackIFvvELNS_8internal8CopyModeE0ELNS2_10RepeatModeE0EE3RunEv
#8 0x7eff329d9216 mojo::Connector::HandleError()
#9 0x7eff329d9f12 mojo::Connector::OnHandleReadyInternal()
#10 0x7eff35410c9a mojo::SimpleWatcher::OnHandleReady()
#11 0x7eff354110f2 mojo::SimpleWatcher::Context::Notify()
#12 0x7eff3540fafe mojo::SimpleWatcher::Context::CallNotify()
#13 0x7eff2ff0850e mojo::edk::Watch::InvokeCallback()
#14 0x7eff2ff043f5 mojo::edk::RequestContext::~RequestContext()
#15 0x7eff2fef72c3 mojo::edk::NodeChannel::OnChannelError()

step 9 is triggered by:
#3 0x7eff31732f29 content::ServiceWorkerProviderHost::~ServiceWorkerProviderHost()
#4 0x7eff316e1b0c IDMap<>::Remove()
#5 0x7eff316e18a3 content::ServiceWorkerContextCore::RemoveProviderHost()
#6 0x7eff317325bc content::(anonymous namespace)::RemoveProviderHost()
#7 0x7eff31724a94 _ZN4base8internal7InvokerINS0_9BindStateIPFvNS_7WeakPtrIN7content24ServiceWorkerInternalsUIEEEiNS4_23ServiceWorkerStatusCodeEEJS6_iS7_EEEFvvEE3RunEPNS0_13BindStateBaseE
#8 0x7eff329da2a5 _ZNO4base8CallbackIFvvELNS_8internal8CopyModeE0ELNS2_10RepeatModeE0EE3RunEv
#9 0x7eff329df589 mojo::InterfaceEndpointClient::NotifyError()
#10 0x7eff329a403a IPC::(anonymous namespace)::ChannelAssociatedGroupController::NotifyEndpointOfError()
#11 0x7eff329a5d45 IPC::(anonymous namespace)::ChannelAssociatedGroupController::OnPipeError()
#12 0x7eff329da2a5 _ZNO4base8CallbackIFvvELNS_8internal8CopyModeE0ELNS2_10RepeatModeE0EE3RunEv
#13 0x7eff329d9216 mojo::Connector::HandleError()
#14 0x7eff329d9f12 mojo::Connector::OnHandleReadyInternal()
#15 0x7eff35410c9a mojo::SimpleWatcher::OnHandleReady()

step 11 is triggered by:
#1 0x7eff316d34ba content::EmbeddedWorkerRegistry::RemoveProcess()
#2 0x7eff316e167e content::ServiceWorkerContextCore::RemoveDispatcherHost()
#3 0x7eff3170c07c content::ServiceWorkerDispatcherHost::OnFilterRemoved()
#4 0x7eff30f3c6c9 content::BrowserMessageFilter::Internal::OnFilterRemoved()
#5 0x7eff32995079 IPC::ChannelProxy::Context::OnChannelClosed()

step 4 is triggered specifically by EmbeddedWorkerInstanceClient (created for start worker) dying
step 9 is triggered specifically by ServiceWorkerNetworkProvider (for the document) dying
step 11 is triggered specifically by the IPC channel dying


Thanks, I see because they're on different pipes. And it'd probably mean any destruction sequence triggered by connection error could run in any order. Tough.
I think I see why Mojo doesn't re-detect the crashed process. RenderProcessHostImpl::ProcessDied() calls RPHI::EnableSendQueue() which sets up the channels again in anticipation of this RPHI being reused for another process.

So Mojo considers this RPHI to be a clean one ready to host a new process.

I don't see an easy way around this. We could add an observer for RenderProcessExited but more observers of lifetime doesn't seem like the answer.

If we could access the |instance_id_| we could detect if the RPHI was recycled unexpectedly. But that likely needs to be done on the UI thread and we've set everything up for message sending on the IO thread.

In the short term the simplest solution seems to just be to check if SWDispatcherHost exists in SendStartWorker. My reasoning is as follows:

1) If the ServiceWorkerDispatcherHost is removed before SendStartWorker,
we’ll discover it in SendStartWorker and bail.

2) If the ServiceWorkerDispatcherHost is removed after SendStartWorker,
then EmbeddedWorkerRegistry will call Detach() since SendStartWorker
called BindWorkerToProcess.

3) We expect that ServiceWorkerDispatcherHost be created and registered
to ServiceWorkerContextCore by the time SendStartWorker() is called.
This is because we expect that RenderProcessHostImpl::Init() to have
been called already, which constructs the DispatcherHost and posts a
task to the IO thread to initialize the DispatcherHost. So by the time
OnProcessAllocated() is called on the IO thread, which eventually leads
to SendStartWorker() on the IO thread, the DispatcherHost should have
been created.
Re c#16, c#17: 
step 9 and 11 might be coming from the same task because SWNetworkProvider uses channel-associated interface, but the main problem seems happening due to difference between step 3 and step {9, 11}. 

Project Member

Comment 20 by bugdroid1@chromium.org, Jul 5 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/28232bff5f48813f417f3d808a68a0858f1765ad

commit 28232bff5f48813f417f3d808a68a0858f1765ad
Author: Matt Falkenhagen <falken@chromium.org>
Date: Wed Jul 05 10:57:32 2017

service worker: Handle renderer crash by checking for SWDispacherHost

Before this patch, if the renderer crash was detected by
EmbeddedWorkerRegistry in between OnProcessAllocated and SendStartWorker,
the embedded worker would never know because
EmbeddedWorkerRegistry::RemoveProcess would not know to call
EmbeddedWorkerInstance::OnDetach() because the worker was not yet
registered for the process via EmbeddedWorkerRegistry::BindWorkerToProcess.
See bug for more details.

The solution is to check for the existence of ServiceWorkerDispatcherHost
in SendStartWorker. My reasoning is as follows:

1) If the ServiceWorkerDispatcherHost is removed before SendStartWorker,
we’ll discover it in SendStartWorker and bail.

2) If the ServiceWorkerDispatcherHost is removed after SendStartWorker,
then EmbeddedWorkerRegistry will call Detach() since SendStartWorker
called BindWorkerToProcess.

3) We expect that ServiceWorkerDispatcherHost be created and registered
to ServiceWorkerContextCore by the time SendStartWorker() is called.
This is because we expect that RenderProcessHostImpl::Init() to have
been called already, which constructs the DispatcherHost and posts a
task to the IO thread to initialize the DispatcherHost. So by the time
OnProcessAllocated() is called on the IO thread, which eventually leads
to SendStartWorker() on the IO thread, the DispatcherHost should have
been created.

Bug:  732729 ,  738310 
Change-Id: I4e46a2265c059724849a90bfbfc68b3bfdbf5be8
Reviewed-on: https://chromium-review.googlesource.com/558599
Commit-Queue: Matt Falkenhagen <falken@chromium.org>
Reviewed-by: Tsuyoshi Horo <horo@chromium.org>
Reviewed-by: Kinuko Yasuda <kinuko@chromium.org>
Reviewed-by: Makoto Shimazu <shimazu@chromium.org>
Cr-Commit-Position: refs/heads/master@{#484238}
[modify] https://crrev.com/28232bff5f48813f417f3d808a68a0858f1765ad/content/browser/service_worker/embedded_worker_instance.cc
[modify] https://crrev.com/28232bff5f48813f417f3d808a68a0858f1765ad/content/browser/service_worker/embedded_worker_instance.h
[modify] https://crrev.com/28232bff5f48813f417f3d808a68a0858f1765ad/content/browser/service_worker/embedded_worker_instance_unittest.cc

Project Member

Comment 21 by bugdroid1@chromium.org, Jul 7 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/daf95c19c43619e2b49d807e2b31e771512f2a4b

commit daf95c19c43619e2b49d807e2b31e771512f2a4b
Author: Matt Falkenhagen <falken@chromium.org>
Date: Fri Jul 07 04:23:01 2017

service worker: Remove some unnecessary thread hops during startup.

Before this patch embedded worker startup did "allocate process" (UI),
"handle result" (IO), "register to devtools" (UI), "complete" (IO).

Now it does: "allocate process and register to devtools" (UI),
"complete" (IO).

This fixes a failing CHECK(rph) when the allocated RenderProcessHost has
been destroyed by the time we return to the UI thread to check it.

Bug:  738310 ,  732729 , 561209
Change-Id: I70b9b2512a01997e56db64121a5e9f3543d3f37a
Reviewed-on: https://chromium-review.googlesource.com/561027
Commit-Queue: Matt Falkenhagen <falken@chromium.org>
Reviewed-by: Makoto Shimazu <shimazu@chromium.org>
Cr-Commit-Position: refs/heads/master@{#484831}
[modify] https://crrev.com/daf95c19c43619e2b49d807e2b31e771512f2a4b/content/browser/service_worker/embedded_worker_instance.cc
[modify] https://crrev.com/daf95c19c43619e2b49d807e2b31e771512f2a4b/content/browser/service_worker/embedded_worker_instance.h
[modify] https://crrev.com/daf95c19c43619e2b49d807e2b31e771512f2a4b/content/browser/service_worker/embedded_worker_instance_unittest.cc
[modify] https://crrev.com/daf95c19c43619e2b49d807e2b31e771512f2a4b/content/browser/service_worker/service_worker_process_manager.cc
[modify] https://crrev.com/daf95c19c43619e2b49d807e2b31e771512f2a4b/content/browser/service_worker/service_worker_process_manager.h
[modify] https://crrev.com/daf95c19c43619e2b49d807e2b31e771512f2a4b/content/browser/service_worker/service_worker_process_manager_unittest.cc
[modify] https://crrev.com/daf95c19c43619e2b49d807e2b31e771512f2a4b/tools/metrics/histograms/enums.xml

Status: Fixed (was: Started)
I'm fairly confident that comment 21 was the real fix for this issue. Now we bind to the RPH right when it's returned, instead of bouncing through IO and back to UI, which meant the process could have been destroyed in the meantime.

A related issue: I think SWProcessManager should check RPH::HasConnection() before returning an "existing process" because the existing process might have been destroyed and the RPH is just being reused. However, the RPH tracking code should go away once PlzNavigate ships. Also we probably should not be trying to restart a worker that was Detached, it looks possible to get into an infinite loop of choosing a bad process.
Project Member

Comment 24 by bugdroid1@chromium.org, Sep 29 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/079829bbf720bbbe1fe0a95c47ce6bb18ad2c0f7

commit 079829bbf720bbbe1fe0a95c47ce6bb18ad2c0f7
Author: Matt Falkenhagen <falken@chromium.org>
Date: Fri Sep 29 02:33:23 2017

service worker: Rely completely on SiteInstance when PlzNavigate is enabled

This is a step toward removing the process tracking code, which we
shouldn't need to do in the PlzNavigate case. In PlzNavigate mode,
we can rely on SiteInstance to track renderer process.

I also somewhat suspect the IPC_FAILED errors are coming from using a
dead process, which can happen when we don't rely on SiteInstance.
See: https://bugs.chromium.org/p/chromium/issues/detail?id=732729#c23

Bug:  732729 ,  765526
Change-Id: I56d54f3c0b7200bf3e38693436619dd56c2811c2
Reviewed-on: https://chromium-review.googlesource.com/668354
Reviewed-by: Tsuyoshi Horo <horo@chromium.org>
Commit-Queue: Matt Falkenhagen <falken@chromium.org>
Cr-Commit-Position: refs/heads/master@{#505261}
[modify] https://crrev.com/079829bbf720bbbe1fe0a95c47ce6bb18ad2c0f7/content/browser/service_worker/service_worker_process_manager.cc
[modify] https://crrev.com/079829bbf720bbbe1fe0a95c47ce6bb18ad2c0f7/content/browser/service_worker/service_worker_process_manager.h
[modify] https://crrev.com/079829bbf720bbbe1fe0a95c47ce6bb18ad2c0f7/content/browser/service_worker/service_worker_process_manager_unittest.cc

Sign in to add a comment