New issue
Advanced search Search tips

Issue 600226 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 1
Type: Bug



Sign in to add a comment

ServiceWorker.StartWorker.Status has a lot of failures for "activate" event

Project Member Reported by nhiroki@chromium.org, Apr 4 2016

Issue description

Cc: peter@chromium.org mvanouwe...@chromium.org
Labels: -Pri-2 Pri-1
Owner: falken@chromium.org
Status: Assigned (was: Available)
This looks real, but I'm not sure it's a code regression. The metric is now alternating between 0.1% on weekends and 0.5% on weekdays. Interestingly, the total service worker startup count spikes up on weekends, and down on weekdays.

Not being able to find a process seems more likely with push notifications than navigations. And indeed ServiceWorker.FetchEvent.MainResource.Status is healthy.

Peter, Michael, do you have metrics on how often "DispatchPushEvent" fails due to the service worker unable to start up?

Anyway, I think my first task here will be to add StartWorker.Status_{StartPurpose} metrics, so we can better understand the problem.

Check this uma timeline: https://goo.gl/U2qhTx

The "Notifications.PersistentWebNotificationClickResult#Service Worker error" line kind of corresponds.
Thanks, that's informative. The raw counts from the push and notification click UMAs make up only a fraction of the PROCESS_NOT_FOUND failures, so it seems we're failing often somewhere else too.

BTW, there's a lot of "Database error" for Notifications.PersistentWebNotificationClickResult. Is that expected? Could that be related to  issue 534537 ?

Comment 4 by peter@chromium.org, Apr 6 2016

Yes - we need to clear notifications associated with a Service Worker when it gets unregistered, but currently aren't doing that yet because of a pending refactor. On the list for M52.

Comment 5 by falken@chromium.org, Apr 19 2016

Status: Started (was: Assigned)

Comment 6 by falken@chromium.org, Apr 19 2016

The only way to get ERROR_PROCESS_NOT_FOUND is if RenderProcessHost::Init returns false, and that only happens if the path for the renderer is not found.

Under typical operation, that path is /system/bin/app_process32 on my tablet, which comes from PathProviderAndroid (with base::FILE_EXE key).

So it seems either PathProviderAndroid is not registered by the time PathService::Get is called, or PathProviderAndroid is failing because readlink() is failing here:

      int bin_dir_size = readlink(kProcSelfExe, bin_dir, PATH_MAX);
      if (bin_dir_size < 0 || bin_dir_size > PATH_MAX) {
        NOTREACHED() << "Unable to resolve " << kProcSelfExe << ".";
        return false;
      }
Project Member

Comment 7 by bugdroid1@chromium.org, Apr 25 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c1a595d1736123171d216b67535303916d95874d

commit c1a595d1736123171d216b67535303916d95874d
Author: falken <falken@chromium.org>
Date: Mon Apr 25 14:02:00 2016

Add CHECK in case PathProviderAndroid is failing.

In  bug 600226 , RenderProcessHost::Init() is returning
false. The only possible cause I can see is if the path
service returned false, so see if PathProviderAndroid
is returning false.

BUG= 600226 

Review URL: https://codereview.chromium.org/1914983002

Cr-Commit-Position: refs/heads/master@{#389468}

[modify] https://crrev.com/c1a595d1736123171d216b67535303916d95874d/base/base_paths_android.cc

Project Member

Comment 8 by bugdroid1@chromium.org, Apr 25 2016

Labels: merge-merged-2716
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c1a595d1736123171d216b67535303916d95874d

commit c1a595d1736123171d216b67535303916d95874d
Author: falken <falken@chromium.org>
Date: Mon Apr 25 14:02:00 2016

Add CHECK in case PathProviderAndroid is failing.

In  bug 600226 , RenderProcessHost::Init() is returning
false. The only possible cause I can see is if the path
service returned false, so see if PathProviderAndroid
is returning false.

BUG= 600226 

Review URL: https://codereview.chromium.org/1914983002

Cr-Commit-Position: refs/heads/master@{#389468}

[modify] https://crrev.com/c1a595d1736123171d216b67535303916d95874d/base/base_paths_android.cc

Some crash reports are trickling in:
F/chromium( 6511): [FATAL:base_paths_android.cc(28)] Check failed: bin_dir_size > 0 && bin_dir_size <= PATH_MAX. Unable to resolve /proc/self/exe. bin_dir_size=-1: Permission denied

This means readlink() is failing with EACCESS: "Search permission is denied for a component of the path prefix."
I don't know why that'd be. Maybe the file is somehow temporarily locked? Or Chrome is in a unprivileged state.

https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Android%27%20AND%20product.version%3D%2752.0.2718.2%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27browser%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BAssert%5D%20base%3A%3APathProviderAndroid%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D&stbtiq=&reportid=0210fae200000000&index=6#7

Comment 11 by torne@chromium.org, Apr 28 2016

Is this happening in the renderer process? If so then I think this is WAI and caused by SELinux preventing it.

Comment 12 by torne@chromium.org, Apr 28 2016

Oh, no, I see it's the browser process. Let me take a closer look.

Comment 13 by torne@chromium.org, Apr 28 2016

Side question: why are we trying to find the executable directory on Android anyway? /proc/self/exe is always going to point to app_process (i.e. the JVM binary), not anything to do with us, and this directory is not going to be something we can use for anything else.

Comment 14 by torne@chromium.org, Apr 28 2016

No idea why /proc/self/exe would be inaccessible for the browser process (and it's definitely the browser process, from the pids in the logs), but it seems like the solution here is to not get this hopefully-unused and almost-certainly useless data in the first place ;)
Project Member

Comment 15 by bugdroid1@chromium.org, May 10 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/fa35291713d19ec1f47f59fd5ba6586a85ca1738

commit fa35291713d19ec1f47f59fd5ba6586a85ca1738
Author: falken <falken@chromium.org>
Date: Tue May 10 02:55:43 2016

Revert of Add CHECK in case PathProviderAndroid is failing. (patchset #2 id:20001 of https://codereview.chromium.org/1914983002/ )

Reason for revert:
We got enough crash reports, reverting now.

Original issue's description:
> Add CHECK in case PathProviderAndroid is failing.
>
> In  bug 600226 , RenderProcessHost::Init() is returning
> false. The only possible cause I can see is if the path
> service returned false, so see if PathProviderAndroid
> is returning false.
>
> BUG= 600226 
>
> Committed: https://crrev.com/c1a595d1736123171d216b67535303916d95874d
> Cr-Commit-Position: refs/heads/master@{#389468}

TBR=torne@chromium.org
# Not skipping CQ checks because original CL landed more than 1 days ago.
BUG= 600226 

Review-Url: https://codereview.chromium.org/1959153003
Cr-Commit-Position: refs/heads/master@{#392530}

[modify] https://crrev.com/fa35291713d19ec1f47f59fd5ba6586a85ca1738/base/base_paths_android.cc

Sorry for coming back to this late. Thanks for your comments torne@!

It turns out from UMA added in comment 9, the majority of PROCESS_NOT_FOUND come from starting a SW for activation. That was unexpected. I think what's happening is some race with browser shutdown: you close the last tab which releases the last controllee so activation starts, but shutdown also starts happening, and somehow that causes readlink to fail.

Activate fails often with "START_WORKER_FAILED" as well, and it also fails on other platforms, so it indeed looks like we're trying to activate at a bad time in the browser's life.
Labels: -OS-Android OS-All
Summary: ServiceWorker.StartWorker.Status has a lot of failures for "activate" event (was: ServiceWorker.StartWorker.Status reports ERROR_PROCESS_NOT_FOUND on Android Stable)
The general bug is starting to start a worker and shutdown happens immediately after. This happens often for the activate event. I can repro failures on Linux easily.

There seems to be a few codepaths that can lead to a start worker failure during shutdown. One is that SWProcessManager chooses uses an existing process and increments ref count, then immediately after it decrements ref count in Shutdown(), letting the renderer die, resulting in a start worker failure.

Another I've seen, which I can't track down well/understand, is RenderProcessHostImpl::OnChannelError() gets called.

Not too sure yet the best way to solve this. One maybe hacky way is for activation is to first hop to the UI thread and back, so if you managed to survive that shutdown is probably not happening.
Would it help to check directly whether a shutdown is in progress? "browser_shutdown::GetShutdownType() != browser_shutdown::NOT_VALID"
That's worth considering. It'd have to be plumbed to content/ somehow of course.

We (content/) get notified when shutdown is starting via BrowserContext::NotifyWillBeDestroyed (I got some pushback when adding that). That immediately calls ServiceWorkerProcessManager::Shutdown, so we might already have all the info we need.
I'm still crawling through shutdown/tab close code. It's looking like some sequence like the following:
ApplicationLifetime::AttemptExitInternal
BrowserCloseManager::TryToCloseBrowsers
Browser::OnWindowClosing
TabStripModel::InternalCloseTabs

starts closing tabs before we enter browser_shutdown::SetTryingToQuit on OnShutdownStarting.

So the last tab gets closed and we start trying to start the worker, then soon after ServiceWorkerProcessManager::Shutdown is called.
hi i confirmed. There is a bug. 

The problem is:
there is a wrong event sequence.  
after "install" event , service is ready even if  not always there is "activate" event executed before.
So the result is first execution controller is null. In firefox the behaviour is right.
#21 I think that may be a different bug. Can you please explain the steps to reproduction the problem?
Project Member

Comment 23 by bugdroid1@chromium.org, Jun 3 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6daa334159472055f2d967e17cc8de3bc98ff232

commit 6daa334159472055f2d967e17cc8de3bc98ff232
Author: falken <falken@chromium.org>
Date: Fri Jun 03 06:40:31 2016

service worker: Avoid starting up for activation during shutdown

Activation is prone to start failures due to browser shutdown, since
it's triggered when tabs close. This patch:
- Delays activation for a second, when triggered by tab close
- Tries to detect when shutdown happened when activation fails

BUG= 600226 

Review-Url: https://codereview.chromium.org/2027583002
Cr-Commit-Position: refs/heads/master@{#397649}

[modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/content/browser/service_worker/service_worker_job_unittest.cc
[modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/content/browser/service_worker/service_worker_metrics.cc
[modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/content/browser/service_worker/service_worker_metrics.h
[modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/content/browser/service_worker/service_worker_process_manager.cc
[modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/content/browser/service_worker/service_worker_process_manager.h
[modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/content/browser/service_worker/service_worker_registration.cc
[modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/content/browser/service_worker/service_worker_registration.h
[modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/tools/metrics/histograms/histograms.xml

Status: Fixed (was: Started)
UMA is looking good. The last Android Dev without the fix (52.0.2743.32) had 99.26% overall startup success, with 74.96% success for activation. The most recent two Android Devs (53.0.2767.6, 53.0.2763.0) have 99.83% overall startup success, with 99.71% success for activation.

Marking this fixed as of 53.0.2763.0.
Labels: M-53

Sign in to add a comment