ServiceWorker.StartWorker.Status has a lot of failures for "activate" event |
||||||
Issue description
,
Apr 6 2016
Check this uma timeline: https://goo.gl/U2qhTx The "Notifications.PersistentWebNotificationClickResult#Service Worker error" line kind of corresponds.
,
Apr 6 2016
Thanks, that's informative. The raw counts from the push and notification click UMAs make up only a fraction of the PROCESS_NOT_FOUND failures, so it seems we're failing often somewhere else too. BTW, there's a lot of "Database error" for Notifications.PersistentWebNotificationClickResult. Is that expected? Could that be related to issue 534537 ?
,
Apr 6 2016
Yes - we need to clear notifications associated with a Service Worker when it gets unregistered, but currently aren't doing that yet because of a pending refactor. On the list for M52.
,
Apr 19 2016
,
Apr 19 2016
The only way to get ERROR_PROCESS_NOT_FOUND is if RenderProcessHost::Init returns false, and that only happens if the path for the renderer is not found.
Under typical operation, that path is /system/bin/app_process32 on my tablet, which comes from PathProviderAndroid (with base::FILE_EXE key).
So it seems either PathProviderAndroid is not registered by the time PathService::Get is called, or PathProviderAndroid is failing because readlink() is failing here:
int bin_dir_size = readlink(kProcSelfExe, bin_dir, PATH_MAX);
if (bin_dir_size < 0 || bin_dir_size > PATH_MAX) {
NOTREACHED() << "Unable to resolve " << kProcSelfExe << ".";
return false;
}
,
Apr 25 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c1a595d1736123171d216b67535303916d95874d commit c1a595d1736123171d216b67535303916d95874d Author: falken <falken@chromium.org> Date: Mon Apr 25 14:02:00 2016 Add CHECK in case PathProviderAndroid is failing. In bug 600226 , RenderProcessHost::Init() is returning false. The only possible cause I can see is if the path service returned false, so see if PathProviderAndroid is returning false. BUG= 600226 Review URL: https://codereview.chromium.org/1914983002 Cr-Commit-Position: refs/heads/master@{#389468} [modify] https://crrev.com/c1a595d1736123171d216b67535303916d95874d/base/base_paths_android.cc
,
Apr 25 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c1a595d1736123171d216b67535303916d95874d commit c1a595d1736123171d216b67535303916d95874d Author: falken <falken@chromium.org> Date: Mon Apr 25 14:02:00 2016 Add CHECK in case PathProviderAndroid is failing. In bug 600226 , RenderProcessHost::Init() is returning false. The only possible cause I can see is if the path service returned false, so see if PathProviderAndroid is returning false. BUG= 600226 Review URL: https://codereview.chromium.org/1914983002 Cr-Commit-Position: refs/heads/master@{#389468} [modify] https://crrev.com/c1a595d1736123171d216b67535303916d95874d/base/base_paths_android.cc
,
Apr 27 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/84d425e0db07b37475a7cb467fec6f0fca41dc6d commit 84d425e0db07b37475a7cb467fec6f0fca41dc6d Author: falken <falken@chromium.org> Date: Wed Apr 27 05:25:04 2016 service worker: Add UMA for start status grouped by start purpose. Mostly for debugging bug 600226 , but it's useful beyond that. BUG= 600226 Review URL: https://codereview.chromium.org/1921503003 Cr-Commit-Position: refs/heads/master@{#389993} [modify] https://crrev.com/84d425e0db07b37475a7cb467fec6f0fca41dc6d/content/browser/service_worker/service_worker_metrics.cc [modify] https://crrev.com/84d425e0db07b37475a7cb467fec6f0fca41dc6d/content/browser/service_worker/service_worker_metrics.h [modify] https://crrev.com/84d425e0db07b37475a7cb467fec6f0fca41dc6d/tools/metrics/histograms/histograms.xml
,
Apr 28 2016
Some crash reports are trickling in: F/chromium( 6511): [FATAL:base_paths_android.cc(28)] Check failed: bin_dir_size > 0 && bin_dir_size <= PATH_MAX. Unable to resolve /proc/self/exe. bin_dir_size=-1: Permission denied This means readlink() is failing with EACCESS: "Search permission is denied for a component of the path prefix." I don't know why that'd be. Maybe the file is somehow temporarily locked? Or Chrome is in a unprivileged state. https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Android%27%20AND%20product.version%3D%2752.0.2718.2%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27browser%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BAssert%5D%20base%3A%3APathProviderAndroid%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D&stbtiq=&reportid=0210fae200000000&index=6#7
,
Apr 28 2016
Is this happening in the renderer process? If so then I think this is WAI and caused by SELinux preventing it.
,
Apr 28 2016
Oh, no, I see it's the browser process. Let me take a closer look.
,
Apr 28 2016
Side question: why are we trying to find the executable directory on Android anyway? /proc/self/exe is always going to point to app_process (i.e. the JVM binary), not anything to do with us, and this directory is not going to be something we can use for anything else.
,
Apr 28 2016
No idea why /proc/self/exe would be inaccessible for the browser process (and it's definitely the browser process, from the pids in the logs), but it seems like the solution here is to not get this hopefully-unused and almost-certainly useless data in the first place ;)
,
May 10 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/fa35291713d19ec1f47f59fd5ba6586a85ca1738 commit fa35291713d19ec1f47f59fd5ba6586a85ca1738 Author: falken <falken@chromium.org> Date: Tue May 10 02:55:43 2016 Revert of Add CHECK in case PathProviderAndroid is failing. (patchset #2 id:20001 of https://codereview.chromium.org/1914983002/ ) Reason for revert: We got enough crash reports, reverting now. Original issue's description: > Add CHECK in case PathProviderAndroid is failing. > > In bug 600226 , RenderProcessHost::Init() is returning > false. The only possible cause I can see is if the path > service returned false, so see if PathProviderAndroid > is returning false. > > BUG= 600226 > > Committed: https://crrev.com/c1a595d1736123171d216b67535303916d95874d > Cr-Commit-Position: refs/heads/master@{#389468} TBR=torne@chromium.org # Not skipping CQ checks because original CL landed more than 1 days ago. BUG= 600226 Review-Url: https://codereview.chromium.org/1959153003 Cr-Commit-Position: refs/heads/master@{#392530} [modify] https://crrev.com/fa35291713d19ec1f47f59fd5ba6586a85ca1738/base/base_paths_android.cc
,
May 17 2016
Sorry for coming back to this late. Thanks for your comments torne@! It turns out from UMA added in comment 9, the majority of PROCESS_NOT_FOUND come from starting a SW for activation. That was unexpected. I think what's happening is some race with browser shutdown: you close the last tab which releases the last controllee so activation starts, but shutdown also starts happening, and somehow that causes readlink to fail. Activate fails often with "START_WORKER_FAILED" as well, and it also fails on other platforms, so it indeed looks like we're trying to activate at a bad time in the browser's life.
,
May 23 2016
The general bug is starting to start a worker and shutdown happens immediately after. This happens often for the activate event. I can repro failures on Linux easily. There seems to be a few codepaths that can lead to a start worker failure during shutdown. One is that SWProcessManager chooses uses an existing process and increments ref count, then immediately after it decrements ref count in Shutdown(), letting the renderer die, resulting in a start worker failure. Another I've seen, which I can't track down well/understand, is RenderProcessHostImpl::OnChannelError() gets called. Not too sure yet the best way to solve this. One maybe hacky way is for activation is to first hop to the UI thread and back, so if you managed to survive that shutdown is probably not happening.
,
May 23 2016
Would it help to check directly whether a shutdown is in progress? "browser_shutdown::GetShutdownType() != browser_shutdown::NOT_VALID"
,
May 23 2016
That's worth considering. It'd have to be plumbed to content/ somehow of course. We (content/) get notified when shutdown is starting via BrowserContext::NotifyWillBeDestroyed (I got some pushback when adding that). That immediately calls ServiceWorkerProcessManager::Shutdown, so we might already have all the info we need.
,
May 24 2016
I'm still crawling through shutdown/tab close code. It's looking like some sequence like the following: ApplicationLifetime::AttemptExitInternal BrowserCloseManager::TryToCloseBrowsers Browser::OnWindowClosing TabStripModel::InternalCloseTabs starts closing tabs before we enter browser_shutdown::SetTryingToQuit on OnShutdownStarting. So the last tab gets closed and we start trying to start the worker, then soon after ServiceWorkerProcessManager::Shutdown is called.
,
May 26 2016
hi i confirmed. There is a bug. The problem is: there is a wrong event sequence. after "install" event , service is ready even if not always there is "activate" event executed before. So the result is first execution controller is null. In firefox the behaviour is right.
,
May 26 2016
#21 I think that may be a different bug. Can you please explain the steps to reproduction the problem?
,
Jun 3 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6daa334159472055f2d967e17cc8de3bc98ff232 commit 6daa334159472055f2d967e17cc8de3bc98ff232 Author: falken <falken@chromium.org> Date: Fri Jun 03 06:40:31 2016 service worker: Avoid starting up for activation during shutdown Activation is prone to start failures due to browser shutdown, since it's triggered when tabs close. This patch: - Delays activation for a second, when triggered by tab close - Tries to detect when shutdown happened when activation fails BUG= 600226 Review-Url: https://codereview.chromium.org/2027583002 Cr-Commit-Position: refs/heads/master@{#397649} [modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/content/browser/service_worker/service_worker_job_unittest.cc [modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/content/browser/service_worker/service_worker_metrics.cc [modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/content/browser/service_worker/service_worker_metrics.h [modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/content/browser/service_worker/service_worker_process_manager.cc [modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/content/browser/service_worker/service_worker_process_manager.h [modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/content/browser/service_worker/service_worker_registration.cc [modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/content/browser/service_worker/service_worker_registration.h [modify] https://crrev.com/6daa334159472055f2d967e17cc8de3bc98ff232/tools/metrics/histograms/histograms.xml
,
Jun 22 2016
UMA is looking good. The last Android Dev without the fix (52.0.2743.32) had 99.26% overall startup success, with 74.96% success for activation. The most recent two Android Devs (53.0.2767.6, 53.0.2763.0) have 99.83% overall startup success, with 99.71% success for activation. Marking this fixed as of 53.0.2763.0.
,
Jun 22 2016
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by falken@chromium.org
, Apr 6 2016Labels: -Pri-2 Pri-1
Owner: falken@chromium.org
Status: Assigned (was: Available)
This looks real, but I'm not sure it's a code regression. The metric is now alternating between 0.1% on weekends and 0.5% on weekdays. Interestingly, the total service worker startup count spikes up on weekends, and down on weekdays. Not being able to find a process seems more likely with push notifications than navigations. And indeed ServiceWorker.FetchEvent.MainResource.Status is healthy. Peter, Michael, do you have metrics on how often "DispatchPushEvent" fails due to the service worker unable to start up? Anyway, I think my first task here will be to add StartWorker.Status_{StartPurpose} metrics, so we can better understand the problem.