New issue
Advanced search Search tips

Issue 707621 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner: ----
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 2
Type: Bug-Regression



Sign in to add a comment

ServiceWorkers randomly disappearing

Reported by ryan@cyph.com, Apr 3 2017

Issue description

UserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36

Steps to reproduce the problem:
Open a site that registers a ServiceWorker, and then continue using the site regularly.

What is the expected behavior?
The ServiceWorker should persist indefinitely until explicitly unregistered by the user.

(If there's a specific intentional reason that ServiceWorkers no longer persist indefinitely, it would be very much appreciated if they could stick around for at least 60 days.)

What went wrong?
The ServiceWorker disappears at some point (even after a long period of daily use) for no apparent reason.

Did this work before? Yes 53.0.2785.116

Does this work in other browsers? Yes

Chrome version: 57.0.2987.133  Channel: stable
OS Version: OS X 10.11.6
Flash Version: 

I'm not entirely sure when this became a problem. I know that I'd never personally experienced it before October 6 of last year, which would've been shortly after I upgraded to 53.0.2785.143, and I'm pretty certain that it wasn't an issue at all before then (or at least it was rare enough that I'd gone at least a year without experiencing it) — so my guess is that 53.0.2785.116 is the last version in which it worked.

I would have reported this immediately at the time, as the impact on our application is severe*, but at the time I thought it was just user error (i.e. the result of a mistake on my part from messing with settings or something). Since then, I've run into the problem myself a few more times and we've recently received some reports from impacted users.

*: Specifically, when this bug is triggered it bricks the Cyph application at https://cyph.ws. This is due to an HPKP-Suicide-based content pinning scheme that relies on the ServiceWorker for "offline" availability; when the ServiceWorker disappears our users just get stuck with a scary TLS pinning error screen. More detail if needed: https://cyph.team/websigndoc
 

Comment 1 by ryan@cyph.com, Apr 3 2017

Also just noticed I neglected to mention two things:

1. This problem is occurring while the server is for all intents and purposes offline (see extended explanation at the bottom). It seems to me that extra care should be given to keeping a ServiceWorker available when its server is offline, given that one of the major selling points of ServiceWorker/AppCache is offline availability.

2. In case it's relevant, the Cache-Control header in our case is "private, max-age=31536000".
Labels: Needs-Feedback
There is nothing by design that automatically unregisters a service worker based on a time limit.

The registration is supposed to last until explicitly unregisters or cleared by some user action. I think the quota manager can also evict the registration when quota is exceeded.

Is it possible users are clearing browsing data like local storage/cookies? That would remove the service worker.

When you see it next can you check chrome://serviceworker-internals/ and see if all service workers got removed or if it was just yours? It's possible that the database got corrupted and we had to clear it, in which case all service workers would be lost.

Comment 3 by ryan@cyph.com, Apr 3 2017

> There is nothing by design that automatically unregisters a service worker based on a time limit.
>
> The registration is supposed to last until explicitly unregisters or cleared by some user action. I think the quota manager can also evict the registration when quota is exceeded.

Hmm, well then this is more of a feature request than a bug report, but would it be possible for ServiceWorkers associated with currently-inaccessible servers to be given a higher priority there, given that they'd be more important to hold onto in cases where they can't be readily replaced?

> Is it possible users are clearing browsing data like local storage/cookies? That would remove the service worker.

That's interesting to know; thanks. It's a possibility for the reports we've received from affected users, but at least in my case (I've probably been experiencing this personally at least once every 2 - 8 weeks since October) that definitely isn't it, as I rarely clear my site data (looking at my current cookies / local storage, I see some records that are definitely ~2 years old).

(Also less bug and more feature request: it'd be nice if it were made more difficult for users to accidentally wipe SWs (or at least inaccessible-server-SWs) through this method, possibly by splitting SWs into a separate category or adding an extra confirmation step for clearing inaccessible-server-SWs or something.)

> When you see it next can you check chrome://serviceworker-internals/ and see if all service workers got removed or if it was just yours? It's possible that the database got corrupted and we had to clear it, in which case all service workers would be lost.

Not sure (I'll make a note to check for this next time it happens to me), but the oldest SW in that list for me right now is from March 16, which suggests that most likely all the SWs got removed last time it happened.

Is it possible that some Chrome updates between October and now might have included breaking changes in the database format that would have caused the same effect as corruption? Alternatively, maybe there's just a bug that's causing this recurring corruption? That said, I have intact download history going back to mid-2015, browsing history that goes back pretty far (at least further back than the last time my SWs got cleared), and (as mentioned) ~2 years of site data; does that rule out this being a database corruption issue, or could something like this cause the SWs to be cleared without touching the rest of my data?
Project Member

Comment 4 by sheriffbot@chromium.org, Apr 3 2017

Cc: falken@chromium.org
Labels: -Needs-Feedback
Thank you for providing more feedback. Adding requester "falken@chromium.org" to the cc list and removing "Needs-Feedback" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Comment 5 by mek@chromium.org, Apr 3 2017

Components: Blink>Storage
We do have some heuristics these days to try to identify "important" sites, and ask the user separately if they really meant to delete those when clearing all storage (don't remember the exact details). I don't believe availability of the host the service worker is on is currently taken into account for this though.

Just to be sure, does your site use unregister() at all?

> Is it possible that some Chrome updates between October and now might have included breaking changes in the database format that would have caused the same effect as corruption? Alternatively, maybe there's just a bug that's causing this recurring corruption?

This is a good question. I can't think of anything such breaking change and our metrics haven't shown any widespread recent regressions.

However,  issue 706491  also could be explained by database corruption.

Comment 7 by ryan@cyph.com, Apr 3 2017

> We do have some heuristics these days ...

Awesome, that's good to know. Well you definitely have my vote for taking SW host availability into account for that.

> Just to be sure, does your site use unregister() at all?

Nope, just did a recursive grep to make sure; that definitely isn't the issue here. (This issue also doesn't seem to happen in Firefox.)

Also, going back to your earlier question about whether it affects all the ServiceWorkers or just one, I just remembered a relevant detail that pretty much answers that: we have a bunch of different environments on separate hostnames using the same SW pinning setup (including e.g. https://cyph.im, which is just a few lines of JS to redirect to our application at https://cyph.ws rather than something complex that might be accidentally calling unregister), and every time I've experienced this I've lost the SWs and AppCaches for all of those environments at once, not just one in particular.
Labels: Needs-Triage-M57
Labels: TE-NeedsTriageHelp
> every time I've experienced this I've lost the SWs and AppCaches for all of those environments at once, not just one in particular.

Thanks, that helps a lot. This sounds like a general storage issue or the quota manager kicking in, then. If just the SW database got corrupted, then AppCache would not be affected.

Storage folks: WDYT?
A few asides...

> I think the quota manager can also evict the registration when quota is exceeded.

Note that we don't evict an origin when it's over quota. We start evicting origins in LRU order when Chrome itself is using more space than it thinks it should. If overall storage is not an issue then all that happens when a site is over quota is the site's operations (cache puts, idb transactions, etc) start getting QuotaExceededError

> it'd be nice if it were made more difficult for users to accidentally wipe SWs (or at least inaccessible-server-SWs) through this method

We have that on mobile and are bringing it to desktop. If the site is granted the persistent storage permission (see storage.spec.whatwg.org) or other heuristics (like being registered for push notifications, having a bookmark, etc) we interrupt the user during storage clearing and say "that includes these important sites - are you sure?" and they can choose to uncheck specific sites or or cancel. 

...

I agree with the guess that for some of the users the loss of the SW is likely caused by clearing browsing data. Our stats show that users do this with extremely high frequency (either for privacy reasons or as the first step in diagnosing a problem, i.e. following questionable advice on forums for "my browser is slow..."). It might be worth asking users that report the problem if they did this, just to get an idea of whether this correlates.

But as for your machine...

First off, thanks for reporting that this goes back as far as 53. I don't recall any storage-wide changes in that time frame, but that's helpful to start looking.

Guesses include:

* Extensions - extensions can trigger clearing browsing data. Make sure you know what your extensions are doing.
* Storage pressure from other origins - e.g. using a handful of other sites that are pushing Chrome past the limit of drive space it's comfortable using, triggering eviction of origins by LRU

Prior to 57 Chrome used only a fraction (1/3) of "free space". So if you hard drive was 100GB but 99% full, Chrome would use at most 300MB; if 3 sites each stored 100MB then Chrome would aggressively evict other sites to keep below 300MB total.

In 58 we changed things so Chrome looks at total disk space not just free space; in the above scenario, Chrome would consider using up to 30GB (although in the case where the disk is mostly full things still get complicated)

So... obvious questions to ask: how much drive space do you have free, and what amount of storage is used by other sites? chrome://settings/cookies shows this (far right column for anything "big")



Comment 12 by ryan@cyph.com, Apr 7 2017

> Thanks, that helps a lot. This sounds like a general storage issue or the quota manager kicking in, then. If just the SW database got corrupted, then AppCache would not be affected.

Sorry I neglected to mention that detail about AppCache in my original report! I kind of mentally group them together as one feature since for our purposes they're effectively the same thing.

> We have that on mobile and are bringing it to desktop. ...

Awesome, that's great to hear!

> I agree with the guess that for some of the users the loss of the SW is likely caused by clearing browsing data. ...

Got it, I'll confirm this next time it comes up. I agree that it probably is the case for at least some of these users.

> Extensions - extensions can trigger clearing browsing data.

Hm, that's definitely possible. I don't have many extensions enabled (certainly not any that I'd expect to do anything like that), but is there a specific permission I should check for to see if any of them has the ability to do this?

> Storage pressure from other origins ... So... obvious questions to ask: how much drive space do you have free, and what amount of storage is used by other sites?

As far as site data within Chrome, I see ~250 MB used by Google Drive, ~10 MB used by a small handful of other sites, and ~800 KB used by each of the Cyph environments. Usually I have a pretty decent amount of free space (between 15 and 50 GB at any given time), although on rare occasions a Docker bug (https://github.com/docker/for-mac/issues/371) that I have to manage a bit causes it to drop below 1 GB if I stop paying attention to it for a while. I haven't noticed a correlation between the instances where I've gotten a low disk space warning from OS X and the instances where Chrome has cleared out my ServiceWorkers/AppCaches, but I hadn't been looking for a connection there. I'll test intentionally using up my free space in the morning and see if I can reproduce the SW/AC issue.

That said, the first time I ever ran into this problem was a bit interesting, and seems like it might possibly contradict the storage contention idea (or indicate a bug in the eviction logic). On October 6, I'd had Cyph (https://cyph.ws) open and running as normal, and then right after opening it (while the tab was still open) I opened another cyph.ws link in a separate tab and ran into the NET::ERR_SSL_PINNED_KEY_NOT_IN_CERT_CHAIN screen that having Cyph's ServiceWorker/AppCache removed causes. I don't remember having any disk space issues around that time, and it seems odd that a ServiceWorker/AppCache that was in active use would be removed by design.

Comment 13 Deleted

Comment 14 by ryan@cyph.com, Apr 7 2017

Update: tested the storage contention idea by reducing my free space to under 10 MB (created a large sparse file with mkfile -n), and I did reproduce this problem (mostly; for whatever reason the ServiceWorkers and AppCaches from Google Drive still stuck around).

This makes sense as an explanation to me, and it's nice to know that it's at least a very edge case problem.

A few questions:

* Is this something that the heuristics being ported from mobile for identifying "important" sites may help mitigate, or is that entirely unrelated to the automatic storage eviction logic?

* Any ideas about what Drive may be doing differently for its SWs/ACs to have survived this? Is it just whitelisted somewhere within Chrome itself (similarly to the HSTS/HPKP preload list), or is this possibly something that we could implement as well?

* One interesting thing I confirmed was that Firefox didn't have the same problem; repeatedly opening and closing Firefox before/during/after the free disk space reduction didn't affect the availability of our ServiceWorkers there. Any thoughts on whether it would make sense for Chrome to adopt this behaviour, and be similarly conservative in evicting ServiceWorkers?
Drive has an extension which grants the origin a special "unlimited storage" permission that opts it out of eviction.

The "open web" version of that we've implemented is the "persistent storage permission", see https://developers.google.com/web/updates/2016/06/persistent-storage - an origin can request the permission, which may be granted on a heuristic basis. Once granted the origin will not be flushed due to storage pressure (so this is a good signal to enable offline UI, etc). Firefox is implementing too, unsure when they're going to ship. They may show a user prompt rather than using heuristics.

Managing storage is a very browser-specific thing and we all use different heuristics, balancing keeping the web ephemeral with empowering apps and not surprising users. We're starting to at least discuss more predictable behavior here, with ideas mostly going into https://storage.spec.whatwg.org

Other than an origin with "unlimited" or "persistent" permission being opted out we just use an LRU; we don't prioritize based on other signals (like having a SW) although that's plausible. Since the "persistent" permission is new we'd encourage sites to try that out.


...

It sounds like we have explanations for the behavior seen on your machine and your users' machines. Can we resolve this issue pending further data?

Comment 16 by ryan@cyph.com, Apr 7 2017

Nice, thanks a lot! That persistent storage permission sounds like exactly what we need.

I agree that there's no evidence for root causes other than storage contention and manual user intervention, and we have a good solution for the former now, so you can mark this as resolved.
Status: WontFix (was: Unconfirmed)
Thanks for your patience and investigation!

Do let us know if more data comes in.

Comment 18 by ryan@cyph.com, Apr 7 2017

Sounds good, thanks! I can at least verify that the Persistent Storage permission is working as intended after manually granting it to https://cyph.ws and repeating my earlier test.

Sign in to add a comment