Issue metadata
Sign in to add a comment
|
ConstraintError (Key already exists in the object store.) on CrOS 58.0.3029.6 |
||||||||||||||||||||||
Issue descriptionUserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 Steps to reproduce the problem: 1. Users who have offline enabled in Google Docs, working on ChromeOS 58.0.3029.6 2. Open an Google Docs document What is the expected behavior? Should work. What went wrong? ConstraintError (Key already exists in the object store) is thrown when trying to write to IndexeDB. Did this work before? Yes 58.0.3027.0 Does this work in other browsers? N/A Chrome version: 58.0.3029.6 Channel: dev OS Version: 58.0.3029.6 Flash Version: We are working on a better repro case. Basically we're reading the database using ObjectStore.get(), that is returning a falsy value, we then try to write a new document using ObjectStore.add(), which fails with the constaint error above. This issue only started appearing on Docs starting 3/8/2017 when Chrome 58.0.3029.6 started rolling out. It only appears on CrOS. Also, the issue doesn't seem to reproduce for newly created documents, so our suspicion is that something may be going wrong with the update flow between two chrome versions.
,
Mar 13 2017
,
Mar 13 2017
This looks to me like it was caused by the V8 version rolling back, https://chromium.googlesource.com/chromium/src/+/a8be5d4dabe8587567639e63831e3fc74ee089c2. Which would mean the same root cause as bug 698214. Basically, the structured clone code has for some time had an assumption that the wire format version would not go backwards, but the V8 rollback for the branch did just that. As a result, objects written by newer versions of the serialization logic are being rejected by older versions. Blink handles this, and all serialization errors, by producing null (which is indeed a falsy value). In that bug, I thought we'd determined that only Canary would see this issue. Is CrOS dev channel unusual in this regard, as compared to the other platform dev channels?
,
Mar 13 2017
That theory seems to match what we're seeing. So we do *not* expect this issue to propagate beyond the dev channel? If so, I'd like to instruct users that reported this issue to disable and re-enable Docs offline so they are able to use their documents again. Also, are there any mechanisms are being put in place to prevent this from happening in the future, especially in prod. What happens if we ever need to rollback in prod? That could potentially brick all Docs users that have offline enabled. Shouldn't wire format changes be backwards compatible?
,
Mar 13 2017
Changes are backwards compatible (i.e. new versions of the code can read old versions of the data); this is a case where forwards compatibility was violated due to the rollback (i.e. old versions of the code can not read new versions of the data). We embed the wire format version in the database and fail the open if it's a "future" version rather than hitting undefined behavior as a defense against this. In this case I don't think the version number was bumped (jbroman@?).
,
Mar 13 2017
The format is backward compatible (newer versions can read data written by older versions), but not forward compatible (old versions cannot read data written by newer versions, and in fact don't even try). I don't know the background behind this decision; it's been this way for some (all?) time. Being perfectly forward compatible would seem impossible (there will be new kinds of values that didn't exist in old versions). Extreme measures are, of course, possible (like requiring everything to be in the deserializer a milestone before they're emitted by the serializer), but that'd be very rough to work with. jsbell, do you know the storage team's historical thinking about this? The particular wire format change in this case is very small, so we could cherry-pick the change forward, or cherry-pick a change that made it accept the new version without ever writing it. I'd thought that unnecessary because it hadn't hit anyone but Canary (and Canary quickly moved back forward past the V8 change again, and so once again was able to read the data).
,
Mar 13 2017
#5: Looking more closely, I was unaware of that copy of the wire format version (it's somewhat detached from the rest of the serialization code). It's a little more complicated now (because there can be wire format version changes on the V8 side, as well), so that mechanism should probably be updated to include a combined version. Though if I had known about that, the failure mode would have changed (database fails to open), but there would still be an issue here, correct? Or do IDB apps recover gracefully?
,
Mar 13 2017
Web apps using IDB need to handle self-induced version skew - if you previously did indexedDB.open('db', 2) then indexedDB.open('db', 1) would error out in a similar way. There will be a different error code, but browser-induced failure here obviously isn't something developers can test for; it goes beyond specified behavior.
The other option we've considered is blowing away the database, which we do in the case of corruption.
,
Mar 13 2017
Yes, sorry I meant forward compatibility. Usually when we deal with storage at the app level, we push out the reading part first and only after a while do we start writing data in the new version. This way we are still able to rollback without breaking things. I don't know if that can be done in your case though. Blowing away the database is definitely not an option we prefer, as it can potentially delete users' unsaved changes causing data loss.
,
Mar 13 2017
For the immediate issue, how do people feel about cherry-picking the read half of 6543519977b2012b58a4ffef28b8527db404fbdb into M58? The patch would look roughly like: // Version 12: regexp and string objects share normal string encoding static const uint32_t kLatestVersion = 12; +// Version 13: host objects have an explicit tag (rather than handling all +// unknown tags) +static const uint32_t kLatestVersionForReading = 13; // wasmWireByteLength:uint32_t, then raw data // compiledDataLength:uint32_t, then raw data kWasmModule = 'W', + // The delegate is responsible for processing all following data. + // This "escapes" to whatever wire format the delegate chooses. + kHostObject = '\\', }; Maybe<bool> ValueDeserializer::ReadHeader() { if (position_ < end_ && *position_ == static_cast<uint8_t>(SerializationTag::kVersion)) { ReadTag().ToChecked(); - if (!ReadVarint<uint32_t>().To(&version_) || version_ > kLatestVersion) { + if (!ReadVarint<uint32_t>().To(&version_) || version_ > kLatestVersionForReading) { isolate_->Throw(*isolate_->factory()->NewError( case SerializationTag::kWasmModule: return ReadWasmModule(); - default: - // TODO(jbroman): Introduce an explicit tag for host objects to avoid - // having to treat every unknown tag as a potential host object. - position_--; + case SerializationTag::kHostObject: return ReadHostObject(); + default: + // Before there was an explicit tag for host objects, all unknown tags + // were delegated to the host. + if (version_ < 13) { + position_--; + return ReadHostObject(); + } + return MaybeHandle<Object>(); } } Alternatively, we could cherry-pick the whole patch (the writing side is just a bump of kLatestVersion and one line to include SerializationTag::kHostObject before each host object), which changes more but is also exactly a patch that's tested in canary/ToT, whereas I'd have to test this locally. Regardless, if this is the issue it shouldn't end up affecting beta/stable users, but dev channel users do deserve a resolution of some kind.
,
Mar 13 2017
I think I'd lean slightly towards the "read half" patch above to keep it defensive in dev rather than pushing out the writing change, just in case further churn happens. Sounds like the follow-up actions are: * Work on the plumbing for getting the serialization version(s) persisted. It's definitely gotten more complicated over time. :( * Teach the IDB code to handle a failure here if our other defenses fail - i.e. replace the DCHECKs with handling the serialization failure, pass an error through to JS, and force-close the connection (as we do on I/O error).
,
Mar 13 2017
,
Mar 13 2017
(Merge request note: the merge would be on the V8 side.)
,
Mar 14 2017
There's already a pending merge request of those (entire!) changes for node.js: https://bugs.chromium.org/p/v8/issues/detail?id=6080 If CrOS needs it too, then maybe we should just do that? (re #3 -- yes, it seems CrOS-Dev is unusual in that it shipped a 3027 build on Dev, which the desktop platforms only shipped on Canary.)
,
Mar 14 2017
The last 58 Dev in ChromeOS is scheduled to go live today and I assume already built. Even if we merge back the feature/fix, it wouldn't help. If ChromeOS does another 58 push on Dev, a merge should be ok. In that case, please merge the whole patch, not only the reading part. The whole patch is at least tested :-). Bernie please decide regarding the merge as you know the actual ChromeOS release plan better,
,
Mar 14 2017
We will be doing another 58 push on Dev, likely next Tuesday, today's is already built and is not a beta candidate.
,
Mar 14 2017
Could a v8 committer please cherry-pick 6543519977b2012b58a4ffef28b8527db404fbdb for me, then? (Alas I can only cherry-pick to Chromium branches.)
,
Mar 15 2017
,
Mar 15 2017
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/41e12ea699bfb7ab564606b8764b3dcd5d12244e commit 41e12ea699bfb7ab564606b8764b3dcd5d12244e Author: Jakob Kummerow <jkummerow@chromium.org> Date: Wed Mar 15 12:12:55 2017 Merged: ValueSerializer: Add an explicit tag for host objects. Revision: 6543519977b2012b58a4ffef28b8527db404fbdb BUG= chromium:686159 , chromium:700603 , v8:6080 LOG=N NOTRY=true NOPRESUBMIT=true NOTREECHECKS=true R=hablich@chromium.org Review-Url: https://codereview.chromium.org/2749263002 . Cr-Commit-Position: refs/branch-heads/5.8@{#29} Cr-Branched-From: eda659cc5e307f20ac1ad542ba12ab32eaf4c7ef-refs/heads/5.8.283@{#1} Cr-Branched-From: 4310cd02d2160b1457baed81a2f40063eb264a21-refs/heads/master@{#43429} [modify] https://crrev.com/41e12ea699bfb7ab564606b8764b3dcd5d12244e/src/value-serializer.cc [modify] https://crrev.com/41e12ea699bfb7ab564606b8764b3dcd5d12244e/test/unittests/value-serializer-unittest.cc
,
Mar 17 2017
Hi, what's the ETA for the fix to hit Chrome OS dev channel please?
,
Mar 17 2017
#16: bhthompson above is the TPM, so sounds like next Tuesday.
,
Mar 17 2017
That is correct, Tuesday is the next 58 dev target.
,
Mar 20 2017
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Mar 20 2017
Merged to 58 above (on the V8 5.8 branch).
,
Mar 21 2017
crbug/698214 suggests that it might have also affected stable versions? I was under the impression that this was caught in dev channel on crOs and canary on other builds. Please confirm.
,
Mar 22 2017
jbroman@, could you confirm which platform/versions are affected? We've received a user report from Mac 57.0.2987.110 Impacted user number is much lower than CrOs, but I just wanted to double-check. Also, in which chrome version is the roll forward included?
,
Mar 22 2017
The CL we believe to be responsible for this was not in M57. Assuming the diagnosis was correct, an M57 user should only observe this issue if: - data is corrupted on disk, or - they were previously running a greater milestone, and downgraded to 57
,
Mar 22 2017
Confirmed that user recently switched from beta to stable channel.
,
Mar 22 2017
jsbell could comment further, but my understanding is that we don't support profile downgrades. We should be behaving slightly better than this, but I accidentally broke the downgrade detection (and intend to fix it). But even if that were working correctly, I think the failure mode would be that the IDB database would fail to open (and the data would appear to be "gone" to the application).
,
Mar 22 2017
Correct. After double-checking the code, if we detect "from the future" data on open we report it as a variant of data-loss - the open's upgradeneeded event has nonstandard dataLoss == "total" / dataLossMessage == "I/O error checking schema" properties. I have issue 703704 filed to track handling this post-open, but we should really never get that far. Feedback welcome!
,
Mar 24 2017
We have verified that a user who was affected upgraded to 58.0.3029.31 and the issue was resolved for him. Thank you for fixing. At this point, I think a postmortem would be appropriate. This issue has caused around ~4000 daily crashes affecting up to 700 users. If the original change had reached stable channel and was rolled back there, the impact would have been catastrophic. Just in Docs, 20% of prod users (users with offline enabled) would have been unable to access their documents. I also think that we can draw useful action items from this: 1) How to better handle object deserialization in IndexedDB (fail fast on read?) [i.e. jsbell@'s crbug/703704] 2) How to avoid making changes in Chrome that can break on rollback. This is something that apps that use storage have to deal with all the time, and should be standard practice in Chrome IMO. 3) Regression tests for Chrome that verify upgrade/downgrade flows (i.e. create a database, upgrade chrome, verify that all the contents are intact and database is functional, then downgrade, and verify again). 4) How can apps like Docs be more resilient to these problems? (eg: We will probably try to catch errors during database initialization and fallback to server-only)
,
Apr 18 2017
Hi - Was there a postmortem or a list of takeaways from this issue? It seems to me that there is a missing best practice here that could avoid this happening again, possibly in stable where it would have caused significant damage.
,
Apr 18 2017
Thanks for prodding - we're getting started on the PM now
,
May 10 2017
,
May 10 2017
Thank you!
,
Aug 1 2017
,
Jan 22 2018
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by jsb...@chromium.org
, Mar 13 2017