GCM.Database.Open indicates a 14% failure rate |
||||
Issue descriptionOpening the GCM Database indicates a failure rate of 14%; 20% on Windows, 10% on Mac, 8% on Linux and almost none (0.02%) on Chrome OS. The vast majority of failures fall into the "invalid argument" bucket. In comparison, opening the GCMKeyStore succeeds in 99.9% of attempts, the Notification Database succeeds in 99.3% of attempts. Chris, any idea what this means? A search indicates that it might indicate concurrent access to the database, but that'd presumably be an IO error?
,
Sep 26 2016
Would this happen when the database does not exist and |create_if_missing| is set to FALSE? We also see 25.9% of loads failing because the database does not exist. This is a much bigger number (factor of 2.5) than the "Invalid Argument" errors, however.
,
Sep 26 2016
We actually check whether the *directory* exists prior to opening the database, and bail out if it doesn't and |create_if_missing| is set to FALSE. When loading the database fails (tracked through GCM.LoadStatus) for any other reason than the database not existing, we'll "reset" (blow away and start over) the database. This happens in 10.5% (!) of all loads.
,
Sep 26 2016
Locking failures do indeed seem to result in IOError instead. I looked through for the possible causes of InvalidArgument errors: Ones that can happen at DB Open include: - "does not exist (create_if_missing is false)", - "exists (error_if_exists is true)" and - "does not match existing comparator". There are some others that don't seem to occur during Open (but I haven't checked all codepaths): - "Invalid database key ID", - "unknown file type" (in leveldb::DumpFile), - "not a dump-able file type", - "changing comparator while building table". Finally there are also some InvalidArgument statuses constructed indirectly via DatabaseErrorToStatus, and it's hard to track those, though perhaps they are just deserializations of the ones above.
,
Sep 27 2016
Peter pointed out https://github.com/google/leveldb/issues/215. Since the GCMStore only checks if the directory exists, it could indeed be unintentionally hitting the "does not exist (create_if_missing is false)" case quite frequently. We should probably add an explicit DeleteDir call and/or check more carefully in GCMStore whether the database exists, in order to narrow down the cause of this.
,
Sep 27 2016
Discussed offline with johnme@ and it does appear as though an empty directory could result in this error. He's going to add a check for leveldb's "CURRENT" file - in addition to checking for the existence of the leveldb directory.
,
Sep 28 2016
I believe this also explains the high rate of "Infinite store reset" errors in GCM.ResetStore (about one Infinite store reset error for every two Invalid argument errors in GCM.Database.Open): 1. DB directory exists but is empty. 2. GCMClientImpl::Start is called with StartMode DELAYED_START. 3. GCMStoreImpl::Backend::OpenStoreAndLoadData thinks the DB already exists since the directory exists, so fails to return STORE_DOES_NOT_EXIST. 4. GCMStoreImpl::Backend::OpenStoreAndLoadData tries to load DB with leveldb::Options.create_if_missing=false, which fails with kInvalidArgument "does not exist (create_if_missing is false)". 5. GCMClientImpl::OnLoadCompleted enters the !result->success block, but does not enter the result->store_does_not_exist block due to #3; instead it calls ResetStore. 6. GCMClientImpl::ResetStore destroys the store without removing its directory, then triggers another load with the same StartMode (in this case DELAYED_START). 7. That load fails for the same reason. 8. GCMClientImpl::ResetStore gets called again, sees that this is the second time it tried to reset the store, logs the GCM.ResetStore UMA, and bails out, setting state_=UNINITIALIZED, which causes any future calls to GCMClientImpl::Start to fail (until the browser is restarted at least). Devices likely remain persistently stuck in this state, except perhaps if a call to GCMClientImpl::Start with StartMode IMMEDIATE_START wins the race condition and results in a load with leveldb::Options.create_if_missing=true (haven't checked if that's possible, or if the two DELAYED_START load attempts would always happen first).
,
Sep 29 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6d6c7809c0fff72b92a8038b7d65909b7449d3da commit 6d6c7809c0fff72b92a8038b7d65909b7449d3da Author: johnme <johnme@chromium.org> Date: Thu Sep 29 22:27:16 2016 GCM Store: Fix invalid argument errors Opening the GCM leveldb database resulted in a high rate of "does not exist (create_if_missing is false)" invalid argument errors, since the GCM Store was incorrectly considering the database to already exist if its directory exists, when actually the directory sometimes continues to exist after the database is destroyed. This is expected to significantly reduce the following UMA buckets: - GCM.Database.Open "Invalid Argument" percent (from ~20% to < 1%) - GCM.LoadStatus "Store open failed" percent (from ~10% to < 1%) - GCM.ResetStore "Infinite store reset" count (from ~10% of GCM.Database.Open count to < 1%) BUG= 650254 Review-Url: https://codereview.chromium.org/2380003002 Cr-Commit-Position: refs/heads/master@{#421957} [modify] https://crrev.com/6d6c7809c0fff72b92a8038b7d65909b7449d3da/components/gcm_driver/gcm_client_impl_unittest.cc [modify] https://crrev.com/6d6c7809c0fff72b92a8038b7d65909b7449d3da/google_apis/gcm/engine/gcm_store_impl.cc [modify] https://crrev.com/6d6c7809c0fff72b92a8038b7d65909b7449d3da/google_apis/gcm/engine/gcm_store_impl_unittest.cc
,
Oct 26 2016
mattgaunt reproduced this, and his GCM Store directory was not quite empty, instead it contained the following: - a LOCK file - a LOG file - an Encryption directory (itself containing files called 000003.log, CURRENT, LOCK, LOG, LOG.old, and MANIFEST-000001) Fortunately https://codereview.chromium.org/2380003002 explicitly checks for a CURRENT file (since we weren't sure if the left over directories were actually empty), so it should handle this fine (and I confirmed that unit tests still successfully reset the DB when there are LOCK and LOG files). Looking at UMA (for Channel == Beta AND Milestone == 55): - GCM.Database.Open "Invalid Argument" count is now tiny - GCM.LoadStatus "Store open failed" count is now tiny - GCM.ResetStore "Infinite store reset" count is now tiny (it's a reasonable fraction of GCM.ResetStore, but the total count for GCM.ResetStore is now tiny) So it seems pretty clear that this was indeed the primary cause of failure, and that it has indeed been fixed. Woohoo \o/ |
||||
►
Sign in to add a comment |
||||
Comment 1 by peter@chromium.org
, Sep 26 2016