New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 809176 link

Starred by 4 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug-Regression



Sign in to add a comment

Large number of "InvalidStateError requesting filesystem" reports

Project Member Reported by ohsnapit...@google.com, Feb 5 2018

Issue description

Google Docs is seeing a large number of "InvalidStateError requesting filesystem" reports, which appear to have started to increase in late January.

The substantial majority of the reports appears to be from Windows machines running Chrome 63 and 64.

Here's the breakdown of our top error reports by OS and Chrome version:

Windows NT 6.1; Win64; x64		63.0.3239.132	36.06%
Windows NT 10.0; Win64; x64		63.0.3239.132	26.24%
Windows NT 6.1				63.0.3239.132	9.93%
Windows NT 6.3; Win64; x64		63.0.3239.132	6.26%
Windows NT 6.1; Win64; x64		64.0.3282.119	3.54%
Windows NT 10.0; Win64; x64		64.0.3282.119	3.47%
Other (All OS and Chrome Version)			14.50%

Looking at these numbers, over 85% of our errors come from Windows machines. This is disproportionate to our usage numbers, where Windows makes up about 35% of our total users.

 
Labels: Needs-Triage-M64
Labels: -Pri-2 Pri-1
Changing the priority on this as we're seeing the errors continue to rise.
It's worth noting that we also appear to be seeing an increase in timeouts when requesting the filesystem via goog.fs.getPersistent. Our current timeout is set to 30 seconds.

These timeout errors also appear to be increasing specifically for Windows machines. 
Cc: dmu...@chromium.org
* 63 is r496892 through r508578. 

* The front-end never generates InvalidStateError. This will only be produced by the back-end passing base::File::FILE_ERROR_FAILED through. Unfortunately, that doesn't narrow it down much.

* The FS API uses leveldb to manage both top level origin-> directory mapping per-origin a mapping of paths to files. It's possible recent leveldb changes (in 63) we suspect are responsible for IDB timeouts on Windows could be causing timeouts here too. (But shouldn't be causing errors, so they may be two distinct issues...)

Comment 6 by cmumford@google.com, Feb 15 2018

As of Tue Feb 13 the beta channel has completely returned to it's normal error rate prior to the spike starting on 1/24. The stable channel, which spiked at 4x the normal error level, is down to 2.4x the normal level. I think that by mid next week (2/21) we should know if it is still trending down, or if it has leveled off at a new higher error rate (requiring further investigation).

Also, the other affected databases (notification, and appCache) have nearly identical trends.
Here's some more results of my investigation. leveldb::DB::Open() is failing, returning a status of I/O Error. There are very few (only three) places in leveldb that create a Status object with this value, and I think that none of them are in the Open() call sequence. However, ChromiumEnv (env_chromium.cc) almost exclusively creates I/O Errors (via a call to MakeIOError) - all of which are logged to UMA. (yay!)

There are two UMA values whose trends exactly match the trends of the top level db open UMA which are: WebCore.IndexedDB.LevelDBOpenErrors.BFE.LockFile and WebCore.IndexedDB.LevelDBOpenErrors.BFE.RenameFile. These two functions are logging FAILED values, again exactly matching the increases at the top level open call (WebCore.IndexedDB.BackingStore.OpenStatus=OpenAttemptNoRecovery). Whenever FAILED is returned File::OSErrorToFileError() logs the actual error value to PlatformFile.UnknownErrors.Windows. The LOCK_VIOLATION curve increase exactly matches the other UMA value increases. There are ~120 different values, and this was found by visual inspection, but it does appear as though ::CreateFile (used by LockFile) and ::RenameFile (Used by RenameFile) are failing with LOCK_VIOLATION.

This bug and  issue 810842  are really the same -  issue 810842  indicates that appcache.InitResult have both increased in the same way. appCache uses SQLite, which does not use base/file.

So it does look like something on Windows has changed. I think the next step is to look at the Windows Updates to see what changes they have introduced.
Cc: cmumford@chromium.org
 Issue 810842  has been merged into this issue.
A few more observations from UMA.

1. Simultaneous increases in beta and stable channels (already knew this).
2. Simultaneous increases in M62, 63, & 64.
3. Simultaneous increasees in Win7 through Win10.
4. Both 32 & 64 bit systems.
5. Both malware and non-malware infected systems.
6. Both spinning disk and SSD's.

It looks like a Windows update distributed widely (like a security patch), or maybe an antivirus definition update.
Cc: jimmyshen@google.com
Owner: jsb...@chromium.org
Status: Assigned (was: Untriaged)
Please update the bug with the latest status, likely from emails in Spring 18
Labels: -Pri-1 Pri-2
We never came to an actionable conclusion here. Dropping priority as we basically acknowledged this as the new normal, but the higher rate is still disturbing.
We seem to have seen a strong resurgence of this issue starting around 12/10 (see https://uma.googleplex.com/p/chrome/timeline_v2/?sid=e97a76a9ef483c693c8c599545d093ce).

It appears that the FileSystem.OriginDatabaseInit IO Errors in Chrome Stable for Windows machines have almost doubled over the past month.
 

Sign in to add a comment