Issue metadata
Sign in to add a comment
|
Increase in IndexedDB open timeouts in Chrome 63 |
||||||||||||||||||||||
Issue descriptionUserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.24 Safari/537.36 Steps to reproduce the problem: In Google Docs we're seeing our number of IndexedDb database open timeouts more than double with Chrome 63 release. The database open request has a timeout of 10 seconds. What is the expected behavior? What went wrong? This issue was discovered by monitoring that is in place to track database open timeouts. Breaking down by browser version it is clear that the timeout has increased with the rollout of Chrome 63. Did this work before? Yes Does this work in other browsers? No Chrome version: 63.0.3239.84 Channel: stable OS Version: OS X 10.12.6 Flash Version:
,
Dec 15 2017
Could be related to bug 785305?
,
Dec 15 2017
dmurph@ - can you take a look?
,
Dec 15 2017
I'm fairly sure this is not related to bug 785305 -- the comments there say that we've had that issue since M50. Also, in case you're wondering (I was), we've re-enabled large value wrapping in https://crrev.com/c/765057 which landed in M64 (first shipped in 64.0.3268.0). The only other project that comes to mind is index tombstone deletion. However, it should only kick in on database close, and the CLs I've reviewed for it also landed in M64.
,
Dec 15 2017
Taking another pass through the data, this does not appear to be as significant of an increase as originally reported. Instead of doubling, we appear to see about a 40-50% increase in errors in Chrome 63.
,
Dec 15 2017
The issue also appears to be specific to Linux and Mac. ChromeOS and Windows appear stable. Here are the following breakdowns of timeout increase from Chrome 62 to Chrome 63 over the past 5 days. ChromeOS - 3% Linux - 107% Mac - 82% Windows - 6%
,
Dec 19 2017
ok so our open times show that, from the browser side, things have not changed much: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=2e649c9eb5f59da2a6a1c41db26687a3 (and the 99-percentile time here is 180ms) So it seems like something is happening on the renderer side before we get the open request, or in the scheduling between the browser and the renderer.
,
Dec 19 2017
I'm working on a patch to improve this performance by no re-queuing all IDB response events, which should help this a fair amount. We're going to be refactoring this area soon (next year) as well which should help iron things out. I'll keep poking at this though.
,
Dec 19 2017
Thanks Dan, I figure any additional information I can provide here might help. Unfortunately we don't have anything in terms of a consistent repro case, we really just have reporting numbers. Looking at the past week (12/12 - 12/18) and we continue to see the same trend on Mac and Linux. Overall increase: ChromeOS - 2% Linux - 111% Mac - 145% Windows - 8% It's also worth noting while Linux and Mac seems to have increased significantly, the error percentage (error count relative to page load count) varies significantly for each platform, in both 62 and 63. Below are the error percentage changes from 62 to 62, as well as the general usage per platform: ChromeOS - 0.23% -> 0.23% 20% of all loads Linux - 0.078% -> 0.165% 1% of all loads Mac - 0.022% -> 0.052% 17% of all loads Windows - 0.058% -> 0.063% 61% of all loads
,
Dec 20 2017
I did an investigation of the blink idb changes between 62 and 63, and I don't see anything that would effect open performance. Value wrapping may have been turned back on here? But that shouldn't matter at all for open. Question: what is your timeout number? I think I remember reading 10 seconds somewhere? This sounds like it could also be a scheduling issue, or a task starvation issue. I'm going to focus on making IDB schedule less tasks to hopefully alleviate. I'll keep poking at any differences between 62 and 63 as well.
,
Dec 20 2017
Another question: is this till the database is opened fully? Does this account for the versionchange event? That could slow things down.
,
Dec 21 2017
We set our 10 second timeout after calling IndexedDb.open, and the timeout is disposed when either onsuccess, onerror, onblocked or onupgradeneeded are called.
,
Dec 22 2017
,
Dec 22 2017
I'm merging the locking-up bug with this one, as the focus is on timeouts where we end up locking up. I looked at umas again and it looks like there is an increase in the time for onerror's to be sent for opens (But not onblocked/onsuccess) for the backend. https://uma.googleplex.com/p/chrome/timeline_v2/?sid=1f4a90c2e966675ce1dd6313bd4af35d I combed through all backend changes and the diff for m62 to m63, and sadly nothing is sticking out to me. Some sweeper changes are there, and the metadata code location refactoring is in there, so I'm going to take a closer look at metadata reading error handling. But this still doesn't really give us info for a 'locked' state, where that time wouldn't be reported anyways. Carrying over info from other bug: User reported a locked-up IDB, and saw lock file error message in their log. Lock file issues should pop up here: https://cs.chromium.org/chromium/src/content/browser/indexed_db/indexed_db_backing_store.cc?l=955 And we should be sending the error back here: https://cs.chromium.org/chromium/src/content/browser/indexed_db/indexed_db_factory_impl.cc?type=cs&l=592
,
Dec 22 2017
Investigating error code differences for mac - it looks like we have a higher percentage of ERANGE errors on mac as well, so I'll look at possible overflows around file offsets & lengths. https://uma.googleplex.com/p/chrome/histograms/?endDate=20171022&dayCount=14&histograms=PlatformFile.UnknownErrors.Posix&fixupData=true&showMax=true&filters=milestone%2Ceq%2C63%2Cplatform%2Ceq%2CM%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial
,
Dec 22 2017
I think I found a potential cause: There was an introduction of the idea of a maximum # of files open with leveldb. This is here: https://chromium-review.googlesource.com/c/chromium/src/+/654070 +cmumford I'm guessing that this could cause leveldb to perform extra slowly if there are a lot of files open elsewhere? oooo - especially for opens, if all of the files are 'open on access', I've seen these file reads from leveldb be pretty small, especially during reading the log file and doing a compaction / recovery.
,
Dec 22 2017
,
Jan 9 2018
Because this is mac/posix only (and windows has 'unlimited' file descriptors), I'm more sure that it's the above change. I'm going to add a flag to disable it, as well as reporting on how many files we open past our limit (and then close right away). This was we can confirm this is the issue when it happens on a dev machine. Plan is to: * Add metrics and reporting * Add off-switch, which will be turned on by default in the worst case scenario * Improve eviction policy, hopefully around our knowledge of the leveldatabase file access patterns.
,
Jan 17 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/67bb8ac2a1b5d4f0d9cacecf43000dd9f666510d commit 67bb8ac2a1b5d4f0d9cacecf43000dd9f666510d Author: Daniel Murphy <dmurph@chromium.org> Date: Wed Jan 17 05:24:44 2018 [LevelDatabase] Add option to bypass file handle limits. Bug: 795369 Change-Id: I4a4102f7094166b7ab55250a1727d60a1f69165f Reviewed-on: https://chromium-review.googlesource.com/861341 Reviewed-by: Victor Costan <pwnall@chromium.org> Commit-Queue: Daniel Murphy <dmurph@chromium.org> Cr-Commit-Position: refs/heads/master@{#529622} [modify] https://crrev.com/67bb8ac2a1b5d4f0d9cacecf43000dd9f666510d/third_party/leveldatabase/env_chromium.cc
,
Feb 16 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2a9a8201b8c0b213aa41b6a390e9324113176f15 commit 2a9a8201b8c0b213aa41b6a390e9324113176f15 Author: Daniel Murphy <dmurph@chromium.org> Date: Fri Feb 16 10:20:31 2018 [LevelDatabase] Switched file handle eviction to LRU (previously MRU) During compaction each level is scanned at the same time sequentially. This means there are files being read from each level at the same time but each level is read in a sequential manner. Files are read multiple times during scanning, first reading the end of the file then contents inside. Because of this behavior, most-recently-used eviction is the worst because: 1. The multiple reads in a file (in a non-sequential manner) causes a ton of filesystem thrashing here. 2. Files are never re-read once they are finished being scanned in a level, so there is no reason to keep it the handle around. Least-recently-used eviction more closely models the reading behavior of a leveldb compaction, and this uses leveldb's sharded LRU cache. File thrashing here is the suspected cause of the assigned bug, where leveldb never opens. This is a speculative but educated guess fix. Bug: 795369 Change-Id: I7f4965a2a93b05ae558688e58d220b0fcf549430 Reviewed-on: https://chromium-review.googlesource.com/905575 Commit-Queue: Victor Costan <pwnall@chromium.org> Reviewed-by: Victor Costan <pwnall@chromium.org> Reviewed-by: Chris Mumford <cmumford@chromium.org> Cr-Commit-Position: refs/heads/master@{#537275} [modify] https://crrev.com/2a9a8201b8c0b213aa41b6a390e9324113176f15/third_party/leveldatabase/env_chromium.cc [modify] https://crrev.com/2a9a8201b8c0b213aa41b6a390e9324113176f15/third_party/leveldatabase/env_chromium.h
,
Mar 5 2018
Marking as fixed, should be fixed in 66 |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by gov...@chromium.org
, Dec 15 2017Labels: Needs-Triage-M63