Issue metadata
Sign in to add a comment
|
Metrics startup regression in 55.0.2868.0 |
||||||||||||||||||||||
Issue descriptionVersion: 55.0.2868.0 Canary OS: Windows The startup regressions affects all startups but I'm using WarmStartup because those are less noisy to repro. Timeline: https://uma.googleplex.com/timeline_v2?sid=68a3eb300afbf48364db8484559d90ca Timeline split by version clearly highlights 55.0.2868.0 as the culprit. The range for a single day is still pretty large: https://chromium.googlesource.com/chromium/src/+log/55.0.2867.0..55.0.2868.0?pretty=fuller&n=10000 It would be great to have a manual bisect of this performed this way for each test build: 1) Launch Chrome once to make it warm 2) Exit Chrome without recording any stat 3) Launch Chrome and record the value @ chrome://histograms/Startup.FirstWebContents.MainNavigationStart.WarmStartup 4) Compare this with pre-recorded stats on your machine for 55.0.2867.0 and 55.0.2868.0 on your machine to determine whether this is a "good" or "bad" value. I also suggest doing (3)/(4) a few times to get a good average and avoid incorrectly reporting a given bisect version as good/bad.
,
Sep 28 2016
Tested this on Windows 10 comparing b/w the chrome version: 55.0.2867.0 and 55.0.2868.0 but didn't observe any major spike averaged over 4 repetition of the below test steps.
Test steps followed:
========================================
1) Launched Chrome once to make it warm.
2) Exit Chrome without recording any stat.
3) Launched Chrome and navigated to @ chrome://histograms/Startup.FirstWebContents.MainNavigationStart.WarmStartup (doesn't show anything)
3.1> 'Stats accumulated from browser startup to previous page load; reload to get stats as of this page load' is shown instead.
4)Quit and Relaunched Chrome again.
5)Navigated to @ chrome://histograms/Startup.FirstWebContents.MainNavigationStart.WarmStartup and recorded 'average'
6)Repeated steps from 1-5 by uninstalling and reinstalling chrome.
Below is the test result:
OS: Windows 10 Average1 Average2 Average3 Average4 Average
55.0.2867.0 337 323 332 339 332.75
55.0.2868.0 330 338 345 341 338.5
gab@: Could you please take a look and confirm if anything missed in the above test steps.
Thanks in advance!
,
Sep 28 2016
For (3), it probably is that Chrome isn't fully warm on the second launch (surprising), you can see this via the suffix used @ chrome://histograms/Startup.FirstWebContents.MainNavigationStart 300ms is also surprisingly fast, can you try with a more realistic profile? e.g. hooked to sync with lots of history, cookies, bookmarks, etc.? (in general I'd recommend that QA be done from such profiles instead of blank profiles) From https://uma.googleplex.com/timeline_v2?sid=5f0e9d83753773c3851e4861580f507f the median regressed from ~950ms -> ~1150ms.
,
Sep 28 2016
,
Sep 29 2016
Given Windows dev released this on the 23rd: win,dev,55.0.2868.3,2016-09-23 18:33:02.137650 It sounds like it's a binary change in 2868. +wittman, can we compare the profiles between 2868 and 2867 and see if we see any difference?
,
Sep 29 2016
I don't see anything obvious in the browser process startups in aggregate: 2867: https://uma.googleplex.com/p/chrome/callstacks?sid=8fdfc06ed0e5699b53ae03711e7ae327 2868: https://uma.googleplex.com/p/chrome/callstacks?sid=67b8ccdd78217404d7ef84d3a77aa639 I tried looking at days post-release to focus on warm startups, but we had some symbolization failures last week that are affecting the execution frequency percentages and making it hard to reason about the data on a day-by-day basis.
,
Sep 29 2016
I did some random drilling down and found a difference - but I'd really like to see this supported by default in the tool. For example, base::IsOnNetworkDrive() while initializing pepper flash goes from 17ms to 26ms here: https://uma.googleplex.com/p/chrome/callstacks?sid=a6e609b296db4a1c6bf28946251e4edc https://uma.googleplex.com/p/chrome/callstacks?sid=32e2e5014b80d1013df9de1107784afe However, again this is nowhere near the big difference in the overall metric. :(
,
Sep 29 2016
Well, I glanced more at the sampling profiler data and really I don't see any smoking gun. It really feels like the last startup regression - where everything seemed a bit slower (with some being a bit more disproportionately slower than others like what I mentioned in comment 7). I checked the process creation UMA metric to see if we could be creating more processes that are slowing everything down - like was the cast last time - but didn't seem to be the case. (The good news is my browsing of sampling profiler data I found a few cases where we're doing some work on startup that we probably shouldn't be - that might be low hanging fruits to help. Unrelated to this, but I'll follow up on those with separate bugs.)
,
Sep 29 2016
Everything being slower was my impression as well. I was able to reprocess the data for individual days; looking at the second day of release for both releases shows a 440ms overall slowdown which is more heavily distributed towards the pre-message-loop code. 2867 9/22: https://uma.googleplex.com/p/chrome/callstacks?sid=3da9f8ae8781d283cdc9a9cbee0a6e09 2868 9/23: https://uma.googleplex.com/p/chrome/callstacks?sid=f6dcdb4a069a55074d7e5e2067333eb0
,
Sep 29 2016
None of Startup.LoadTime.ProcessCreateToDllMain2.WarmStartup Startup.BrowserMessageLoopStartTimeFromMainEntry2.WarmStartup Startup.BrowserMainToRendererMain.WarmStartup appear to be affected so it's likely something between message loop starting and main navigation starting (but not BrowserMainToRendererMain) and if it's systemic that leaves quite a bit of options opened still!
,
Sep 29 2016
This bug is reported as M55 Beta blocker.Please try to resolve this before M55 branch on Oct 6th,2016 so it has enough baking time in Dev.
,
Sep 29 2016
gab@, I am able to reproduce this issue on chrome#55.0.2868.0 for Win7/10 64-bit OS, however repro is not consistent enough to run a bisect. Are there any others checks i can perform while reproducing this? I am seeing the record (chrome://histograms/Startup.FirstWebContents.MainNavigationStart.WarmStartup) values in the range of 1000 - 5000 ms. Thank you!
,
Sep 30 2016
When you say "able to repro" do you mean you can see it being on average slower on 2868 than 2867? If so you can bisect by running multiple times for each test build?
,
Sep 30 2016
Yes, on 2868 the average range is 1000 - 5000ms however on 2867 i didn't see the case where it exceeds 1000ms on most of the chrome launches. I tried to run the bisect script (launching each chrome revision 5 times), however i am ending up with the wrong suspect (means i am seeing different good/bad revisions each time i run the bisect). Let me try one more time and see if that helps. Thank you!
,
Oct 3 2016
Thanks gab@ and mano@ for the additional info. Can confirm the average values on bisect builds as mentioned in the C#14. I am able to reproduce the issue with my corp account having lots of bookmarks,histories, cookies. Bisect result: ============== https://chromium.googlesource.com/chromium/src/+log/962fe81d78f08b5776f9fa658a301c2e2cf43bf4..30ebb94f3ea635cf436d4c6e29c67f834b6b6c4f jam@: Could you please take a look at this and help in further investigation. Thank you!
,
Oct 3 2016
Ah indeed, I should have thought about it as I was even notified that this CL might affect that very metric but had forgotten when notified of the regression a few days later... So it's not a regression in startup per se, it's a change in the way the metric is computed. It appears this CL adds *very slow* WebContents to the mix of profiled first web contents, the 95th percentile of the metric highlights this quite well, going from 6s to 326s for Startup.FirstWebContents.MainNavigationStart.WarmStartup https://uma.googleplex.com/timeline_v2?sid=cd9c49eaf7b74b8724fd14310861c5d0 The data volume however hasn't changed so it's not because we're reporting more profiles. Perhaps this CL makes us miss a signal and we only get it much later (e.g. on next navigation)?
,
Oct 3 2016
I'm not sure I understand gab's comments about "this CL adds *very slow* WebContents to the mix of profiled first web contents"? Can you elaborate? The change moved the watching to earlier, so I doubt this would make us miss a signal?
,
Oct 3 2016
Well somehow the 95th percentile jumped more than 50X (from 6s to 326s), so there are some WebContents for which the timing is now way off. I initially thought those were "new" but after looking at the volume which hasn't changed I suspect some web contents instead don't report until much later (I forgot to rephrase the quoted paragraph after verifying volume hence the slight discrepancy in wording you highlighted).
,
Oct 3 2016
Right I understand the metric went up, although we all agree that the specific change (r420175) didn't affect the timings. The hypothesis in comment 16 seems like the most plausible solution, i.e. that where I moved FirstWebContentsProfiler construction to misses some codepaths and so we only get it for the second navigation after an arbitrary amount of time of the user reading the first page. I did code inspection and it looks like in the session restore case is the culprit. I'm working on a fix.
,
Oct 4 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ede1c0cf9ffbbace44aa77d1bb3c3abc6a460426 commit ede1c0cf9ffbbace44aa77d1bb3c3abc6a460426 Author: jam <jam@chromium.org> Date: Tue Oct 04 19:24:14 2016 Fix FirstWebContentsProfiler not getting created when using session restore. This regressed in r420175 because a different code path is taken that doesn't go through browser_navigator.cc's CreateTargetContents. BUG= 650349 Review-Url: https://codereview.chromium.org/2388803003 Cr-Commit-Position: refs/heads/master@{#422883} [modify] https://crrev.com/ede1c0cf9ffbbace44aa77d1bb3c3abc6a460426/chrome/browser/metrics/first_web_contents_profiler.cc [modify] https://crrev.com/ede1c0cf9ffbbace44aa77d1bb3c3abc6a460426/chrome/browser/metrics/first_web_contents_profiler.h [modify] https://crrev.com/ede1c0cf9ffbbace44aa77d1bb3c3abc6a460426/chrome/browser/sessions/session_restore.cc [modify] https://crrev.com/ede1c0cf9ffbbace44aa77d1bb3c3abc6a460426/chrome/browser/ui/browser_navigator.cc
,
Oct 6 2016
The fix is working fine on the last(55.0.2881.5 -64 bit PGO) and the latest canary(55.0.2882.0 - 32bit PGO) as tested on Windows-10. 55.0.2881.5 =========== 397+372+370+367+373=375.8 55.0.2882.0 =========== 345+350+375+371+334=355.0 Adding the verified label therefore. jam@: Please close the issue if there is no further work to be done here. Thank you!
,
Oct 6 2016
,
Oct 17 2016
The metric has not recovered : https://uma.googleplex.com/timeline_v2?sid=6879659bcf6c3f0c2ee38880c5e106a7 (removing ReleaseBlock though since we know it's only a measurement error, not a true regression) @ajha: what was your verification?
,
Oct 19 2016
Attached is the screen-cast of the verification steps performed to verify the issue and same steps performed for bisect as updated in C#15. The average range over 5 runs is < 1000ms as observed in good build in C#14 and C#15. Steps followed: =============== 1. Launched chrome and signed into chrome using account with bookmarks/history/cookies(corp account) 2. Once the account was synced, closed and relaunched the chrome. 3. Navigated to chrome://histograms/Startup.FirstWebContents.MainNavigationStart.WarmStartup and observed the mean. 4. Tried the same 4/5 times by quitting/relaunching chrome and navigating to the URL. 5. Observed the mean averaged under < 1000 ms over 5 runs. gab@: Please review the attached screen-cast and confirm if anything being missed there.
,
Oct 20 2016
Hmmm the "mean" of a single version doesn't mean anything, you need to figure out the mean of a version with the issue and one without on the same machine to see whether it changed anything. Either way, timeline clearly shows that this isn't fixed. Will mark as blocking M55 Stable (6 weeks to go) as fake regressions could still hide real regressions and that's undesired.
,
Oct 20 2016
Are we certain that it's a measurement change caused by that CL or is there still doubt that it's the cause and not something else? If there's still doubt, could we revert it or disable to verify if it recovers?
,
Oct 20 2016
+benhenry who's a TPM who's in charge of system health / performance in Chrome
,
Oct 24 2016
Removing TE-Verified labels, as per comment# 26
,
Oct 24 2016
re revert: we need to figure out which code path is being missed and is therefore causing the metric to be higher. reverting won't help because we do need this change to switch to PlzNavigate, and we we'd be at square 1 otherwise. how important is this metric btw? has it caught any real regressions in the past?
,
Oct 24 2016
So we're sure it's caused by that change and not by anything else in the range? If so, I agree no need to revert - I just don't want us to end up doing a bunch of work focusing on that change to eventually figure out it's something else entirely. Re: importance of this metric - I believe the specific one Gab linked to in the original comment is a breakdown of Startup.FirstWebContents.NonEmptyPaint2 which also shows the regression (https://uma.googleplex.com/timeline_v2?sid=b933e19bc388b2d7c2db8052a93e464f) - and that metric is the main desktop startup metric we use - and was used to find e.g. go/m51-startup-regression-postmortem
,
Oct 24 2016
@asvitkine: yeah it's pretty certain that this change is what caused it, since the metric changed right when it landed and and the change was the only one that touched that code. I've been looking at these code paths and I might just change how it's done. The issue is that the implementation before my change didn't work with PlzNavigate. My attempt at updating it to work with PlzNavigate depended on hooking all the places that could add an active WebContents to the tab strip, but clearly even with the above fix we're missing some. Instead of trying to play whack-a-mole with different code paths, I'll switch back to the original approach. However that didn't work with PlzNavigate because WebContentsObserver::DidStartNavigation was called before that point. So if we did miss a DidStartNavigation call, I'll use the time from NavigationHandle::NavigationStart() which is when the navigation started. This does have some subtle differences from previous implementation for non-PlzNavigate: this will be the time that the renderer started the navigation, as opposed to currently when the browser received the IPC. Unless the system is extremely overloaded and there are large delays from renderer to browser UI thread, this should be about the same. We'll verify when the cl lands.
,
Oct 24 2016
(changing the summary to indicate that this isn't a real startup regression)
,
Oct 24 2016
OK, thanks for confirming - your proposed change SGTM.
,
Oct 24 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f4ba53f555685c67fc3fb4ffdb18c4ff7baea19c commit f4ba53f555685c67fc3fb4ffdb18c4ff7baea19c Author: jam <jam@chromium.org> Date: Mon Oct 24 20:58:25 2016 A different approach to fixing FirstWebContentsProfiler with PlzNavigate. The previous fix depended on ensuring we add hooks to all the code paths that cause the first active WebContents to get added to the tab strip. We're still missing some code paths. So instead restore the old behavior. Even though it did miss navigation starts, fix that by using the time from NavigationHandle for PlzNavigate. This should restore the metrics for non-PlzNavigate case. For PlzNavigate, the start timings were already different regardless from non-PlzNavigate because of how navigations are structured very differently. This change shouldn't affects things too much for PlzNavigate though, because the only difference should be the IPC time from the renderer to the browser thread. BUG= 650349 Review-Url: https://codereview.chromium.org/2448553002 Cr-Commit-Position: refs/heads/master@{#427144} [modify] https://crrev.com/f4ba53f555685c67fc3fb4ffdb18c4ff7baea19c/chrome/browser/metrics/chrome_browser_main_extra_parts_metrics.cc [modify] https://crrev.com/f4ba53f555685c67fc3fb4ffdb18c4ff7baea19c/chrome/browser/metrics/first_web_contents_profiler.cc [modify] https://crrev.com/f4ba53f555685c67fc3fb4ffdb18c4ff7baea19c/chrome/browser/metrics/first_web_contents_profiler.h [modify] https://crrev.com/f4ba53f555685c67fc3fb4ffdb18c4ff7baea19c/chrome/browser/sessions/session_restore.cc [modify] https://crrev.com/f4ba53f555685c67fc3fb4ffdb18c4ff7baea19c/chrome/browser/ui/browser_navigator.cc
,
Oct 26 2016
**** Bulk edit - please ignore if not applicable **** A friendly reminder that M55 Stable is launch is coming soon! Your bug is labelled as Stable ReleaseBlock, pls make sure to land the fix and get it merged into the release branch ASAP so it gets enough baking time in Beta (before Stable promotion). Thank you!
,
Oct 26 2016
Looks like the latest canary graphs have recovered.
,
Oct 26 2016
Your change meets the bar and is auto-approved for M55 (branch: 2883)
,
Oct 26 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e9a4b4a1a46d33af714fd1a0821bda65734b87cc commit e9a4b4a1a46d33af714fd1a0821bda65734b87cc Author: John Abd-El-Malek <jam@chromium.org> Date: Wed Oct 26 23:11:08 2016 A different approach to fixing FirstWebContentsProfiler with PlzNavigate. The previous fix depended on ensuring we add hooks to all the code paths that cause the first active WebContents to get added to the tab strip. We're still missing some code paths. So instead restore the old behavior. Even though it did miss navigation starts, fix that by using the time from NavigationHandle for PlzNavigate. This should restore the metrics for non-PlzNavigate case. For PlzNavigate, the start timings were already different regardless from non-PlzNavigate because of how navigations are structured very differently. This change shouldn't affects things too much for PlzNavigate though, because the only difference should be the IPC time from the renderer to the browser thread. BUG= 650349 Review-Url: https://codereview.chromium.org/2448553002 Cr-Commit-Position: refs/heads/master@{#427144} (cherry picked from commit f4ba53f555685c67fc3fb4ffdb18c4ff7baea19c) Review URL: https://codereview.chromium.org/2450363002 . Cr-Commit-Position: refs/branch-heads/2883@{#321} Cr-Branched-From: 614d31daee2f61b0180df403a8ad43f20b9f6dd7-refs/heads/master@{#423768} [modify] https://crrev.com/e9a4b4a1a46d33af714fd1a0821bda65734b87cc/chrome/browser/metrics/chrome_browser_main_extra_parts_metrics.cc [modify] https://crrev.com/e9a4b4a1a46d33af714fd1a0821bda65734b87cc/chrome/browser/metrics/first_web_contents_profiler.cc [modify] https://crrev.com/e9a4b4a1a46d33af714fd1a0821bda65734b87cc/chrome/browser/metrics/first_web_contents_profiler.h [modify] https://crrev.com/e9a4b4a1a46d33af714fd1a0821bda65734b87cc/chrome/browser/sessions/session_restore.cc [modify] https://crrev.com/e9a4b4a1a46d33af714fd1a0821bda65734b87cc/chrome/browser/ui/browser_navigator.cc
,
Oct 27 2016
Tested the issue on Windows 10 using chrome dev version #56.0.2902.0 as per comment #0. Attaching a screen cast for reference. jam@ - Could you please verify the screencast and please let us know if it is the expected behavior. Thanks...!!
,
Oct 27 2016
The fix is verified by looking at the uma stats.
,
Oct 27 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e9a4b4a1a46d33af714fd1a0821bda65734b87cc commit e9a4b4a1a46d33af714fd1a0821bda65734b87cc Author: John Abd-El-Malek <jam@chromium.org> Date: Wed Oct 26 23:11:08 2016 A different approach to fixing FirstWebContentsProfiler with PlzNavigate. The previous fix depended on ensuring we add hooks to all the code paths that cause the first active WebContents to get added to the tab strip. We're still missing some code paths. So instead restore the old behavior. Even though it did miss navigation starts, fix that by using the time from NavigationHandle for PlzNavigate. This should restore the metrics for non-PlzNavigate case. For PlzNavigate, the start timings were already different regardless from non-PlzNavigate because of how navigations are structured very differently. This change shouldn't affects things too much for PlzNavigate though, because the only difference should be the IPC time from the renderer to the browser thread. BUG= 650349 Review-Url: https://codereview.chromium.org/2448553002 Cr-Commit-Position: refs/heads/master@{#427144} (cherry picked from commit f4ba53f555685c67fc3fb4ffdb18c4ff7baea19c) Review URL: https://codereview.chromium.org/2450363002 . Cr-Commit-Position: refs/branch-heads/2883@{#321} Cr-Branched-From: 614d31daee2f61b0180df403a8ad43f20b9f6dd7-refs/heads/master@{#423768} [modify] https://crrev.com/e9a4b4a1a46d33af714fd1a0821bda65734b87cc/chrome/browser/metrics/chrome_browser_main_extra_parts_metrics.cc [modify] https://crrev.com/e9a4b4a1a46d33af714fd1a0821bda65734b87cc/chrome/browser/metrics/first_web_contents_profiler.cc [modify] https://crrev.com/e9a4b4a1a46d33af714fd1a0821bda65734b87cc/chrome/browser/metrics/first_web_contents_profiler.h [modify] https://crrev.com/e9a4b4a1a46d33af714fd1a0821bda65734b87cc/chrome/browser/sessions/session_restore.cc [modify] https://crrev.com/e9a4b4a1a46d33af714fd1a0821bda65734b87cc/chrome/browser/ui/browser_navigator.cc
,
Oct 28 2016
As per comment #41, adding the verified labels. Thanks...!!
,
Oct 28 2016
Should we wait until the graph fully recovers before marking as verified? Looking at the data I see the latest data point at 2480 whereas before the regression it was around ~2000: https://uma.googleplex.com/timeline_v2?sid=23ed380f261b7a47464724488125fb42 But maybe there's a different view that shows the recovery more clearly?
,
Oct 28 2016
The 1-day view shows the full recovery already. I have a reminder set up to look at the 3-day graph on Monday.
,
Nov 4 2016
Verified. The full timeline hasn't fully recovered to previous values yet [1] but this is because of old versions with the issue still being mixed in the aggregate values. Filtering out by "dominant version" [2] shows a full recovery. [1] https://uma.googleplex.com/timeline_v2?sid=e0704189b928be1582f4159f1dc75974 [2] https://uma.googleplex.com/timeline_v2?sid=0c523f0ff99e4674c53e5d687bb1a338
,
Nov 4 2016
[Automated comment] removing mislabelled merge-merged-2840 |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by gab@chromium.org
, Sep 26 2016