Renderer CPM higher on Windows Stable starting in M64 |
||||||||||||||||||||||||||||||||||||
Issue descriptionChrome Version: 64.0.3282.39 OS: Windows Renderer CPM increase significantly around Dec 16th, and was still high by the end of the month. Jumping from around 120 CPM to roughly 155 CPM, almost a 30% increase. https://uma.googleplex.com/p/chrome/timeline_v2?sid=12bfb5044a2c0cf51e7462676f878c18
,
Jan 9 2018
This renderer spike is an aggregation all "beta" milestones starting M62. If you select M64 alone then you see dip in the renderer CPM. M64 Beta: https://uma.googleplex.com/p/chrome/timeline_v2?sid=f43b0f1504993caadd4eafbb563ee069 >=M62 Beta: https://uma.googleplex.com/p/chrome/timeline_v2?sid=3bb0f03ee7c90dfccf54b4b5b5f302ea
,
Jan 10 2018
More analysis here: https://docs.google.com/document/d/1AVlTmcrSxYwk0rxQqjSWD0FQtqiLcPh1a2N_8XOu1qE/edit The main difference between the renderer crashes between M63 and M64 Beta is a spike of about 10% in "[Out of Memory] base::PartitionBucket::SlowPathAlloc" magic signature. crbug/788293. regarding #2: Since the pageloads drop after version update, I am not sure if we can be confident about the data. https://uma.googleplex.com/timeline_v2?sid=642990975f2ba3e96aaf438a94a58e96 Is there a way to see what experiments may have been turned on and may be cause increase in renderer CPM?
,
Jan 11 2018
,
Jan 16 2018
,
Jan 16 2018
,
Jan 16 2018
,
Jan 16 2018
Adding this week's V8 stability and memory sheriffs.
,
Jan 17 2018
Looks like there's a 50% bump in sandbox OOM kills on 64 bits around this time too. https://uma.googleplex.com/p/chrome/timeline_v2/?sid=234d5838a2d54c6b203bc5df37a581e6
,
Jan 18 2018
Dev also increases around the same time and version: https://uma.googleplex.com/p/chrome/timeline_v2?sid=f71bf72d9456c459b3dbd45d5297e7ce 65 looks normal: https://uma.googleplex.com/p/chrome/timeline_v2?sid=f1ffdba3a3215fba0f8da093fe628b6d Far shot: Some Finch experiment was rolled out/ramped up on 64 only? Maybe site isolation related?
,
Jan 18 2018
,
Jan 18 2018
,
Jan 18 2018
I've prepared a patch to show how to disable address space reservation on 32 bit Windows. It's a speculative fix that we may want to merge to M64 if we can't identify the cause of the OOM's. It should merge cleanly, and I think it's a safe change. I would expect this to help allocation failures that don't involve base/allocator/partition_allocator/page_allocator. Note that partition_alloc uses page_allocator (see comment #3). https://chromium-review.googlesource.com/c/chromium/src/+/874755
,
Jan 18 2018
,
Jan 19 2018
I've attached vmmap (available as part of the sysinternals suite at https://docs.microsoft.com/en-us/sysinternals/downloads/sysinternals-suite) recordings of slashdot.org from M63 and M65. The bug is quite obvious once you know what to look for. If we look at the Private Data line we can see that Private WS and Committed are both lower in M65 compared to M63. However Size (reserved address space is higher). If we select the Private Data line and then sort the bottom data by Size then the huge number of allocations with 960 K as their size and 512 K as their committed/private/Total WS/Private WS becomes obvious. These 960 K blocks also have two blocks - one committed and one reserved. On M63 these blocks have a size of 512 K and one block. A vmmap capture of a build with the fix will make it obvious whether the fix worked or not. I don't know why the private committed data is lower in M65, but that's good.
,
Jan 19 2018
,
Jan 19 2018
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/a9aeaa65e654a1590695f82dc80c1ce9e4c02ef4 commit a9aeaa65e654a1590695f82dc80c1ce9e4c02ef4 Author: Bill Budge <bbudge@chromium.org> Date: Fri Jan 19 10:17:02 2018 [memory] Change OS::Allocate on Windows to not over allocate. - Changes OS::Allocate to first try an exact size aligned allocation, then padded allocations. All padded allocations should be trimmed. Bug: chromium:800511 Change-Id: Iccab2eddbf2a3b08d2b83b95f96c766c9fad7a82 Reviewed-on: https://chromium-review.googlesource.com/875242 Reviewed-by: Hannes Payer <hpayer@chromium.org> Commit-Queue: Bill Budge <bbudge@chromium.org> Cr-Commit-Position: refs/heads/master@{#50706} [modify] https://crrev.com/a9aeaa65e654a1590695f82dc80c1ce9e4c02ef4/src/base/platform/platform-cygwin.cc [modify] https://crrev.com/a9aeaa65e654a1590695f82dc80c1ce9e4c02ef4/src/base/platform/platform-win32.cc
,
Jan 19 2018
Looks like this was the likely root-cause of issue 799837.
,
Jan 19 2018
I agree. My "fix" helped by releasing address space, but didn't address the root cause.
,
Jan 19 2018
looks like we need to check that the fix in #17 looks good in the vmmmap for the next canary, then merge this to M64. I'm assigning to Bill since it looks like he's best place to make sure this happens.
,
Jan 20 2018
I verified allocations which weren't trimmed (512K with 960K reserved) are now trimmed. Browser seemed stable with a lot of clicking around. V8 team are cc'ed to assist with merging if needed. Thanks everyone for your help fixing this.
,
Jan 20 2018
This bug requires manual review: We are only 2 days from stable. Please contact the milestone owner if you have questions. Owners: cmasso@(Android), cmasso@(iOS), kbleicher@(ChromeOS), abdulsyed@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Jan 20 2018
I think the fix is in v8 6.6.1 which is https://chromium.googlesource.com/v8/v8/+log/refs/heads/6.6.1 and is not yet on Canary but should be in 66.0.3326.0 (and it will also need an M65 merge). Since we won't be cutting a new Beta until Monday (at the earliest) I vote for giving this 24hrs or so on M66 Canary, then merging to M65 then M64 on Monday morning.
,
Jan 20 2018
This is already merged to 6.5. https://chromium.googlesource.com/v8/v8.git/+log/6.5-lkgr
,
Jan 20 2018
To clarify, I tested a pre-release Canary build, 65.0.3325.3, 32 bit on Windows.
,
Jan 20 2018
Approving merge for M64 (conditionally once it's well tested and verified in Canary with over weekend's worth of data to ensure no issues, regressions, or stability concerns).
,
Jan 22 2018
hpayer@ - can you please merge this to M64 V8 branch? bbudge@ is OOO. (marking it as assigned until fix is landed in 6.4)
,
Jan 22 2018
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/66948e96674bfccad51219e29e3650957b714099 commit 66948e96674bfccad51219e29e3650957b714099 Author: Hannes Payer <hpayer@chromium.org> Date: Mon Jan 22 15:06:58 2018 Merged: [memory] Change OS::Allocate on Windows to not over allocate. Revision: a9aeaa65e654a1590695f82dc80c1ce9e4c02ef4 BUG=chromium:800511 LOG=N NOTRY=true NOPRESUBMIT=true NOTREECHECKS=true R=mlippautz@chromium.org Change-Id: I59b42759b36af19824acc88742c498736725cda5 Reviewed-on: https://chromium-review.googlesource.com/878621 Reviewed-by: Michael Lippautz <mlippautz@chromium.org> Cr-Commit-Position: refs/branch-heads/6.4@{#80} Cr-Branched-From: 0407506af3d9d7e2718be1d8759296165b218fcf-refs/heads/6.4.388@{#1} Cr-Branched-From: a5fc4e085ee543cb608eb11034bc8f147ba388e1-refs/heads/master@{#49724} [modify] https://crrev.com/66948e96674bfccad51219e29e3650957b714099/src/base/platform/platform-cygwin.cc [modify] https://crrev.com/66948e96674bfccad51219e29e3650957b714099/src/base/platform/platform-win32.cc
,
Jan 22 2018
,
Jan 22 2018
,
Jan 24 2018
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/8fa2c8d2189b7847ce46b05531b52d7091487b96 commit 8fa2c8d2189b7847ce46b05531b52d7091487b96 Author: Michael Hablich <hablich@chromium.org> Date: Wed Jan 24 15:37:48 2018 Merged: [memory] Change OS::Allocate on Windows to not over allocate. Revision: a9aeaa65e654a1590695f82dc80c1ce9e4c02ef4 BUG=chromium:800511,805439 LOG=N NOTRY=true NOPRESUBMIT=true NOTREECHECKS=true R=machenbach@chromium.org Change-Id: Ic048861e0ce04dea0ee7ec2ee71112c29ff11506 Reviewed-on: https://chromium-review.googlesource.com/883803 Reviewed-by: Michael Hablich <hablich@chromium.org> Cr-Commit-Position: refs/branch-heads/6.5@{#7} Cr-Branched-From: 73c55f57fe8506011ff854b15026ca765b669700-refs/heads/6.5.254@{#1} Cr-Branched-From: 594a1a0b6e551397cfdf50870f6230da34db2dc8-refs/heads/master@{#50664} [modify] https://crrev.com/8fa2c8d2189b7847ce46b05531b52d7091487b96/src/base/platform/platform-cygwin.cc [modify] https://crrev.com/8fa2c8d2189b7847ce46b05531b52d7091487b96/src/base/platform/platform-win32.cc
,
Jan 29 2018
Unfortunately, looking at this closer, it's certainly clear to me when looking at overall m64 beta stability, the root cause of the increase in CPM is hangs (WAIT_TIMEOUT or RESULT_CODE_HUNG), not OOMs. https://uma.googleplex.com/p/chrome/timeline_v2?sid=209097b90d18dc30dadc4d0bdb2967f6 These hangs are being investigated in https://bugs.chromium.org/p/chromium/issues/detail?id=806661 it seems the reduction later on in beta promotion was due to RESULT_CODE_KILLED_BAD_MESSAGE mostly vanishing.
,
Jan 29 2018
sorry, to clarify, it seems the OOM fix in #17 did fix 32-bit (and helped a ton), but 64-bit is still suffering mostly from hangs...
,
Jan 29 2018
,
Jan 30 2018
The hangs may be: https://bugs.chromium.org/p/chromium/issues/detail?id=793428
,
Jan 30 2018
,
Feb 1 2018
wfh@, Friendly ping to get an update on this issue as its Priority changed to '0' & its stable blocker for M65. Thanks in advance..!
,
Feb 1 2018
Hi jmukthavaram@ - we're still actively investigating this.
,
Feb 1 2018
,
Feb 1 2018
,
Feb 1 2018
Mentioned in b/806661, but I looked into the UMA data. Both WAIT_TIMEOUT and STATUS_ACCESS_VIOLATION have serious regressions that could be root causes of the renderer regression.
,
Feb 1 2018
This regression has graduated to stable.
,
Feb 1 2018
,
Feb 1 2018
,
Feb 8 2018
FWIW, looking at renderer_launch_count and comparing with page_load_count, it seems M64 does have an increase in avg number of renderer launches per page load, possibly naturally as a result of site isolation. Renderer launches per 100 page loads: (http://shortn/_0jurDWftEA) M61 - 7 M62 - 6.6 M63 - 7.1 (avg for M61-M62 is 6.9) M64 - 8.6 (24% increase) This could explain some of the overall increase, but when slicing renderer_crash_count over renderer_launch_count there still appears to be around an 18% overall increase in M64 (http://shortn/_slROUHfQFI)
,
Feb 12 2018
Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable? If a fix is in active development, please set the status to Started. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Feb 12 2018
,
Feb 13 2018
M65 Stable promotion is coming VERY soon. Your bug is labelled as Stable ReleaseBlock, pls make sure to land the fix and request a merge into the release branch ASAP. Thank you.
,
Feb 15 2018
renderer CPM for latest M64 spin 64.0.3282.167/168 on 64-bit seems far lower. 63.0.3239.132 - 85 CPM 64.0.3282.140 - 291 CPM 64.0.3282.167/168 - 100 CPM Also, renderer CPM on m65 beta seems lower too. It seems a fix was landed between 140 and .167 that un-regressed this, or the population changed somehow. There doesn't appear to be a discernible change in crash/ signatures between .140 and .167 which indicates some kind of (large) data gap.
,
Feb 21 2018
Just to update the latest behavior of this issue, observing below cpm on both M64 and M65 channels for Windows platform under renderer process. M64: ---- 63.0.3239.132 - 115.954 cpm 64.0.3282.140 - 146.898 cpm 64.0.3282.167 - 132.595 cpm 64.0.3282.168 - 124.136 cpm M65: ---- 65.0.3325.51 - 120.474 cpm 65.0.3325.73 - 117.958 cpm Thanks!
,
Feb 21 2018
M65 Stable promotion is coming VERY soon. Your bug is labelled as Stable ReleaseBlock, pls make sure to land the fix and request a merge into the release branch ASAP. Merge has to happen latest by 4:00 PM PT Monday (02/26/18) in order to make it to last M65 beta release next week. Thank you.
,
Feb 26 2018
Gentle ping !! It is marked as stable blocker for M65. Please merge fix ASAP as M65 stable promotion is scheduled today night. Thanks..!
,
Feb 27 2018
There is no new fix yet and M65 stable promotion is coming soon next week. Renderer CPM on m65 beta seems lower per comments #49 and #50. Should we still consider this a blocker for M65?
,
Feb 27 2018
yes this should still be a blocker. Looking at 14-day aggregated renderer CPM numbers for milestones (at peak data volume over sliding 14 day window), I see M63: 105.856 (data vol: 3.1 billion) M64: 122.253 (data vol: 2.8 billion) M65: 126.721 (data vol: 2.6 billion) source: https://uma.googleplex.com/p/chrome/timeline_v2?sid=63f098b254ee189c60d693566e019d29 Given, for the last regression on M64, stable turned out to be worse than beta (in terms of % regression) - I do not have any data to conclude that this is fixed on M65 (yet).
,
Mar 2 2018
I just went through the top 100 renderer crashes. The main new crashes since M63 are issue 793887 and issue 809784. I commented on issue 793887 and I'm hoping we can reduce the number of crashes reported there. Issue 809784 is still a bit concerning and may need more investigation. I also have the feeling that the number of OOMs is proportionally increasing, do we have any metrics that track that specifically?
,
Mar 2 2018
,
Mar 2 2018
Thanks for looking at this. We do track the OOMs (from renderers) - they are also in CrashExitCodes.Renderer. On 32-bit they are all in the "Out of Memory" bucket, and on 64-bit they are are in SUM("Out of Memory" and SBOX_FATAL_MEMORY_EXCEEDED).
They can also be tracked via OOM sad tabs which are Tabs.SadTab.OomCreated
There's a ton of data on M64 here -> http://shortn/_kaWYXYeYt0
TL;DR; OOMs actually seem *down* to me, with plain old crashes (exception) up, and WAIT_TIMEOUT up massively. It's interesting that you believe OOMs are going up on crash/ - it's possible because we are losing some large proportion of crashes and so proportionally the OOMs are increased...?
,
Mar 2 2018
Something to note about issue 809784 is that crashes with that magic signature used to be lumped in with "v8::internal::`anonymous namespace'::Invoke", which includes any crash in V8 generated code. So a spike there isn't necessarily "new".
,
Mar 5 2018
Comment #55 should've said issue 793277 instead of 793887.
,
Mar 5 2018
I compared the data on crash with the UMA data we have, looking at the 14-day aggregate (https://uma.googleplex.com/p/chrome/histograms/?endDate=20180303&dayCount=14&histograms=CrashExitCodes.Renderer&fixupData=true&showMax=true&filters=channel%2Ceq%2C4%2Csimple_version%2Ceq%2C64.0.3282.186%2Cplatform%2Ceq%2CW%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial and https://uma.googleplex.com/p/chrome/histograms/?endDate=20180303&dayCount=14&histograms=CrashExitCodes.Renderer&fixupData=true&showMax=true&filters=channel%2Ceq%2C4%2Csimple_version%2Ceq%2C63.0.3239.132%2Cplatform%2Ceq%2CW%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial and https://crash.corp.google.com/browse?q=product.name%3D%27Chrome%27%20AND%20reporttime%20%2F%201000%20%3E%201518825600%20AND%20reporttime%20%2F%201000%20%3C%201520035200%20AND%20expanded_custom_data.ChromeCrashProto.ptype%3D%27renderer%27&compProp=product.Version&v1=63.0.3239.132&v2=64.0.3282.186#-magicsignature:30,-magicsignature2:30,-stablesignature:30,+crashreason:1000). I've added the data to the spreadsheet http://shortn/_kaWYXYeYt0 , and indeed there doesn't seem to be a significant increase in OOMs on the renderer process. There's a significant increase in OOMs on the browser process (7% to 11%).
,
Mar 5 2018
Thanks for that data, lfg. As part of this bug I have only looked at renderer data and have not looked at browser data at all (I rarely look at browser, I think siggi looks at browser far more than I do). I'm not even sure of the best way to analyze browser memory usage... or number of OOMs (c.f. issue 667354) - there is an UMA metric Stability.BrowserExitCodes but I am not sure how much I trust this - as looking at http://shortn/_FuuExyJRgM and comparing SUM(anything other than normal exit) and the value of "crashes" in the stability counts shows dramatically different values... I suggest we track any browser regressions in a separate bug to keep this one about the renderer.
,
Mar 8 2018
I know of one bug that seems to be hitting quite a few Google employees and probably others, causing crashes that evade crashpad. The problem is tracked in crbug.com/792289 and it is heap corruption inside Microsoft's crypto code. When the heap detects heap corruption it does a fast-fail and no crash is recorded. The dates don't line up particularly well so it is probably unrelated, but... If nothing else, heap corruption is one cause of crashes that crashpad cannot catch.
,
Mar 8 2018
Issue 813218 has been affecting WebRTC based apps pretty severely recently. Might be related to this.
,
Mar 10 2018
,
Mar 12 2018
Friendly ping to get an update on this issue as it is marked as stable blocker. Thanks..!
,
Mar 20 2018
[stability sheriff] Should this be a P0 and ReleaseBlock-Stable? If this isn't being actively worked on, we should remove those labels now.
,
Mar 21 2018
The remaining regression is still being worked on, but with careful consideration we have decided not to block releases on it.
,
Mar 21 2018
,
Mar 23 2018
,
Sep 8
No longer on the Chrome team, e-mail me @google.com if any attention still required from me here, otherwise good luck! |
||||||||||||||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||||||||||||||
Comment 1 by wfh@chromium.org
, Jan 9 2018Labels: -Pri-3 Stability OS-Windows Pri-2