[Stability] : Mac renderer crash rate spikes from 67.0.3396.30 to 67.0.3396.40 due to GPU_DEAD_ON_ARRIVAL. |
||||||||||
Issue descriptionUMA : https://uma.googleplex.com/timeline_v2?sid=28d2e886899cb10a68fbb565fa7a7a06 Chrome Dash : https://chromedash.googleplex.com/dashboard?dashboard=desktop-release-beta Based on comparison between 66.0.3359.106 to 67.0.3396.40 below are few magic signatures which have spiked 1% or above : https://goto.google.com/xbjdh crbug/840446 : Offscreen::getCG crbug/819685 : blink::MarkingVisitor::Visit crbug/820218 : v8::internal::ConcurrentMarkingVisitor::ProcessStrongHeapObject
,
May 14 2018
I'm not observing a significant change in renderer crash rates on beta according to UMA. Chrome dash asserts that 3396.40 has double the CPM of 3396.30, but none of the crash signatures appear to have changed significant. The only thing I can think of is more OOMs, potentially due to site isolation?
,
May 14 2018
Adding creis@ for site isolation experiment.
,
May 14 2018
pbommana - can you clarify what data you're using that shows that the crash rate is elevated [more than expected] for beta renderer channel?
,
May 14 2018
Comment 2: Site Isolation generally reduces the number of OOMs (and the renderer CPMs on Mac in general), likely because we have more, smaller processes. Here's a link to the renderer CPMs for Mac Beta with and without Site Isolation: https://uma.googleplex.com/p/chrome/variations/?sid=8dce10390becb6d6b8a57a453e93ef78
,
May 14 2018
@creis - thanks for the link. Attaching to graphs of per-renderer and total memory usage comparing .30 and .40 - no significant movement between them. It's not really clear to me how we could be doubling CPM uniformly across all crash categories. Maybe we've changed how we measure CPM or else Chrome dash has an accounting error?
,
May 14 2018
@creis - btw, per-process memory usage is lower for OOPIF, but total memory usage is higher: https://uma.googleplex.com/p/chrome/variations/?sid=4b6242d9e68f96cccf380e10d09c070c
,
May 14 2018
Right-- total memory use is expected to go up with Site Isolation, but we haven't seen that affect renderer OOM reports in practice.
,
May 14 2018
Got it - we're seeing a spike of ~500 renderer crash exit codes, most of which are GPU_DEAD_ON_ARRIVAL.
,
May 14 2018
+ sunnyps, piman. Also adding to GPU triage queue.
,
May 14 2018
@#9: renderer exits with GPU_DEAD_ON_ARRIVAL? Sounds like a bucketing issue, the only thing that returns that is the GPU process: https://cs.chromium.org/search/?q=RESULT_CODE_GPU_DEAD_ON_ARRIVAL&sq=package:chromium
,
May 14 2018
This graph shows all UMA-reported renderer crashes. The remainder of the 500 spike are caused by INVALID_CMDLINE_URL, but that appears to have a natural variation of a couple hundred across beta versions. The spike in GPU_DEAD_ON_ARRIVAL appears to be the only make change between versions.
,
May 16 2018
This UMA histogram is reporting ChildProcessTerminationInfo::exit_code: https://cs.chromium.org/chromium/src/chrome/browser/metrics/chrome_stability_metrics_provider.cc?g=0&l=101 The exit code appears to be just that: https://cs.chromium.org/chromium/src/base/process/kill_posix.cc?type=cs&g=0&l=45 The termination status has some more information: https://cs.chromium.org/chromium/src/base/process/kill_posix.cc?type=cs&g=0&l=56 So it's really not clear to me how this UMA histogram is supposed to use the CrashExitCodes enum. Specifically, we're seeing a spike of exit code 4 from renderers. Maybe this is divide by zero? As piman@ points out, RESULT_CODE_GPU_DEAD_ON_ARRIVAL makes no sense as that's only well defined for the GPU process. Over to histogram owner wfh@ to make sense of this.
,
May 16 2018
sorry I know nothing about macOS stability.
,
May 16 2018
Assigning back to ellyjones@ (Mac TL).
,
May 16 2018
If it's due to a crash, I believe the status is the signal number, and 4 would be SIGILL. Can we check in crash/ ?
,
May 16 2018
M67 Stable promotion is coming soon. Your bug is labelled as Stable ReleaseBlock, pls make sure to land the fix and request a merge into the release branch ASAP. If fix is ready to be merged by Monday 4:00 PM PT, we can take it in for next week last M67 beta release. Thank you.
,
May 17 2018
The only place we're seeing this regression is in the UMA metric. crash/ is not showing any movement.
,
May 17 2018
+rkaplow@, ptal comment #13 and #18. Thank you.
,
May 18 2018
I'm having trouble making sense of this analysis so far. The mac renderer CPM did seem to go up, however the signal is very noisy and it's been at this level recently: https://uma.googleplex.com/timeline_v2?sid=28d2e886899cb10a68fbb565fa7a7a06 Based on https://uma.googleplex.com/timeline_v2?sid=1f763764c6b6589d72ce565e5223ff50 THe increase could be the GPU However if we look at it on a 1day basis https://uma.googleplex.com/timeline_v2?sid=1a0f523cc03ac6a74d696f0bcb942fe0 it was just a one day spike, and isn't recurring. And it;s only SO not sure this is worth investigating
,
May 18 2018
Thanks rkaplow. You're right, the 1-day view makes it clear that this is just a 1-day spike. Let's WontFix this for now and reopen if there's something more actionable. |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by gov...@chromium.org
, May 14 2018Owner: ellyjo...@chromium.org
Status: Assigned (was: Untriaged)