Issue metadata
Sign in to add a comment
|
M58 Stability : Spike in browser crash rate for Windows. |
||||||||||||||||||||||
Issue descriptionThis bug is for tracking the spike in browser crash rate for Windows. 58.0.3029.68 - BETA =================== UMA Browser histogram ===================== https://uma.googleplex.com/timeline_v2?sid=589c19e47093b0f71e94ae3ed6314072 There was a spike in windows browser crash in M58. But there is no obvious regression in go/chromecrash Version comparison of M58 Vs M57 ================================ https://crash.corp.google.com/browse?q=product.name%3D%27Chrome%27AND%20custom_data.ChromeCrashProto.ptype%3D%27browser%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D&compProp=product.Version&v1=58.0.3029.68&v2=57.0.2987.98#stablesignature:1000 Adding in stability sheriffs queue for investigation.
,
Apr 17 2017
It's hard to pinpoint to any one specific signature that's causing about 22% increase in CPM. (https://uma.googleplex.com/timeline_v2?sid=9a6089f3d1d2f1ca00b5a8bd16e9f02e). Increase seems to be across all Windows platforms. crbug/707735 caused some spike but it was fixed last week and fix is reflected in 58.0.3029.68. Comparing 57 and 58(latest), nothing really sticks out as a clear culprit that may be causing CPM increase. Few potential signatures that may be causing a spike (based on crash report comparison) crbug/614753 - StartupBrowserCreator::ProcessCmdLineImpl crbug/710420 - views::View::SchedulePaint crbug/697827 content::NavigationControllerImpl::RendererDidNavigateToExistingPage
,
Apr 17 2017
,
Apr 18 2017
A big part of this might be hangs? The reason "Simulated Exception" is 12% of crashes in 58.0.3029.33 but only 1.7% in 58.0.3029.19: https://crash.corp.google.com/dremel_query_ui?q=SELECT%20product.version%2C%20SUM(hit)%2FCOUNT(*)%0AFROM%20(SELECT%20product.version%2C%20crash.reason%20%3D%20%27Simulated%20Exception%27%20AS%20hit%0AFROM%20crash.prod.latest%0AWHERE%20product.name%20%3D%20%27Chrome%27%20AND%20custom_data.ChromeCrashProto.ptype%20%3D%20%27browser%27%20AND%20custom_data.ChromeCrashProto.channel%20%3D%20%27beta%27%0A)GROUP%20BY%201%0AORDER%20BY%201%20DESC https://crash.corp.google.com/dremel_query_ui?q=SELECT%20a.x%2C%20a.crash.reason%2C%20a.N%20%2F%20b.Tot%0AFROM%20(SELECT%20product.version%20AS%20x%2C%20crash.reason%2C%20COUNT(*)%20as%20N%0AFROM%20crash.prod.latest%0AWHERE%20product.name%3D%27Chrome%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27browser%27%20AND%20%2757%27%20%3C%20product.Version%20AND%20product.version%20%3C%20%276%27%20AND%20custom_data.ChromeCrashProto.channel%20%3D%20%27beta%27%0AGROUP%20BY%20x%2C%20crash.reason)%20AS%20a%0AJOIN%20(SELECT%20product.version%20AS%20y%2C%20COUNT(*)%20AS%20Tot%0AFROM%20crash.prod.latest%0AWHERE%20product.name%3D%27Chrome%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27browser%27%20AND%20%2757%27%20%3C%20product.Version%20AND%20product.version%20%3C%20%276%27%20AND%20custom_data.ChromeCrashProto.channel%20%3D%20%27beta%27%0AGROUP%20BY%20y)%20AS%20b%0AON%20a.x%20%3D%20b.y%0AORDER%20BY%20a.x%20DESC%2C%20a.crash.reason There's a jump in frames w/ purecall in .54 (2% of browser crashes) continuing in .68 (4.7%) which is probably the issue with third party software in Issue 710420 : https://crash.corp.google.com/dremel_query_ui?q=SELECT%20product.version%2C%20COUNT(*)%0AFROM%20crash.prod.latest%0AWHERE%20product.name%3D%27Chrome%27%20AND%20custom_data.ChromeCrashProto.channel%20%3D%20%27beta%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27browser%27%0AOMIT%20RECORD%20IF%20SUM(CrashedStackTrace.StackFrame.FunctionName%3D%27purecall%27)%20%3D%200%0AGROUP%20BY%201%0AORDER%20BY%201%20DESC That would leave about ~5%? unexplained though. It looks like 32-bit regressed more markedly than 64-bit between .19 and .33 so slicing by bitness might be useful: https://uma.googleplex.com/timeline_v2?sid=4c32160add5cff76073cce1b2dca2982
,
Apr 18 2017
,
Apr 18 2017
58.0.3029.19 to 58.0.3029.54 in 32 bit has markedly higher spike over 64 bit. Renderer crashes are going down on 64 and 32 bit, but browser crashes are increasing.
,
Apr 19 2017
Bug 710420 is attributed to AV software. It may be that the vendor released an update at this time that exacerbated an existing issue. Unfortunately it looks like there is little to be done on our end. Bug 614753 (which I have looked at in a previous sheriffing shift) is attributed to third party software, and perhaps additional distribution of it is triggering an increase in frequency. A partial fix to Bug 697827 was merged into the M58 branch on 3/17. It looks like for some reason the fix on desktop will be different, so I will try to find an owner.
,
Apr 19 2017
Looking at comparative crash reports across 58.0.3029.68 and 58.0.3029.19 shows a big spike in base::i18n::InitializeICU (Bug 445616) in .19 which resolved itself in .68. Third party software interference is also suspected in this bug. There's also a lot of utility process crashes (Bug 704495) but I believe that these are existing crashes whose stack traces were redistributed by a server side change in Crash. I pinged Bug 697827, not sure there is a lot to follow up on otherwise.
,
Apr 19 2017
I talked to amineer@, but he is a bit swamped. There is a spike in renderer crashes, but it seems to be stacktrace renaming on old bugs. Had a new https://bugs.chromium.org/p/chromium/issues/detail?id=712969
,
Apr 27 2017
Working purely on the data in Stability.Counts UMA, sliced by version rather than channel I see: 57.0.2987.88 -> 57...98 - ~13% increase. 57...98 -> 58.0.3029.19 - ~7% increase. 58...19 -> 58...33 - No significant change. 58...33 -> 58...41 - No significant change. 58...41 -> 58...68 - ~5 reduction. That seems to fit with the InitializeICU comment describing spike between ...19 and ...68, but leaves a 13% increase between 57...88 and ...98 to explain.
,
Apr 27 2017
Re #10: Hmmm, the numbers fit better with a spike from M57...98->M58...19 once you re-introduce channel, which of course is important due to M57...98 having been promoted to Beta at that time. The signatures that I see in the top-ten only starting from M58...19 on beta-channel, and still there in M58...81 are: - StartupBrowserCreator::ProcessCmdLineImpl (see issue 614753, filed for M52; third-party software injecting stuff). - mojo::edk::Core::WaitManyInternal (see issue 627960, filed for M54; GPU-related hangs being mis-attributed to Mojo). Bruce, could you take a look at those two signatures to see if there's anything actionable?
,
Apr 28 2017
crbug.com/614753 - it's still malware. I've annotated the bug. This is the query I used: https://crash.corp.google.com/browse?q=product.name%3D%27Chrome%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27StartupBrowserCreator%3A%3AProcessCmdLineImpl%27%20AND%20product.Version%3D%2758.0.3029.81%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#samplereports crbug.com/627960 - crash rate seems to low to be of interest. This is the query I used: https://crash.corp.google.com/browse?q=product.name%3D%27Chrome%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BThreadWatcher%20UI%20hang%5D%20mojo%3A%3Aedk%3A%3ACore%3A%3AWaitManyInternal%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt= Are those the signatures you were wondering about?
,
May 3 2017
Running the following dremel queries shows the magic signatures for hangs (crash.reason = 'Simulated Exception') and we see that there are a lot more of these on M58 beta than M57 beta: For M58: https://crash.corp.google.com/dremel_query_ui?q=SELECT%20product.version%2C%20custom_data.ChromeCrashProto.magic_signature_1.name%20as%20magsig%2C%20COUNT(*)%20as%20total%20FROM%20crash.prod.latest%20WHERE%20product.name%20%3D%20%27Chrome%27%20AND%20custom_data.ChromeCrashProto.ptype%20%3D%20%27browser%27%20AND%20custom_data.ChromeCrashProto.channel%20%3D%20%27beta%27%20AND%20crash.reason%20%3D%20%27Simulated%20Exception%27%20AND%20product.version%20LIKE%20%2758%25%27%20GROUP%20BY%20product.version%2C%20magsig%20ORDER%20BY%20product.version%20DESC%2C%20total%20DESC%3B For M57: https://crash.corp.google.com/dremel_query_ui?q=SELECT%20product.version%2C%20custom_data.ChromeCrashProto.magic_signature_1.name%20as%20magsig%2C%20COUNT(*)%20as%20total%20FROM%20crash.prod.latest%20WHERE%20product.name%20%3D%20%27Chrome%27%20AND%20custom_data.ChromeCrashProto.ptype%20%3D%20%27browser%27%20AND%20custom_data.ChromeCrashProto.channel%20%3D%20%27beta%27%20AND%20crash.reason%20%3D%20%27Simulated%20Exception%27%20AND%20product.version%20LIKE%20%2757%25%27%20GROUP%20BY%20product.version%2C%20magsig%20ORDER%20BY%20product.version%20DESC%2C%20total%20DESC%3B Now to figure out how to do a join to get the crash percentage for these...
,
May 8 2017
For common crashes you can sum the CPM table like this: https://crash.corp.google.com/dremel_query_ui?q=SELECT% 20cpm_info.version_cpms.version%2C%20SUM(cpm_info.version_cpms.cpm)%0AFROM% 20FLATTEN((SELECT%20Signature%2C%20cpm_info.version_cpms. version%2C%20cpm_info.version_cpms.cpm%0AFROM%20crash. analysis.prod.latest%0AWHERE%20product%20%3D%20%27Chrome% 27%20AND%20cpm_info.channel%20%3D%20%27beta%27%0AAND% 20Signature%20IN%20(SELECT%20custom_data.ChromeCrashProto.magic_ signature_1.name%20FROM%20crash.prod.latest%20WHERE%20product.name%20%3D%20% 27Chrome%27%20AND%20custom_data.ChromeCrashProto.ptype% 20%3D%20%27browser%27%20AND%20custom_data.ChromeCrashProto.channel%20% 3D%20%27beta%27%20AND%20crash.reason%20%3D%20%27Simulated% 20Exception%27%20AND%20product.version%20LIKE%20% 2758%25%27)%20AND%20cpm_info.version_cpms.version%20LIKE% 20%2758%25%27)%2C%20cpm_info.version_cpms)%0AGROUP%20BY% 201%0AORDER%20BY%201%20DESC In this case it excludes 9,649 reports in 1,153 signatures (these ones <https://crash.corp.google.com/dremel_query_ui?q=SELECT%20custom_data.ChromeCrashProto.magic_signature_1.name%2C%20COUNT(*)%20FROM%20crash.prod.latest%20WHERE%20product.name%20%3D%20%27Chrome%27%20AND%20custom_data.ChromeCrashProto.ptype%20%3D%20%27browser%27%20AND%20custom_data.ChromeCrashProto.channel%20%3D%20%27beta%27%20AND%20crash.reason%20%3D%20%27Simulated%20Exception%27%20AND%20product.version%20LIKE%20%2758%25%27%0AAND%20custom_data.ChromeCrashProto.magic_signature_1.name%20NOT%20IN%20(SELECT%20Signature%0AFROM%20crash.analysis.prod.latest%0AWHERE%20cpm_info.channel%20%3D%20%27beta%27%20AND%20product%20%3D%20%27Chrome%27%0AAND%20cpm_info.version_cpms.version%20LIKE%20%2758%25%27)%0AGROUP%20BY%201%0AORDER%20BY%202%20DESC%2C%201>) which don't appear in the analysis table. That's 64% of reports. So you might be able to impute a CPM from that? I guess for 58.0.3029.81 it is around 1.2 CPM. Note you can't compare 1.2 CPM from crash to UMA CPM because they're measuring different things--crash measures crash reports per million pageloads; UMA measures unclean shutdowns per million page loads.
,
May 24 2017
ligimole@: Just checking in as stability sheriff. It looks like the crash rate on the original link (https://uma.googleplex.com/timeline_v2?sid=589c19e47093b0f71e94ae3ed6314072) has fallen back down again, closer to where it was (though not quite as low as it was in 59.0.3071.19). Is it worth putting more time into this issue at this point?
,
May 24 2017
Currently M58 is in stable, seeing ~300 CPM which is closer to M57. Looks like the issue is resolved, closing now. https://uma.googleplex.com/timeline_v2?sid=40ccec39a4579e91695f35c6fa2cebd9 |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by ligim...@chromium.org
, Apr 17 2017