Crash spike in ResourceDispatcher::OnSetDataBuffer |
||||
Issue descriptionhttps://crash.corp.google.com/browse?q=product.name%3D%27Chrome%27%20%20AND%20custom_data.ChromeCrashProto.ptype%3D%27renderer%27%20AND%20cpu.architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BAssert%5D%20content%3A%3AResourceDispatcher%3A%3AOnSetDataBuffer%27%20AND%20custom_data.ChromeCrashProto.channel%3D%27canary%27%20AND%20product.version%20CONTAINS%20%2752.0%27%20AND%20custom_data.ChromeCrashProto.malware_verdict%3Dfalse&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#samplereports:5,processuptime,exploitabilityrating,crashanalysisresult,osversion,gpuvendorid,gpudeviceid,gpudriverversion,cpuarchitecture,cpuinfo,experiments:20,3rdparty,+oncrashedthread,exe%2Fdllmismatch,isasan The crash occurs about 90% on 32-bit, 10% on 64-bit Windows, so appears agnostic to that stat. There are actually 2 crashes with this signature. Most crashes are in this failed assertion: CHECK((shm_valid && shm_size > 0) || (!shm_valid && !shm_size)); Although some are in: CrashOnMapFailure(); This latter crash has been around forever, and has different properties, so I'm going to ignore it in this bug. The crash tends to happen immediately after launch (0ms uptime). Using windbg shows some interesting properties: The shm_handle has been successfully attachment brokered. Based on the previous line, we know that it is a valid handle. Its contents vary, but look legitimate: e.g.: 0:000> dt shm_handle Local var @ r14 Type base::SharedMemoryHandle* +0x000 handle_ : 0x00000000`00000378 Void +0x008 pid_ : 0xba4 +0x00c ownership_passes_to_ipc_ : 0 The other function parameters look like garbage: request_id = 0n0 shm_size = 0n0 renderer_pid = 0x1000 renderer_pid is always 0x1000, even though the real pid is almost certainly 0xba4. The assertion that is failing is (shm_valid && shm_size > 0). We find ourselves in a familiar situation. I'm going to add debugging to Canary to find out what's going on. The possibilities: 1. The sender of the IPC message is passing through garbage. 2. Something is modifying the IPC message in flight. 3. There's a bug in deserialization. 4. There's a memory corruption error on the receiver side. I went through this process not too long ago: https://bugs.chromium.org/p/chromium/issues/detail?id=493414. My suspicion is in-flight modification.
,
May 10 2016
This problem is still present on the latest beta, (51.0.2704.36), which has the attachment brokering fix. I have no idea why there would be a difference between beta and dev, if indeed there is a difference. I looked at all of my changes to ipc/ and content/ that would have occurred between beta and dev, and didn't find anything relevant.
,
May 11 2016
I was looking at the UMA histogram for attachment broker errors on Windows Canary. [I don't see how this would cause the bug, but there is a significant shift in the same time range.] On 5/4, version 2723, we see around 1 error per 10^8 successes. This trend continues into all future versions. On 5/3, versions 2722, we see around 1 error per 10^7 successes. This trend continues into all past futures. The relevant CL range is 390842 to 390887. Looking at Windows Canary for ResourceDispatcher::OnSetDataBuffer crashes, we see that there's a sharp dropoff from 2722 to 2723. From 2723 and onwards, there are still a couple of crashes, but they don't have the super weird 0x1000 signature. Looking at crashes in 2722, most have the weird 0x1000 signature.
,
May 11 2016
Ah ha! Looking at the buildspec for 2723.0, we see that it includes revisions up through 391139. https://uberchromegw.corp.google.com/viewvc/chrome-internal/trunk/tools/buildspec/releases/52.0.2723.0/DEPS?view=markup 2722.0 includes revisions through 390853. rockot@ turned on Mojo by default in 391029. This seems really promising. Chrome used to pass handles to child processes as a command line flag, which also makes them trivial to sniff. https://codereview.chromium.org/1941003002 If this were the case, then I would expect to see no crashes in 2716 with this signature when the Mojo Experiment is turned on. Indeed, when we look at crashes with this signature, we see that 50.46% have MojoChannel::Control (non-Mojo) and 48.63% have MojoChannel::Default (non-Mojo). Of the 6 crashes with Mojo turned on, none has the 0x1000 signature. https://crash.corp.google.com/browse?q=custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BAssert%5D%20content%3A%3AResourceDispatcher%3A%3AOnSetDataBuffer%27%20AND%20product.name%3D%27Chrome%27%20AND%20product.version%3D%2752.0.2716.0%27%20AND%20custom_data.ChromeCrashProto.malware_verdict%3Dfalse&ignore_case=false&enable_rewrite=false&omit_field_name=custom_data.ChromeCrashProto.experiments.ids&omit_field_value=fd02e767-ca7d8d80&omit_field_opt=%3D#samplereports:5,experiments:1000
,
May 11 2016
This is fixed, as of 391293 (Turn on Mojo Channel reland) https://codereview.chromium.org/1950513002 There are still a couple of odds and ends crashes, but those have way lower frequency, and have been around forever. The Mojo A/B test proves that Mojo fixes the 0x1000 signature crash. Yay. Based on previous research into failures with this particular IPC message, it seems likely that UwS was somehow managing to modify these messages in flight. This would also neatly explain why Mojo fixes the problem.
,
May 16 2016
We are seeing these crashes on M51 as well please find the numbers below, If it's got good coverage on canary can we get the fix into M51 as well please: https://crash.corp.google.com/browse?q=custom_data.ChromeCrashProto.ptype%3D%27renderer%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BAssert%5D%20content%3A%3AResourceDispatcher%3A%3AOnSetDataBuffer%27%20AND%20product.version%20CONTAINS%20%2751.0.%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D
,
May 16 2016
Renderer process MojoChannel is presumably going to launch in M-52, not M-51. I don't think there's anything we can do about this particular crash for M-51.
,
May 16 2016
Unfortunately the "fix" cannot be turned on in M51. The crashes disappeared as a result of replacing the Chrome IPC subsystem, which is far too significant a change to merge back IMHO. I think this does indicate that the crashes are likely being caused by malware, for whatever that's worth.
,
May 16 2016
Thank you Erik and rockot@ for info, When I was seeing the number of crashes on M52 were very low hence thought we can merge the fix into M51. Based on Comment#6 removing the label M51. |
||||
►
Sign in to add a comment |
||||
Comment 1 by erikc...@chromium.org
, May 10 201625.9 KB
25.9 KB View Download