New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 609677 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: May 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 3
Type: Bug

Blocking:
issue 609262



Sign in to add a comment

Crash spike in ResourceDispatcher::OnSetDataBuffer

Project Member Reported by erikc...@chromium.org, May 6 2016

Issue description

https://crash.corp.google.com/browse?q=product.name%3D%27Chrome%27%20%20AND%20custom_data.ChromeCrashProto.ptype%3D%27renderer%27%20AND%20cpu.architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BAssert%5D%20content%3A%3AResourceDispatcher%3A%3AOnSetDataBuffer%27%20AND%20custom_data.ChromeCrashProto.channel%3D%27canary%27%20AND%20product.version%20CONTAINS%20%2752.0%27%20AND%20custom_data.ChromeCrashProto.malware_verdict%3Dfalse&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#samplereports:5,processuptime,exploitabilityrating,crashanalysisresult,osversion,gpuvendorid,gpudeviceid,gpudriverversion,cpuarchitecture,cpuinfo,experiments:20,3rdparty,+oncrashedthread,exe%2Fdllmismatch,isasan

The crash occurs about 90% on 32-bit, 10% on 64-bit Windows, so appears agnostic to that stat.

There are actually 2 crashes with this signature. Most crashes are in this failed assertion:
 CHECK((shm_valid && shm_size > 0) || (!shm_valid && !shm_size));

Although some are in:
 CrashOnMapFailure();

This latter crash has been around forever, and has different properties, so I'm going to ignore it in this bug.

The crash tends to happen immediately after launch (0ms uptime). Using windbg shows some interesting properties:

The shm_handle has been successfully attachment brokered. Based on the previous line, we know that it is a valid handle. Its contents vary, but look legitimate:
e.g.:
0:000> dt shm_handle
Local var @ r14 Type base::SharedMemoryHandle*
   +0x000 handle_          : 0x00000000`00000378 Void
   +0x008 pid_             : 0xba4
   +0x00c ownership_passes_to_ipc_ : 0

The other function parameters look like garbage:
                                  request_id = 0n0
                                    shm_size = 0n0
                                renderer_pid = 0x1000

renderer_pid is always 0x1000, even though the real pid is almost certainly 0xba4. The assertion that is failing is (shm_valid && shm_size > 0).

We find ourselves in a familiar situation. I'm going to add debugging to Canary to find out what's going on. The possibilities:

1. The sender of the IPC message is passing through garbage.
2. Something is modifying the IPC message in flight.
3. There's a bug in deserialization.
4. There's a memory corruption error on the receiver side.

I went through this process not too long ago: https://bugs.chromium.org/p/chromium/issues/detail?id=493414. My suspicion is in-flight modification.




 
This problem may be fixed. The attachment broker race condition fix was landed after 2716. 

Somehow, the latest dev (which is also the first dev after 2716) has seen 0 instances of this crash with the really strange signature (renderer_pid = 0x1000, etc.). I'm going to continue to check in on the crash count. I also have no idea why fixing the attachment broker race condition would be related to this crash.
Screen Shot 2016-05-10 at 9.40.18 AM.png
25.9 KB View Download
This problem is still present on the latest beta, (51.0.2704.36), which has the attachment brokering fix. I have no idea why there would be a difference between beta and dev, if indeed there is a difference. I looked at all of my changes to ipc/ and content/ that would have occurred between beta and dev, and didn't find anything relevant.
I was looking at the UMA histogram for attachment broker errors on Windows Canary. [I don't see how this would cause the bug, but there is a significant shift in the same time range.]

On 5/4, version 2723, we see around 1 error per 10^8 successes. This trend continues into all future versions.
On 5/3, versions 2722, we see around 1 error per 10^7 successes. This trend continues into all past futures.

The relevant CL range is 390842 to 390887. 

Looking at Windows Canary for ResourceDispatcher::OnSetDataBuffer crashes, we see that there's a sharp dropoff from 2722 to 2723. From 2723 and onwards, there are still a couple of crashes, but they don't have the super weird 0x1000 signature. Looking at crashes in 2722, most have the weird 0x1000 signature.
Ah ha!
Looking at the buildspec for 2723.0, we see that it includes revisions up through 391139. 
https://uberchromegw.corp.google.com/viewvc/chrome-internal/trunk/tools/buildspec/releases/52.0.2723.0/DEPS?view=markup

2722.0 includes revisions through 390853.

rockot@ turned on Mojo by default in 391029. This seems really promising. Chrome used to pass handles to child processes as a command line flag, which also makes them trivial to sniff.
https://codereview.chromium.org/1941003002

If this were the case, then I would expect to see no crashes in 2716 with this signature when the Mojo Experiment is turned on. Indeed, when we look at crashes with this signature, we see that 50.46% have MojoChannel::Control (non-Mojo) and 48.63% have MojoChannel::Default (non-Mojo). Of the 6 crashes with Mojo turned on, none has the 0x1000 signature.
https://crash.corp.google.com/browse?q=custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BAssert%5D%20content%3A%3AResourceDispatcher%3A%3AOnSetDataBuffer%27%20AND%20product.name%3D%27Chrome%27%20AND%20product.version%3D%2752.0.2716.0%27%20AND%20custom_data.ChromeCrashProto.malware_verdict%3Dfalse&ignore_case=false&enable_rewrite=false&omit_field_name=custom_data.ChromeCrashProto.experiments.ids&omit_field_value=fd02e767-ca7d8d80&omit_field_opt=%3D#samplereports:5,experiments:1000


Cc: roc...@chromium.org jsc...@chromium.org jam@chromium.org
Labels: M-52
Status: Fixed (was: Assigned)
Summary: Crash spike in ResourceDispatcher::OnSetDataBuffer (was: Crash in ResourceDispatcher::OnSetDataBuffer)
This is fixed, as of 391293 (Turn on Mojo Channel reland)
https://codereview.chromium.org/1950513002

There are still a couple of odds and ends crashes, but those have way lower frequency, and have been around forever. 

The Mojo A/B test proves that Mojo fixes the 0x1000 signature crash. Yay. Based on previous research into failures with this particular IPC message, it seems likely that UwS was somehow managing to modify these messages in flight. This would also neatly explain why Mojo fixes the problem.


Renderer process MojoChannel is presumably going to launch in M-52, not M-51. I don't think there's anything we can do about this particular crash for M-51.

Comment 8 by roc...@chromium.org, May 16 2016

Unfortunately the "fix" cannot be turned on in M51. The crashes disappeared
as a result of replacing the Chrome IPC subsystem, which is far too
significant a change to merge back IMHO. I think this does indicate that
the crashes are likely being caused by malware, for whatever that's worth.
Labels: -M-51
Thank you Erik and rockot@ for info, When I was seeing the number of crashes on M52 were very low hence thought we can merge the fix into M51. 

Based on Comment#6 removing the label M51.

Sign in to add a comment