New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 756115 link

Starred by 4 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

mojo::ReportBadMessage does not allow distinguishing different reasons

Project Member Reported by nasko@chromium.org, Aug 16 2017

Issue description

The existing content::ReceivedBadMessage (and the chrome/ equivalent) allow specifying the reason for the bad message. It is very useful to be able to distinguish which code called it and it is very convenient to filter them through the crash query interface.

We should add support to mojo::ReportBadMessage for similarly reporting a reason code.
 

Comment 1 by roc...@chromium.org, Aug 16 2017

The "bad message reason" thing we have makes me very sad, as it requires
defining enum values at a lower layer which pertain to implementation
details at arbitrarily higher layers. I would like to avoid this if
possible. i.e. I don't want an enum in mojo/public somewhere that has
values like kServiceWorkerBadUrl.

Note that we do capture a detailed crash key string identifying the reason,
but I assume the issue you want to address is specifically the ability to
filter crash reports based on some "reason" key, and I guess we can't query
against partial crash key values?

I suppose we could augment the ReportBadMessage API take an arbitrary
integer value (as well as a the free-form string it takes now) and keep the
awkward system we have in place today where content defines a bunch of
error codes, but that still seems a bit unfortunate to me.

Comment 2 by creis@chromium.org, Aug 16 2017

Cc: nick@chromium.org
We're not partial to the enum, and have considered using strings as well.  The key thing we need is a way to get different magic signatures for different callers of ReportBadMessage.  A string could be fine for this, assuming they're "unique enough" in practice.

Comment 3 by creis@chromium.org, Aug 16 2017

Components: Internals>Mojo
Interesting-- it looks like we already have multiple "Renderer kill 123" signatures already, broken down by a few different stack frames.  We might be in better shape than I thought.  For example:

[Renderer kill 123] mojo::edk::NodeChannel::OnChannelError
[Renderer kill 123] cc::mojom::CompositorFrameSinkRequestValidator::Accept
[Renderer kill 123] cc::mojom::CompositorFrameSinkStubDispatch::Accept
[Renderer kill 123] base::internal::Invoker<base::internal::BindState<void (*)(int,base::FilePath const &),int>,void >::Run
[Renderer kill 123] mojo::MessageHeaderValidator::Accept

Are these sufficient to tease apart the separate types of kills, or will each one have a mix of different "mojo-message-error-*" reasons?

If they have a mix of reasons, we could modify the magic signature special case logic to show some or all of the reason instead of [Renderer kill 123].  (We could also make the reason strings shorter if they're unwieldy for magic signatures, if that's needed.)

Comment 4 by roc...@chromium.org, Aug 16 2017

They will have a mix of reasons. 123 is the generic code used for errors
reported through mojo::ReportBadMessage, so yes, there will be a mix of
mojo-message-error-* key values, all associated with 123 kills.

Comment 5 by roc...@chromium.org, Aug 16 2017

What about just hashing the crash key down to a ~64 bit hex value and
stuffing it in the sig?

Comment 6 by creis@chromium.org, Aug 16 2017

A hash is better than nothing, though not very human friendly.

With ReceivedBadMessage, we get two things from the magic signature:
1) A meaningful stack frame name (the one that tells us the context for why the IPC valid validation).
2) A number that's easy to look up in bad_message.h to disambiguate which kill it was in that function, since it's otherwise impossible to tell if there are multiple kills in the same function (thanks to compiler optimizations).

It sounds like we may be losing #1 (?), but the hash would give us at least part of #2.  The trouble is, there's no way to look up the hash in the source code, and the stack frames in these crash reports are not helpful.  For example, I'm not able to find the code that decided to kill the renderer in crash 93f48933282865df, since it appears to be generated code?

If we had a bunch of magic signatures with these hashes, how do we find the corresponding source code to know why we're crashing?

Comment 7 by roc...@chromium.org, Aug 16 2017

I don't think we're losing either 1 or 2. The only point of a hash in the sig would be to disambiguate magic sigs for easy querying.

In terms of actual concrete stack data, nothing really changes. The magic signature signal becomes a hash instead of an enum value.

We still have the crash key itself (i.e. the source data for the hash) with a free-form string explaining the precise crash reason. For explicit calls to ReportBadMessage this will be whatever the caller passes in.

In validation failures from generated code, the string includes as much detail as we have available about the name of the interface, the method called, and nature of the validation failure (e.g. unexpected null value).

What are we missing?

Comment 8 by creis@chromium.org, Aug 16 2017

Take the crash I mentioned: 93f48933282865df.  The crashing line in the report isn't displayed:
"Couldn't load source file: c:\b\c\b\win_pgo\src\out\release\gen\cc\ipc\compositor_frame_sink.mojom.cc"

On top of that, the message in the dump is:
mojo-message-error-1Received bad user message: Validation failed for CompositorFram
mojo-message-error-2eSink RequestValidator [VALIDATION_ERROR_UNEXPECTED_STRUCT_HEAD
mojo-message-error-3ER]

"Received bad user message: Validation failed for CompositorFrameSink RequestValidator [VALIDATION_ERROR_UNEXPECTED_STRUCT_HEADER]"

That string doesn't show up in the code anywhere.  I can find VALIDATION_ERROR_UNEXPECTED_STRUCT_HEADER in a bunch of places that don't help me understand it, and CompositorFrameSink seems to lead me to generated code and this mojom file, which doesn't shed any light on the kill:
https://cs.chromium.org/chromium/src/services/viz/public/interfaces/compositing/compositor_frame_sink.mojom?type=cs&q=compositor_frame_sink.mojo&sq=package:chromium&l=1

So far, I still don't know which code killed the renderer process, and thus how to triage or diagnose this crash.  (Even if I know it's in CompositorFrameSink, which appears to be related to compositing, I don't see anything about which message or part of the code had the validation that failed.)  How should I be interpreting this?

Comment 9 by roc...@chromium.org, Aug 16 2017

The string is composed at runtime. The stack trace clearly (modulo tribal knowledge, documentation, etc) indicates that a cc::mojom::CompositorFrameSink::SubmitCompositorFrame message is malformed, and specifically that the |metadata| field of the frame has an invalid |selection| field.

Of course the actual reason for the corruption is not clear, but content::ReceivedBadMessage certainly doesn't address that either. All we'd get with legacy IPC + ReceivedBadMessage in fact is the message ID, and the stack would just be in some generic IPC code path. So it seems we actually have slightly more information to work with than before.

There may be room to add even more information somehow, and perhaps we can integrate generated code into whatever tooling does this xrefing, but I still don't understand what information we're actually losing compared to the old system.

Comment 10 by creis@chromium.org, Aug 25 2017

Sorry for the delayed reply; just getting back from traveling.

It's possible all the information is in there, but there's definitely a lot of tribal knowledge needed there, which I am definitely lacking.  I can't seem to locate documentation that would help, either.

Part of the problem is that the words that matter here (e.g., metadata, selection) blend in with the Mojo terms that are around it (e.g., data, validate, etc), at least to those who don't know Mojo internals.

If enough data is there in the crash dump and it's possible to use it for triaging different types of Mojo renderer kills, then can someone put together a page for how to read a stack trace like this and connect it up with what the bug is?  I'm sure that will be hugely useful to stability sheriffs and general Chrome developers.  Or does such a page already exist somewhere?

(Feel free to adjust the bug summary accordingly.)

Comment 11 by creis@chromium.org, Aug 29 2017

Cc: ivanpe@chromium.org
Components: Internals>CrashReporting
To get a different type of example, I forced a crash in Nasko's new localStorage renderer kill logic (by attaching VisualStudio and forcing the security check to fail).  The crash report is bdc614b7179b273f.

This one is pretty nice-- the stack has useful data, and the magic signature is almost what we'd want: 
[Renderer kill 123] content::mojom::StoragePartitionServiceStubDispatch::Accept

A few downsides:

1) The magic signature (and blamed stack frame) should really be one stack frame "higher" (in the UI), since the kill itself is in content::StoragePartitionImpl::OpenLocalStorage.  ivanpe@, do you know why we're blaming the frame below that?  If we can get the frame with the mojo::ReportBadMessage line highlighted, it will be much more useful.

2) The signature still won't be able to distinguish between multiple kills in the same function, but we can probably live with that since the reported message is stored as a crash key.  It's possible (if a bit more awkward) to run queries limiting it to a given kill based on the message for the cases that's needed.

If we can get the stack frame fixed, and if there's some documentation for how to read crashes like the one in comments 8-10, then I think we'll be set.
Regarding one of the questions asked in comment 11:

(1) There is some custom code (in Chrome post-processor) that deals with the Magic Signature generation when we see content::RenderProcessHostImpl::OnMojoError on the stack.

Basically, we cluster such crash reports around the frame that matches the following regex:

/::Accept\(/

For more details take a look at this discussion: 
https://groups.google.com/a/google.com/forum/#!searchin/crash-team/Capturing$20better$20signatures$20for$20common$20stacks%7Csort:relevance/crash-team/5v69lIMS29A/OLiiovKoAgAJ

Here is the code that generates the magic signature: http://google3/googleclient/chrome/stability/crcrash/magic_signature.cc?l=617&rcl=166717095
Status: Assigned (was: Untriaged)
Owner: rockot@google.com

Sign in to add a comment