mojo::ReportBadMessage does not allow distinguishing different reasons |
||||||
Issue descriptionThe existing content::ReceivedBadMessage (and the chrome/ equivalent) allow specifying the reason for the bad message. It is very useful to be able to distinguish which code called it and it is very convenient to filter them through the crash query interface. We should add support to mojo::ReportBadMessage for similarly reporting a reason code.
,
Aug 16 2017
We're not partial to the enum, and have considered using strings as well. The key thing we need is a way to get different magic signatures for different callers of ReportBadMessage. A string could be fine for this, assuming they're "unique enough" in practice.
,
Aug 16 2017
Interesting-- it looks like we already have multiple "Renderer kill 123" signatures already, broken down by a few different stack frames. We might be in better shape than I thought. For example: [Renderer kill 123] mojo::edk::NodeChannel::OnChannelError [Renderer kill 123] cc::mojom::CompositorFrameSinkRequestValidator::Accept [Renderer kill 123] cc::mojom::CompositorFrameSinkStubDispatch::Accept [Renderer kill 123] base::internal::Invoker<base::internal::BindState<void (*)(int,base::FilePath const &),int>,void >::Run [Renderer kill 123] mojo::MessageHeaderValidator::Accept Are these sufficient to tease apart the separate types of kills, or will each one have a mix of different "mojo-message-error-*" reasons? If they have a mix of reasons, we could modify the magic signature special case logic to show some or all of the reason instead of [Renderer kill 123]. (We could also make the reason strings shorter if they're unwieldy for magic signatures, if that's needed.)
,
Aug 16 2017
They will have a mix of reasons. 123 is the generic code used for errors reported through mojo::ReportBadMessage, so yes, there will be a mix of mojo-message-error-* key values, all associated with 123 kills.
,
Aug 16 2017
What about just hashing the crash key down to a ~64 bit hex value and stuffing it in the sig?
,
Aug 16 2017
A hash is better than nothing, though not very human friendly. With ReceivedBadMessage, we get two things from the magic signature: 1) A meaningful stack frame name (the one that tells us the context for why the IPC valid validation). 2) A number that's easy to look up in bad_message.h to disambiguate which kill it was in that function, since it's otherwise impossible to tell if there are multiple kills in the same function (thanks to compiler optimizations). It sounds like we may be losing #1 (?), but the hash would give us at least part of #2. The trouble is, there's no way to look up the hash in the source code, and the stack frames in these crash reports are not helpful. For example, I'm not able to find the code that decided to kill the renderer in crash 93f48933282865df, since it appears to be generated code? If we had a bunch of magic signatures with these hashes, how do we find the corresponding source code to know why we're crashing?
,
Aug 16 2017
I don't think we're losing either 1 or 2. The only point of a hash in the sig would be to disambiguate magic sigs for easy querying. In terms of actual concrete stack data, nothing really changes. The magic signature signal becomes a hash instead of an enum value. We still have the crash key itself (i.e. the source data for the hash) with a free-form string explaining the precise crash reason. For explicit calls to ReportBadMessage this will be whatever the caller passes in. In validation failures from generated code, the string includes as much detail as we have available about the name of the interface, the method called, and nature of the validation failure (e.g. unexpected null value). What are we missing?
,
Aug 16 2017
Take the crash I mentioned: 93f48933282865df. The crashing line in the report isn't displayed: "Couldn't load source file: c:\b\c\b\win_pgo\src\out\release\gen\cc\ipc\compositor_frame_sink.mojom.cc" On top of that, the message in the dump is: mojo-message-error-1Received bad user message: Validation failed for CompositorFram mojo-message-error-2eSink RequestValidator [VALIDATION_ERROR_UNEXPECTED_STRUCT_HEAD mojo-message-error-3ER] "Received bad user message: Validation failed for CompositorFrameSink RequestValidator [VALIDATION_ERROR_UNEXPECTED_STRUCT_HEADER]" That string doesn't show up in the code anywhere. I can find VALIDATION_ERROR_UNEXPECTED_STRUCT_HEADER in a bunch of places that don't help me understand it, and CompositorFrameSink seems to lead me to generated code and this mojom file, which doesn't shed any light on the kill: https://cs.chromium.org/chromium/src/services/viz/public/interfaces/compositing/compositor_frame_sink.mojom?type=cs&q=compositor_frame_sink.mojo&sq=package:chromium&l=1 So far, I still don't know which code killed the renderer process, and thus how to triage or diagnose this crash. (Even if I know it's in CompositorFrameSink, which appears to be related to compositing, I don't see anything about which message or part of the code had the validation that failed.) How should I be interpreting this?
,
Aug 16 2017
The string is composed at runtime. The stack trace clearly (modulo tribal knowledge, documentation, etc) indicates that a cc::mojom::CompositorFrameSink::SubmitCompositorFrame message is malformed, and specifically that the |metadata| field of the frame has an invalid |selection| field. Of course the actual reason for the corruption is not clear, but content::ReceivedBadMessage certainly doesn't address that either. All we'd get with legacy IPC + ReceivedBadMessage in fact is the message ID, and the stack would just be in some generic IPC code path. So it seems we actually have slightly more information to work with than before. There may be room to add even more information somehow, and perhaps we can integrate generated code into whatever tooling does this xrefing, but I still don't understand what information we're actually losing compared to the old system.
,
Aug 25 2017
Sorry for the delayed reply; just getting back from traveling. It's possible all the information is in there, but there's definitely a lot of tribal knowledge needed there, which I am definitely lacking. I can't seem to locate documentation that would help, either. Part of the problem is that the words that matter here (e.g., metadata, selection) blend in with the Mojo terms that are around it (e.g., data, validate, etc), at least to those who don't know Mojo internals. If enough data is there in the crash dump and it's possible to use it for triaging different types of Mojo renderer kills, then can someone put together a page for how to read a stack trace like this and connect it up with what the bug is? I'm sure that will be hugely useful to stability sheriffs and general Chrome developers. Or does such a page already exist somewhere? (Feel free to adjust the bug summary accordingly.)
,
Aug 29 2017
To get a different type of example, I forced a crash in Nasko's new localStorage renderer kill logic (by attaching VisualStudio and forcing the security check to fail). The crash report is bdc614b7179b273f. This one is pretty nice-- the stack has useful data, and the magic signature is almost what we'd want: [Renderer kill 123] content::mojom::StoragePartitionServiceStubDispatch::Accept A few downsides: 1) The magic signature (and blamed stack frame) should really be one stack frame "higher" (in the UI), since the kill itself is in content::StoragePartitionImpl::OpenLocalStorage. ivanpe@, do you know why we're blaming the frame below that? If we can get the frame with the mojo::ReportBadMessage line highlighted, it will be much more useful. 2) The signature still won't be able to distinguish between multiple kills in the same function, but we can probably live with that since the reported message is stored as a crash key. It's possible (if a bit more awkward) to run queries limiting it to a given kill based on the message for the cases that's needed. If we can get the stack frame fixed, and if there's some documentation for how to read crashes like the one in comments 8-10, then I think we'll be set.
,
Aug 29 2017
Regarding one of the questions asked in comment 11: (1) There is some custom code (in Chrome post-processor) that deals with the Magic Signature generation when we see content::RenderProcessHostImpl::OnMojoError on the stack. Basically, we cluster such crash reports around the frame that matches the following regex: /::Accept\(/ For more details take a look at this discussion: https://groups.google.com/a/google.com/forum/#!searchin/crash-team/Capturing$20better$20signatures$20for$20common$20stacks%7Csort:relevance/crash-team/5v69lIMS29A/OLiiovKoAgAJ Here is the code that generates the magic signature: http://google3/googleclient/chrome/stability/crcrash/magic_signature.cc?l=617&rcl=166717095
,
Aug 1
,
Oct 17
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by roc...@chromium.org
, Aug 16 2017