ARM64 abort() crashes not being decoded correctly |
||||
Issue descriptionMost/all calls to abort() on arm64 are not decoding correctly in breakpad - it gets lost in the transition between libc and libwebviewchromium and ends up stack-scanning to the wrong place. Primiano says this is because on arm64 we can't validate that a frame pointer is okay unless the next frame also has a frame pointer, which chromium doesn't. Ideas for how to maybe fix this: 1) zero the unused portion of the message buffer in ~LogMessage before calling BreakDebugger, to clean up potential misleading stack contents. Not sure if the stack layout is actually such that this is relevant. 2) Compile the translation unit containing base::android::BreakDebugger with frame pointers enabled, so that breakpad can find it accurately from the frame pointers in libc and then switch to using chromium's CFI, eliminating stack scanning in this case. This does also happen in Chrome for Android but it doesn't appear to be as widespread? Not sure why? Crashes for WebView: https://crash.corp.google.com/browse?q=product.name%3D%27AndroidWebView%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BAssert%5D%20logging%3A%3ALogMessage%3A%3A~LogMessage%27%20AND%20stable_signature%20CONTAINS%20%27~LogMessage%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#-samplereports:5,-clientid,sdkversion,cpuarchitecture Crashes for Chrome for Android: https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Android%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BAssert%5D%20logging%3A%3ALogMessage%3A%3A~LogMessage%27%20AND%20stable_signature%20CONTAINS%20%27~LogMessage%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#samplereports
,
Apr 11 2016
OK, Primiano and I looked into this more and it looks like the problem is that on both arm and arm64, frame pointer based unwinding in breakpad is a huge problem and results in off-by-one stack frames, where the values of fp/sp are out of sync with the values of pc in some cases.
It's not anything to do with not zeroing stack or similar; the failure is *before* we start stack scanning. Ignore the previous posts from us both ;)
The problem is when we crash in a leaf function in libc that does not have a stack frame (which is very common, because syscall wrappers do not create a stack frame, so this happens a lot when we call abort() which leads to a tgkill(self) syscall). What happens in this case is that the correct stack would be:
0 tgkill
1 pthread_kill
2 raise
3 abort
4 base::debug::BreakDebugger
5 logging::LogMessage::~LogMessage
6 <whichever Chromium code hit a LOG(FATAL)>
However, because there was no stack frame for tgkill (all the other functions in this stack created a frame), what we get in frame 0 in breakpad is the program counter pointing to tgkill, but the frame pointer and stack pointer point at the frame created by pthread_kill. The frame pointer unwinding trusts the actual CPU link register here to refer to the correct calling function (which in general wouldn't be reliable as LR can be used as a temp register on arm/arm64, and is destroyed by calls, but in this case happens to still point to pthread_kill). The resulting unwinding ends up being:
0 tgkill (but with the stack frame for pthread_kill)
1 pthread_kill (but with the stack frame for raise)
2 raise (but with the stack frame for abort)
3 abort (but with the stack frame for BreakDebugger, *which didn't set a frame pointer* and so fp here is garbage)
Now everything is broken, because we've recovered a program counter value that's inside the chromium code and so breakpad has CFI data available, *but the current fp/sp values are wrong* due to the off-by-one frame processing, and so applying the CFI data to the wrong state results in garbage and we jump off to space. It then tries to recover by stack scanning, and happens to "miss" and scan to the wrong place (always a risk with stack scanning) and then never makes it back on track again.
On ARM64 we were under the impression that we always set a frame pointer and so it's surprising that BreakDebugger doesn't on some builds (we should look into this), but on ARM we don't always set a frame pointer, and in any case it should not have been necessary for it to have one: if we'd unwound the libc frames correctly we would have arrived in the CFI-covered area with reasonable stack values and could have carried on using CFI to get a correct stack.
The root problem here seems to me to be that the frame pointer based stack unwinding assumes that the previous frame's link register value is where it should get the program counter value from, and this is *never* guaranteed to be reasonable for frame 1, since LR in frame 0 is *not actually valid* most of the time. We tried modifying the FP unwinding code to ignore LR entirely and just read the next PC directly from the current function's frame - in the case explained here, this results in:
0 tgkill (but with the stack frame for pthread_kill)
1 raise (with its correct stack frame - we have "skipped" pthread_kill here because the only record we were ever there was in the unreliable link register of frame 0)
2 abort
3 BreakDebugger
4 rest of stack now looks fairly reasonable because we're doing CFI from here
However, this doesn't seem like it will produce the right results for all crashes - it seems to me like we're potentially introducing the opposite problem in some other cases that don't start in a leaf function?
So to continue to test things, we found a dump that was from a device where we can get the symbols for libc (a nexus), and tested unwinding the same crash using correct CFI for *both* libc and chromium. This seemed to produce an unwinding that was instead missing the BreakDebugger frame, and skipped diretly from abort to ~LogMessage, though it did continue from there, and the two unwindings didn't agree about exactly which stack bytes belonged to which frame. We were increasingly unsure what was correct here, and I've not attached our results because we're not confident we didn't mess things up or are missing something entirely :)
The ARM64 procedure call spec says that it's unspecified at which point in the frame the frame pointer chain goes, and disassembling chromium code shows that the structure gcc produces, for a function which saves two callee-saved registers, creates a frame, and also has 32 bytes of local variables on the stack, looks like:
SP value of a callee-saved register
SP + 8 another callee-saved register
FP -> SP + 16 previous function's FP
SP + 24 previous function's LR (i.e. the return address)
SP + 32 local variables
...
SP + 64 start of previous function's stack frame
This seems like it means there's no sane way for breakpad to work out where the actual stack frame boundaries are, since the FP chain points *somewhere in the middle* of the stack frames. This probably doesn't matter a lot for unwinding purposes, but makes the output from "minidump_stackwalk -s" hard to interpret because the bytes that are between SP and FP are sometimes assigned to the previous frame and sometimes to the current frame depending exactly how you unwound it... :/
Sorry about this massive text dump, and probably some things here are wrong/confusing, but it definitely appears to be the case that at least for arm64 (and we suspect also for arm but haven't dug in detail), frame pointer based unwinding in breakpad produces incorrect results in at least some common cases, and only happens to work frequently because stack scanning happens to recover, or other mitigating coincidences...
We really need some help from breakpad/debugging experts here :)
,
Apr 11 2016
+mark, blundell, rmcilroy, probably the only three other people on this planet who might know something about the breakpad arm64 stack walker. Copy what torne@ said... it was a long day staring at assembly today. The summary to me seems to be: - On arm64, if we always do CFI-based unwinding everything is fine. - This is not the case when the topmost functions in the stack are functions for which we don't have CFI symbols (e.g., crashing in libc abort, like in this issue) - In cases like this, breakpad ends up doing the unwinding as follows: using frame pointer for the topmost libc frames, and CFI once it hits chrome. - As torne@ thoroughly described, the arm64 FP-based unwinding seems to be off-by-one whenever the link-register does not match the PC of the previous frame (for instance, but not exclusively, in the case of a syscall wrapper). - These off by one errors might be masked and recovered (in the sense of producing a valid sequence of call stack, even though the state of each frame is completely wrong) if we have frame pointers (i.e. no -fomit-frame-pointer) also in chrome. - Due to the GYP->GN transition looks like we missed frame pointers on arm64 for a while (M49-M50). Effectively this ended up obliterating what we did in crrev.com/381923002 to address Issue 391706. The first revision that re-disabled FPO optimization on arm64 is 50.0.2641.0 (https://crrev.com/1666033002). - torne@ and I suspect this FP-based unwinding should be likewise wrong on ARM. However the problem there is less evident (at least for Android), because the bionic arm32 libc.so does seems to be built with FPO optimization (so, no frame pointer, as opposite to arm64) causing breakpad to perform a last-resort stack scan instead of fp-based unwinding on arm32. - In essence, it might be that for the cases like #3 a stack-scan might end up producing better results than a fp-driven unwinding with the wrong logic.
,
Apr 12 2016
IIRC, Breakpad on iOS doesn't have CFI so always use fp-based unwinding, and that's always been the case. AFAIK there haven't been any widespread problems with incorrect callstacks being reported as a result. It sounds like you think that Breakpad is making some assumptions about the relationship between FP and LR that aren't valid. Could you restate those assumptions succinctly?
,
Apr 12 2016
The core problem is that breakpad believes that the LR value in frame 0 means something, which is rarely going to be true.
,
Apr 12 2016
Also, you don't necessarily get an incorrect *callstack* in all cases (though we do in the cases we're especially concerned about here) - some of the information reported by minidump_stackwalk is wrong (register values, etc) but it can still be the case that the callstack itself is correct..
,
Apr 12 2016
I walked through this with Primiano this morning. I don't actually think it is due to breakpad believing the LR in frame 0 means something (i.e., in the example above LR is actually correct and breakpad still get's the stackwalk wrong). The conclusion we came to was that the FP frame walking machinery actually puts the callee's LR in the caller's LR context. i.e., in [1] the callers_lr is calculated as the last_fp + 8 - this is the location which the callee pushed the LR it got on entry. However, this is the LR of the callee, not the caller (it was set during the Branch And Link from the caller, so the caller never had this value in it's LR). As such, something like the approach of ignoring LR entirely and always reading caller's PC from the callee's frame might be a solution, however I'm not sure how that would play with the CFI stack walking. https://code.google.com/p/chromium/codesearch#chromium/src/breakpad/src/processor/stackwalker_arm64.cc&q=caller_lr&sq=package:chromium&type=cs&l=199
,
Apr 12 2016
Succinctly I think that the problem is this (which is a code-translation of what rmcilroy wrote in #8): https://codereview.chromium.org/1884503002 It's very interesting to know that iOS uses only FP and no CFI, in fact I think that's makes a bit easier to see what's happening. Take for instance crash ID: 809362c400000000 Download the minidump and the symbols (repro steps in http://www/~primiano/bugs/crbug_601759/repro_steps.txt) and symbolize with and without the change. Here are the result of the iOS stackwalk: with ToT: http://~primiano/bugs/crbug_601759/stackwalk_tot.txt with this patch: http://~primiano/bugs/crbug_601759/stackwalk_with_crrev_1884503002.txt In the case of ToT, Chrome!base::debug::BreakDebugger() shows twice (both in frame #0 and frame #1). I do believe that the stack that you see in frame #1 instead, is the actual stack of Chrome!-[CRWWebController webViewDidChange] [crw_web_controller.mm : 1127 + 0x10] In fact BreakDebugger does not have such a big stack if you look at its disassembly. From a concrete point, this is probably not affecting iOS that much: your top frames are often duplicated (unless the topmost function on the stack is a leaf function that did not emit a frame), and all the stack contents that you see are off by one (but probably nobody ever realized and nobody cares about the context of the stack per each frame). But I believe that in the case of Android, where we mix FP and CFI, this causes us to enter the chrome unwinding at the wrong point, as the stack frames are off by one.
,
Apr 12 2016
Interesting. Note that the arm64 logic in question is just inherited from the arm logic (i.e., stackwalker_arm.cc). Mark would thus be the best person to actually review that change, as he reviewed the initial port of Breakpad to ARM.
,
Apr 12 2016
to make the implication in c#10 explicit, if the fix in your CL is correct then it presumably also needs to be made in stackwalker_arm.cc.
,
Apr 12 2016
Yeah, we only looked at this in detail for arm64 but the logic is indeed the same and we've observed the same issue on arm crashes - it seems to be less common but maybe one of the ways for this to be covered up is more likely to work on arm.
,
Apr 12 2016
Yeah, at least for android, on arm neither chrome nor Android's libc seem to have frame pointer enabled (they also build with FPO). So I guess we never hit that code. The thing that would be really useful here is if somebody did run my patch against some bunch of crashes to make sure that does not break / regress anything else. I spent most of these two days this and my spare bandwidth for this week is really over :-/ If anybody could help on the testing side would be really helpful.
,
Apr 12 2016
On iOS FP is always used as CFI doesn't exist, so that code is hit. Regression testing in Breakpad is always a mess (yay, Crashpad!). AFAIK the typical regression testing methodology is "land the patch and see if things go haywire."
,
Apr 12 2016
On android only a small minority of crashes are arm64 and so changing it just there would be somewhat safer as a test but I guess this isn't the case on iOS and it would affect both, so that's not as good an idea as we originally assumed ;)
,
Apr 12 2016
I think the starting point is to go through the CL with Mark to see what he thinks of the change. Discussion of strategies for landing the change can happen after that IMO.
,
Apr 13 2016
Ok let's do this. Let me take another look to that CL before sending out for review (I wonder whther that can break the other cases, e.g. where we start with CFI and do FP-unwinding later). I am out of bandwidth this week, so can take a look at earliest next week. My understanding is that this is not anymore urgent how we thought in the beginning, as we found out the the cause of WebView crash regression was the lack of FP due to the GN transiton. If anybody wants to look / play / test this in the meantime, I will be more than happy. But I somehow suspect that this will not happen.
,
Apr 13 2016
I'll double-check that we stop having so many bad decodings in webview after M50 goes out, which should be compiled with frame pointers again; if not then we might need to do this more urgently but it seems likely to be fine.
,
Apr 18 2016
M50 webview has almost none of these bad decodings, so building with frame pointers does seem to work around this. I suspect the only reason we ever needed to build with frame pointers is because of this bug, though, and if we fix it then we can build with -fomit-frame-pointer again and get a teeny binsize/perf win. |
||||
►
Sign in to add a comment |
||||
Comment 1 Deleted