Native on-device stack unwinding broke when moving to the lld linker |
||||
Issue descriptionMonochrome native stack dumps are usually corrupt in logcat output and tombstone files. This implies that on-device stack unwinding isn't working, and hence, stack frames can't be properly delineated. This problem likely doesn't affect breakpad/crashpad, as server-side stack unwinding is done with a separate system. However, for developers that wish to decode stacks in local builds, it's a problem. torne@ correctly guessed that this was caused by the move to the LLD linker. More specifically, it might be relocation packing related. That CL explicitly warns that something like this might happen: https://chromium-review.googlesource.com/872225 Related issue 755225 discusses the related issue of maintaining scripts that symbolize device-generated stacks. There is discussion about whether such scripts should be abandoned, in favor of reusing crashpad decoding directly (ie. reusing the server-side decoding mechanism to avoid having multiple implementations). However, that choice is independent of whether a device is capable of generating a proper stack trace in the first place. Even if we don't use that trace, the broken decoding could potentially have broader impact. See discussion starting here: https://bugs.chromium.org/p/chromium/issues/detail?id=755225#c18 This bug explicitly tracks broken on-device debuggerd unwinding.
,
Jan 7
Exhibit A. Monochrome stack traces usually have a few frames, which themselves are not correct: 09-20 15:29:52.043 3198 3198 F DEBUG : backtrace: 09-20 15:29:52.043 3198 3198 F DEBUG : #00 pc 0001a772 /system/lib/libc.so (abort+63) 09-20 15:29:52.043 3198 3198 F DEBUG : #01 pc 01a89e87 /data/app/org.chromium.chrome-Y56YdyZkZ-edGnUBOySeJg==/base.apk (offset 0x21a9000)
,
Jan 7
,
Jan 7
Relocation packing doesn't appear to be the culprit (at least at first glance). Per the comment suggesting relocation packing (linked below), I tried removing the //build/config/android:lld_pack_relocations GN config. And induced crash still only shows two stack frames. https://bugs.chromium.org/p/chromium/issues/detail?id=755225#c20
,
Jan 8
Since we don't build with frame pointers, I don't think that debuggerd is going to be able to unwind the stack at all. I think that's also the reason why stacks only have a few frames. Fixing the symbol lookup of the few frames that we have can be done by looking at the offset of the native library within the apk, see here for details: https://cs.chromium.org/chromium/src/build/android/pylib/symbols/apk_native_libs.py?rcl=3cf9cd8e07ce48b301c6d86e5dead8376ecf765b&l=256 But, the frame pointer issue means that the stack traces will always be very limited. I don't think there's anything to do about that, hence the decision in crbug.com/755225#c13. Minidumps and minidump_stackwalk can probably be made easier to use, and that seems like the only reasonable way to use unwinding data that's necessary for bigger stack traces.
,
Jan 8
No, the issue here is specifically for non-official builds, which include unwind tables. The actual official builds don't include unwinding info so debuggerd can't do anything, but "default" builds do, and debuggerd could unwind them just fine before the lld switch. debuggerd is supposed to be taking the file offset of the .so within the APK into account already, and this also used to work; the address printed for the frame is supposed to be the .so-relative address, and the (offset 0xwhatever) is the offset of the .so within the .apk. The only thing the stack symbolisation script should have to do is to find the correct unstripped .so and pass the addresses in the tombstone/log directly to addr2line for that .so file. It only tries to find APK offsets/etc in order to guess which .so is the right one, not to actually do any calculation with them. The script, as-is, works correctly if you 1) disable lld and 2) only have one APK built in your output dir such that the guessing-which-library code is guaranteed to guess the right thing.
,
Jan 8
Ah, I see. Our use cases have always been official builds. So then I see how the lld change is the problem. Is debuggerd borked across all android versions or just some? (when I looked at unwinding in the official build there appears to be a change in debuggerd behavior between M and O).
,
Jan 8
I'm not sure; I don't remember in which version the load-from-apk stack trace generation was implemented and I'm not sure if it changed later, and I think cjgrant was only testing recent versions.
,
Jan 9
Correct - I was using O, and just moved to P.
,
Jan 9
Matt, on #7, I'm curious about the official build use-case. Developers rarely build official builds when coding, but presumably rely heavily on the stack symbolizer script (either against logcat crash output or a tombstone). Crashes in production builds from the wild (as I understand it) go through a completely different flow (breakpad/server/etc). So what's the intersection of official builds and the stack symbolizer script? Pardon my naivety here.
,
Jan 9
That's a totally reasonable question, and I'm more naive for only thinking of the official build case... My team works on general clank performance (startup, memory use, etc), and there are enough differences in behavior that for consistency we (nearly) only consider the official build. For example, when looking at page load time when invoked from AGSA (which needs an official build).
,
Jan 9
If you're building official builds locally you can also just re-enable unwind tables, even though official builds default to it being off. The presence of unwind tables should not affect startup time or memory use (only APK size). The only place you really can't use unwind tables is in the actual builds from the official release builder.
,
Jan 9
Interesting, I didn't know that. I should play around with that and update/write an explainer. Should those play nicely with debuggerd, assuming the ldd issue gets figured out?
,
Jan 9
Yes, if the issue here is fixed you should always get a reasonable stack from debuggerd if you build with unwind tables. However, the "exclude_unwind_tables" GN variable is not actually declared as an argument, so you can't set it yourself. It's only actually set to true if is_chrome_branded && is_official_build, though - so if you build an official but not-chrome-branded build, you get the unwind tables still. If you need to be able to do it even with is_chrome_branded you should just send a CL to move the default value into a declare_args block so it can be overridden in args.gn :)
,
Jan 14
,
Jan 15
debuggerd is unable to unwind stacks when using lld because lld is generating a readonly segment to load the initial portion of the file instead of including that data into an executable segment (which is very common, but not required):
$ readelf --segments libmonochrome.so
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000034 0x00000034 0x00000034 0x00140 0x00140 R 0x4
INTERP 0x000174 0x00000174 0x00000174 0x00013 0x00013 R 0x1
[Requesting program interpreter: /system/bin/linker]
LOAD 0x000000 0x00000000 0x00000000 0x25110d0 0x25110d0 R 0x1000
LOAD 0x2512000 0x02512000 0x02512000 0x6525070 0x6525070 R E 0x1000
LOAD 0x8a38000 0x08a38000 0x08a38000 0x2a2cc8 0x3ff9dc RW 0x1000
DYNAMIC 0x8ccb4d0 0x08ccb4d0 0x08ccb4d0 0x000f0 0x000f0 RW 0x4
GNU_RELRO 0x8a61000 0x08a61000 0x08a61000 0x279cc8 0x27a000 R 0x1
GNU_EH_FRAME 0xcd1590 0x00cd1590 0x00cd1590 0x449f34 0x449f34 R 0x4
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0
NOTE 0x000188 0x00000188 0x00000188 0x000b4 0x000b4 R 0x4
crash_dump's unwinder attempts to parse an ELF file from the mapping containing the program counter, but this mapping does not contain the whole ELF file.
maps contents:
8dbb0000-8dd40000 r--s 00868000 fd:20 35203 /data/app/org.chromium.chrome-1/base.apk
8dd40000-8ddc6000 r--s 009f7000 fd:20 35203 /data/app/org.chromium.chrome-1/base.apk
8ddc6000-8e3d6000 r--s 00244000 fd:20 35203 /data/app/org.chromium.chrome-1/base.apk
8e3d6000-908e8000 r--p 01281000 fd:20 35203 /data/app/org.chromium.chrome-1/base.apk <-- start of ELF object
908e8000-96e0e000 r-xp 03793000 fd:20 35203 /data/app/org.chromium.chrome-1/base.apk <-- text of that ELF object
96e0e000-96e37000 rw-p 09cb9000 fd:20 35203 /data/app/org.chromium.chrome-1/base.apk
96e37000-970b1000 r--p 09ce2000 fd:20 35203 /data/app/org.chromium.chrome-1/base.apk
The flag `-Wl,--no-rosegment` can be used to disable this behavior... and this isn't the only tool to be affected by this: https://cs.chromium.org/chromium/src/build/config/compiler/BUILD.gn?rcl=d3adca4c9f6eec0cd94bbee54db119483772d682&l=740
I've verified that setting `ro_segment_workaround_for_valgrind = true` allows debuggerd to successfully walk the stack.
How about setting this flag (or a more generally named one) on the bots or other situations where we'd like debuggerd to function?
,
Jan 15
,
Jan 15
Wow, that's a really nice find, thanks! I don't think there is any situation where we don't want debuggerd to properly walk the stack in case of a crash (symbolization or not), so I propose to simply enable this for all Android builds instead. Also, for the record, I found this has been fixed in AOSP now: https://android.googlesource.com/platform/system/core/+/01040b10b2393014c2d49265b6ef75a1a2a459a7 https://android.googlesource.com/platform/system/core/+/a09c4a6ff2806644c8baf0b93415078fa252b0e9 http://b/120981155
,
Jan 15
Candidate CL at: https://chromium-review.googlesource.com/c/chromium/src/+/1412153
,
Jan 15
Yeah it seems reasonable to just do it for android in general. I'm not sure what the advantage of having a readonly segment is? I guess it's slightly less memory mapped +x to use to find gadgets, but our main library probably has every conceivable gadget in it already without needing to try to execute non-code.
,
Jan 15
Issue 920460 has been merged into this issue.
,
Jan 16
(6 days ago)
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/8e191cf853f4ac09743fe9fd6658d684051acfc1 commit 8e191cf853f4ac09743fe9fd6658d684051acfc1 Author: David 'Digit' Turner <digit@google.com> Date: Wed Jan 16 20:55:49 2019 Android: fix stack unwinding issue with lld. By default, lld is generating a small readonly segment to load the start of an ELF file (essentially the ELF header + program headers), followed by an executable segment that contains the executable portion. Other linkers like BFD or Gold actually use a single executable segment for boths. This unfortunately confuses certain tools, like the the Android debuggerd daemon, who then cannot unwind the stack properly when printing crash stack traces, or even Valgrind. This CL adds the -Wl,-no-rosegment linker parameter when building for Android to work around this issue. This should not change the size of speed of generated machine code in any significant ways. BUG=919499 R=cjgrant@chromium.org,mcarry@chromium.org,agrieve@chromium.org,jperaza@chromium.org Change-Id: I4caf295eb184076afc1109993dff9ccdbd1de873 Reviewed-on: https://chromium-review.googlesource.com/c/1412153 Reviewed-by: Peter Wen <wnwen@chromium.org> Reviewed-by: Matthew Cary <mattcary@chromium.org> Reviewed-by: agrieve <agrieve@chromium.org> Reviewed-by: Christopher Grant <cjgrant@chromium.org> Commit-Queue: Peter Wen <wnwen@chromium.org> Cr-Commit-Position: refs/heads/master@{#623348} [modify] https://crrev.com/8e191cf853f4ac09743fe9fd6658d684051acfc1/build/config/compiler/BUILD.gn |
||||
►
Sign in to add a comment |
||||
Comment 1 by cjgrant@chromium.org
, Jan 7