New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 919499 link

Starred by 2 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 2
Type: Bug



Sign in to add a comment

Native on-device stack unwinding broke when moving to the lld linker

Project Member Reported by cjgrant@chromium.org, Jan 7

Issue description

Monochrome native stack dumps are usually corrupt in logcat output and tombstone files.  This implies that on-device stack unwinding isn't working, and hence, stack frames can't be properly delineated.

This problem likely doesn't affect breakpad/crashpad, as server-side stack unwinding is done with a separate system.  However, for developers that wish to decode stacks in local builds, it's a problem.

torne@ correctly guessed that this was caused by the move to the LLD linker.  More specifically, it might be relocation packing related.  That CL explicitly warns that something like this might happen:

https://chromium-review.googlesource.com/872225

Related issue 755225 discusses the related issue of maintaining scripts that symbolize device-generated stacks.  There is discussion about whether such scripts should be abandoned, in favor of reusing crashpad decoding directly (ie. reusing the server-side decoding mechanism to avoid having multiple implementations).  However, that choice is independent of whether a device is capable of generating a proper stack trace in the first place.  Even if we don't use that trace, the broken decoding could potentially have broader impact.  See discussion starting here:

https://bugs.chromium.org/p/chromium/issues/detail?id=755225#c18

This bug explicitly tracks broken on-device debuggerd unwinding.
 
Labels: DevX
Exhibit A.  Monochrome stack traces usually have a few frames, which themselves are not correct:

09-20 15:29:52.043  3198  3198 F DEBUG   : backtrace:
09-20 15:29:52.043  3198  3198 F DEBUG   :     #00 pc 0001a772  /system/lib/libc.so (abort+63)
09-20 15:29:52.043  3198  3198 F DEBUG   :     #01 pc 01a89e87  /data/app/org.chromium.chrome-Y56YdyZkZ-edGnUBOySeJg==/base.apk (offset 0x21a9000)
Components: -OS>Systems>CrashReporting Internals>CrashReporting
Relocation packing doesn't appear to be the culprit (at least at first glance).

Per the comment suggesting relocation packing (linked below), I tried removing the //build/config/android:lld_pack_relocations GN config.  And induced crash still only shows two stack frames.

https://bugs.chromium.org/p/chromium/issues/detail?id=755225#c20
Since we don't build with frame pointers, I don't think that debuggerd is going to be able to unwind the stack at all. I think that's also the reason why stacks only have a few frames.

Fixing the symbol lookup of the few frames that we have can be done by looking at the offset of the native library within the apk, see here for details: https://cs.chromium.org/chromium/src/build/android/pylib/symbols/apk_native_libs.py?rcl=3cf9cd8e07ce48b301c6d86e5dead8376ecf765b&l=256

But, the frame pointer issue means that the stack traces will always be very limited. I don't think there's anything to do about that, hence the decision in crbug.com/755225#c13.

Minidumps and minidump_stackwalk can probably be made easier to use, and that seems like the only reasonable way to use unwinding data that's necessary for bigger stack traces.
No, the issue here is specifically for non-official builds, which include unwind tables. The actual official builds don't include unwinding info so debuggerd can't do anything, but "default" builds do, and debuggerd could unwind them just fine before the lld switch.

debuggerd is supposed to be taking the file offset of the .so within the APK into account already, and this also used to work; the address printed for the frame is supposed to be the .so-relative address, and the (offset 0xwhatever) is the offset of the .so within the .apk.

The only thing the stack symbolisation script should have to do is to find the correct unstripped .so and pass the addresses in the tombstone/log directly to addr2line for that .so file. It only tries to find APK offsets/etc in order to guess which .so is the right one, not to actually do any calculation with them. The script, as-is, works correctly if you 1) disable lld and 2) only have one APK built in your output dir such that the guessing-which-library code is guaranteed to guess the right thing.
Ah, I see. Our use cases have always been official builds.

So then I see how the lld change is the problem. Is debuggerd borked across all android versions or just some? (when I looked at unwinding in the official build there appears to be a change in debuggerd behavior between M and O).
I'm not sure; I don't remember in which version the load-from-apk stack trace generation was implemented and I'm not sure if it changed later, and I think cjgrant was only testing recent versions.
Correct - I was using O, and just moved to P.
Matt, on #7, I'm curious about the official build use-case.  Developers rarely build official builds when coding, but presumably rely heavily on the stack symbolizer script (either against logcat crash output or a tombstone).  Crashes in production builds from the wild (as I understand it) go through a completely different flow (breakpad/server/etc).  So what's the intersection of official builds and the stack symbolizer script?  Pardon my naivety here.
That's a totally reasonable question, and I'm more naive for only thinking of the official build case...

My team works on general clank performance (startup, memory use, etc), and there are enough differences in behavior that for consistency we (nearly) only consider the official build. For example, when looking at page load time when invoked from AGSA (which needs an official build).
If you're building official builds locally you can also just re-enable unwind tables, even though official builds default to it being off. The presence of unwind tables should not affect startup time or memory use (only APK size). The only place you really can't use unwind tables is in the actual builds from the official release builder.
Interesting, I didn't know that. I should play around with that and update/write an explainer.

Should those play nicely with debuggerd, assuming the ldd issue gets figured out?
Yes, if the issue here is fixed you should always get a reasonable stack from debuggerd if you build with unwind tables.

However, the "exclude_unwind_tables" GN variable is not actually declared as an argument, so you can't set it yourself. It's only actually set to true if is_chrome_branded && is_official_build, though - so if you build an official but not-chrome-branded build, you get the unwind tables still. If you need to be able to do it even with is_chrome_branded you should just send a CL to move the default value into a declare_args block so it can be overridden in args.gn :)
Status: Available (was: Untriaged)
debuggerd is unable to unwind stacks when using lld because lld is generating a readonly segment to load the initial portion of the file instead of including that data into an executable segment (which is very common, but not required):

$ readelf --segments libmonochrome.so
Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  PHDR           0x000034 0x00000034 0x00000034 0x00140 0x00140 R   0x4
  INTERP         0x000174 0x00000174 0x00000174 0x00013 0x00013 R   0x1
      [Requesting program interpreter: /system/bin/linker]
  LOAD           0x000000 0x00000000 0x00000000 0x25110d0 0x25110d0 R   0x1000
  LOAD           0x2512000 0x02512000 0x02512000 0x6525070 0x6525070 R E 0x1000
  LOAD           0x8a38000 0x08a38000 0x08a38000 0x2a2cc8 0x3ff9dc RW  0x1000
  DYNAMIC        0x8ccb4d0 0x08ccb4d0 0x08ccb4d0 0x000f0 0x000f0 RW  0x4
  GNU_RELRO      0x8a61000 0x08a61000 0x08a61000 0x279cc8 0x27a000 R   0x1
  GNU_EH_FRAME   0xcd1590 0x00cd1590 0x00cd1590 0x449f34 0x449f34 R   0x4
  GNU_STACK      0x000000 0x00000000 0x00000000 0x00000 0x00000 RW  0
  NOTE           0x000188 0x00000188 0x00000188 0x000b4 0x000b4 R   0x4

crash_dump's unwinder attempts to parse an ELF file from the mapping containing the program counter, but this mapping does not contain the whole ELF file.

maps contents:
8dbb0000-8dd40000 r--s 00868000 fd:20 35203 /data/app/org.chromium.chrome-1/base.apk
8dd40000-8ddc6000 r--s 009f7000 fd:20 35203 /data/app/org.chromium.chrome-1/base.apk
8ddc6000-8e3d6000 r--s 00244000 fd:20 35203 /data/app/org.chromium.chrome-1/base.apk
8e3d6000-908e8000 r--p 01281000 fd:20 35203 /data/app/org.chromium.chrome-1/base.apk  <-- start of ELF object
908e8000-96e0e000 r-xp 03793000 fd:20 35203 /data/app/org.chromium.chrome-1/base.apk  <-- text of that ELF object
96e0e000-96e37000 rw-p 09cb9000 fd:20 35203 /data/app/org.chromium.chrome-1/base.apk
96e37000-970b1000 r--p 09ce2000 fd:20 35203 /data/app/org.chromium.chrome-1/base.apk

The flag `-Wl,--no-rosegment` can be used to disable this behavior... and this isn't the only tool to be affected by this: https://cs.chromium.org/chromium/src/build/config/compiler/BUILD.gn?rcl=d3adca4c9f6eec0cd94bbee54db119483772d682&l=740

I've verified that setting `ro_segment_workaround_for_valgrind = true` allows debuggerd to successfully walk the stack.
How about setting this flag (or a more generally named one) on the bots or other situations where we'd like debuggerd to function?



Cc: jperaza@chromium.org
Wow, that's a really nice find, thanks! I don't think there is any situation where we don't want debuggerd to properly walk the stack in case of a crash (symbolization or not), so I propose to simply enable this for all Android builds instead.

Also, for the record, I found this has been fixed in AOSP now:

https://android.googlesource.com/platform/system/core/+/01040b10b2393014c2d49265b6ef75a1a2a459a7

https://android.googlesource.com/platform/system/core/+/a09c4a6ff2806644c8baf0b93415078fa252b0e9

http://b/120981155

Yeah it seems reasonable to just do it for android in general. I'm not sure what the advantage of having a readonly segment is? I guess it's slightly less memory mapped +x to use to find gadgets, but our main library probably has every conceivable gadget in it already without needing to try to execute non-code.
Issue 920460 has been merged into this issue.
Project Member

Comment 22 by bugdroid1@chromium.org, Jan 16 (6 days ago)

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8e191cf853f4ac09743fe9fd6658d684051acfc1

commit 8e191cf853f4ac09743fe9fd6658d684051acfc1
Author: David 'Digit' Turner <digit@google.com>
Date: Wed Jan 16 20:55:49 2019

Android: fix stack unwinding issue with lld.

By default, lld is generating a small readonly segment to load
the start of an ELF file (essentially the ELF header + program
headers), followed by an executable segment that contains the
executable portion. Other linkers like BFD or Gold actually
use a single executable segment for boths.

This unfortunately confuses certain tools, like the the Android
debuggerd daemon, who then cannot unwind the stack properly when
printing crash stack traces, or even Valgrind.

This CL adds the -Wl,-no-rosegment linker parameter when
building for Android to work around this issue. This should
not change the size of speed of generated machine code in
any significant ways.

BUG=919499
R=cjgrant@chromium.org,mcarry@chromium.org,agrieve@chromium.org,jperaza@chromium.org

Change-Id: I4caf295eb184076afc1109993dff9ccdbd1de873
Reviewed-on: https://chromium-review.googlesource.com/c/1412153
Reviewed-by: Peter Wen <wnwen@chromium.org>
Reviewed-by: Matthew Cary <mattcary@chromium.org>
Reviewed-by: agrieve <agrieve@chromium.org>
Reviewed-by: Christopher Grant <cjgrant@chromium.org>
Commit-Queue: Peter Wen <wnwen@chromium.org>
Cr-Commit-Position: refs/heads/master@{#623348}
[modify] https://crrev.com/8e191cf853f4ac09743fe9fd6658d684051acfc1/build/config/compiler/BUILD.gn

Sign in to add a comment