kernel modules don't get symbolic addresses in stack crawls |
|||||||||
Issue descriptionIf you look at the list of crash reports on kevin, you see that the top few ones are: * kernel-0xffffffbffc1cb02c * kernel-0xffffffbffc1c202c * kernel-0xffffffbffc1cb024 * kernel-0xffffffbffc1c2024 * kernel-0xffffffbffc1b902c That's not very helpful. Basically: the kernel is unable to give any good stack crawls whenever it gets to an address in a module. That's annoying. ===== This was talked about a bunch in the context of the private bug b/35580462 when we were trying to track down WiFi problems. At the time, Kevin proposed two approaches here: -- Approach #1: make the kernel at least print module-relative offsets * https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/399680 CHROMIUM: printk: Make %pS print offsets into modules AKA: https://patchwork.kernel.org/patch/9391241/ -- Approach #2: don't strip * https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/403312 cros-kernel2: Don't strip local symbols from kernel modules -- Upstream was a little confused by approach #1. I believe they found it strange and unexpected that modules had their symbols stripped but that the main kernel didn't have its symbols stripped. For approach #2 folks were a little concerned about the size impact. ===== It seems like we should come to a resolution here. IMHO the 10-15% increase in kernel module size is probably worth it. NOTE: a third approach was talked about where we somehow came up with a tools infrastructure that made it easy to resolve these values into something sensible by combining with debug symbols. While this is possible, IMHO it's a bit too much overhead. We'd need to dump the module table in all crash dumps, we'd have to integrate it with the crash reporter, and everyone digging into ramoops files would need manually do the resolution (or we'd need to post-process ramoops files?)
,
Oct 17 2017
@1: Yes. See above: > Approach #2: don't strip > > * https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/403312 > cros-kernel2: Don't strip local symbols from kernel modules CL:403312 resolves to the URL I posted.
,
Oct 17 2017
This one works for me: https://chromium-review.googlesource.com/#/c/chromiumos/overlays/chromiumos-overlay/+/724168 ...and we could probably land that if others agree that the overhead is worth it. If we want to save a bit of space we could probably trim down the set of installed firmware files, which we need to do anyway.
,
Oct 17 2017
Just to give an example of a stack dump I placed in mwifiex. With things stripped (existing), I see: [ 9.258544] [<ffffffc0004cfcc8>] dump_stack+0x8c/0xc8 [ 9.258557] [<ffffffbffc0a7ea0>] mwifiex_add_card+0x110/0x508 [mwifiex] [ 9.258561] [<ffffffbffc0ea114>] 0xffffffbffc0ea114 [ 9.258565] [<ffffffc00051c824>] pci_device_probe+0xa4/0xf4 Without things stripped: [ 9.718673] [<ffffffc0004cfcc8>] dump_stack+0x8c/0xc8 [ 9.723735] [<ffffffbffc0e2ea0>] mwifiex_add_card+0x110/0x434 [mwifiex] [ 9.730353] [<ffffffbffc133114>] mwifiex_pcie_probe+0x114/0x164 [mwifiex_pcie] [ 9.737574] [<ffffffc00051c824>] pci_device_probe+0xa4/0xf4 --- As you can see, _sometimes_ it resolves to a symbol and sometimes not. ...though I'm not convinced that symbol is really right. From what I can remember in the arm32 days the function name was often meaningless because it picked the closest symbol it could find. --- Kirtika pointed to an x86 example from b/67463786. There several modules were involved: [ 7.898893] [<ffffffff9529c060>] dump_stack+0x4e/0x71 [ 7.898913] [<ffffffffc0331622>] iwl_trans_pcie_send_hcmd+0x32f/0x55a [iwlwifi] [ 7.898931] [<ffffffffc037acbb>] ? iwl_mvm_update_d0i3_power_mode+0x68b/0x763 [iwlmvm] [ 7.898947] [<ffffffff94c965e1>] ? finish_wait+0x64/0x64 [ 7.898960] [<ffffffffc032b4ef>] iwl_trans_send_cmd+0xa3/0xa5 [iwlwifi] [ 7.898978] [<ffffffffc03636c1>] iwl_mvm_send_cmd+0x55/0x90 [iwlmvm] [ 7.898993] [<ffffffffc0363744>] iwl_mvm_send_cmd_pdu+0x48/0x5e [iwlmvm] [ 7.899008] [<ffffffffc036123f>] iwl_mvm_get_ctrl_pos+0x191/0x1d4 [iwlmvm] [ 7.899024] [<ffffffffc036132a>] iwl_mvm_phy_ctxt_changed+0x22/0x24 [iwlmvm] [ 7.899039] [<ffffffffc035ada8>] iwl_mvm_rx_mfuart_notif+0x2d89/0x2db4 [iwlmvm] [ 7.899055] [<ffffffffc035ae00>] ? iwl_mvm_ref+0x2d/0x7c [iwlmvm] [ 7.899073] [<ffffffffc02c13ff>] __iwl7000_ieee80211_iter_chan_contexts_atomic+0x69b/0x797 [iwl7000_mac80211] [ 7.899095] [<ffffffffc02c188e>] ieee80211_recalc_chanctx_min_def+0x1b4/0x316 [iwl7000_mac80211] [ 7.899115] [<ffffffffc02c2274>] ieee80211_recalc_smps_chanctx+0x666/0x93c [iwl7000_mac80211] [ 7.899134] [<ffffffffc02c33b7>] ieee80211_vif_use_channel+0x19d/0x20e [iwl7000_mac80211] [ 7.899153] [<ffffffffc02d4177>] __iwl7000_ieee80211_connection_loss+0xb05/0x1166 [iwl7000_mac80211] [ 7.899174] [<ffffffffc02d8fbb>] ieee80211_mgd_auth+0x21d/0x2dc [iwl7000_mac80211] [ 7.899193] [<ffffffffc02ab32b>] __iwl7000_ieee80211_csa_finish+0x628/0x3249 [iwl7000_mac80211] [ 7.899212] [<ffffffffc0246d4e>] cfg80211_mlme_auth+0x164/0x1a5 [cfg80211] [ 7.899228] [<ffffffffc0235bef>] cfg80211_vendor_cmd_reply+0xdff/0x2558 [cfg80211] [ 7.899245] [<ffffffff951dde9c>] genl_family_rcv_msg+0x28f/0x303 It seemed like it again did a pretty good job at at least printing _something_ that wasn't a hex address, though I wonder if it really printed sensible function names. So basically it seems like arm64 is worse than everything else in that it often prints something totally useless. ...but I'd also believe that everything else is not perfect either. I'll try to grab more data.
,
Oct 18 2017
tl; dr:
* The effect of keeping local symbols is ~6% on arm32, ~10% on x86_64, and ~14% on arm64.
* On arm32, this change is least important: things kinda sorta work "well enough"
* On arm64, this change is very important. Things mostly don't work.
* Total space increase on x86_64 and arm64 is 1.7 MB.
We can make things a little less bad with a patch like Kevin's (that at least prints the module). If we're really going to do that then we should try to convince upstream that our use case is a sane one (upstream, by default, seems to keep local symbols in modules).
Details below. I'm mostly comparing veyron_jerry (arm32, kernel v3.14) vs. kevin (arm64, kernel 4.4):
I'm still of the opinion that we should eat the 1.7 MB of disk space. I'll see about going through our firmware to see if there's some low hanging fruit where we can recover space there...
=====
As a simple test, I'll run this on both (after adding a WARN_ON(1) to kevin to match jerry):
echo 1 > /sys/kernel/debug/mwifiex/mlan0/reset
--
BEFORE my change:
jerry:
[ 37.820895] [<c0223fe4>] (warn_slowpath_null) from [<bf22564c>] (mwifiex_sdio_card_reset+0x34/0x2d8 [mwifiex_sdio])
[ 37.831415] [<bf22564c>] (mwifiex_sdio_card_reset [mwifiex_sdio]) from [<bf1f57c8>] (mwifiex_reset_write+0xe4/0x114 [mwifiex])
[ 37.842895] [<bf1f57c8>] (mwifiex_reset_write [mwifiex]) from [<c033fa40>] (vfs_write+0xe4/0x178)
[ 37.851798] [<c033fa40>] (vfs_write) from [<c0340024>] (SyS_write+0x64/0xbc)
kevin:
[ 58.611081] [<ffffffbffc0e35c4>] 0xffffffbffc0e35c4
[ 58.615977] [<ffffffbffc0c1ecc>] mwifiex_debugfs_remove+0x16e4/0x2818 [mwifiex]
[ 58.623310] [<ffffffc00044db98>] full_proxy_write+0x64/0xac
[ 58.628895] [<ffffffc00036c554>] __vfs_write+0x54/0x10c
[ 58.634130] [<ffffffc00036c7b8>] vfs_write+0x98/0xd0
--
AFTER my change:
jerry:
[ 35.883399] [<c0223fe4>] (warn_slowpath_null) from [<bf25064c>] (mwifiex_sdio_card_reset+0x34/0x70 [mwifiex_sdio])
[ 35.893862] [<bf25064c>] (mwifiex_sdio_card_reset [mwifiex_sdio]) from [<bf2227c8>] (mwifiex_reset_write+0xe4/0x114 [mwifiex])
[ 35.905353] [<bf2227c8>] (mwifiex_reset_write [mwifiex]) from [<c033fa40>] (vfs_write+0xe4/0x178)
[ 35.914309] [<c033fa40>] (vfs_write) from [<c0340024>] (SyS_write+0x64/0xbc)
kevin:
[ 80.338236] [<ffffffbffc1355c4>] mwifiex_pcie_card_reset+0x18/0x40 [mwifiex_pcie]
[ 80.345758] [<ffffffbffc105ecc>] mwifiex_reset_write+0xa8/0xb8 [mwifiex]
...clearly things are pretty different.
=====
Overall it's kinda weird that the symbol "mwifiex_sdio_card_reset" wasn't stripped on jerry. Looking in the source code, it's declared like this:
static void mwifiex_sdio_card_reset(struct mwifiex_adapter *adapter)
...so there's no reason it shouldn't be stripped, right? ...but it clearly isn't. Not only does the stack crawl prove that, but you can also see readelf:
--
$ armv7a-cros-linux-gnueabi-readelf -a /build/veyron_jerry/usr/lib/debug/lib/modules/3.14.0/kernel/drivers/net/wireless-3.8/mwifiex/mwifiex_sdio.ko.debug | grep mwifiex_sdio_card_reset
50: 00000618 112 FUNC LOCAL DEFAULT 2 mwifiex_sdio_card_reset
$ armv7a-cros-linux-gnueabi-readelf -a /build/veyron_jerry/lib/modules/3.14.0/kernel/drivers/net/wireless-3.8/mwifiex/mwifiex_sdio.ko | grep mwifiex_sdio_card_reset
0x618 <mwifiex_sdio_card_reset>: @0x78
0x688 <mwifiex_sdio_card_reset+0x70>: @0x84
32: 00000618 112 FUNC LOCAL DEFAULT 2 mwifiex_sdio_card_reset
--
Perhaps this has something to do with those odd extra entries? Maybe they somehow cause the linker not to strip things? They look like this:
0x618 <mwifiex_sdio_card_reset>: @0x78
Compact model index: 1
0x9b vsp = r11
0x44 vsp = vsp - 20
0x86 0x83 pop {r4, r5, r11, r13, r14}
0xb0 finish
0xb0 finish
0x688 <mwifiex_sdio_card_reset+0x70>: @0x84
Compact model index: 1
0x9b vsp = r11
0x46 vsp = vsp - 28
0xb1 0x08 pop {r3}
0x86 0x87 pop {r4, r5, r6, r11, r13, r14}
--
Note on kevin it was stripped:
$ aarch64-cros-linux-gnu-readelf -a /build/kevin/usr/lib/debug/lib/modules/4.4.86/kernel/drivers/net/wireless/marvell/mwifiex/mwifiex_pcie.ko.debug | grep mwifiex_pcie_card_reset
111: 00000000000025ac 64 FUNC LOCAL DEFAULT 2 mwifiex_pcie_card_reset
$ aarch64-cros-linux-gnu-readelf -a /build/kevin/lib/modules/4.4.86/kernel/drivers/net/wireless/marvell/mwifiex/mwifiex_pcie.ko | grep mwifiex_pcie_card_reset
--
...actually, I dug up an old email from vapier@ about this. He said:
> the --strip-unneeded option specifically rips out symbols that aren't needed for relocation
> processing. i imagine the ISA on x86_64/aarch64 and their PIC-friendly ISA means they can
> do a lot more relative relocs which means throwing away more symbols, while on the older
> aarch32, the symbols are needed to process relocs. hence in some cases they stay behind.
So basically the answer is: jerry works a bit better because we already don't strip as much on jerry. It's possible that those extra entries have to do with processing relocs.
=====
We can get some idea of how much stripping was happening on arm32 (where we never had a huge problems) vs. arm64 (where we clearly do) by comparing the effect of my CL on jerry and kevin:
Before:
$ du -s /build/kevin/lib/modules/
11696 /build/kevin/lib/modules/
$ du -s /build/veyron_jerry/lib/modules/
7000 /build/veyron_jerry/lib/modules/
After:
$ du -s /build/kevin/lib/modules/
13344 /build/kevin/lib/modules/
$ du -s /build/veyron_jerry/lib/modules/
7412 /build/veyron_jerry/lib/modules/
Thus on jerry the CL results in size that is 1.06 times as big (a 6% gain)
...and on kevin the CL results in size that is 1.14 times as big (a 14% gain)
In other words the devices where this is working "well enough" are working well enough because (for whatever reason) the compiler isn't stripping many symbols.
=====
Let's think about "well enough" a bit more. Let's focus on one difference on jerry, specifically:
mwifiex_sdio_card_reset+0x34/0x70
vs.
mwifiex_sdio_card_reset+0x34/0x2d8
Basically, though jerry seems to have gotten the right function, the "before" and "after" disagree on how big the function is. What's right? Let's use kgdb to answer that. On jerry:
(gdb) monitor lsmod
...
mwifiex_sdio 30777 0xbf22b0c0 0 (Live) 0xbf225000 [ ]
btmrvl_sdio 21517 0xbf21f178 0 (Live) 0xbf21b000 [ ]
btmrvl 18227 0xbf215000 1 (Live) 0xbf212000 [ btmrvl ]
mwifiex 205169 0xbf201184 1 (Live) 0xbf1d6000 [ mwifiex ]
...
(gdb) add-symbol-file /build/veyron_jerry/usr/lib/debug/lib/modules/3.14.0/kernel/drivers/net/wireless-3.8/mwifiex/mwifiex.ko.debug 0xbf1d6000
(gdb) add-symbol-file /build/veyron_jerry/usr/lib/debug/lib/modules/3.14.0/kernel/drivers/net/wireless-3.8/mwifiex/mwifiex_sdio.ko.debug 0xbf225000
(gdb) disass 0xbf22564c
Dump of assembler code for function mwifiex_sdio_card_reset:
0xbf225618 <+0>: mov r12, sp
...
0xbf22564c <+52>: ldr r3, [r5]
...
0xbf225674 <+92>: ldm sp, {r4, r5, r11, sp, pc}
0xbf225678 <+96>: svclt 0x0022b000
0xbf22567c <+100>: svclt 0x002294bb
0xbf225680 <+104>: svclt 0x0022b25c
0xbf225684 <+108>: ; <UNDEFINED> instruction: 0xc10084b8
End of assembler dump.
gdb gives address in decimal, so in hex that means we're actually at mwifiex_sdio_card_reset+0x34/0x70 (not /0x2d8)
=====
The above implies that, perhaps, we thought mwifiex_sdio_card_reset() was too big because we didn't have the next symbol. Let's see if that's true. Using gdb to disassemble right after mwifiex_sdio_card_reset(), I see that the next function is mwifiex_decode_rx_packet(). That function should be 616 bytes long. ...and, in fact, hex(616 + 0x70) = 0x2d8, so it looks like we somehow thought that was part of mwifiex_sdio_card_reset(). Looking through elf files:
$ armv7a-cros-linux-gnueabi-readelf -a /build/veyron_jerry/usr/lib/debug/lib/modules/3.14.0/kernel/drivers/net/wireless-3.8/mwifiex/mwifiex_sdio.ko.debug | grep mwifiex_decode_rx_packet
53: 00000688 616 FUNC LOCAL DEFAULT 2 mwifiex_decode_rx_packet
$ armv7a-cros-linux-gnueabi-readelf -a /build/veyron_jerry/lib/modules/3.14.0/kernel/drivers/net/wireless-3.8/mwifiex/mwifiex_sdio.ko | grep mwifiex_decode_rx_packet
...so right, mwifiex_decode_rx_packet() was stripped. Had we done a stack trace there we would have claimed that we were actually in mwifiex_sdio_card_reset().
=====
One thing that seems particularly bad on kevin is that we don't even get the module a lot of the time. For instance, if we claimed we were in mwifiex_sdio_card_reset() when we should have said mwifiex_decode_rx_packet() that's bad, but not as bad as on kevin saying that we're here:
[ 58.611081] [<ffffffbffc0e35c4>] 0xffffffbffc0e35c4
...why do we do that? It looks like for kevin for mwifiex_sdio the symbol table contains almost nothing in it and we don't find any symbol that's close enough. You can see with this:
$ aarch64-cros-linux-gnu-readelf -s /build/kevin/lib/modules/4.4.86/kernel/drivers/net/wireless/marvell/mwifiex/mwifiex_pcie.ko | grep -v UND
...taking out the junk, you get:
43: 0000000000000000 896 OBJECT GLOBAL DEFAULT 24 __this_module
55: 0000000000000000 28 FUNC GLOBAL DEFAULT 6 cleanup_module
64: 0000000000000000 44 FUNC GLOBAL DEFAULT 4 init_module
79: 0000000000000038 160 OBJECT GLOBAL DEFAULT 8 __mod_pci__mwifiex_ids_de
=====
I have little experience with x86. The claim in an old email thread was that x86_64 was bad, though the above example from Kirtika shows that maybe it's not too bad.
Let's see using the size difference before and after my change:
Before:
$ du -s /build/reef/lib/modules/
17472 /build/reef/lib/modules/
After:
$ du -s /build/reef/lib/modules/
19148 /build/reef/lib/modules/
...aka a 10% gain in size, or halfway between arm32 and arm64. What about if we look at symbols? Certainly there are _a lot_ more symbols in:
x86_64-cros-linux-gnu-readelf -s /build/reef/lib/modules/4.4.86-11896-g52e8e9b1a09e/kernel/drivers/net/wireless/iwl7000/iwlwifi/iwlwifi.ko | grep -v UND
=====
Actually, maybe we can compare apples to apples by using a common module.
aarch64-cros-linux-gnu-readelf -s /build/kevin/lib/modules/4.4.86/kernel/net/ipv6/netfilter/ip6table_filter.ko | grep -v UND
vs.
x86_64-cros-linux-gnu-readelf -s /build/reef/lib/modules/4.4.86-11896-g52e8e9b1a09e/kernel/net/ipv6/netfilter/ip6table_filter.ko | grep -v UND
vs.
armv7a-cros-linux-gnueabi-readelf -s /build/veyron_jerry/lib/modules/3.14.0/kernel/net/ipv6/netfilter/ip6table_filter.ko | grep -v UND
On arm32, we have:
11: 00000000 68 FUNC LOCAL DEFAULT 2 ip6table_filter_hook
14: 00000044 32 FUNC LOCAL DEFAULT 2 ip6table_filter_net_exit
15: 00000064 124 FUNC LOCAL DEFAULT 2 ip6table_filter_net_init
...but on x86_64 and arm64 we don't have anything really interesting...
The above is relevant because we needed to spend time on this when dealing with bug #774279 .
,
Oct 18 2017
I would say go for it. I keep hitting that problem, and it is sometimes annoying when trying to decode stack traces from crashes observed in the field.
,
Oct 18 2017
Agreed with going for it.
,
Oct 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/85266f1daa7f8f97ee46e1c5a4c0d8aeaee57ccb commit 85266f1daa7f8f97ee46e1c5a4c0d8aeaee57ccb Author: Douglas Anderson <dianders@chromium.org> Date: Sun Oct 22 19:34:29 2017 cros-kernel2: Don't strip local symbols from kernel modules splitdebug will normally use `strip --strip-unneeded` which deletes local (static) symbols. This severely affects the kernel's ability to print useful backtraces. The problems affect arm64 and x86_64 the worst since on arm32 the compiler / linker tend to keep quite a few local symbols for their own purposes. Depending on the module, the lack of local symbols can either make the stack crawl _very_ hard to interpret (due to the kernel just dumping a raw hex number into the backtrace) or sorta hard to interpret (due to the kernel thinking we're part of whichever closest function it has a symbol for). Adding the symbol information adds some overhead both in terms of disk space used and in terms of RAM used. For disk space used on a few representative kernels, we've got: arm32 (kernel 3.14): 7000K => 7412K arm64 (kernel 4.4): 11696K => 13344K x86_64 (kernel 4.4): 17472K => 19148K ...AKA, it's not huge and we can save that space elsewhere it need be (in fact, CL:729700 ought to do the trick). In terms of RAM used, there will be a little bit but it should be _less than_ the above because we don't load all the modules into memory at once. I could also believe (but didn't confirm) that the kernel stores these strings more efficiently than the elf format. In order to get this behavior that is different than the portage default, we've got to work behind portage's back a little bit. Specifically: a) We've got to tell portage not to strip our modules. We use the STRIP_MASK which seems to work well. b) If we want to keep debug symbols ourselves (for kgdb) then we've got to generate those ourselves. We'll solve that by just calling "modules_install" twice on the kernel, once asking the kernel to to the stripping (it will keep local symbols) and once asking it to give full debug info. NOTE: there are a few side effects of the above change: 1. We no longer have "splitdebug" for the stuff that we stick in /usr/lib/debug. We put full elf files there. This is maginally (10%) bigger but it should be OK. These still seem to work fine for kgdb. 2. We'll get a few bits of extra stuff in the "/usr/lib/debug" because modules_install adds some extra cruft, like the a few firmware files. Again, it shouldn't hurt. 3. Build times will be slightly affected by this change, but how much depends on a lot of factors. Historically each call into the kernel build system was quite slow due to lots of calls to the compiler and the overhead of our compiler wrappers. That's getting fixed, though (see chromium:767073), at least for newer kernels. We're also speeding up our invocations of the compiler. As far as I could tell this change was either neutral or slightly faster on kernel 4.4 w/ the fixes to chromium:767073. Even on clang (the worst case) on kernel 4.4 the overhead was at most a couple of seconds even without those fixes. BUG= chromium:775257 TEST=Check size of modules; do a stack crawl from in a module Change-Id: Ie04c8a3c43640450145f2ff3da4bd4c8a159961f Reviewed-on: https://chromium-review.googlesource.com/724168 Commit-Ready: Douglas Anderson <dianders@chromium.org> Tested-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/85266f1daa7f8f97ee46e1c5a4c0d8aeaee57ccb/eclass/cros-kernel2.eclass
,
Oct 23 2017
Landed in M-64. Not suggesting that we pick back to M-63 at the time because: * It's not _that_ critical. * Even though it's safe, it does bloat the root filesystem a little bit. On ToT that was mitigated by the fix for bug #770230 which is probably too risky for M-63.
,
Oct 23 2017
Couple of questions as an inexperienced person who only debugged a kernel ramoops once or twice: 1. Is it correct that this change applies to all architectures? 2. What new implications does this have for x86_64 where "%pS" works mostly correctly and prints the correct function name + offset, at least for the first frame in the stack-dump? Does this mean that I can start using addr2line like so, because the module now contains symbols and debug information? addr2line -e <modulename.ko> <address on stdin> When Kevin helped me debug b/67463786, we had to do a complicated dance to get to the crashing line: - download the debug form of the module. - coalesce regular module and debug form using eu-unstrip. Local (static) symbols were still missing, so use objdump to get a full disasm and then try to guess crashing line from there. How does that change now?
,
Oct 23 2017
So, Doug can correct me if I'm wrong, but: I would expect this to fix many of the cases where a stack trace in the kernel log is missing the name of a local symbol. I would not expect this to add detailed line number info to the .ko files in CrOS images (because that takes up a lot of space). It looks like the fully unstripped modules still wind up in /usr/lib/debug, not in the images. So for pinpointing the exact location + circumstances of a crash in the first frame, you'll probably still need to fetch the .debug files and run eu-unstrip.
,
Oct 23 2017
> 1. Is it correct that this change applies to all architectures?
Yes. It's less important for some (like arm32) where many of the local symbols weren't stripped anyway, but it will still help even there.
> 2. What new implications does this have for x86_64 where "%pS" works
> mostly correctly and prints the correct function name + offset, at
> least for the first frame in the stack-dump?
If it was "mostly correct" before it will be even more correct now. AKA we'll have more functions present than we used to.
> Does this mean that I can start using addr2line like so, because the module now contains symbols and debug information?
I've never used addr2line before, but I wouldn't imagine the use of it would change too much. See below.
> you'll probably still need to fetch the .debug files and run eu-unstrip
Almost. At the moment we're not putting "split debug" files in /usr/lib/debug anymore. We're essentially pre-eu-unstripping them. ...but yes, you still need to fetch the debug tarball separately if you didn't do a local build. If you did a local build then the full unstripped modules will be in /usr/lib/debug.
===
Let's see. If I've got this from my WARN_ON(1):
[ 46.201798] [<ffffffbffc1355c4>] mwifiex_pcie_card_reset+0x18/0x40 [mwifiex_pcie]
...then I can run this (without eu-unstrip because we now have full modules there):
aarch64-cros-linux-gnu-objdump -dr \
/build/kevin/usr/lib/debug/lib/modules/4.4.92/kernel/drivers/net/wireless/marvell/mwifiex/mwifiex_pcie.ko | \
grep '<mwifiex_pcie_card_reset>:'
...and I get:
00000000000025ac <mwifiex_pcie_card_reset>:
So I guess I could try addr2line of 0x25ac + 0x18 = 0x25c4
addr2line -e \
/build/kevin/usr/lib/debug/lib/modules/4.4.92/kernel/drivers/net/wireless/marvell/mwifiex/mwifiex_pcie.ko \
0x25c4
...and I get:
/mnt/host/source/src/third_party/kernel/v4.4/drivers/net/wireless/marvell/mwifiex/pcie.c:2791
So that seems to work OK...
,
Nov 21 2017
Adding Matt Chen and one more person from Intel since he had some feedback on how iwlwifi stack traces in Chrome OS were confusing. Matt, can you provide some examples of bad Chrome OS traces, along with a good Linux example of how you expect it to be?
,
Nov 21 2017
I recently bumped into an example: [ 1626.072587] ================================================================== [ 1626.072622] BUG: KASAN: use-after-free in hci_reset_dev+0xef6/0x104a [bluetooth] at addr ffff88016acaf3f0 [ 1626.072633] Read of size 2 by task kworker/u9:4/6387 [ 1626.072638] ============================================================================= [ 1626.072649] BUG skbuff_head_cache (Tainted: G B W ): kasan: bad access detected [ 1626.072655] ----------------------------------------------------------------------------- [ 1626.072672] INFO: Allocated in __alloc_skb+0x5e/0x2b9 age=0 cpu=2 pid=6030 [ 1626.072685] ___slab_alloc.constprop.69+0x319/0x3fc [ 1626.072697] __slab_alloc.isra.64.constprop.68+0x1d/0x22 [ 1626.072708] kmem_cache_alloc+0x85/0x17e [ 1626.072717] __alloc_skb+0x5e/0x2b9 [ 1626.072743] hci_prepare_cmd+0x2d/0xf1 [bluetooth] [ 1626.072768] hci_req_add_ev+0x61/0xdb [bluetooth] [ 1626.072790] iwl_pci_unregister_driver+0x21c9/0x2737 [iwlwifi] [ 1626.072807] iwl_pcie_irq_handler+0x7c8/0xac7 [iwlwifi] [ 1626.072819] irq_thread_fn+0x43/0x5e [ 1626.072825] irq_thread+0x1f8/0x285 [ 1626.072835] kthread+0x183/0x192 [ 1626.072846] ret_from_fork+0x3f/0x70 [ 1626.072858] INFO: Freed in __kfree_skb+0x94/0x97 age=0 cpu=1 pid=6387 [ 1626.072869] __slab_free+0x68/0x359 [ 1626.072880] kmem_cache_free+0xb1/0x128 [ 1626.072890] __kfree_skb+0x94/0x97 [ 1626.072900] kfree_skb+0xf4/0xfd [ 1626.072926] hci_reset_dev+0xeed/0x104a [bluetooth] [ 1626.072936] process_one_work+0x2eb/0x4c8 [ 1626.072947] worker_thread+0x346/0x48b [ 1626.072956] kthread+0x183/0x192 [ 1626.072966] ret_from_fork+0x3f/0x70 There are some iwlwifi calls in the middle of bluetooh calls, which doesn't make sense.
,
Nov 21 2017
i suspect that's unrelated to this issue, especially given you're using kasan. this was about the backtrace being correct, just the symbols were missing. probably want to file a new bug, and be clear whether this only occurs when using kasan.
,
Nov 21 2017
#14: The backtrace tries to find a close symbol, but the closest symbol it can find is in the iwl driver (which is why the function offset is 0x21c9 and 0x7c8). hci_reset_dev doesn't look correct either. This is more likely hci_rx_work() or another worker function in the same file.
,
Nov 27 2017
@14: can you confirm that you have the fixes from above to adjust the symbol stripping? With just the little stack crawl you provided I can't tell if this is a local build (in which case maybe you didn't sync down the eclass change?) or you have some type of bot doing the kasan build / test?
,
Nov 27 2017
I most likely did *not* have the fixes in that tree. This is a local tree with our latest iwl7000 driver on top that I was debugging and bumped into this issue. I've also seen some wrong symbols showing up in other occasions (not from Kasan). I'll update the tree and check the symbols next time I bump into a stack dump.
,
Nov 27 2017
@18: OK. Until I hear back from you on that then I'll consider the problem fixed then. ;-)
,
Nov 27 2017
Yes, please do. I'll bug you if I see the bug in the future. :)
,
Jan 9 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/85f3806f1a0df328559b7eb33eaf4c2cb5606925 commit 85f3806f1a0df328559b7eb33eaf4c2cb5606925 Author: Mike Frysinger <vapier@chromium.org> Date: Tue Jan 09 06:42:16 2018 cros-kernel: only mask out kernel modules The strip changes in CL:724168 were meant to change behavior for just kernel modules, but since we masked all of /lib/modules/, we also no longer created debug info for vdso files. All kernel modules live under /lib/modules/<ver>/kernel/, so disable stripping only for that subpath. BUG= chromium:775257 TEST=`emerge-eve chromeos-kernel-4_4` shows vdso .debug files again Change-Id: I196ce9c211df3d902fa5d363d99d5f8bb9210e30 Reviewed-on: https://chromium-review.googlesource.com/853332 Commit-Ready: Mike Frysinger <vapier@chromium.org> Tested-by: Mike Frysinger <vapier@chromium.org> Reviewed-by: Chong Jiang <chongjiang@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> [modify] https://crrev.com/85f3806f1a0df328559b7eb33eaf4c2cb5606925/eclass/cros-kernel2.eclass
,
Jan 9 2018
Mike, can we cherrypick #21 into release-R64-10176.B?
,
Jan 9 2018
,
Jan 9 2018
This bug requires manual review: We are only 13 days from stable. Please contact the milestone owner if you have questions. Owners: cmasso@(Android), cmasso@(iOS), kbleicher@(ChromeOS), abdulsyed@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Jan 10 2018
We're fairly close to stable. Is this risk for not going in or more of an inconvenience with analysis?
,
Jan 10 2018
This inconveniences offline analysis for CWP. We would have a workaround of building and ingesting the vdso debug info manually, rather than extracting it from the pre-built debug.tgz's. This would need to be done separately per board.
,
Jan 16 2018
Ping. Is there still time to consider this for a cherrypick? The fix in #21 been verified on data coming back from dev channel, version 10299.0.0.
,
Jan 17 2018
ok, let's get it in providing this has had bake and verification time on dev. Please merge asap to get it in the next beta
,
Feb 12 2018
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Feb 16 2018
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Feb 16 2018
I think we've missed the boat for M64, no? Anyway, I don't want to keep getting these emails, so I'll drop the Merge attempt myself :) |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by groeck@chromium.org
, Oct 17 2017