New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 775257 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

kernel modules don't get symbolic addresses in stack crawls

Project Member Reported by diand...@chromium.org, Oct 16 2017

Issue description

If you look at the list of crash reports on kevin, you see that the top few ones are:

* kernel-0xffffffbffc1cb02c
* kernel-0xffffffbffc1c202c
* kernel-0xffffffbffc1cb024
* kernel-0xffffffbffc1c2024
* kernel-0xffffffbffc1b902c

That's not very helpful.  Basically: the kernel is unable to give any good stack crawls whenever it gets to an address in a module.  That's annoying.

=====

This was talked about a bunch in the context of the private bug b/35580462 when we were trying to track down WiFi problems.  At the time, Kevin proposed two approaches here:

--

Approach #1: make the kernel at least print module-relative offsets

* https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/399680
  CHROMIUM: printk: Make %pS print offsets into modules

AKA: https://patchwork.kernel.org/patch/9391241/

--

Approach #2: don't strip

* https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/403312
  cros-kernel2: Don't strip local symbols from kernel modules

--

Upstream was a little confused by approach #1.  I believe they found it strange and unexpected that modules had their symbols stripped but that the main kernel didn't have its symbols stripped.

For approach #2 folks were a little concerned about the size impact.

=====

It seems like we should come to a resolution here.  IMHO the 10-15% increase in kernel module size is probably worth it.

NOTE: a third approach was talked about where we somehow came up with a tools infrastructure that made it easy to resolve these values into something sensible by combining with debug symbols.  While this is possible, IMHO it's a bit too much overhead.  We'd need to dump the module table in all crash dumps, we'd have to integrate it with the crash reporter, and everyone digging into ramoops files would need manually do the resolution (or we'd need to post-process ramoops files?)



 

Comment 1 by groeck@chromium.org, Oct 17 2017

Have you looked at CL:403312 ?

@1: Yes.  See above:

> Approach #2: don't strip
> 
> * https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/403312
>   cros-kernel2: Don't strip local symbols from kernel modules

CL:403312 resolves to the URL I posted.
This one works for me:

  https://chromium-review.googlesource.com/#/c/chromiumos/overlays/chromiumos-overlay/+/724168

...and we could probably land that if others agree that the overhead is worth it.  If we want to save a bit of space we could probably trim down the set of installed firmware files, which we need to do anyway.

Cc: kirtika@chromium.org
Just to give an example of a stack dump I placed in mwifiex.  With things stripped (existing), I see:

[    9.258544] [<ffffffc0004cfcc8>] dump_stack+0x8c/0xc8
[    9.258557] [<ffffffbffc0a7ea0>] mwifiex_add_card+0x110/0x508 [mwifiex]
[    9.258561] [<ffffffbffc0ea114>] 0xffffffbffc0ea114
[    9.258565] [<ffffffc00051c824>] pci_device_probe+0xa4/0xf4

Without things stripped:

[    9.718673] [<ffffffc0004cfcc8>] dump_stack+0x8c/0xc8
[    9.723735] [<ffffffbffc0e2ea0>] mwifiex_add_card+0x110/0x434 [mwifiex]
[    9.730353] [<ffffffbffc133114>] mwifiex_pcie_probe+0x114/0x164 [mwifiex_pcie]
[    9.737574] [<ffffffc00051c824>] pci_device_probe+0xa4/0xf4

---

As you can see, _sometimes_ it resolves to a symbol and sometimes not.  ...though I'm not convinced that symbol is really right.  From what I can remember in the arm32 days the function name was often meaningless because it picked the closest symbol it could find.

---

Kirtika pointed to an x86 example from b/67463786.  There several modules were involved:

[    7.898893]  [<ffffffff9529c060>] dump_stack+0x4e/0x71
[    7.898913]  [<ffffffffc0331622>] iwl_trans_pcie_send_hcmd+0x32f/0x55a [iwlwifi]
[    7.898931]  [<ffffffffc037acbb>] ? iwl_mvm_update_d0i3_power_mode+0x68b/0x763 [iwlmvm]
[    7.898947]  [<ffffffff94c965e1>] ? finish_wait+0x64/0x64
[    7.898960]  [<ffffffffc032b4ef>] iwl_trans_send_cmd+0xa3/0xa5 [iwlwifi]
[    7.898978]  [<ffffffffc03636c1>] iwl_mvm_send_cmd+0x55/0x90 [iwlmvm]
[    7.898993]  [<ffffffffc0363744>] iwl_mvm_send_cmd_pdu+0x48/0x5e [iwlmvm]
[    7.899008]  [<ffffffffc036123f>] iwl_mvm_get_ctrl_pos+0x191/0x1d4 [iwlmvm]
[    7.899024]  [<ffffffffc036132a>] iwl_mvm_phy_ctxt_changed+0x22/0x24 [iwlmvm]
[    7.899039]  [<ffffffffc035ada8>] iwl_mvm_rx_mfuart_notif+0x2d89/0x2db4 [iwlmvm]
[    7.899055]  [<ffffffffc035ae00>] ? iwl_mvm_ref+0x2d/0x7c [iwlmvm]
[    7.899073]  [<ffffffffc02c13ff>] __iwl7000_ieee80211_iter_chan_contexts_atomic+0x69b/0x797 [iwl7000_mac80211]
[    7.899095]  [<ffffffffc02c188e>] ieee80211_recalc_chanctx_min_def+0x1b4/0x316 [iwl7000_mac80211]
[    7.899115]  [<ffffffffc02c2274>] ieee80211_recalc_smps_chanctx+0x666/0x93c [iwl7000_mac80211]
[    7.899134]  [<ffffffffc02c33b7>] ieee80211_vif_use_channel+0x19d/0x20e [iwl7000_mac80211]
[    7.899153]  [<ffffffffc02d4177>] __iwl7000_ieee80211_connection_loss+0xb05/0x1166 [iwl7000_mac80211]
[    7.899174]  [<ffffffffc02d8fbb>] ieee80211_mgd_auth+0x21d/0x2dc [iwl7000_mac80211]
[    7.899193]  [<ffffffffc02ab32b>] __iwl7000_ieee80211_csa_finish+0x628/0x3249 [iwl7000_mac80211]
[    7.899212]  [<ffffffffc0246d4e>] cfg80211_mlme_auth+0x164/0x1a5 [cfg80211]
[    7.899228]  [<ffffffffc0235bef>] cfg80211_vendor_cmd_reply+0xdff/0x2558 [cfg80211]
[    7.899245]  [<ffffffff951dde9c>] genl_family_rcv_msg+0x28f/0x303

It seemed like it again did a pretty good job at at least printing _something_ that wasn't a hex address, though I wonder if it really printed sensible function names.


So basically it seems like arm64 is worse than everything else in that it often prints something totally useless.  ...but I'd also believe that everything else is not perfect either.  I'll try to grab more data.

tl; dr:

* The effect of keeping local symbols is ~6% on arm32, ~10% on x86_64, and ~14% on arm64.
* On arm32, this change is least important: things kinda sorta work "well enough"
* On arm64, this change is very important.  Things mostly don't work.
* Total space increase on x86_64 and arm64 is 1.7 MB.

We can make things a little less bad with a patch like Kevin's (that at least prints the module).  If we're really going to do that then we should try to convince upstream that our use case is a sane one (upstream, by default, seems to keep local symbols in modules).

Details below.  I'm mostly comparing veyron_jerry (arm32, kernel v3.14) vs. kevin (arm64, kernel 4.4):


I'm still of the opinion that we should eat the 1.7 MB of disk space.  I'll see about going through our firmware to see if there's some low hanging fruit where we can recover space there...

=====

As a simple test, I'll run this on both (after adding a WARN_ON(1) to kevin to match jerry): 
  echo 1 > /sys/kernel/debug/mwifiex/mlan0/reset

--

BEFORE my change:

jerry:

[   37.820895] [<c0223fe4>] (warn_slowpath_null) from [<bf22564c>] (mwifiex_sdio_card_reset+0x34/0x2d8 [mwifiex_sdio])
[   37.831415] [<bf22564c>] (mwifiex_sdio_card_reset [mwifiex_sdio]) from [<bf1f57c8>] (mwifiex_reset_write+0xe4/0x114 [mwifiex])
[   37.842895] [<bf1f57c8>] (mwifiex_reset_write [mwifiex]) from [<c033fa40>] (vfs_write+0xe4/0x178)
[   37.851798] [<c033fa40>] (vfs_write) from [<c0340024>] (SyS_write+0x64/0xbc)

kevin:

[   58.611081] [<ffffffbffc0e35c4>] 0xffffffbffc0e35c4
[   58.615977] [<ffffffbffc0c1ecc>] mwifiex_debugfs_remove+0x16e4/0x2818 [mwifiex]
[   58.623310] [<ffffffc00044db98>] full_proxy_write+0x64/0xac
[   58.628895] [<ffffffc00036c554>] __vfs_write+0x54/0x10c
[   58.634130] [<ffffffc00036c7b8>] vfs_write+0x98/0xd0

--

AFTER my change:

jerry:

[   35.883399] [<c0223fe4>] (warn_slowpath_null) from [<bf25064c>] (mwifiex_sdio_card_reset+0x34/0x70 [mwifiex_sdio])
[   35.893862] [<bf25064c>] (mwifiex_sdio_card_reset [mwifiex_sdio]) from [<bf2227c8>] (mwifiex_reset_write+0xe4/0x114 [mwifiex])
[   35.905353] [<bf2227c8>] (mwifiex_reset_write [mwifiex]) from [<c033fa40>] (vfs_write+0xe4/0x178)
[   35.914309] [<c033fa40>] (vfs_write) from [<c0340024>] (SyS_write+0x64/0xbc)

kevin:

[   80.338236] [<ffffffbffc1355c4>] mwifiex_pcie_card_reset+0x18/0x40 [mwifiex_pcie]
[   80.345758] [<ffffffbffc105ecc>] mwifiex_reset_write+0xa8/0xb8 [mwifiex]

...clearly things are pretty different.

=====

Overall it's kinda weird that the symbol "mwifiex_sdio_card_reset" wasn't stripped on jerry.  Looking in the source code, it's declared like this:

  static void mwifiex_sdio_card_reset(struct mwifiex_adapter *adapter)

...so there's no reason it shouldn't be stripped, right?  ...but it clearly isn't.  Not only does the stack crawl prove that, but you can also see readelf:

--

$ armv7a-cros-linux-gnueabi-readelf -a /build/veyron_jerry/usr/lib/debug/lib/modules/3.14.0/kernel/drivers/net/wireless-3.8/mwifiex/mwifiex_sdio.ko.debug | grep mwifiex_sdio_card_reset
    50: 00000618   112 FUNC    LOCAL  DEFAULT    2 mwifiex_sdio_card_reset

$ armv7a-cros-linux-gnueabi-readelf -a /build/veyron_jerry/lib/modules/3.14.0/kernel/drivers/net/wireless-3.8/mwifiex/mwifiex_sdio.ko | grep mwifiex_sdio_card_reset
0x618 <mwifiex_sdio_card_reset>: @0x78
0x688 <mwifiex_sdio_card_reset+0x70>: @0x84
    32: 00000618   112 FUNC    LOCAL  DEFAULT    2 mwifiex_sdio_card_reset

--

Perhaps this has something to do with those odd extra entries?  Maybe they somehow cause the linker not to strip things?  They look like this:

0x618 <mwifiex_sdio_card_reset>: @0x78
  Compact model index: 1
  0x9b      vsp = r11
  0x44      vsp = vsp - 20
  0x86 0x83 pop {r4, r5, r11, r13, r14}
  0xb0      finish
  0xb0      finish

0x688 <mwifiex_sdio_card_reset+0x70>: @0x84
  Compact model index: 1
  0x9b      vsp = r11
  0x46      vsp = vsp - 28
  0xb1 0x08 pop {r3}
  0x86 0x87 pop {r4, r5, r6, r11, r13, r14}

--

Note on kevin it was stripped:

$ aarch64-cros-linux-gnu-readelf -a /build/kevin/usr/lib/debug/lib/modules/4.4.86/kernel/drivers/net/wireless/marvell/mwifiex/mwifiex_pcie.ko.debug | grep mwifiex_pcie_card_reset
   111: 00000000000025ac    64 FUNC    LOCAL  DEFAULT    2 mwifiex_pcie_card_reset

$ aarch64-cros-linux-gnu-readelf -a /build/kevin/lib/modules/4.4.86/kernel/drivers/net/wireless/marvell/mwifiex/mwifiex_pcie.ko | grep mwifiex_pcie_card_reset

--

...actually, I dug up an old email from vapier@ about this.  He said:

> the --strip-unneeded option specifically rips out symbols that aren't needed for relocation 
> processing.  i imagine the ISA on x86_64/aarch64 and their PIC-friendly ISA means they can 
> do a lot more relative relocs which means throwing away more symbols, while on the older 
> aarch32, the symbols are needed to process relocs.  hence in some cases they stay behind.

So basically the answer is: jerry works a bit better because we already don't strip as much on jerry.  It's possible that those extra entries have to do with processing relocs.

=====

We can get some idea of how much stripping was happening on arm32 (where we never had a huge problems) vs. arm64 (where we clearly do) by comparing the effect of my CL on jerry and kevin:

Before:

$ du -s /build/kevin/lib/modules/
11696   /build/kevin/lib/modules/

$ du -s /build/veyron_jerry/lib/modules/
7000    /build/veyron_jerry/lib/modules/

After:

$ du -s /build/kevin/lib/modules/
13344   /build/kevin/lib/modules/

$ du -s /build/veyron_jerry/lib/modules/
7412    /build/veyron_jerry/lib/modules/


Thus on jerry the CL results in size that is 1.06 times as big (a 6% gain)
...and on kevin the CL results in size that is 1.14 times as big (a 14% gain)


In other words the devices where this is working "well enough" are working well enough because (for whatever reason) the compiler isn't stripping many symbols.  

=====

Let's think about "well enough" a bit more.  Let's focus on one difference on jerry, specifically:

  mwifiex_sdio_card_reset+0x34/0x70
vs. 
  mwifiex_sdio_card_reset+0x34/0x2d8


Basically, though jerry seems to have gotten the right function, the "before" and "after" disagree on how big the function is.  What's right?  Let's use kgdb to answer that.  On jerry:

(gdb) monitor lsmod
...
mwifiex_sdio           30777  0xbf22b0c0    0  (Live) 0xbf225000 [ ]
btmrvl_sdio            21517  0xbf21f178    0  (Live) 0xbf21b000 [ ]
btmrvl                 18227  0xbf215000    1  (Live) 0xbf212000 [ btmrvl ]
mwifiex               205169  0xbf201184    1  (Live) 0xbf1d6000 [ mwifiex ]
...

(gdb) add-symbol-file /build/veyron_jerry/usr/lib/debug/lib/modules/3.14.0/kernel/drivers/net/wireless-3.8/mwifiex/mwifiex.ko.debug 0xbf1d6000
(gdb) add-symbol-file /build/veyron_jerry/usr/lib/debug/lib/modules/3.14.0/kernel/drivers/net/wireless-3.8/mwifiex/mwifiex_sdio.ko.debug 0xbf225000

(gdb) disass 0xbf22564c
Dump of assembler code for function mwifiex_sdio_card_reset:
   0xbf225618 <+0>:     mov     r12, sp
   ...
   0xbf22564c <+52>:    ldr     r3, [r5]
   ...
   0xbf225674 <+92>:    ldm     sp, {r4, r5, r11, sp, pc}
   0xbf225678 <+96>:    svclt   0x0022b000
   0xbf22567c <+100>:   svclt   0x002294bb
   0xbf225680 <+104>:   svclt   0x0022b25c
   0xbf225684 <+108>:                   ; <UNDEFINED> instruction: 0xc10084b8
End of assembler dump.

gdb gives address in decimal, so in hex that means we're actually at mwifiex_sdio_card_reset+0x34/0x70 (not /0x2d8)

=====

The above implies that, perhaps, we thought mwifiex_sdio_card_reset() was too big because we didn't have the next symbol.  Let's see if that's true.  Using gdb to disassemble right after mwifiex_sdio_card_reset(), I see that the next function is mwifiex_decode_rx_packet().  That function should be 616 bytes long.  ...and, in fact, hex(616 + 0x70) = 0x2d8, so it looks like we somehow thought that was part of mwifiex_sdio_card_reset().  Looking through elf files:

$ armv7a-cros-linux-gnueabi-readelf -a /build/veyron_jerry/usr/lib/debug/lib/modules/3.14.0/kernel/drivers/net/wireless-3.8/mwifiex/mwifiex_sdio.ko.debug | grep mwifiex_decode_rx_packet
    53: 00000688   616 FUNC    LOCAL  DEFAULT    2 mwifiex_decode_rx_packet
$ armv7a-cros-linux-gnueabi-readelf -a /build/veyron_jerry/lib/modules/3.14.0/kernel/drivers/net/wireless-3.8/mwifiex/mwifiex_sdio.ko | grep mwifiex_decode_rx_packet

...so right, mwifiex_decode_rx_packet() was stripped.  Had we done a stack trace there we would have claimed that we were actually in mwifiex_sdio_card_reset().

=====

One thing that seems particularly bad on kevin is that we don't even get the module a lot of the time.  For instance, if we claimed we were in mwifiex_sdio_card_reset() when we should have said mwifiex_decode_rx_packet() that's bad, but not as bad as on kevin saying that we're here:

[   58.611081] [<ffffffbffc0e35c4>] 0xffffffbffc0e35c4

...why do we do that?  It looks like for kevin for mwifiex_sdio the symbol table contains almost nothing in it and we don't find any symbol that's close enough.  You can see with this:

$ aarch64-cros-linux-gnu-readelf -s /build/kevin/lib/modules/4.4.86/kernel/drivers/net/wireless/marvell/mwifiex/mwifiex_pcie.ko | grep -v UND

...taking out the junk, you get:

    43: 0000000000000000   896 OBJECT  GLOBAL DEFAULT   24 __this_module
    55: 0000000000000000    28 FUNC    GLOBAL DEFAULT    6 cleanup_module
    64: 0000000000000000    44 FUNC    GLOBAL DEFAULT    4 init_module
    79: 0000000000000038   160 OBJECT  GLOBAL DEFAULT    8 __mod_pci__mwifiex_ids_de

=====

I have little experience with x86.  The claim in an old email thread was that x86_64 was bad, though the above example from Kirtika shows that maybe it's not too bad.  

Let's see using the size difference before and after my change:

Before:

$ du -s /build/reef/lib/modules/
17472   /build/reef/lib/modules/

After:

$ du -s /build/reef/lib/modules/
19148   /build/reef/lib/modules/

...aka a 10% gain in size, or halfway between arm32 and arm64.  What about if we look at symbols?  Certainly there are _a lot_ more symbols in:

x86_64-cros-linux-gnu-readelf -s /build/reef/lib/modules/4.4.86-11896-g52e8e9b1a09e/kernel/drivers/net/wireless/iwl7000/iwlwifi/iwlwifi.ko | grep -v UND

=====

Actually, maybe we can compare apples to apples by using a common module.  

aarch64-cros-linux-gnu-readelf -s /build/kevin/lib/modules/4.4.86/kernel/net/ipv6/netfilter/ip6table_filter.ko | grep -v UND

vs. 

x86_64-cros-linux-gnu-readelf -s /build/reef/lib/modules/4.4.86-11896-g52e8e9b1a09e/kernel/net/ipv6/netfilter/ip6table_filter.ko | grep -v UND

vs.

armv7a-cros-linux-gnueabi-readelf -s /build/veyron_jerry/lib/modules/3.14.0/kernel/net/ipv6/netfilter/ip6table_filter.ko | grep -v UND


On arm32, we have:

    11: 00000000    68 FUNC    LOCAL  DEFAULT    2 ip6table_filter_hook
    14: 00000044    32 FUNC    LOCAL  DEFAULT    2 ip6table_filter_net_exit
    15: 00000064   124 FUNC    LOCAL  DEFAULT    2 ip6table_filter_net_init

...but on x86_64 and arm64 we don't have anything really interesting...


The above is relevant because we needed to spend time on this when dealing with  bug #774279 .

Comment 6 by groeck@chromium.org, Oct 18 2017

I would say go for it. I keep hitting that problem, and it is sometimes annoying when trying to decode stack traces from crashes observed in the field.


Comment 7 by snanda@chromium.org, Oct 18 2017

Agreed with going for it.
Project Member

Comment 8 by bugdroid1@chromium.org, Oct 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/85266f1daa7f8f97ee46e1c5a4c0d8aeaee57ccb

commit 85266f1daa7f8f97ee46e1c5a4c0d8aeaee57ccb
Author: Douglas Anderson <dianders@chromium.org>
Date: Sun Oct 22 19:34:29 2017

cros-kernel2: Don't strip local symbols from kernel modules

splitdebug will normally use `strip --strip-unneeded` which deletes
local (static) symbols.  This severely affects the kernel's ability to
print useful backtraces.

The problems affect arm64 and x86_64 the worst since on arm32 the
compiler / linker tend to keep quite a few local symbols for their own
purposes.

Depending on the module, the lack of local symbols can either make the
stack crawl _very_ hard to interpret (due to the kernel just dumping a
raw hex number into the backtrace) or sorta hard to interpret (due to
the kernel thinking we're part of whichever closest function it has a
symbol for).

Adding the symbol information adds some overhead both in terms of disk
space used and in terms of RAM used.  For disk space used on a few
representative kernels, we've got:
  arm32 (kernel 3.14):  7000K =>  7412K
  arm64 (kernel 4.4):  11696K => 13344K
  x86_64 (kernel 4.4): 17472K => 19148K
...AKA, it's not huge and we can save that space elsewhere it need be
(in fact, CL:729700 ought to do the trick).  In terms of RAM used,
there will be a little bit but it should be _less than_ the above
because we don't load all the modules into memory at once.  I could
also believe (but didn't confirm) that the kernel stores these strings
more efficiently than the elf format.

In order to get this behavior that is different than the portage
default, we've got to work behind portage's back a little bit.
Specifically:
a) We've got to tell portage not to strip our modules.  We use the
   STRIP_MASK which seems to work well.
b) If we want to keep debug symbols ourselves (for kgdb) then we've
   got to generate those ourselves.  We'll solve that by just calling
   "modules_install" twice on the kernel, once asking the kernel to to
   the stripping (it will keep local symbols) and once asking it to
   give full debug info.

NOTE: there are a few side effects of the above change:
1. We no longer have "splitdebug" for the stuff that we stick in
   /usr/lib/debug.  We put full elf files there.  This is maginally
   (10%) bigger but it should be OK.  These still seem to work fine
   for kgdb.
2. We'll get a few bits of extra stuff in the "/usr/lib/debug" because
   modules_install adds some extra cruft, like the a few firmware
   files.  Again, it shouldn't hurt.
3. Build times will be slightly affected by this change, but how much
   depends on a lot of factors.  Historically each call into the
   kernel build system was quite slow due to lots of calls to the
   compiler and the overhead of our compiler wrappers.  That's getting
   fixed, though (see chromium:767073), at least for newer kernels.
   We're also speeding up our invocations of the compiler.  As far as
   I could tell this change was either neutral or slightly faster on
   kernel 4.4 w/ the fixes to chromium:767073.  Even on clang (the
   worst case) on kernel 4.4 the overhead was at most a couple of
   seconds even without those fixes.

BUG= chromium:775257 
TEST=Check size of modules; do a stack crawl from in a module

Change-Id: Ie04c8a3c43640450145f2ff3da4bd4c8a159961f
Reviewed-on: https://chromium-review.googlesource.com/724168
Commit-Ready: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/85266f1daa7f8f97ee46e1c5a4c0d8aeaee57ccb/eclass/cros-kernel2.eclass

Labels: M-64
Owner: diand...@chromium.org
Status: Fixed (was: Untriaged)
Landed in M-64.  Not suggesting that we pick back to M-63 at the time because:
* It's not _that_ critical.
* Even though it's safe, it does bloat the root filesystem a little bit.  On ToT that was mitigated by the fix for  bug #770230  which is probably too risky for M-63.

Comment 10 by kirtika@google.com, Oct 23 2017

Couple of questions as an inexperienced person who only debugged a kernel ramoops once or twice:
1. Is it correct that this change applies to all architectures? 
2. What new implications does this have for x86_64 where "%pS" works mostly correctly and prints the correct function name + offset, at least for the first frame in the stack-dump? Does this mean that I can start using addr2line like so, because the module now contains symbols and debug information? 
addr2line -e <modulename.ko> 
<address on stdin>

When Kevin helped me debug b/67463786, we had to do a complicated dance to get to the crashing line:
- download the debug form of the module. 
- coalesce regular module and debug form using eu-unstrip.
Local (static) symbols were still missing, so use objdump to get a full disasm
and then try to guess crashing line from there. 
How does that change now? 
So, Doug can correct me if I'm wrong, but:

I would expect this to fix many of the cases where a stack trace in the kernel log is missing the name of a local symbol.

I would not expect this to add detailed line number info to the .ko files in CrOS images (because that takes up a lot of space).  It looks like the fully unstripped modules still wind up in /usr/lib/debug, not in the images.  So for pinpointing the exact location + circumstances of a crash in the first frame, you'll probably still need to fetch the .debug files and run eu-unstrip.
> 1. Is it correct that this change applies to all architectures? 

Yes.  It's less important for some (like arm32) where many of the local symbols weren't stripped anyway, but it will still help even there.


> 2. What new implications does this have for x86_64 where "%pS" works 
> mostly correctly and prints the correct function name + offset, at 
> least for the first frame in the stack-dump? 

If it was "mostly correct" before it will be even more correct now.  AKA we'll have more functions present than we used to.


> Does this mean that I can start using addr2line like so, because the module now contains symbols and debug information? 

I've never used addr2line before, but I wouldn't imagine the use of it would change too much.  See below.


> you'll probably still need to fetch the .debug files and run eu-unstrip

Almost.  At the moment we're not putting "split debug" files in /usr/lib/debug anymore.  We're essentially pre-eu-unstripping them.  ...but yes, you still need to fetch the debug tarball separately if you didn't do a local build.  If you did a local build then the full unstripped modules will be in /usr/lib/debug.


===

Let's see.  If I've got this from my WARN_ON(1):

[   46.201798] [<ffffffbffc1355c4>] mwifiex_pcie_card_reset+0x18/0x40 [mwifiex_pcie]


...then I can run this (without eu-unstrip because we now have full modules there):
  aarch64-cros-linux-gnu-objdump -dr \
   /build/kevin/usr/lib/debug/lib/modules/4.4.92/kernel/drivers/net/wireless/marvell/mwifiex/mwifiex_pcie.ko | \
   grep '<mwifiex_pcie_card_reset>:'

...and I get:
  00000000000025ac <mwifiex_pcie_card_reset>:

So I guess I could try addr2line of 0x25ac + 0x18 = 0x25c4
  addr2line -e \
    /build/kevin/usr/lib/debug/lib/modules/4.4.92/kernel/drivers/net/wireless/marvell/mwifiex/mwifiex_pcie.ko \
  0x25c4

...and I get:
  /mnt/host/source/src/third_party/kernel/v4.4/drivers/net/wireless/marvell/mwifiex/pcie.c:2791


So that seems to work OK...

Comment 13 by kirtika@google.com, Nov 21 2017

Cc: matt.c...@intel.com luciano....@intel.com
Adding Matt Chen and one more person from Intel since he had some feedback on how iwlwifi stack traces in Chrome OS were confusing. 
Matt, can you provide some examples of bad Chrome OS traces, along with a good Linux example of how you expect it to be?

I recently bumped into an example:

[ 1626.072587] ==================================================================
[ 1626.072622] BUG: KASAN: use-after-free in hci_reset_dev+0xef6/0x104a [bluetooth] at addr ffff88016acaf3f0
[ 1626.072633] Read of size 2 by task kworker/u9:4/6387
[ 1626.072638] =============================================================================
[ 1626.072649] BUG skbuff_head_cache (Tainted: G    B   W      ): kasan: bad access detected
[ 1626.072655] -----------------------------------------------------------------------------

[ 1626.072672] INFO: Allocated in __alloc_skb+0x5e/0x2b9 age=0 cpu=2 pid=6030
[ 1626.072685] 	___slab_alloc.constprop.69+0x319/0x3fc
[ 1626.072697] 	__slab_alloc.isra.64.constprop.68+0x1d/0x22
[ 1626.072708] 	kmem_cache_alloc+0x85/0x17e
[ 1626.072717] 	__alloc_skb+0x5e/0x2b9
[ 1626.072743] 	hci_prepare_cmd+0x2d/0xf1 [bluetooth]
[ 1626.072768] 	hci_req_add_ev+0x61/0xdb [bluetooth]
[ 1626.072790] 	iwl_pci_unregister_driver+0x21c9/0x2737 [iwlwifi]
[ 1626.072807] 	iwl_pcie_irq_handler+0x7c8/0xac7 [iwlwifi]
[ 1626.072819] 	irq_thread_fn+0x43/0x5e
[ 1626.072825] 	irq_thread+0x1f8/0x285
[ 1626.072835] 	kthread+0x183/0x192
[ 1626.072846] 	ret_from_fork+0x3f/0x70
[ 1626.072858] INFO: Freed in __kfree_skb+0x94/0x97 age=0 cpu=1 pid=6387
[ 1626.072869] 	__slab_free+0x68/0x359
[ 1626.072880] 	kmem_cache_free+0xb1/0x128
[ 1626.072890] 	__kfree_skb+0x94/0x97
[ 1626.072900] 	kfree_skb+0xf4/0xfd
[ 1626.072926] 	hci_reset_dev+0xeed/0x104a [bluetooth]
[ 1626.072936] 	process_one_work+0x2eb/0x4c8
[ 1626.072947] 	worker_thread+0x346/0x48b
[ 1626.072956] 	kthread+0x183/0x192
[ 1626.072966] 	ret_from_fork+0x3f/0x70


There are some iwlwifi calls in the middle of bluetooh calls, which doesn't make sense.
i suspect that's unrelated to this issue, especially given you're using kasan.  this was about the backtrace being correct, just the symbols were missing.

probably want to file a new bug, and be clear whether this only occurs when using kasan.
#14: The backtrace tries to find a close symbol, but the closest symbol it can find is in the iwl driver (which is why the function offset is 0x21c9 and 0x7c8).

hci_reset_dev doesn't look correct either. This is more likely hci_rx_work() or another worker function in the same file.

@14: can you confirm that you have the fixes from above to adjust the symbol stripping?  With just the little stack crawl you provided I can't tell if this is a local build (in which case maybe you didn't sync down the eclass change?) or you have some type of bot doing the kasan build / test?
I most likely did *not* have the fixes in that tree.  This is a local tree with our latest iwl7000 driver on top that I was debugging and bumped into this issue.

I've also seen some wrong symbols showing up in other occasions (not from Kasan).

I'll update the tree and check the symbols next time I bump into a stack dump.
@18: OK.  Until I hear back from you on that then I'll consider the problem fixed then.  ;-)
Yes, please do.  I'll bug you if I see the bug in the future. :)
Project Member

Comment 21 by bugdroid1@chromium.org, Jan 9 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/85f3806f1a0df328559b7eb33eaf4c2cb5606925

commit 85f3806f1a0df328559b7eb33eaf4c2cb5606925
Author: Mike Frysinger <vapier@chromium.org>
Date: Tue Jan 09 06:42:16 2018

cros-kernel: only mask out kernel modules

The strip changes in CL:724168 were meant to change behavior for just
kernel modules, but since we masked all of /lib/modules/, we also no
longer created debug info for vdso files.  All kernel modules live
under /lib/modules/<ver>/kernel/, so disable stripping only for that
subpath.

BUG= chromium:775257 
TEST=`emerge-eve chromeos-kernel-4_4` shows vdso .debug files again

Change-Id: I196ce9c211df3d902fa5d363d99d5f8bb9210e30
Reviewed-on: https://chromium-review.googlesource.com/853332
Commit-Ready: Mike Frysinger <vapier@chromium.org>
Tested-by: Mike Frysinger <vapier@chromium.org>
Reviewed-by: Chong Jiang <chongjiang@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>

[modify] https://crrev.com/85f3806f1a0df328559b7eb33eaf4c2cb5606925/eclass/cros-kernel2.eclass

Mike, can we cherrypick #21 into release-R64-10176.B?
Labels: Merge-Request-64
Project Member

Comment 24 by sheriffbot@chromium.org, Jan 9 2018

Labels: -Merge-Request-64 Hotlist-Merge-Review Merge-Review-64
This bug requires manual review: We are only 13 days from stable.
Please contact the milestone owner if you have questions.
Owners: cmasso@(Android), cmasso@(iOS), kbleicher@(ChromeOS), abdulsyed@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
We're fairly close to stable.  Is this risk for not going in or more of an inconvenience with analysis?

This inconveniences offline analysis for CWP. We would have a workaround of building and ingesting the vdso debug info manually, rather than extracting it from the pre-built debug.tgz's. This would need to be done separately per board.
Ping. Is there still time to consider this for a cherrypick? The fix in #21 been verified on data coming back from dev channel, version 10299.0.0.
Labels: -Merge-Review-64 Merge-Approved-64
ok, let's get it in providing this has had bake and verification time on dev. Please merge asap to get it in the next beta 
Project Member

Comment 29 by sheriffbot@chromium.org, Feb 12 2018

Cc: josa...@chromium.org
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Project Member

Comment 30 by sheriffbot@chromium.org, Feb 16 2018

This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Approved-64
I think we've missed the boat for M64, no? Anyway, I don't want to keep getting these emails, so I'll drop the Merge attempt myself :)

Sign in to add a comment