New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 906142 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner: ----
Closed: Nov 18
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

kernel crash collector misses some signatures

Project Member Reported by semenzato@chromium.org, Nov 16

Issue description

Steve discovered that crash reports such as

https://crash.corp.google.com/browse?stbtiq=006a585ec258db1b

omit the function name from the signature (in this case, [<ffffffff94fb0c27>)

The reason seems to be that the RE for a signature cannot match the crash log:

const char* const kPCRegex[] = {
    0, " PC is at ([^\\+ ]+).*",
    " epc\\s+:\\s+\\S+\\s+([^\\+ ]+).*",  // MIPS has an exception program
                                          // counter
    " EIP: \\[<.*>\\] ([^\\+ ]+).*",  // X86 uses EIP for the program counter
    " RIP  \\[<.*>\\] ([^\\+ ]+).*",  // X86_64 uses RIP for the program counter
};

but the crash log contains "RIP: ..." with a colon.  The fix would seem obvious:

    " RIP[: ] \\[<.*>\\] ([^\\+ ]+).*",

I'll send a CL.
 
I meant to add this log snippet for convenience.  Also, the difference is not just the presence of the colon.


<1>[ 1505.850798] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
<1>[ 1505.850823] IP: [<ffffffff94fb0c27>] list_del_init+0x8/0x1b
<5>[ 1505.850843] PGD 0
<5>[ 1505.850854] Oops: 0002 [#1] SMP
<0>[ 1505.853049] gsmi: Log Shutdown Reason 0x03
<5>[ 1505.853059] Modules linked in: ip6t_REJECT rfcomm i2c_dev uinput zram(C) memconsole zsmalloc(C) snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_page_alloc fuse nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer r8169 ath9k_btcoex ath9k_common_btcoex ath9k_hw_btcoex ath mac80211 cfg80211 ath3k btusb btrtl btbcm btintel bluetooth
<5>[ 1505.853231] CPU 1
<5>[ 1505.853240] Pid: 2663, comm: quipper Tainted: G WC 3.8.11 #1
<5>[ 1505.853254] RIP: 0010:[<ffffffff94fb0c27>] [<ffffffff94fb0c27>] list_del_init+0x8/0x1b
<5>[ 1505.853272] RSP: 0000:ffff880171789dd8 EFLAGS: 00010293
<5>[ 1505.853282] RAX: ffff880171789de8 RBX: ffff8801715e6b40 RCX: 000000000000003c
<5>[ 1505.853294] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8801715e6b40

Did we really completely miss this? :P :P :P

95906b24fbe4d arch/um/sys-x86_64/sysrq.c (Jeff Dike          2008-02-04 22:31:20 -0800 21)      printk(KERN_INFO "RIP: %04lx:[<%016lx>]\n", PT_REGS_CS(regs) & 0xffff,

Also, we started looking for RIP only in 2012.

commit 6755167e049cd03d20c2e4d1037700a176196101
Author:     Bryan Freed <bfreed@chromium.org>
AuthorDate: Mon Apr 2 17:05:48 2012 -0700
Commit:     Gerrit <chrome-bot@google.com>
CommitDate: Tue Apr 3 17:47:37 2012 -0700

    Better support 64bit x86 kernel crashes.
    
    FindCrashingFunction() fails to find the crashing kernel function because
    the x86 64bit kernel code (arch/x86/kernel/dumpstack.c:die()) issues "RIP"
    instead of "EIP".  So it falls back to using the panic string in the crash
    signature.
    
    This causes the logging_KernelCrash.py test to fail because it looks for the
    crashing function in the signature string.
    
    BUG=chromium-os:22353
    TEST=logging_KernelCrash
    
    Change-Id: Ic4790648facf5f0be2b82001f8e18b606eca16c7
    Reviewed-on: https://gerrit.chromium.org/gerrit/19492
    Reviewed-by: Ben Chan <benchan@chromium.org>
    Tested-by: Bryan Freed <bfreed@chromium.org>
    Commit-Ready: Bryan Freed <bfreed@chromium.org>

diff --git a/crash-reporter/kernel_collector.cc b/crash-reporter/kernel_collector.cc
index d23c59591..0895b4eb3 100644
--- a/crash-reporter/kernel_collector.cc
+++ b/crash-reporter/kernel_collector.cc
@@ -41,6 +41,7 @@ static const char *s_pc_regex[] = {
   0,
   " PC is at ([^\\+ ]+).*",
   " EIP: \\[<.*>\\] ([^\\+ ]+).*",  // X86 uses EIP for the program counter
+  " RIP  \\[<.*>\\] ([^\\+ ]+).*",  // X86_64 uses RIP for the program counter
 };

Seriously? :)


Cc: dgagnon@chromium.org krk@google.com
Labels: XAct
I am not sure this ever worked as intended.

The intention of these patterns seems to be that the group matching the function name should stop at + or space.  But it does not.  For instance we get this signature on kernel 3.8 ARM:

namespace_unlock+0xbc/0x114

apparently from this like in the log:

<5>[ 2643.251426] PC is at namespace_unlock+0xbc/0x114

(this is from https://crash.corp.google.com/browse?stbtiq=6a1811763216104b)

I am not even sure where that signature is generated because kernel_collector.cc always appends the hash:

std::string KernelCollector::ComputeKernelStackSignature(

  return StringPrintf("%s-%s%s-%08X", kKernelExecName,
                      (is_watchdog_crash ? "(HANG)-" : ""),
                      human_string.c_str(), stack_hash);

(human_string is what's matched by the regexes in #2)

Labels: -Pri-3 Pri-2
So for instance, this is what we see today (all kernel crashes):

1	UnspecifiedSignature 16.26%
2	kernel-skl_compute_wm-E7D4AA9D 7.58%
3	dpm_drv_timeout+0x50/0x60 6.46%
4	namespace_unlock+0x84/0x114 1.11%
5	put_prev_task_fair+0x304/0x3f8 0.86%
6	namespace_unlock+0xbc/0x114 0.80%
7	kernel-elan_probe-7C818D34 0.60%
8	kernel-(HANG)-hung_task: blocked tasks-00000000 0.51%
9	mwifiex_wait_queue_complete+0x48/0x154 0.40%
10	kernel-Fatal machine check-00000000 0.36%
11	reset_ctrl_regs+0xd0/0x290 0.31%
12	kernel-(HANG)-hung_task: blocked tasks-50B91DA3 0.28%
13	cpuidle_enter_state+0x60/0xe8 0.25%
14	klist_release+0x3c/0xe4 0.23%
15	kernel-Machine halted.-4924674A 0.22%
16	kernel-Fatal Machine check-00000000 0.21%

To start with, that's too many UnspecifiedSignature.  Also it would be good to be consistent.

The root problem, I believe, is that the crash reporter doesn't have a unit test.
Labels: -Restrict-View-Google
#5 correction: the crash reporter doesn't have a unit test for some signature types.
Status: Started (was: Untriaged)
I've looked at the git history for __show_regs() in arch/x86/um/sysrq_64.c (and previously at arch/um/sys-x86_64/sysrq.c) and at no time RIP was NOT followed by a colon.

Comment 9 Deleted

Cc: benchan@chromium.org
Re comment #5: "...does not have a unit test."

That would be up to the code reviewer to ask for one, right?
Maybe ask the reviewer if they would care to add one now? :D
BTW, is it possible there are multiple places which report "RIP"?

I'm looking at tutorial and the same dmesg example has both with and without ":".

http://d3s.mff.cuni.cz/teaching/crash_dump_analysis/slides/08-linux.pdf
You're right:

./arch/x86/kernel/dumpstack.c:305:	printk(KERN_ALERT "RIP ");

	/* Executive summary in case the oops scrolled away */
	printk(KERN_ALERT "RIP ");
	printk_address(regs->ip, 1);
	printk(" RSP <%016lx>\n", regs->sp);

That's in __die(), after calling show_regs(), which prints "RIP:".  So the version with the colon is probably more reliable.


Want to give some context on how this can impact the Chromeos team during weekly health checks. Health checks are usually done for beta/dev versions released in the previous week. For example, 72.0.3602.0 (11232.0.0) that was pushed on Monday will be checked next week and currently out of the 118 reports, almost all of them are showing the hash with a crash count of 1.

https://screenshot.googleplex.com/QJxYH5KEfqe.png (many more below screenshot limit)

This make it very hard to distinguish which signatures are more impactful and deserves a callout for engineers' time compared to other ones.
Labels: -Pri-2 Pri-1
#13 agree.  Making Pri-1.  I was treating as such anyway.
Sadly the reason for the missing function name in the signature is not obvious.  I wrote the missing unit test for x86_64 and the current code appears to WAI.  (But I made a number of changes so I need to double-check.)
Cc: jwer...@chromium.org satorux@chromium.org
Owner: ----
Ignore #15.  I was confused, and the regular expression was really wrong.
So I think that https://crrev.com/c/1340434 fixes this.
Some emacs-lisp to convert the syslog to a string constant.


;;; Converts text to a C string literal.                                                                                            
(defun text-to-c-string ()
  (interactive)

  (goto-char (point-min))
  (while (search-forward "\\" nil t)
    (replace-match "\\\\\\\\"))
  
  (goto-char (point-min))
  (while (search-forward "\"" nil t)
    (replace-match "\\\\\""))
  
  (goto-char (point-min))
  (while (progn
           (re-search-forward "$" nil t)
           (replace-match "\\\\n")
           (forward-char 1)
           (not (eobp))))
  
  (goto-char (point-min))
  (while (progn
           (insert "\"")
           (= (forward-line) 0)))

  (goto-char (point-min))
  (while (not (eobp))
    ;; Process one line.                                                                                                            
    (beginning-of-line)
    (let ((bol (point)))
      (end-of-line)
      (cond
       ((> (current-column) 73)
        ;; Break at space, if possible.  Otherwise break where we are.                                                              
        (move-to-column 73)
        (re-search-backward "[ \t]" bol t)
        (forward-char 1)
        (insert "\"\n\""))
       (t
        (insert "\"")
        (forward-char 1))))))


Project Member

Comment 18 by bugdroid1@chromium.org, Nov 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/cd616235bf53a6e28fa5bdde3c4ec5f0a7273465

commit cd616235bf53a6e28fa5bdde3c4ec5f0a7273465
Author: Luigi Semenzato <semenzato@chromium.org>
Date: Sun Nov 18 15:11:08 2018

crash: kernel_collector: fix x86_64 and add test

The regular expression for x86_64 had an extra space.
I don't think this ever worked.

Also added a unit test for that case.

Some small cleanup:

* Removed the print_diagnostics flag and corresponding
printfs.  This was essentially a debugging aid which
took advantage of the fact that the output of crash_reporter
is usually discarded.  I replaced some of those printfs
with LOG(WARNING) where it made sense.

* Refactored ComputeKernelStackSignatureCommon for clarity.

BUG= chromium:906142 
TEST=unit test

Change-Id: I598b6da0f95be9ee99239719561a03a1fd92ec51
Reviewed-on: https://chromium-review.googlesource.com/1340434
Commit-Ready: Luigi Semenzato <semenzato@chromium.org>
Tested-by: Luigi Semenzato <semenzato@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/cd616235bf53a6e28fa5bdde3c4ec5f0a7273465/crash-reporter/kernel_collector.h
[modify] https://crrev.com/cd616235bf53a6e28fa5bdde3c4ec5f0a7273465/crash-reporter/crash_reporter.cc
[modify] https://crrev.com/cd616235bf53a6e28fa5bdde3c4ec5f0a7273465/crash-reporter/kernel_collector_test.cc
[modify] https://crrev.com/cd616235bf53a6e28fa5bdde3c4ec5f0a7273465/crash-reporter/kernel_collector.cc

Status: Fixed (was: Started)
I am hoping that builds following this one will show a much smaller number of unspecified signatures.  If that's the case, we can mark this Verified.
semenzato@ Thank you so much for working on this!!

Sign in to add a comment