kernel crash collector misses some signatures |
|||||||||
Issue descriptionSteve discovered that crash reports such as https://crash.corp.google.com/browse?stbtiq=006a585ec258db1b omit the function name from the signature (in this case, [<ffffffff94fb0c27>) The reason seems to be that the RE for a signature cannot match the crash log: const char* const kPCRegex[] = { 0, " PC is at ([^\\+ ]+).*", " epc\\s+:\\s+\\S+\\s+([^\\+ ]+).*", // MIPS has an exception program // counter " EIP: \\[<.*>\\] ([^\\+ ]+).*", // X86 uses EIP for the program counter " RIP \\[<.*>\\] ([^\\+ ]+).*", // X86_64 uses RIP for the program counter }; but the crash log contains "RIP: ..." with a colon. The fix would seem obvious: " RIP[: ] \\[<.*>\\] ([^\\+ ]+).*", I'll send a CL.
,
Nov 16
Did we really completely miss this? :P :P :P 95906b24fbe4d arch/um/sys-x86_64/sysrq.c (Jeff Dike 2008-02-04 22:31:20 -0800 21) printk(KERN_INFO "RIP: %04lx:[<%016lx>]\n", PT_REGS_CS(regs) & 0xffff, Also, we started looking for RIP only in 2012. commit 6755167e049cd03d20c2e4d1037700a176196101 Author: Bryan Freed <bfreed@chromium.org> AuthorDate: Mon Apr 2 17:05:48 2012 -0700 Commit: Gerrit <chrome-bot@google.com> CommitDate: Tue Apr 3 17:47:37 2012 -0700 Better support 64bit x86 kernel crashes. FindCrashingFunction() fails to find the crashing kernel function because the x86 64bit kernel code (arch/x86/kernel/dumpstack.c:die()) issues "RIP" instead of "EIP". So it falls back to using the panic string in the crash signature. This causes the logging_KernelCrash.py test to fail because it looks for the crashing function in the signature string. BUG=chromium-os:22353 TEST=logging_KernelCrash Change-Id: Ic4790648facf5f0be2b82001f8e18b606eca16c7 Reviewed-on: https://gerrit.chromium.org/gerrit/19492 Reviewed-by: Ben Chan <benchan@chromium.org> Tested-by: Bryan Freed <bfreed@chromium.org> Commit-Ready: Bryan Freed <bfreed@chromium.org> diff --git a/crash-reporter/kernel_collector.cc b/crash-reporter/kernel_collector.cc index d23c59591..0895b4eb3 100644 --- a/crash-reporter/kernel_collector.cc +++ b/crash-reporter/kernel_collector.cc @@ -41,6 +41,7 @@ static const char *s_pc_regex[] = { 0, " PC is at ([^\\+ ]+).*", " EIP: \\[<.*>\\] ([^\\+ ]+).*", // X86 uses EIP for the program counter + " RIP \\[<.*>\\] ([^\\+ ]+).*", // X86_64 uses RIP for the program counter }; Seriously? :)
,
Nov 16
,
Nov 16
I am not sure this ever worked as intended. The intention of these patterns seems to be that the group matching the function name should stop at + or space. But it does not. For instance we get this signature on kernel 3.8 ARM: namespace_unlock+0xbc/0x114 apparently from this like in the log: <5>[ 2643.251426] PC is at namespace_unlock+0xbc/0x114 (this is from https://crash.corp.google.com/browse?stbtiq=6a1811763216104b) I am not even sure where that signature is generated because kernel_collector.cc always appends the hash: std::string KernelCollector::ComputeKernelStackSignature( return StringPrintf("%s-%s%s-%08X", kKernelExecName, (is_watchdog_crash ? "(HANG)-" : ""), human_string.c_str(), stack_hash); (human_string is what's matched by the regexes in #2)
,
Nov 16
So for instance, this is what we see today (all kernel crashes): 1 UnspecifiedSignature 16.26% 2 kernel-skl_compute_wm-E7D4AA9D 7.58% 3 dpm_drv_timeout+0x50/0x60 6.46% 4 namespace_unlock+0x84/0x114 1.11% 5 put_prev_task_fair+0x304/0x3f8 0.86% 6 namespace_unlock+0xbc/0x114 0.80% 7 kernel-elan_probe-7C818D34 0.60% 8 kernel-(HANG)-hung_task: blocked tasks-00000000 0.51% 9 mwifiex_wait_queue_complete+0x48/0x154 0.40% 10 kernel-Fatal machine check-00000000 0.36% 11 reset_ctrl_regs+0xd0/0x290 0.31% 12 kernel-(HANG)-hung_task: blocked tasks-50B91DA3 0.28% 13 cpuidle_enter_state+0x60/0xe8 0.25% 14 klist_release+0x3c/0xe4 0.23% 15 kernel-Machine halted.-4924674A 0.22% 16 kernel-Fatal Machine check-00000000 0.21% To start with, that's too many UnspecifiedSignature. Also it would be good to be consistent. The root problem, I believe, is that the crash reporter doesn't have a unit test.
,
Nov 16
,
Nov 16
#5 correction: the crash reporter doesn't have a unit test for some signature types.
,
Nov 16
I've looked at the git history for __show_regs() in arch/x86/um/sysrq_64.c (and previously at arch/um/sys-x86_64/sysrq.c) and at no time RIP was NOT followed by a colon.
,
Nov 16
Re comment #5: "...does not have a unit test." That would be up to the code reviewer to ask for one, right? Maybe ask the reviewer if they would care to add one now? :D
,
Nov 16
BTW, is it possible there are multiple places which report "RIP"? I'm looking at tutorial and the same dmesg example has both with and without ":". http://d3s.mff.cuni.cz/teaching/crash_dump_analysis/slides/08-linux.pdf
,
Nov 16
You're right:
./arch/x86/kernel/dumpstack.c:305: printk(KERN_ALERT "RIP ");
/* Executive summary in case the oops scrolled away */
printk(KERN_ALERT "RIP ");
printk_address(regs->ip, 1);
printk(" RSP <%016lx>\n", regs->sp);
That's in __die(), after calling show_regs(), which prints "RIP:". So the version with the colon is probably more reliable.
,
Nov 16
Want to give some context on how this can impact the Chromeos team during weekly health checks. Health checks are usually done for beta/dev versions released in the previous week. For example, 72.0.3602.0 (11232.0.0) that was pushed on Monday will be checked next week and currently out of the 118 reports, almost all of them are showing the hash with a crash count of 1. https://screenshot.googleplex.com/QJxYH5KEfqe.png (many more below screenshot limit) This make it very hard to distinguish which signatures are more impactful and deserves a callout for engineers' time compared to other ones.
,
Nov 16
#13 agree. Making Pri-1. I was treating as such anyway.
,
Nov 16
Sadly the reason for the missing function name in the signature is not obvious. I wrote the missing unit test for x86_64 and the current code appears to WAI. (But I made a number of changes so I need to double-check.)
,
Nov 17
Ignore #15. I was confused, and the regular expression was really wrong. So I think that https://crrev.com/c/1340434 fixes this.
,
Nov 17
Some emacs-lisp to convert the syslog to a string constant.
;;; Converts text to a C string literal.
(defun text-to-c-string ()
(interactive)
(goto-char (point-min))
(while (search-forward "\\" nil t)
(replace-match "\\\\\\\\"))
(goto-char (point-min))
(while (search-forward "\"" nil t)
(replace-match "\\\\\""))
(goto-char (point-min))
(while (progn
(re-search-forward "$" nil t)
(replace-match "\\\\n")
(forward-char 1)
(not (eobp))))
(goto-char (point-min))
(while (progn
(insert "\"")
(= (forward-line) 0)))
(goto-char (point-min))
(while (not (eobp))
;; Process one line.
(beginning-of-line)
(let ((bol (point)))
(end-of-line)
(cond
((> (current-column) 73)
;; Break at space, if possible. Otherwise break where we are.
(move-to-column 73)
(re-search-backward "[ \t]" bol t)
(forward-char 1)
(insert "\"\n\""))
(t
(insert "\"")
(forward-char 1))))))
,
Nov 18
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/cd616235bf53a6e28fa5bdde3c4ec5f0a7273465 commit cd616235bf53a6e28fa5bdde3c4ec5f0a7273465 Author: Luigi Semenzato <semenzato@chromium.org> Date: Sun Nov 18 15:11:08 2018 crash: kernel_collector: fix x86_64 and add test The regular expression for x86_64 had an extra space. I don't think this ever worked. Also added a unit test for that case. Some small cleanup: * Removed the print_diagnostics flag and corresponding printfs. This was essentially a debugging aid which took advantage of the fact that the output of crash_reporter is usually discarded. I replaced some of those printfs with LOG(WARNING) where it made sense. * Refactored ComputeKernelStackSignatureCommon for clarity. BUG= chromium:906142 TEST=unit test Change-Id: I598b6da0f95be9ee99239719561a03a1fd92ec51 Reviewed-on: https://chromium-review.googlesource.com/1340434 Commit-Ready: Luigi Semenzato <semenzato@chromium.org> Tested-by: Luigi Semenzato <semenzato@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/cd616235bf53a6e28fa5bdde3c4ec5f0a7273465/crash-reporter/kernel_collector.h [modify] https://crrev.com/cd616235bf53a6e28fa5bdde3c4ec5f0a7273465/crash-reporter/crash_reporter.cc [modify] https://crrev.com/cd616235bf53a6e28fa5bdde3c4ec5f0a7273465/crash-reporter/kernel_collector_test.cc [modify] https://crrev.com/cd616235bf53a6e28fa5bdde3c4ec5f0a7273465/crash-reporter/kernel_collector.cc
,
Nov 18
I am hoping that builds following this one will show a much smaller number of unspecified signatures. If that's the case, we can mark this Verified.
,
Nov 19
semenzato@ Thank you so much for working on this!! |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by semenzato@chromium.org
, Nov 16