veyron_minnie: video_WebRtcPerf: Watchdog detected hard LOCKUP on cpu 3 |
|||||
Issue descriptionOn 10224.0.0 the video_WebRtcPerf test failed because the DUT rebooted with a "Watchdog detected hard LOCKUP on cpu 3" kernel panic. The CPU seems to be getting stuck after some time. It happened only once, so might be a flake or DUT issue, but better to keep eye on this. Errata strike again? (See b/35563293 for a similar issue from long ago.) Logs: http://ubercautotest.corp.google.com/tko/retrieve_logs.cgi?job=//results/163272731-chromeos-test/chromeos4-row9-rack9-host7 Dashboard: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/testDetails?milestone=65&daysBack=30&testName=video_WebRtcPerf
,
Dec 20 2017
From logs above: [19248.944518] tpm_i2c_infineon 1-0020: command 0xba (size 18) returned code 0x0 [19267.898862] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3 [19267.898889] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.14.0 #1 [19267.898924] [<c02fc98c>] (unwind_backtrace) from [<c02f8ec4>] (show_stack+0x20/0x24) [19267.898942] [<c02f8ec4>] (show_stack) from [<c086a534>] (dump_stack+0x7c/0xc0) [19267.898956] [<c086a534>] (dump_stack) from [<c0869db4>] (panic+0xa8/0x1fc) [19267.898973] [<c0869db4>] (panic) from [<c0387474>] (watchdog_timer_fn+0x234/0x26c) [19267.898992] [<c0387474>] (watchdog_timer_fn) from [<c0209af0>] (hrtimer_interrupt+0x2f8/0x618) [19267.899013] [<c0209af0>] (hrtimer_interrupt) from [<c0707aac>] (arch_timer_handler_phys+0x38/0x48) [19267.899031] [<c0707aac>] (arch_timer_handler_phys) from [<c0355878>] (handle_percpu_devid_irq+0xf4/0x19c) [19267.899052] [<c0355878>] (handle_percpu_devid_irq) from [<c0351ff4>] (generic_handle_irq+0x30/0x40) [19267.899067] [<c0351ff4>] (generic_handle_irq) from [<c03522ec>] (__handle_domain_irq+0x8c/0xb0) [19267.899082] [<c03522ec>] (__handle_domain_irq) from [<c0200390>] (gic_handle_irq+0x48/0x6c) [19267.899096] [<c0200390>] (gic_handle_irq) from [<c02f9ac0>] (__irq_svc+0x40/0x70) [19267.899107] Exception stack(0xee127f20 to 0xee127f68) [19267.899118] 7f20: ee127f78 00001186 27ac5a06 00001186 ee7c3288 00000000 ee126028 00000000 [19267.899131] 7f40: 0000004c c103e2d0 c105206c ee127fb4 00000008 ee127f68 c0221580 c02bb6c0 [19267.899142] 7f60: 000f0013 ffffffff [19267.899154] [<c02f9ac0>] (__irq_svc) from [<c02bb6c0>] (cpuidle_idle_call+0x1b0/0x324) [19267.899169] [<c02bb6c0>] (cpuidle_idle_call) from [<c0200410>] (arch_cpu_idle+0x18/0x48) [19267.899183] [<c0200410>] (arch_cpu_idle) from [<c021bc10>] (cpu_startup_entry+0x1c4/0x23c) [19267.899198] [<c021bc10>] (cpu_startup_entry) from [<c02fad80>] (secondary_start_kernel+0x14c/0x174) [19267.899230] [<c02fad80>] (secondary_start_kernel) from [<002eec84>] (0x2eec84) [19269.065909] SMP: failed to stop secondary CPUs [19269.065930] CPU0 PC: <c0869d0c> panic+0x0/0x1fc [19269.065947] CPU1 PC: <c02006d4> flush_tlb_page+0x68/0xe0 [19270.246419] SMP: failed to stop secondary CPUs --- One specific thing that's missing from the logs is the PC for "CPU3". That's really weird. Ah, actually, possibly there's a bug in the printing code in "arch/arm/mach-rockchip/rockchip.c". Sigh. If two CPUs are stuck on the same instruction then it looks possible that we won't print both of them... So I'd guess CPU3 is either stuck on "panic" or "flush_tlb_page". In any case, I'm not convinced that we're at an errata again. You can certainly get CPUs to look wedged without an errata being involved (I believe getting stuck with interrupts off can cause things like this). Also note that previously the errata on A17 were characterized by one CPU being running in userspace. AKA in the previous bug you pointed at: <3>[ 3063.355907] CPU3 PC: <aeb874b0> 0xaeb874b0 It was userspace that was running a set of instructions that would was tripping an errata. Obviously we could be hitting some different type of errata and that could be an errata that's triggered by only kernel behavior, but if nothing else it doesn't quite match the previous several errata on this particular CPU model. --- One thing that strikes me as extra odd is that "CPU0" is sitting at "panic". Where is that panic message printed? I don't think it's this one since this panic was printed by CPU 2. Is it somehow possible that the system hung while trying to call panic()?? --- How reproducible is this? It looks like it's happened one time? I'm not sure we're going to be able to track down much more from just that. Previous errata track down required several weeks of dedicated (near 100%) time, and then I _had_ an reproducible test case. --- At this point I probably don't have cycles to own this bug, but I'm happy to support another owner if someone wants to step up. Moving myself to CC and leaving this "Available". If you disagree with this then please shout.
,
Dec 21 2017
It only happened once in the last 30 days.
,
Dec 21
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Jan 6
Seems obsolete |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by wuchengli@chromium.org
, Dec 20 2017Owner: diand...@chromium.org
Status: Available (was: Untriaged)