Host did not return from reboot |
||||||||||||||||||||||||||||||||
Issue descriptionhttps://uberchromegw.corp.google.com/i/chromeos/builders/cyan-cheets-paladin/builds/596 Probably a one off, reporting for posterity
,
Jun 2 2016
,
Jun 3 2016
Promoting to generic "Host did not return from reboot" tracker to collect occurrences
,
Jun 3 2016
,
Jun 7 2016
,
Jun 7 2016
There have been 86 autofiled issues with "Host did not return from reboot" recently. We need to investigate this and see if there is something we can do to improve the tests and/or infrastructure to reduce flakiness.
,
Jun 7 2016
This is generally not an infra bug. If we tell a DUT to reboot and it doesn't come back, that indicates a problem on the DUT.
,
Jun 7 2016
,
Jun 9 2016
,
Jun 9 2016
akeshet@ who from the product team do we want to add to this conversation?
,
Jun 10 2016
Who owns the system services? Usually they would be the go-to person +adlr/sameer for guidance
,
Jun 10 2016
Adlr & I are a good start for this. When the DUTs fail to reboot, are any logs collected when the systems are subsequently repaired? Getting those logs will be helpful to triage this further.
,
Jun 11 2016
Re Comment 8 - I might disagree with the statement. There are many reasons a device doesn't respond to an ssh command for login. It could be the infra, the usb dongle initialization, dhcp failures, ssh not starting properly, actual dut failures etc. In order to get to the bottom of this we should really build a side channel via servo connected to the dut to collect information about the device - an ec event log would be a good start. Is the Infra team (usually Simran in the past) willing to help plumb up log collection via servo?
,
Jun 15 2016
The logs that are collected are full of shill crashes, no symbols though as minidump_stackwalker was not found.
,
Jun 15 2016
We have also been seeing shill crashes elsewhere. We may need to figure out how to get symbols for shill crashes again. +ejcaruso@
,
Jun 15 2016
Assigning this to current gardener to track.
,
Jun 17 2016
Green again.
,
Jun 17 2016
Reopening to track other occurrences of "Host did not return from reboot". We hav e been seeing them for a long time, so not convinced that this is fully resolved just from a few days of absence.
,
Jun 17 2016
,
Jun 18 2016
Assigning to next week's gardener.
,
Jun 18 2016
,
Jun 20 2016
hshi@ - Can you confirm that you are looking into this this week? Maybe we can look for recent failures and then connect to the DUT to look at any logs?
,
Jun 20 2016
,
Jun 20 2016
,
Jun 20 2016
The PFQ hasn't been green for days (https://uberchromegw.corp.google.com/i/chromeos/builders/master-chromium-pfq), however I don't see any recent occurences of "Host did not return from reboot".
,
Jun 21 2016
,
Jun 23 2016
I looked at all the logs, for the recent 'Host did not return from reboot' failure, Ketaki pointed us at this morning : https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-cheets-chrome-pfq/builds/333/steps/HWTest%20%5Barc-bvt-cq%5D/logs/stdio provision_AutoUpdate starts at '2016-06-23 01:56:10' it reboots at '2016-06-23 01:56:51' the reboot.verify times out at '2016-06-23 02:06:55' The next test passed with issue ... provision_AutoUpdate starts at '2016-06-23 02:08:30' succeeds at '2016-06-23 02:09:11' reboot.verify failed because the ssh connection never went through for 8 min: 06/23 01:58:52.858 DEBUG| abstract_ssh:0625| Host chromeos4-row6-rack9-host21 is now unreachable over ssh, is down 06/23 01:58:52.859 DEBUG| ssh_host:0180| Running (ssh) 'true' 06/23 02:06:53.244 WARNI| base_utils:0910| run process timeout (480) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_ufGOFwssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row6-rack9-host21 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true" 06/23 02:06:55.260 DEBUG| abstract_ssh:0550| Host chromeos4-row6-rack9-host21 is still down after waiting 483 seconds the console-ramoops available is the following test: https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/67533590-chromeos-test/chromeos4-row6-rack9-host19/sysinfo/console-ramoops suggests that the kernel actually rebooted properly and the USB ethernet interface went up ... [ 9.972516] smsc75xx 1-3:1.0 eth1: link up, 1000Mbps, full-duplex, lpa 0xC1E1 ... then wait for 8min mostly idle before being shut down gracefully (by what ?) as shown by the firmware eventlog : 295 | 2016-06-23 01:36:42 | Kernel Event | Clean Shutdown 296 | 2016-06-23 01:36:44 | System boot | 3264 297 | 2016-06-23 01:36:44 | System Reset 298 | 2016-06-23 01:57:14 | Kernel Event | Clean Shutdown 299 | 2016-06-23 01:57:16 | System boot | 3265 300 | 2016-06-23 01:57:16 | System Reset 301 | 2016-06-23 02:05:07 | Kernel Event | Clean Shutdown 302 | 2016-06-23 02:05:09 | System boot | 3266 303 | 2016-06-23 02:05:09 | System Reset So either the failure was at a higher network level (no DHCP lease, shill flake, DUT ssh daemon issue) or the host didn't try hard enough the connection a reboot 'fixed' it (but was it even necessary, was the actual graceful reboot issued over ssh ? that would confirm that the issue is the latter)
,
Jun 23 2016
The analysis in #28 was wrong actually, the second test was not on the DUT I thought
,
Jun 23 2016
Got the logs from the right machine this time : <<< firmware eventlog before the test >>> 274 | 2016-06-23 01:09:15 | System boot | 3031 275 | 2016-06-23 01:09:15 | System Reset 276 | 2016-06-23 01:17:40 | Kernel Event | Clean Shutdown 277 | 2016-06-23 01:17:42 | System boot | 3032 278 | 2016-06-23 01:17:42 | System Reset <<< firmware eventlog after the test (recorded by another test on the DUT >>> 274 | 2016-06-23 01:09:15 | System boot | 3031 275 | 2016-06-23 01:09:15 | System Reset 276 | 2016-06-23 01:17:40 | Kernel Event | Clean Shutdown 277 | 2016-06-23 01:17:42 | System boot | 3032 278 | 2016-06-23 01:17:42 | System Reset 279 | 2016-06-23 02:45:19 | System boot | 3033 280 | 2016-06-23 02:45:19 | SUS Power Fail 281 | 2016-06-23 02:45:19 | System Reset 282 | 2016-06-23 02:45:19 | Wake Source | Power Button | 0 283 | 2016-06-23 02:45:56 | Kernel Event | Clean Shutdown 284 | 2016-06-23 02:45:58 | System boot | 3034 285 | 2016-06-23 02:45:58 | System Reset 286 | 2016-06-23 02:47:26 | Kernel Event | Clean Shutdown 287 | 2016-06-23 02:47:28 | System boot | 3035 288 | 2016-06-23 02:47:28 | System Reset [...] So the machine didn't even complete a clean shutdown when asked to reboot in provision_AutoUpdate
,
Jun 23 2016
The matching host log from cautotest: Provision 67539165 chromeos-test cyan-cheets-tot-chrome-pfq-cheets-informational/R53-8490.0.0-b281/arc-bvt-cq/cheets_KeyboardTest 2016-06-23 03:03:31 Completed Job 67533586 chromeos-test cyan-cheets-chrome-pfq/R53-8490.0.0-rc2/arc-bvt-cq/cheets_ContainerSmokeTest Failed Repair 67533586 chromeos-test cyan-cheets-chrome-pfq/R53-8490.0.0-rc2/arc-bvt-cq/cheets_ContainerSmokeTest 2016-06-23 02:32:06 Completed Provision 67533586 chromeos-test cyan-cheets-chrome-pfq/R53-8490.0.0-rc2/arc-bvt-cq/cheets_ContainerSmokeTest 2016-06-23 01:55:49 Failed Job 67532923 chromeos-test cyan-cheets-android-pfq/R53-8490.0.0-rc2/arc-bvt-cq/video_VideoSanity.vp8 2016-06-23 01:52:45 Completed The repair log is there : https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/hosts/chromeos4-row6-rack9-host21/56791460-repair/debug/autoserv.DEBUG Interesting facts : -> the forced power cycle we are seeing in the firmware eventlog at 279-282 : 06/23 02:45:17.129 ERROR| repair:0397| Repair failed: Power cycle the host with RPM [...] 06/23 02:45:17.132 INFO | repair:0140| Repair action triggered: Reset the DUT via servo [...] 06/23 02:45:40.077 DEBUG| abstract_ssh:0542| Host chromeos4-row6-rack9-host21 is now up So a simple cold reset 'fixed' the device. We would need to save the logs in this case just after the reset to be able to investigate further : firmware eventlog, messages, net.log, ...
,
Jun 24 2016
I have done the following CL to try to get logs in the case mentioned in #31 so we can debug it further : https://chromium-review.googlesource.com/#/c/355739/
,
Jul 8 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/8e13b68c1bb5aafd6f2853079d38b14a9f384537 commit 8e13b68c1bb5aafd6f2853079d38b14a9f384537 Author: Vincent Palatin <vpalatin@chromium.org> Date: Fri Jun 24 00:38:06 2016 autotest: record logs after forcing reset when repairing If the DUT was no longer accessible through SSH, ensure we are recording the logs after re-gaining SSH access (e.g. after trigger a servo reset) before clobbering them. BUG=chromium:615662 TEST=run a local autoserv instance: sudo ./server/autoserv -R ... Change-Id: I4aa549145ad724368290ea7566ca41577d92194b Reviewed-on: https://chromium-review.googlesource.com/355739 Commit-Ready: Vincent Palatin <vpalatin@chromium.org> Tested-by: Vincent Palatin <vpalatin@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/8e13b68c1bb5aafd6f2853079d38b14a9f384537/server/hosts/cros_repair.py
,
Jul 8 2016
Re-assigning to me since I'm investigating it. Now that the new logging is in place, if you see a new occurence of the 'Host did not return from reboot' error, Please link it here.
,
Jul 9 2016
issue 626713 would have had interesting reboot failure after AU on Butterfly, but those machines have no servo :(
,
Jul 10 2016
Why do any machines in the lab not have servo? Aviv?
,
Jul 10 2016
There are a few devices that never had full support of servo (had issues that ultimately we couldn't overcome). Butterfly was one of them. See https://bugs.chromium.org/p/chromium/issues/detail?id=383097. I believe this is true for Parrot as well. +Richard knows the full list.
,
Jul 11 2016
Currently, the list of devices not supporting servo is: butterfly parrot parrot_ivb stout x86-mario x86-zgb
,
Jul 12 2016
Happened again on cyan-cheets-paladin https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-cheets-paladin/builds/1065
,
Jul 12 2016
the DUT was chromeos4-row6-rack9-host11. It recovers after a servo cold reset in repair. The repair logs are there : https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row6-rack9-host11/57292226-repair/ The repair started at 2016-07-12 09:39:18 PST, the hard reset happend at 09:53 START ---- repair.reset timestamp=1468342356 localtime=Jul 12 09:52:36 GOOD ---- verify.ssh timestamp=1468342389 localtime=Jul 12 09:53:09 END GOOD ---- repair.reset timestamp=1468342389 localtime=Jul 12 09:53:09 GOOD ---- verify.cros timestamp=1468342392 localtime=Jul 12 09:53:12 the original test logs for the timestamps are there : https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/69352354-chromeos-test/chromeos4-row6-rack9-host11/provision_AutoUpdate/debug/provision_AutoUpdate.DEBUG
,
Jul 12 2016
The firmware event log looks like this : 247 | 2016-07-12 06:44:14 | Kernel Event | Clean Shutdown 248 | 2016-07-12 06:44:16 | System boot | 3390 249 | 2016-07-12 06:44:16 | System Reset 250 | 2016-07-12 06:53:33 | Kernel Event | Clean Shutdown 251 | 2016-07-12 06:53:35 | System boot | 3391 252 | 2016-07-12 06:53:35 | System Reset 253 | 2016-07-12 09:52:38 | System boot | 3392 254 | 2016-07-12 09:52:38 | SUS Power Fail 255 | 2016-07-12 09:52:38 | System Reset 256 | 2016-07-12 09:52:38 | Wake Source | Power Button | 0 There is no trace of the shutdown/reboot initiated by the test at 09:00 : 07/12 09:00:02.520 DEBUG| ssh_host:0180| Running (ssh) '( sleep 1; reboot & sleep 10; reboot -f ) </dev/null >/dev/null 2>&1 & echo -n $!' 07/12 09:00:02.825 DEBUG| base_utils:0297| [stdout] 12365 07/12 09:00:02.827 DEBUG| abstract_ssh:0604| Host chromeos4-row6-rack9-host11 pre-shutdown boot_id is 66464aba-279d-44da-b232-30bd85347efe 07/12 09:00:02.828 DEBUG| ssh_host:0180| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi' 07/12 09:00:03.131 DEBUG| base_utils:0278| [stdout] 66464aba-279d-44da-b232-30bd85347efe 07/12 09:00:04.138 DEBUG| ssh_host:0180| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi' 07/12 09:02:02.714 WARNI| base_utils:0910| run process timeout (118) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_t2jdl0ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row6-rack9-host11 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_for_restart|wait_down|get_boot_id] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)\";fi; if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi" 07/12 09:02:03.724 DEBUG| abstract_ssh:0625| Host chromeos4-row6-rack9-host11 is now unreachable over ssh, is down 07/12 09:02:03.726 DEBUG| ssh_host:0180| Running (ssh) 'true' 07/12 09:10:04.066 WARNI| base_utils:0910| run process timeout (480) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_t2jdl0ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row6-rack9-host11 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true" 07/12 09:10:06.084 DEBUG| abstract_ssh:0550| Host chromeos4-row6-rack9-host11 is still down after waiting 483 seconds 07/12 09:10:06.086 INFO | server_job:0129| ABORT ---- reboot.verify timestamp=1468339806 localtime=Jul 12 09:10:06 Host did not return from reboot So it still seems that Cyan is failing on the shutdown/reboot path not on the way up at boot.
,
Jul 12 2016
Would it be possible to send the alt-volup-x key combo via servo before servo initiates the cold reset? That may allow us to generate and gather the more interesting kernel logs that we will likely want here.
,
Jul 13 2016
There is a slightly-corrupted console-ramoops collected (hopefully it's the right one, I'm not 100% sure). I'm attaching it, nothing outstanding though. I will evaluate adding the Alt+VolUp+X in the repair phase.
,
Jul 13 2016
Assuming that the console-ramooops is complete, the kernel never got to the point of asking the firmware for a reboot. Do we need to invent an equivalent of pm_print_times that logs more details about the shutdown/restart sequence so that we know where in the kernel things stalled?
,
Jul 13 2016
Shouldn't the watchdog reboot us ? Given this issue seems to be cyan specific, is there a watchdog there ? has somebody (e.g. Intel) really run some intensive reboot testing before shipping this platform ?
,
Jul 20 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/hdctools/+/3acbbe57aff59159db96a80bbb9e475b1ce454a4 commit 3acbbe57aff59159db96a80bbb9e475b1ce454a4 Author: Vincent Palatin <vpalatin@chromium.org> Date: Tue Jul 19 15:40:12 2016 servod: add sysrq-x key combination Allow to trigger the kernel System Request X (sysrq-x) to force a kernel panic on keyboard drivers supporting it. BUG=chromium:615662 TEST=manual, run "dut-control kb_press:sysrq_x kb_press:sysrq_x kb_press:sysrq_x" and see the main processor rebooting, then find 'Kernel panic - not syncing: ChromeOS X Key' in /dev/pstore/console-ramoops. Change-Id: Ie83eb2306645595f927bfcfe23d35031c0df26c7 Reviewed-on: https://chromium-review.googlesource.com/361760 Commit-Ready: Vincent Palatin <vpalatin@chromium.org> Tested-by: Vincent Palatin <vpalatin@chromium.org> Reviewed-by: Todd Broch <tbroch@chromium.org> [modify] https://crrev.com/3acbbe57aff59159db96a80bbb9e475b1ce454a4/servo/keyboard_handlers.py [modify] https://crrev.com/3acbbe57aff59159db96a80bbb9e475b1ce454a4/servo/servo_server.py [modify] https://crrev.com/3acbbe57aff59159db96a80bbb9e475b1ce454a4/servo/drv/kb.py [modify] https://crrev.com/3acbbe57aff59159db96a80bbb9e475b1ce454a4/servo/data/keyboard.xml
,
Jul 20 2016
Sorry, I had missed c#45. wmatrix indicates various platform_KernelErrorPaths tests passing for Cyan: https://wmatrix.googleplex.com/platform/kernel_all?platforms=cyan I think it depends on why and when the reboot processing got stuck. They won't trigger: - on a somewhat benign "while(1) { do_something; msleep(1000); }" in some driver then the watchdogs won't trigger. - very late in the reboot sequence where the the watchdogs have been disabled.
,
Jul 20 2016
Adding Shyam from Intel. I'd like to see Intel run some reboot tests on Cyan and debug in case they hit issues.
,
Jul 20 2016
@comment 48, think you have cc'd the wrong Shyam.
,
Jul 20 2016
The right Shyam is here. Acking Puneet's ask.
,
Jul 20 2016
,
Jul 20 2016
,
Jul 20 2016
,
Jul 21 2016
We are already looking into this request. We will update the status/results later today.
,
Jul 22 2016
Thanks Margarita! Do let Google know if we need more data on this.
,
Jul 22 2016
Margarita - are you just trying to answer the question in comment 45? Or are you trying to come up with a new experiment on stress testing a reboot cycle on the Cyan platform?
,
Jul 22 2016
margarita@ I'll take this one. We'll be using the bootperf tool to stress test two cyans, one with a cyan-cheets image and the other with a regular cyan image. This should tell us if the problem occurs on Cyan builds that have ARC++ or not, but do you already know whether this is a cheets specific issue? We'll get back to you regarding the watchdog as we find out more information. Regarding the question of testing, yes the ChromeOS teams at Intel have conducted numerous reboot and suspend/resume stress tests before BSW and cyan were released. I'll provide more information as we get further into the debugging process.
,
Jul 25 2016
vpalatin@ what is the Coreboot version on the DUTs you're reproducing this on?
,
Jul 25 2016
> vpalatin@ what is the Coreboot version on the DUTs you're reproducing this on? That's lab devices, so there is not a single answer. the last one was : fwid = Google_Cyan.7287.57.60
,
Jul 27 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/e7dc9284f1b7ed7dc98803641b8ed32a1082bb2e commit e7dc9284f1b7ed7dc98803641b8ed32a1082bb2e Author: Vincent Palatin <vpalatin@chromium.org> Date: Thu Jul 14 09:31:58 2016 autotest: try to repair DUT with sysrq-x first When the DUT was no longer accessible through SSH, try first to repair it by pressing 3 times sysrq-x before trying the servo cold reset. This way we can try to force the kernel to trigger a panic and get the last kernel logs in the console ramoops after reboot. BUG=chromium:615662 TEST=run a local autoserv instance: sudo ./server/autoserv -R -m ... after stopping 'shill' on the DUT to simulate a networking issue. CQ-DEPEND=CL:361760 Change-Id: I31c4bf567655d4207c7aa9e490b1fcbcaa9ef8d2 Reviewed-on: https://chromium-review.googlesource.com/361691 Commit-Ready: Vincent Palatin <vpalatin@chromium.org> Tested-by: Vincent Palatin <vpalatin@chromium.org> Reviewed-by: Sameer Nanda <snanda@chromium.org> Reviewed-by: Todd Broch <tbroch@chromium.org> [modify] https://crrev.com/e7dc9284f1b7ed7dc98803641b8ed32a1082bb2e/server/cros/servo/servo.py [modify] https://crrev.com/e7dc9284f1b7ed7dc98803641b8ed32a1082bb2e/server/hosts/cros_repair.py
,
Jul 27 2016
Saw a host did not return from reboot on a couple canaries today. https://uberchromegw.corp.google.com/i/chromeos/builders/elm-cheets-release/builds/226 https://uberchromegw.corp.google.com/i/chromeos/builders/oak-release/builds/268
,
Jul 28 2016
For the elm-cheets-release build, the machine was salvaged by a servo cold-reset during the repair (this repair doesn't have a sysrq-x yet), the leftover in the ramoops tends to show that the SoC reset was called (see below). The test timeline was the following regarding reboots : 1st reboot succeeded: reboot.verify START Jul 27 13:06:26 GOOD Jul 27 13:07:02 2nd reboot failed: reboot.verify START Jul 27 13:13:01 ABORT Jul 27 13:23:06 The firmware eventlog actually recorded a 'boot' for both of them but is suggesting that the 2nd one ended up in recovery: 278 | 2016-07-27 13:06:29 | System boot | 0 279 | 2016-07-27 13:13:10 | System boot | 0 280 | 2016-07-27 13:13:10 | Chrome OS Recovery Mode | Unknown 281 | 2016-07-27 13:13:12 | System boot | 0 282 | 2016-07-27 13:13:12 | Chrome OS Recovery Mode | Unknown 283 | 2016-07-27 15:21:33 | System boot | 0 The repair job just did the servo cold reset at 'Jul 27 15:21:32' and everything looks in a proper state afterwards (as in https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/hosts/chromeos2-row5-rack7-host1/57897614-repair/status.log ). So the failure was spuriously going to Recovery mode, not sure if there is any hint why. ramoops leftovers: [ 314.222106] EHT4-fs (imcblk0p%): mounting ext2 file systEm using the ext4 subays4em [ 314.225882] EPT4-fs (mmcbhk0p5): mounted filesystem without journal. Opts: [ 392.255693] smsc95xx\001-2.1.1.1:1.0 eth1: hardware isf't capable of remote wakeup [ 392.509223] init: console-ttiS0 main process (1301) killed by TERM signal [ 392.509678] init: daisydog main process (1338) killed bx TERM signah [ 392.510699] ifit: powerd main `rocess (1 09) kIh,ed by TERM signal [ 392.411307] init: chapsd main process\00(1878) killed by TERM signal [ 392.111773] init: permission_brokdr m`in prObess (2322) killed\00By TERM signal [ 382.513137] init: crash-sender main process (4228) killed by TERM signal [ 392.513826] init: crks-machine-id-regen-periodic main process (4234) iilled by TERM signal [ !92.514443] init: mtpd main process (4241) kilded by TERM signal [ 392.515069] inip: log-rgtate main process (450 kildeD\00bq TERM signal [ 392.51565 ] init: cras main process (4201) killed by TERM sigjAl [ 392.517702] ijit: bluetoothd mail process (4322) killed by TERM signad [\00 292.522322] ini4: w!r.-collector mai prOcess (5890) killed by TERM signal [ 392.523105] init: tlsdate` main prmc%ss (5927) killed by TERM signal [ 392.523643] init: netfilter-queue\00main process (5966) +illed by TERM siGnal [ 392.537590] EXT4-fs (dm-0) re-mounted. Opts: [ 392.554399] init: cryptohomed main process (1979! killed by TERL signal [ 392.692(79] init: rEcover_dutS\00eain `rocacs (5884) killed by TERM signal [ 392.773025] Unsafe core_pattern used wath suhd_dulpable=2. Pipe handler gr fullx qUaliFied cOra `ump `ath reqei2ed. [ 393.828659] reboot: R%starting system
,
Jul 28 2016
For the oak-release job, after the test failed, the next 'Provision' succeeded without a 'Repair'. After extracting the firmware eventlog from one of the subsequent test job : 240 | 2016-07-27 12:45:26 | System boot | 0 241 | 2016-07-27 12:52:15 | System boot | 0 <== failed reboot in the test 242 | 2016-07-27 12:52:15 | Chrome OS Recovery Mode | Unknown 243 | 2016-07-27 12:52:17 | System boot | 0 244 | 2016-07-27 12:52:17 | Chrome OS Recovery Mode | Unknown 245 | 2009-12-31 16:00:00 | System boot | 0 246 | 2009-12-31 16:00:00 | System boot | 0 247 | 2009-12-31 16:00:00 | System boot | 0 248 | 2009-12-31 16:00:00 | System boot | 0 249 | 2009-12-31 16:00:00 | System boot | 0 250 | 2009-12-31 16:00:00 | System boot | 0 251 | 2009-12-31 16:00:09 | System boot | 0 252 | 2009-12-31 16:00:00 | System boot | 0 253 | 2009-12-31 16:00:00 | System boot | 0 254 | 2009-12-31 16:00:00 | System boot | 0 255 | 2009-12-31 16:00:00 | System boot | 0 256 | 2009-12-31 16:00:00 | System boot | 0 257 | 2009-12-31 16:00:00 | System boot | 0 258 | 2009-12-31 16:00:00 | System boot | 0 259 | 2009-12-31 16:00:00 | System boot | 0 260 | 2009-12-31 16:00:00 | System boot | 0 261 | 2009-12-31 16:00:00 | System boot | 0 262 | 2009-12-31 16:00:00 | System boot | 0 263 | 2009-12-31 16:00:00 | System boot | 0 264 | 2009-12-31 16:00:00 | System boot | 0 265 | 2009-12-31 16:00:00 | System boot | 0 266 | 2009-12-31 16:43:45 | System boot | 0 267 | 2016-07-27 20:44:45 | System boot | 0 268 | 2016-07-27 20:51:20 | System boot | 0 Same issue as the elm, the machine went to recovery.
,
Jul 28 2016
"Unknown", hmmm... unfortunately, mosys' recovery reason table is heavily out of date (opened issue 632413 ). Can you please run this on the affected DUT and attach elog.hex so I can fish out the reason manually and get an idea what happened here? flashrom -r -i RW_ELOG:/tmp/elog.bin xxd /tmp/elog.bin > /tmp/elog.hex
,
Jul 28 2016
The logs extracted from the lab machines as requested by Julius : * from elm chromeos2-row5-rack7-host1: - https://storage.googleapis.com/chromeos-vpa/chromeos2-row5-rack7-host1.cros/mosys_eventlog.txt - https://storage.googleapis.com/chromeos-vpa/chromeos2-row5-rack7-host1.cros/elog.bin - https://storage.googleapis.com/chromeos-vpa/chromeos2-row5-rack7-host1.cros/elog.bin * from oak chromeos4-row11-rack8-host11: - https://storage.googleapis.com/chromeos-vpa/chromeos4-row11-rack8-host11.cros/mosys_eventlog.txt - https://storage.googleapis.com/chromeos-vpa/chromeos4-row11-rack8-host11.cros/elog.bin - https://storage.googleapis.com/chromeos-vpa/chromeos4-row11-rack8-host11.cros/elog.hex
,
Jul 28 2016
Thanks Vincent. Our mystery recovery reason is 0x2d ("Error trying to clear TPM owner") in all cases. Since these devices shouldn't be transitioning into dev mode, the only way this can happen is if something set the clear_tpm_owner_request NVRAM flag (with crossytem). Some part of the three TPM commands involved in that (force clear, enable, activate) went wrong.
Unfortunately, that's all I can tell you right now... to debug it further, we'd probably have to catch it in the act (UART output should at least tell us which of the commands failed, and with what error code). A good first step would be looking up which test exactly it ran during/before that error (maybe some sort of TPM test? That clear_tpm flag must be set by something...), running it in isolation and seeing if that repros more quickly.
,
Jul 29 2016
The `crossystem clear_tpm_owner_request=1` command runs every time
we do a powerwash. That happens in a couple of cases:
* When running the powerwash test.
* One possible repair strategy is to run powerwash and then
reinstall via AU.
Additionally, we recently changed provisioning code to invoke that
command immediately before rebooting after AU.
,
Jul 29 2016
Oak/Elm specific investigation forked to http://crosbug.com/p/55764.
,
Aug 11 2016
This issue seems to be happening pretty regular with the cyan Android PFQ builds: https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-android-pfq/builds/57 https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-android-pfq/builds/58 https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-android-pfq/builds/60 https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-android-pfq/builds/71 (this also was suffering from other issues) https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-android-pfq/builds/72
,
Aug 12 2016
@davidriley Yes all the Cyan derivatives have a real unsolved issue, they sometimes got stuck on the shutdown path as mentioned on comment #41 to #59
,
Aug 17 2016
Vincent, Intel will have an update on this today. Thanks, Shyam
,
Aug 19 2016
We were able to reproduce locally, and have an Engineer working on it. We will provide an update by next week.
,
Aug 19 2016
Google, Can you please add matthew.s.atwood@intel.com and Aditya Swarup (aditya.swarup@intel.com) to this bug for faster updates? Intel is currently unable to add engineers to this bug. Thanks a lot, Shyam
,
Aug 19 2016
+Matt & Aditya
,
Aug 19 2016
,
Aug 19 2016
AS this is the 'generic' bug for all the reboot failures, I have opened issue 639301 for the investigations on Cyan shutdown failure: https://bugs.chromium.org/p/chromium/issues/detail?id=639301 Please put your progress there.
,
Aug 19 2016
Locally this problem is much easier to reproduce with a cyan-cheets build than with a cyan build, I think the first step is to see what changes with the kernel use flags/cmdline/config to hopefully help point us in the right direction. We are also in the process of collecting our own failure logs for both cases to ensure this is the same issue. Expect an update next week.
,
Aug 22 2016
as per comment #76, please put the Cyan investigations in issue 639301
,
Aug 30 2016
,
Sep 19 2017
I made comments in https://bugs.chromium.org/p/chromium/issues/detail?id=639301#c75 that cyan was worse than other boards. wizpig (45% suite success rate) is also worse than other boards running the same bvt-inline suite (81% suite success rate). Provision rates for that same suite at 93% success rate for wizpig, vs 99% across the 14 boards running that suite. wizpig does appear to be partially caused by two bad DUTs: chromeos2-row8-rack8-host17: 100.0% 56/56 chromeos2-row8-rack8-host18: 98.0% 56/57 chromeos2-row8-rack8-host21: 65.0% 25/38 chromeos6-row2-rack19-host1: 100.0% 56/56 chromeos6-row2-rack19-host3: 100.0% 56/56 chromeos6-row2-rack20-host20: 96.0% 54/56 chromeos6-row2-rack20-host22: 100.0% 56/56 chromeos6-row2-rack20-host4: 100.0% 56/56 chromeos6-row2-rack20-host7: 0.0% 0/12 edgar seems to perform a lot better running bvt-cq and didn't show provision failures and has the same success rate as the average across all boards. |
||||||||||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||||||||||
Comment 1 by akes...@chromium.org
, Jun 2 2016Summary: cran-cheets provision: Host did not return from reboot (was: provision: Host did not return from reboot)