New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 615662 link

Starred by 6 users

Issue metadata

Status: Assigned
Owner:
Last visit > 30 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 546424
issue 589027
issue 590491
issue 598529
issue 639301

Blocking:
issue 636065



Sign in to add a comment

Host did not return from reboot

Project Member Reported by kevcheng@chromium.org, May 28 2016

Issue description

Blockedon: 598529
Summary: cran-cheets provision: Host did not return from reboot (was: provision: Host did not return from reboot)
Status: Unconfirmed (was: Untriaged)
Summary: cyan-cheets provision: Host did not return from reboot (was: cran-cheets provision: Host did not return from reboot)
Summary: Host did not return from reboot (was: cyan-cheets provision: Host did not return from reboot)
Promoting to generic "Host did not return from reboot" tracker to collect occurrences
Blockedon: 590491
Blockedon: 546424
Cc: akes...@chromium.org steve...@chromium.org jen...@chromium.org achuith@chromium.org sosa@chromium.org
Labels: -Pri-3 Hotlist-CrOS-Gardener Build-PFQ-Failures OS-Chrome Pri-1
Status: Available (was: Unconfirmed)
There have been 86 autofiled issues with "Host did not return from reboot" recently.

We need to investigate this and see if there is something we can do to improve the tests and/or infrastructure to reduce flakiness.

Comment 8 by sosa@chromium.org, Jun 7 2016

Components: -Infra>Client>ChromeOS
This is generally not an infra bug. If we tell a DUT to reboot and it doesn't come back, that indicates a problem on the DUT. 

Comment 9 by sosa@chromium.org, Jun 7 2016

Components: OS>Systems
Blockedon: 589027
akeshet@ who from the product team do we want to add to this conversation?

Comment 12 by sosa@chromium.org, Jun 10 2016

Cc: adlr@chromium.org snanda@chromium.org
Who owns the system services? Usually they would be the go-to person

+adlr/sameer for guidance
Cc: puneetster@chromium.org
Adlr & I are a good start for this.

When the DUTs fail to reboot, are any logs collected when the systems are subsequently repaired?  Getting those logs will be helpful to triage this further.
Re Comment 8 - I might disagree with the statement.  There are many reasons a device doesn't respond to an ssh command for login.  It could be the infra, the usb dongle initialization, dhcp failures, ssh not starting properly, actual dut failures etc.  In order to get to the bottom of this we should really build a side channel via servo connected to the dut to collect information about the device - an ec event log would be a good start.  Is the Infra team (usually Simran in the past) willing to help plumb up log collection via servo?  

Comment 15 by ihf@chromium.org, Jun 15 2016

The logs that are collected are full of shill crashes, no symbols though as minidump_stackwalker was not found.
66742721-chromeos-test%2Fchromeos4-row3-rack4-host1%2Fdebug%2Fautoserv.txt
382 KB View Download
Cc: ejcaruso@chromium.org
We have also been seeing shill crashes elsewhere. We may need to figure out how to get symbols for shill crashes again.

+ejcaruso@

Owner: ihf@chromium.org
Assigning this to current gardener to track.

Comment 18 by ihf@chromium.org, Jun 17 2016

Status: Fixed (was: Available)
Green again.
Status: Available (was: Fixed)
Reopening to track other occurrences of "Host did not return from reboot". We hav e been seeing them for a long time, so not convinced that this is fully resolved just from a few days of absence.
Cc: ihf@chromium.org
Owner: ----

Comment 21 by ihf@chromium.org, Jun 18 2016

Owner: h...@chromium.org
Assigning to next week's gardener.
Labels: dut-health
hshi@ - Can you confirm that you are looking into this this week? Maybe we can look for recent failures and then connect to the DUT to look at any logs?

Status: Sss (was: Available)
Status: Assigned (was: Sss)

Comment 26 by h...@chromium.org, Jun 20 2016

The PFQ hasn't been green for days (https://uberchromegw.corp.google.com/i/chromeos/builders/master-chromium-pfq), however I don't see any recent occurences of "Host did not return from reboot".
Labels: -dut-health Hotlist-CrOS-DutHealth
Cc: keta...@chromium.org
I looked at all the logs, for the recent 'Host did not return from reboot' failure, Ketaki pointed us at this morning :
https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-cheets-chrome-pfq/builds/333/steps/HWTest%20%5Barc-bvt-cq%5D/logs/stdio

provision_AutoUpdate starts at '2016-06-23 01:56:10'
                 it reboots at '2016-06-23 01:56:51'
the reboot.verify times out at '2016-06-23 02:06:55'

The next test passed with issue ...
provision_AutoUpdate starts at '2016-06-23 02:08:30'
                   succeeds at '2016-06-23 02:09:11'

reboot.verify failed because the ssh connection never went through for 8 min:
06/23 01:58:52.858 DEBUG|      abstract_ssh:0625| Host chromeos4-row6-rack9-host21 is now unreachable over ssh, is down
06/23 01:58:52.859 DEBUG|          ssh_host:0180| Running (ssh) 'true'
06/23 02:06:53.244 WARNI|        base_utils:0910| run process timeout (480) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_ufGOFwssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row6-rack9-host21 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true"
06/23 02:06:55.260 DEBUG|      abstract_ssh:0550| Host chromeos4-row6-rack9-host21 is still down after waiting 483 seconds

the console-ramoops available is the following test:
https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/67533590-chromeos-test/chromeos4-row6-rack9-host19/sysinfo/console-ramoops
suggests that the kernel actually rebooted properly and the USB ethernet interface went up
...
[    9.972516] smsc75xx 1-3:1.0 eth1: link up, 1000Mbps, full-duplex, lpa 0xC1E1
...
then wait for 8min mostly idle before being shut down gracefully (by what ?) as shown by the firmware eventlog :
295 | 2016-06-23 01:36:42 | Kernel Event | Clean Shutdown
296 | 2016-06-23 01:36:44 | System boot | 3264
297 | 2016-06-23 01:36:44 | System Reset
298 | 2016-06-23 01:57:14 | Kernel Event | Clean Shutdown
299 | 2016-06-23 01:57:16 | System boot | 3265
300 | 2016-06-23 01:57:16 | System Reset
301 | 2016-06-23 02:05:07 | Kernel Event | Clean Shutdown
302 | 2016-06-23 02:05:09 | System boot | 3266
303 | 2016-06-23 02:05:09 | System Reset

So either the failure was at a higher network level (no DHCP lease, shill flake, DUT ssh daemon issue) or the host didn't try hard enough the connection
a reboot 'fixed' it (but was it even necessary, was the actual graceful reboot issued over ssh ? that would confirm that the issue is the latter)
The analysis in #28 was wrong actually, the second test was not on the DUT I thought
Got the logs from the right machine this time :
<<< firmware eventlog before the test >>>
274 | 2016-06-23 01:09:15 | System boot | 3031
275 | 2016-06-23 01:09:15 | System Reset
276 | 2016-06-23 01:17:40 | Kernel Event | Clean Shutdown
277 | 2016-06-23 01:17:42 | System boot | 3032
278 | 2016-06-23 01:17:42 | System Reset

<<< firmware eventlog after the test (recorded by another test on the DUT >>>
274 | 2016-06-23 01:09:15 | System boot | 3031
275 | 2016-06-23 01:09:15 | System Reset
276 | 2016-06-23 01:17:40 | Kernel Event | Clean Shutdown
277 | 2016-06-23 01:17:42 | System boot | 3032
278 | 2016-06-23 01:17:42 | System Reset
279 | 2016-06-23 02:45:19 | System boot | 3033
280 | 2016-06-23 02:45:19 | SUS Power Fail
281 | 2016-06-23 02:45:19 | System Reset
282 | 2016-06-23 02:45:19 | Wake Source | Power Button | 0
283 | 2016-06-23 02:45:56 | Kernel Event | Clean Shutdown
284 | 2016-06-23 02:45:58 | System boot | 3034
285 | 2016-06-23 02:45:58 | System Reset
286 | 2016-06-23 02:47:26 | Kernel Event | Clean Shutdown
287 | 2016-06-23 02:47:28 | System boot | 3035
288 | 2016-06-23 02:47:28 | System Reset
[...]

So the machine didn't even complete a clean shutdown when asked to reboot in provision_AutoUpdate
The matching host log from cautotest:
Provision 67539165	chromeos-test	cyan-cheets-tot-chrome-pfq-cheets-informational/R53-8490.0.0-b281/arc-bvt-cq/cheets_KeyboardTest 2016-06-23 03:03:31 Completed
Job       67533586	chromeos-test	cyan-cheets-chrome-pfq/R53-8490.0.0-rc2/arc-bvt-cq/cheets_ContainerSmokeTest                                         Failed
Repair    67533586	chromeos-test	cyan-cheets-chrome-pfq/R53-8490.0.0-rc2/arc-bvt-cq/cheets_ContainerSmokeTest                     2016-06-23 02:32:06 Completed
Provision 67533586	chromeos-test	cyan-cheets-chrome-pfq/R53-8490.0.0-rc2/arc-bvt-cq/cheets_ContainerSmokeTest                     2016-06-23 01:55:49 Failed
Job	  67532923	chromeos-test	cyan-cheets-android-pfq/R53-8490.0.0-rc2/arc-bvt-cq/video_VideoSanity.vp8	                 2016-06-23 01:52:45 Completed

The repair log is there :
https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/hosts/chromeos4-row6-rack9-host21/56791460-repair/debug/autoserv.DEBUG

Interesting facts :
-> the forced power cycle we are seeing in the firmware eventlog at 279-282 :
06/23 02:45:17.129 ERROR|            repair:0397| Repair failed: Power cycle the host with RPM
[...]
06/23 02:45:17.132 INFO |            repair:0140| Repair action triggered: Reset the DUT via servo
[...]
06/23 02:45:40.077 DEBUG|      abstract_ssh:0542| Host chromeos4-row6-rack9-host21 is now up


So a simple cold reset 'fixed' the device.
We would need to save the logs in this case just after the reset to be able to investigate further :
  firmware eventlog, messages, net.log, ...

I have done the following CL to try to get logs in the case mentioned in #31 so we can debug it further :
https://chromium-review.googlesource.com/#/c/355739/
Project Member

Comment 33 by bugdroid1@chromium.org, Jul 8 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/8e13b68c1bb5aafd6f2853079d38b14a9f384537

commit 8e13b68c1bb5aafd6f2853079d38b14a9f384537
Author: Vincent Palatin <vpalatin@chromium.org>
Date: Fri Jun 24 00:38:06 2016

autotest: record logs after forcing reset when repairing

If the DUT was no longer accessible through SSH, ensure we are recording the
logs after re-gaining SSH access (e.g. after trigger a servo reset) before
clobbering them.

BUG=chromium:615662
TEST=run a local autoserv instance:
sudo ./server/autoserv -R ...

Change-Id: I4aa549145ad724368290ea7566ca41577d92194b
Reviewed-on: https://chromium-review.googlesource.com/355739
Commit-Ready: Vincent Palatin <vpalatin@chromium.org>
Tested-by: Vincent Palatin <vpalatin@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/8e13b68c1bb5aafd6f2853079d38b14a9f384537/server/hosts/cros_repair.py

Owner: vpalatin@chromium.org
Re-assigning to me since I'm investigating it.

Now that the new logging is in place, if you see a new occurence of the 'Host did not return from reboot' error, Please link it here.
issue 626713 would have had interesting reboot failure after AU on Butterfly, but those machines have no servo :(
Why do any machines in the lab not have servo?  Aviv?

Comment 37 by sosa@chromium.org, Jul 10 2016

Cc: jrbarnette@chromium.org
There are a few devices that never had full support of servo (had issues that ultimately we couldn't overcome). Butterfly was one of them. See https://bugs.chromium.org/p/chromium/issues/detail?id=383097. I believe this is true for Parrot as well. +Richard knows the full list.
Currently, the list of devices not supporting servo is:
   butterfly parrot parrot_ivb stout x86-mario x86-zgb

the DUT was chromeos4-row6-rack9-host11.
It recovers after a servo cold reset in repair.
The repair logs are there :
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row6-rack9-host11/57292226-repair/

The repair started at 2016-07-12 09:39:18 PST, the hard reset happend at 09:53
	START	----	repair.reset	timestamp=1468342356	localtime=Jul 12 09:52:36	
		GOOD	----	verify.ssh	timestamp=1468342389	localtime=Jul 12 09:53:09	
	END GOOD	----	repair.reset	timestamp=1468342389	localtime=Jul 12 09:53:09	
	GOOD	----	verify.cros	timestamp=1468342392	localtime=Jul 12 09:53:12	

the original test logs for the timestamps are there : https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/69352354-chromeos-test/chromeos4-row6-rack9-host11/provision_AutoUpdate/debug/provision_AutoUpdate.DEBUG


The firmware event log looks like this :
247 | 2016-07-12 06:44:14 | Kernel Event | Clean Shutdown
248 | 2016-07-12 06:44:16 | System boot | 3390
249 | 2016-07-12 06:44:16 | System Reset
250 | 2016-07-12 06:53:33 | Kernel Event | Clean Shutdown
251 | 2016-07-12 06:53:35 | System boot | 3391
252 | 2016-07-12 06:53:35 | System Reset
253 | 2016-07-12 09:52:38 | System boot | 3392
254 | 2016-07-12 09:52:38 | SUS Power Fail
255 | 2016-07-12 09:52:38 | System Reset
256 | 2016-07-12 09:52:38 | Wake Source | Power Button | 0

There is no trace of the shutdown/reboot initiated by the test at 09:00 :
07/12 09:00:02.520 DEBUG|          ssh_host:0180| Running (ssh) '( sleep 1; reboot & sleep 10; reboot -f ) </dev/null >/dev/null 2>&1 & echo -n $!'
07/12 09:00:02.825 DEBUG|        base_utils:0297| [stdout] 12365
07/12 09:00:02.827 DEBUG|      abstract_ssh:0604| Host chromeos4-row6-rack9-host11 pre-shutdown boot_id is 66464aba-279d-44da-b232-30bd85347efe
07/12 09:00:02.828 DEBUG|          ssh_host:0180| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
07/12 09:00:03.131 DEBUG|        base_utils:0278| [stdout] 66464aba-279d-44da-b232-30bd85347efe
07/12 09:00:04.138 DEBUG|          ssh_host:0180| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
07/12 09:02:02.714 WARNI|        base_utils:0910| run process timeout (118) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_t2jdl0ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row6-rack9-host11 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_for_restart|wait_down|get_boot_id] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)\";fi; if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi"
07/12 09:02:03.724 DEBUG|      abstract_ssh:0625| Host chromeos4-row6-rack9-host11 is now unreachable over ssh, is down
07/12 09:02:03.726 DEBUG|          ssh_host:0180| Running (ssh) 'true'
07/12 09:10:04.066 WARNI|        base_utils:0910| run process timeout (480) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_t2jdl0ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row6-rack9-host11 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true"
07/12 09:10:06.084 DEBUG|      abstract_ssh:0550| Host chromeos4-row6-rack9-host11 is still down after waiting 483 seconds
07/12 09:10:06.086 INFO |        server_job:0129| 			ABORT	----	reboot.verify	timestamp=1468339806	localtime=Jul 12 09:10:06	Host did not return from reboot

So it still seems that Cyan is failing on the shutdown/reboot path not on the way up at boot.
Would it be possible to send the alt-volup-x key combo via servo before servo initiates the cold reset?  That may allow us to generate and gather the more interesting kernel logs that we will likely want here.
There is a slightly-corrupted console-ramoops collected (hopefully it's the right one, I'm not 100% sure). I'm attaching it, nothing outstanding though.

I will evaluate adding the Alt+VolUp+X in the repair phase.
console-ramoops.txt
64.1 KB View Download
Assuming that the console-ramooops is complete, the kernel never got to the point of asking the firmware for a reboot.  Do we need to invent an equivalent of pm_print_times that logs more details about the shutdown/restart sequence so that we know where in the kernel things stalled?
Shouldn't the watchdog reboot us ?
Given this issue seems to be cyan specific, is there a watchdog there ? has somebody (e.g. Intel) really run some intensive reboot testing before shipping this platform ?
Project Member

Comment 46 by bugdroid1@chromium.org, Jul 20 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/hdctools/+/3acbbe57aff59159db96a80bbb9e475b1ce454a4

commit 3acbbe57aff59159db96a80bbb9e475b1ce454a4
Author: Vincent Palatin <vpalatin@chromium.org>
Date: Tue Jul 19 15:40:12 2016

servod: add sysrq-x key combination

Allow to trigger the kernel System Request X (sysrq-x) to force a kernel
panic on keyboard drivers supporting it.

BUG=chromium:615662
TEST=manual, run "dut-control kb_press:sysrq_x kb_press:sysrq_x
kb_press:sysrq_x" and see the main processor rebooting,
then find 'Kernel panic - not syncing: ChromeOS X Key' in
/dev/pstore/console-ramoops.

Change-Id: Ie83eb2306645595f927bfcfe23d35031c0df26c7
Reviewed-on: https://chromium-review.googlesource.com/361760
Commit-Ready: Vincent Palatin <vpalatin@chromium.org>
Tested-by: Vincent Palatin <vpalatin@chromium.org>
Reviewed-by: Todd Broch <tbroch@chromium.org>

[modify] https://crrev.com/3acbbe57aff59159db96a80bbb9e475b1ce454a4/servo/keyboard_handlers.py
[modify] https://crrev.com/3acbbe57aff59159db96a80bbb9e475b1ce454a4/servo/servo_server.py
[modify] https://crrev.com/3acbbe57aff59159db96a80bbb9e475b1ce454a4/servo/drv/kb.py
[modify] https://crrev.com/3acbbe57aff59159db96a80bbb9e475b1ce454a4/servo/data/keyboard.xml

Sorry, I had missed c#45.

wmatrix indicates various platform_KernelErrorPaths tests passing for Cyan:
https://wmatrix.googleplex.com/platform/kernel_all?platforms=cyan

I think it depends on why and when the reboot processing got stuck. They won't trigger:
- on a somewhat benign "while(1) { do_something; msleep(1000); }" in some driver then the watchdogs won't trigger.
- very late in the reboot sequence where the the watchdogs have been disabled.
Cc: shyamsundarr@chromium.org
Adding Shyam from Intel.  I'd like to see Intel run some reboot tests on Cyan and debug in case they hit issues.
@comment 48, think you have cc'd the wrong Shyam.
The right Shyam is here. Acking Puneet's ask.
Cc: -shyamsundarr@chromium.org
Cc: shyam.su...@intel.com
Cc: margarit...@intel.com tony.sun@intel.com
We are already looking into this request.  We will update the status/results later today.
Thanks Margarita! Do let Google know if we need more data on this.
Margarita - are you just trying to answer the question in comment 45?  Or are you trying to come up with a new experiment on stress testing a reboot cycle on the Cyan platform?
margarita@ I'll take this one.

We'll be using the bootperf tool to stress test two cyans, one with a cyan-cheets image and the other with a regular cyan image. This should tell us if the problem occurs on Cyan builds that have ARC++ or not, but do you already know whether this is a cheets specific issue?

We'll get back to you regarding the watchdog as we find out more information. 

Regarding the question of testing, yes the ChromeOS teams at Intel have conducted numerous reboot and suspend/resume stress tests before BSW and cyan were released.

I'll provide more information as we get further into the debugging process.
vpalatin@ what is the Coreboot version on the DUTs you're reproducing this on?
> vpalatin@ what is the Coreboot version on the DUTs you're reproducing this on?

That's lab devices, so there is not a single answer.
the last one was :
fwid                   = Google_Cyan.7287.57.60
Project Member

Comment 60 by bugdroid1@chromium.org, Jul 27 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/e7dc9284f1b7ed7dc98803641b8ed32a1082bb2e

commit e7dc9284f1b7ed7dc98803641b8ed32a1082bb2e
Author: Vincent Palatin <vpalatin@chromium.org>
Date: Thu Jul 14 09:31:58 2016

autotest: try to repair DUT with sysrq-x first

When the DUT was no longer accessible through SSH, try first to repair
it by pressing 3 times sysrq-x before trying the servo cold reset.
This way we can try to force the kernel to trigger a panic and get the
last kernel logs in the console ramoops after reboot.

BUG=chromium:615662
TEST=run a local autoserv instance:
sudo ./server/autoserv -R -m ...
after stopping 'shill' on the DUT to simulate a networking issue.
CQ-DEPEND=CL:361760

Change-Id: I31c4bf567655d4207c7aa9e490b1fcbcaa9ef8d2
Reviewed-on: https://chromium-review.googlesource.com/361691
Commit-Ready: Vincent Palatin <vpalatin@chromium.org>
Tested-by: Vincent Palatin <vpalatin@chromium.org>
Reviewed-by: Sameer Nanda <snanda@chromium.org>
Reviewed-by: Todd Broch <tbroch@chromium.org>

[modify] https://crrev.com/e7dc9284f1b7ed7dc98803641b8ed32a1082bb2e/server/cros/servo/servo.py
[modify] https://crrev.com/e7dc9284f1b7ed7dc98803641b8ed32a1082bb2e/server/hosts/cros_repair.py

For the elm-cheets-release build, the machine was salvaged by a servo cold-reset during the repair (this repair doesn't have a sysrq-x yet),
the leftover in the ramoops tends to show that the SoC reset was called (see below).
The test timeline was the following regarding reboots :
1st reboot succeeded: 
 reboot.verify START Jul 27 13:06:26  GOOD  Jul 27 13:07:02
2nd reboot failed:
 reboot.verify START Jul 27 13:13:01  ABORT Jul 27 13:23:06
The firmware eventlog actually recorded a 'boot' for both of them but is suggesting that the 2nd one ended up in recovery:
278 | 2016-07-27 13:06:29 | System boot | 0
279 | 2016-07-27 13:13:10 | System boot | 0
280 | 2016-07-27 13:13:10 | Chrome OS Recovery Mode | Unknown
281 | 2016-07-27 13:13:12 | System boot | 0
282 | 2016-07-27 13:13:12 | Chrome OS Recovery Mode | Unknown
283 | 2016-07-27 15:21:33 | System boot | 0

The repair job just did the servo cold reset at 'Jul 27 15:21:32' and everything looks in a proper state afterwards (as in https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/hosts/chromeos2-row5-rack7-host1/57897614-repair/status.log ).
So the failure was spuriously going to Recovery mode, not sure if there is any hint why. 

ramoops leftovers:
[  314.222106] EHT4-fs (imcblk0p%): mounting ext2 file systEm using the ext4 subays4em
[  314.225882] EPT4-fs (mmcbhk0p5): mounted filesystem without journal. Opts: 
[  392.255693] smsc95xx\001-2.1.1.1:1.0 eth1: hardware isf't capable of remote wakeup
[  392.509223] init: console-ttiS0 main process (1301) killed by TERM signal
[  392.509678] init: daisydog main process (1338) killed bx TERM signah
[  392.510699] ifit: powerd main `rocess (1 09) kIh,ed by TERM signal
[  392.411307] init: chapsd main process\00(1878) killed by TERM signal
[  392.111773] init: permission_brokdr m`in prObess (2322) killed\00By TERM signal
[  382.513137] init: crash-sender main process (4228) killed by TERM signal
[  392.513826] init: crks-machine-id-regen-periodic main process (4234) iilled by TERM signal
[  !92.514443] init: mtpd main process (4241) kilded by TERM signal
[  392.515069] inip: log-rgtate main process (450	 kildeD\00bq TERM signal
[  392.51565 ] init: cras main process (4201) killed by TERM sigjAl
[  392.517702] ijit: bluetoothd mail process (4322) killed by TERM signad
[\00 292.522322] ini4: w!r.-collector mai prOcess (5890) killed by TERM signal
[  392.523105] init: tlsdate` main prmc%ss (5927) killed by TERM signal
[  392.523643] init: netfilter-queue\00main process (5966) +illed by TERM siGnal
[  392.537590] EXT4-fs (dm-0) re-mounted. Opts: 
[  392.554399] init: cryptohomed main process (1979! killed by TERL signal
[  392.692(79] init: rEcover_dutS\00eain `rocacs (5884) killed by TERM signal
[  392.773025] Unsafe core_pattern used wath suhd_dulpable=2. Pipe handler gr fullx qUaliFied cOra `ump `ath reqei2ed.
[  393.828659] reboot: R%starting system
For the oak-release job, after the test failed, the next 'Provision' succeeded without a 'Repair'.

After extracting the firmware eventlog from one of the subsequent test job :
240 | 2016-07-27 12:45:26 | System boot | 0
241 | 2016-07-27 12:52:15 | System boot | 0      <== failed reboot in the test
242 | 2016-07-27 12:52:15 | Chrome OS Recovery Mode | Unknown
243 | 2016-07-27 12:52:17 | System boot | 0
244 | 2016-07-27 12:52:17 | Chrome OS Recovery Mode | Unknown
245 | 2009-12-31 16:00:00 | System boot | 0          
246 | 2009-12-31 16:00:00 | System boot | 0
247 | 2009-12-31 16:00:00 | System boot | 0
248 | 2009-12-31 16:00:00 | System boot | 0
249 | 2009-12-31 16:00:00 | System boot | 0
250 | 2009-12-31 16:00:00 | System boot | 0
251 | 2009-12-31 16:00:09 | System boot | 0
252 | 2009-12-31 16:00:00 | System boot | 0
253 | 2009-12-31 16:00:00 | System boot | 0
254 | 2009-12-31 16:00:00 | System boot | 0
255 | 2009-12-31 16:00:00 | System boot | 0
256 | 2009-12-31 16:00:00 | System boot | 0
257 | 2009-12-31 16:00:00 | System boot | 0
258 | 2009-12-31 16:00:00 | System boot | 0
259 | 2009-12-31 16:00:00 | System boot | 0
260 | 2009-12-31 16:00:00 | System boot | 0
261 | 2009-12-31 16:00:00 | System boot | 0
262 | 2009-12-31 16:00:00 | System boot | 0
263 | 2009-12-31 16:00:00 | System boot | 0
264 | 2009-12-31 16:00:00 | System boot | 0
265 | 2009-12-31 16:00:00 | System boot | 0
266 | 2009-12-31 16:43:45 | System boot | 0
267 | 2016-07-27 20:44:45 | System boot | 0        
268 | 2016-07-27 20:51:20 | System boot | 0

Same issue as the elm, the machine went to recovery.
"Unknown", hmmm... unfortunately, mosys' recovery reason table is heavily out of date (opened  issue 632413 ). Can you please run this on the affected DUT and attach elog.hex so I can fish out the reason manually and get an idea what happened here?

flashrom -r -i RW_ELOG:/tmp/elog.bin
xxd /tmp/elog.bin > /tmp/elog.hex
Thanks Vincent. Our mystery recovery reason is 0x2d ("Error trying to clear TPM owner") in all cases. Since these devices shouldn't be transitioning into dev mode, the only way this can happen is if something set the clear_tpm_owner_request NVRAM flag (with crossytem). Some part of the three TPM commands involved in that (force clear, enable, activate) went wrong.

Unfortunately, that's all I can tell you right now... to debug it further, we'd probably have to catch it in the act (UART output should at least tell us which of the commands failed, and with what error code). A good first step would be looking up which test exactly it ran during/before that error (maybe some sort of TPM test? That clear_tpm flag must be set by something...), running it in isolation and seeing if that repros more quickly.
The `crossystem clear_tpm_owner_request=1` command runs every time
we do a powerwash.  That happens in a couple of cases:
  * When running the powerwash test.
  * One possible repair strategy is to run powerwash and then
    reinstall via AU.

Additionally, we recently changed provisioning code to invoke that
command immediately before rebooting after AU.

Oak/Elm specific investigation forked to http://crosbug.com/p/55764.
@davidriley
Yes all the Cyan derivatives have a real unsolved issue, they sometimes got stuck on the shutdown path as mentioned on comment #41 to #59



Vincent,

Intel will have an update on this today.

Thanks,
Shyam
We were able to reproduce locally, and have an Engineer working on it. We will provide an update by next week. 
Google, 

Can you please add matthew.s.atwood@intel.com and Aditya Swarup (aditya.swarup@intel.com) to this bug for faster updates? Intel is currently unable to add engineers to this bug.

Thanks a lot,
Shyam
Cc: aditya.s...@intel.com matthew....@intel.com
+Matt & Aditya
Blockedon: 639301
AS this is the 'generic' bug for all the reboot failures, I have opened issue 639301
for the investigations on Cyan shutdown failure:
https://bugs.chromium.org/p/chromium/issues/detail?id=639301
Please put your progress there.
Locally this problem is much easier to reproduce with a cyan-cheets build than with a cyan build, I think the first step is to see what changes with the kernel use flags/cmdline/config to hopefully help point us in the right direction.

We are also in the process of collecting our own failure logs for both cases to ensure this is the same issue. Expect an update next week.
as per comment #76, please put the Cyan investigations in issue 639301
Blocking: 636065
I made comments in https://bugs.chromium.org/p/chromium/issues/detail?id=639301#c75 that cyan was worse than other boards.  wizpig (45% suite success rate) is also worse than other boards running the same bvt-inline suite (81% suite success rate).  Provision rates for that same suite at 93% success rate for wizpig, vs 99% across the 14 boards running that suite.  wizpig does appear to be partially caused by two bad DUTs:
chromeos2-row8-rack8-host17: 100.0% 56/56
chromeos2-row8-rack8-host18: 98.0% 56/57
chromeos2-row8-rack8-host21: 65.0% 25/38
chromeos6-row2-rack19-host1: 100.0% 56/56
chromeos6-row2-rack19-host3: 100.0% 56/56
chromeos6-row2-rack20-host20: 96.0% 54/56
chromeos6-row2-rack20-host22: 100.0% 56/56
chromeos6-row2-rack20-host4: 100.0% 56/56
chromeos6-row2-rack20-host7: 0.0% 0/12

edgar seems to perform a lot better running bvt-cq and didn't show provision failures and has the same success rate as the average across all boards.

Sign in to add a comment