New issue
Advanced search Search tips

Issue 849367 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 847716



Sign in to add a comment

All bvt veyron_rialto DUTs are failing repair (can't boot from USB)

Project Member Reported by pprabhu@chromium.org, Jun 4 2018

Issue description

Blocking: 847716
They're all failing to even boot from USB as part of the final repair attempt:

START	----	repair	timestamp=1528135231	localtime=Jun 04 11:00:31	
	GOOD	----	verify.servo_ssh	timestamp=1528135233	localtime=Jun 04 11:00:33	
	GOOD	----	verify.brd_config	timestamp=1528135234	localtime=Jun 04 11:00:34	
	GOOD	----	verify.ser_config	timestamp=1528135234	localtime=Jun 04 11:00:34	
	GOOD	----	verify.job	timestamp=1528135235	localtime=Jun 04 11:00:35	
	GOOD	----	verify.servod	timestamp=1528135240	localtime=Jun 04 11:00:40	
	GOOD	----	verify.pwr_button	timestamp=1528135240	localtime=Jun 04 11:00:40	
	GOOD	----	verify.lid_open	timestamp=1528135240	localtime=Jun 04 11:00:40	
	GOOD	----	verify.update	timestamp=1528135244	localtime=Jun 04 11:00:44	
	GOOD	----	verify.PASS	timestamp=1528135244	localtime=Jun 04 11:00:44	
	FAIL	----	verify.ssh	timestamp=1528135829	localtime=Jun 04 11:10:29	No answer to ping from chromeos2-row2-rack10-host4
	START	----	repair.rpm	timestamp=1528135829	localtime=Jun 04 11:10:29	
		FAIL	----	repair.rpm	timestamp=1528136105	localtime=Jun 04 11:15:05	chromeos2-row2-rack10-host4 is still offline after powercycling
	END FAIL	----	repair.rpm	timestamp=1528136105	localtime=Jun 04 11:15:05	
	START	----	repair.sysrq	timestamp=1528136105	localtime=Jun 04 11:15:05	
		FAIL	----	repair.sysrq	timestamp=1528136338	localtime=Jun 04 11:18:58	Host chromeos2-row2-rack10-host4 is still offline after sysrq.
	END FAIL	----	repair.sysrq	timestamp=1528136338	localtime=Jun 04 11:18:58	
	START	----	repair.servoreset	timestamp=1528136338	localtime=Jun 04 11:18:58	
		FAIL	----	repair.servoreset	timestamp=1528136564	localtime=Jun 04 11:22:44	Host chromeos2-row2-rack10-host4 is still offline after servoreset.
	END FAIL	----	repair.servoreset	timestamp=1528136564	localtime=Jun 04 11:22:44	
	START	----	repair.firmware	timestamp=1528136564	localtime=Jun 04 11:22:44	
		FAIL	----	repair.firmware	timestamp=1528136564	localtime=Jun 04 11:22:44	Firmware repair is not applicable to host chromeos2-row2-rack10-host4.
	END FAIL	----	repair.firmware	timestamp=1528136564	localtime=Jun 04 11:22:44	
	START	----	repair.usb	timestamp=1528136564	localtime=Jun 04 11:22:44	
		FAIL	----	repair.usb	timestamp=1528137082	localtime=Jun 04 11:31:22	DUT failed to boot from USB after 300 seconds
	END FAIL	----	repair.usb	timestamp=1528137082	localtime=Jun 04 11:31:22	
END FAIL	----	repair	timestamp=1528137082	localtime=Jun 04 11:31:22	
Filed b/109669494 to try to get some logs from one of the dead DUTs.
They all have separate servo's, so this is not a common labstation issue.


pprabhu@pprabhu:chromiumos$ dut-status -p bvt -b veyron_rialto -n | xargs -i atest host stat {} | grep servo_host
servo_host : chromeos2-row1-rack10-host7-servo
servo_host : chromeos2-row2-rack10-host4-servo
servo_host : chromeos2-row2-rack10-host5-servo
servo_host : chromeos2-row2-rack10-host6-servo
servo_host : chromeos2-row2-rack10-host7-servo
servo_host : chromeos2-row2-rack10-host9-servo


Another chance is that we have a bad veyron_rialto stable image.
Labels: Hotlist-Deputy
> Another chance is that we have a bad veyron_rialto stable image.

We have enough data to answer that question easily...

$ dut-status -b veyron_rialto -p bvt
hostname                       S   last checked         URL
chromeos2-row1-rack10-host7    NO  2018-06-06 11:38:43  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row1-rack10-host7/564195-repair/
chromeos2-row2-rack10-host4    NO  2018-06-06 11:38:43  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host4/564196-repair/
chromeos2-row2-rack10-host5    NO  2018-06-01 12:37:09  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host5/545573-repair/
chromeos2-row2-rack10-host6    NO  2018-06-06 11:38:43  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host6/564198-repair/
chromeos2-row2-rack10-host7    NO  2018-06-06 11:38:43  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host7/564199-repair/
chromeos2-row2-rack10-host9    NO  2018-06-06 11:38:43  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row2-rack10-host9/564197-repair/

Looking at status.log from all of those failures, there's a
common theme:
 1) The DUT is offline.
 2) Servo verification reports no errors.
 3) Re-installing from USB fails like this:

	START	----	repair.usb	timestamp=1528311642	localtime=Jun 06 12:00:42	
		FAIL	----	repair.usb	timestamp=1528312159	localtime=Jun 06 12:09:19	DUT failed to boot from USB after 300 seconds
	END FAIL	----	repair.usb	timestamp=1528312159	localtime=Jun 06 12:09:19	

> Another chance is that we have a bad veyron_rialto stable image.

Just to be clear:  A bad stable image isn't likely.  The chosen
image file is the latest Beta build for rialto, an R66 release.
If the image were bad, the most likely cause would be corruption
in googlestorage, and that's not very likely at all.

The first thing to look for is whether any recent servo changes
adversely impacted rialto.

Status: Fixed (was: Started)
Internal bug tracking recovery fixed. All Rialtos are back.

Sign in to add a comment