New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 795902 link

Starred by 1 user

Issue metadata

Status: WontFix
Closed: Jan 2018
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: ----

Sign in to add a comment

fizz-release failing since Dec 8

Project Member Reported by, Dec 18 2017

Issue description

Filed by on behalf of

fizz-release:804-831 failed

Builders failed on: 
- fizz-release:

Every build of fizz-release since 804 on Dec 8 is failing with messages that look similar to: provision: FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row4-rack5-host2: 0) ChromiumOSUpdateError: chromeos2-row4-rack5-host2 cannot recover from reboot at pre-setup of rootfs update, 1) SSHConnectionError: ssh: connect to host port 22: Connection timed out.
Components: Infra
I've done a spot check of all the repair tasks on fizz DUTs
in the bvt pool.  There's a lot of them: in a 72 hour period,
there was a total of 63 repair events; that's enough to average
every DUT failing at least once on every release builder run.

Below are logs of a prototypical event:

The attached "status.log" file tells the tale.  The key is this part:
	FAIL	----	verify.ssh	timestamp=1513636301	localtime=Dec 18 14:31:41	No answer to ping from chromeos2-row4-rack5-host7
	START	----	repair.rpm	timestamp=1513636301	localtime=Dec 18 14:31:41	
		GOOD	----	verify.ssh	timestamp=1513636349	localtime=Dec 18 14:32:29	
		GOOD	----	verify.power	timestamp=1513636349	localtime=Dec 18 14:32:29	
	END GOOD	----	repair.rpm	timestamp=1513636349	localtime=Dec 18 14:32:29	

The "verify.ssh" line says that the DUT was offline.  The "repair.rpm"
action means that the system used an RPM device to unplug/replug AC
power to the DUT.  The logs show that power cycling AC caused the DUT
to boot up and return to working order.

The code for "repair.rpm" looks for and gathers crash dumps, if they're
found.  The logs show no dumps, so it looks like whatever caused the
problems, there were no crashes.

5.5 KB View Download
Components: -Infra OS>Kernel
Status: Assigned (was: Available)
This smells like a system hang, so let's give it to the kernel.

Assigning to a sheriff to find a proper expert.

Status: WontFix (was: Assigned)
Haven't seen any more recurrences of this problem and it's been over a month without any updates.  Closing.

Sign in to add a comment