New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 1 user
Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: ----



Sign in to add a comment
fizz-release failing since Dec 8
Project Member Reported by sheriff-...@appspot.gserviceaccount.com, Dec 18 Back to list
Filed by sheriff-o-matic@appspot.gserviceaccount.com on behalf of bmgordon@google.com

fizz-release:804-831 failed

Builders failed on: 
- fizz-release: 
  https://luci-milo.appspot.com/buildbot/chromeos/fizz-release/831

Every build of fizz-release since 804 on Dec 8 is failing with messages that look similar to: provision: FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row4-rack5-host2: 0) ChromiumOSUpdateError: chromeos2-row4-rack5-host2 cannot recover from reboot at pre-setup of rootfs update, 1) SSHConnectionError: ssh: connect to host 100.115.226.211 port 22: Connection timed out.
 
Components: Infra
Cc: yungleem@chromium.org
I've done a spot check of all the repair tasks on fizz DUTs
in the bvt pool.  There's a lot of them: in a 72 hour period,
there was a total of 63 repair events; that's enough to average
every DUT failing at least once on every release builder run.

Below are logs of a prototypical event:
    https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row4-rack5-host7/61981800-repair

The attached "status.log" file tells the tale.  The key is this part:
	FAIL	----	verify.ssh	timestamp=1513636301	localtime=Dec 18 14:31:41	No answer to ping from chromeos2-row4-rack5-host7
	START	----	repair.rpm	timestamp=1513636301	localtime=Dec 18 14:31:41	
		GOOD	----	verify.ssh	timestamp=1513636349	localtime=Dec 18 14:32:29	
		GOOD	----	verify.power	timestamp=1513636349	localtime=Dec 18 14:32:29	
	END GOOD	----	repair.rpm	timestamp=1513636349	localtime=Dec 18 14:32:29	

The "verify.ssh" line says that the DUT was offline.  The "repair.rpm"
action means that the system used an RPM device to unplug/replug AC
power to the DUT.  The logs show that power cycling AC caused the DUT
to boot up and return to working order.

The code for "repair.rpm" looks for and gathers crash dumps, if they're
found.  The logs show no dumps, so it looks like whatever caused the
problems, there were no crashes.

status.log
5.5 KB View Download
Components: -Infra OS>Kernel
Owner: bmgordon@chromium.org
Status: Assigned
This smells like a system hang, so let's give it to the kernel.

Assigning to a sheriff to find a proper expert.

Sign in to add a comment