|fizz-release failing since Dec 8|
|Project Member Reported by sheriff-...@appspot.gserviceaccount.com, Dec 18||Back to list|
Filed by firstname.lastname@example.org on behalf of email@example.com fizz-release:804-831 failed Builders failed on: - fizz-release: https://luci-milo.appspot.com/buildbot/chromeos/fizz-release/831 Every build of fizz-release since 804 on Dec 8 is failing with messages that look similar to: provision: FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row4-rack5-host2: 0) ChromiumOSUpdateError: chromeos2-row4-rack5-host2 cannot recover from reboot at pre-setup of rootfs update, 1) SSHConnectionError: ssh: connect to host 100.115.226.211 port 22: Connection timed out.
I've done a spot check of all the repair tasks on fizz DUTs in the bvt pool. There's a lot of them: in a 72 hour period, there was a total of 63 repair events; that's enough to average every DUT failing at least once on every release builder run. Below are logs of a prototypical event: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row4-rack5-host7/61981800-repair The attached "status.log" file tells the tale. The key is this part: FAIL ---- verify.ssh timestamp=1513636301 localtime=Dec 18 14:31:41 No answer to ping from chromeos2-row4-rack5-host7 START ---- repair.rpm timestamp=1513636301 localtime=Dec 18 14:31:41 GOOD ---- verify.ssh timestamp=1513636349 localtime=Dec 18 14:32:29 GOOD ---- verify.power timestamp=1513636349 localtime=Dec 18 14:32:29 END GOOD ---- repair.rpm timestamp=1513636349 localtime=Dec 18 14:32:29 The "verify.ssh" line says that the DUT was offline. The "repair.rpm" action means that the system used an RPM device to unplug/replug AC power to the DUT. The logs show that power cycling AC caused the DUT to boot up and return to working order. The code for "repair.rpm" looks for and gathers crash dumps, if they're found. The logs show no dumps, so it looks like whatever caused the problems, there were no crashes.
This smells like a system hang, so let's give it to the kernel. Assigning to a sheriff to find a proper expert.
|► Sign in to add a comment|