fizz-release failing since Dec 8 |
||||
Issue descriptionFiled by sheriff-o-matic@appspot.gserviceaccount.com on behalf of bmgordon@google.com fizz-release:804-831 failed Builders failed on: - fizz-release: https://luci-milo.appspot.com/buildbot/chromeos/fizz-release/831 Every build of fizz-release since 804 on Dec 8 is failing with messages that look similar to: provision: FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row4-rack5-host2: 0) ChromiumOSUpdateError: chromeos2-row4-rack5-host2 cannot recover from reboot at pre-setup of rootfs update, 1) SSHConnectionError: ssh: connect to host 100.115.226.211 port 22: Connection timed out.
,
Dec 18 2017
,
Dec 19 2017
I've done a spot check of all the repair tasks on fizz DUTs
in the bvt pool. There's a lot of them: in a 72 hour period,
there was a total of 63 repair events; that's enough to average
every DUT failing at least once on every release builder run.
Below are logs of a prototypical event:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row4-rack5-host7/61981800-repair
The attached "status.log" file tells the tale. The key is this part:
FAIL ---- verify.ssh timestamp=1513636301 localtime=Dec 18 14:31:41 No answer to ping from chromeos2-row4-rack5-host7
START ---- repair.rpm timestamp=1513636301 localtime=Dec 18 14:31:41
GOOD ---- verify.ssh timestamp=1513636349 localtime=Dec 18 14:32:29
GOOD ---- verify.power timestamp=1513636349 localtime=Dec 18 14:32:29
END GOOD ---- repair.rpm timestamp=1513636349 localtime=Dec 18 14:32:29
The "verify.ssh" line says that the DUT was offline. The "repair.rpm"
action means that the system used an RPM device to unplug/replug AC
power to the DUT. The logs show that power cycling AC caused the DUT
to boot up and return to working order.
The code for "repair.rpm" looks for and gathers crash dumps, if they're
found. The logs show no dumps, so it looks like whatever caused the
problems, there were no crashes.
,
Dec 19 2017
This smells like a system hang, so let's give it to the kernel. Assigning to a sheriff to find a proper expert.
,
Jan 26 2018
Haven't seen any more recurrences of this problem and it's been over a month without any updates. Closing. |
||||
►
Sign in to add a comment |
||||
Comment 1 by bmgordon@chromium.org
, Dec 18 2017