Powerwash repair can leave behind incorrect dev_image directory
Reported by
jrbarnette@chromium.org,
Apr 13 2016
|
|||||||||
Issue description
In some cases, if powerwash repair fails, it can leave behind
a broken DUT that improperly passes verify checks.
Here's a snippet of history from a particular DUT:
2016-04-11 22:39:35 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack3-host13/413056-provision/
2016-04-11 22:36:20 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack3-host13/413029-verify/
2016-04-11 22:06:03 NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack3-host13/412944-repair/
2016-04-11 22:02:02 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack3-host13/412923-provision/
The provision at 22:02 failed because of a CQ bug that made
the DUT unable to perform AU (the DUT had no working servo).
The repair at 22:06 failed, because AU was the only available
option for repair, and the DUT was unable. However, the repair
did run power wash, and successfully ran stateful update. The
result was that the DUT was left with a pending, working stateful
update that gets applied on the next reboot.
Normally, powerwash makes a DUT fail verify until repaired.
However, the application of the stateful update makes the DUT
appear healthy, because:
1) The python interpreter reappears in /usr/local/bin, and
2) The stateful update regenerates /var/tmp, which clobbers
the /var/tmp/provision_failed flag file.
I don't fully understand how this happened, because I can't see
where the reboot happened in the sequence above. However, the
fact that the DUT passed verify after powerwash is pretty much
dispositve: only a successful stateful_update could do that.
,
Apr 13 2016
I think an analogous problem can happen if stateful update fails,
but the AU step succeeds: On the next boot the DUT will have a
new build with an out-of-date stateful. This might not have the
same consequence of a false repair success, though.
I think what's needed is a check, somewhere, to confirm that
the version of the stateful image matches the installed root FS.
I can think of two ways to make this work:
* Add a verifier check in Autotest.
* Change chromeos_startup to remove rather than install
dev_image_new if there's a mismatch at boot time.
We'd also need to make sure that the stateful content
identifies its proper version, e.g. with a copy of
lsb-release from the root FS.
,
Apr 13 2016
Re this comment:
> I don't fully understand how this happened, because I can't see
> where the reboot happened in the sequence above. However, the
> fact that the DUT passed verify after powerwash is pretty much
> dispositve: only a successful stateful_update could do that.
There's a full explanation of why the reboot occurred in this
particular instance here:
https://bugs.chromium.org/p/chromium/issues/detail?id=602887#c11
N.B. AFAICT, this behavior will only occur when a DUT has a
working servo with a broken USB stick. That's not meant to
happen, but it's not exceedingly rare. So, when this does happen,
we want the broken DUT to _stay_ broken, and not cycle through
failing repair, passing repair, and then failing provision.
,
Apr 18 2016
,
Apr 26 2016
,
Apr 27 2017
This issue has been available for more than 365 days, and should be re-evaluated. Please re-triage this issue. The Hotlist-Recharge-Cold label is applied for tracking purposes, and should not be removed after re-triaging the issue. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 27 2017
Let's call this a (low-priority) feature request; c#2 describes what's being requested.
,
May 16 2017
,
Dec 13 2017
,
May 18 2018
,
May 18 2018
In the current machine_install() code, if AU fails, the update invokes `stateful_update --stateful_change=reset`, which will cancel the pending stateful update. That change was added 7/2017, or a bit more than a year after this was filed. So... Fixed now. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by jrbarnette@chromium.org
, Apr 13 2016For the record, here's the status.log showing the failed repair performing the powerwash: START ---- repair timestamp=1460437567 localtime=Apr 11 22:06:07 GOOD ---- verify.ssh timestamp=1460437589 localtime=Apr 11 22:06:29 GOOD ---- verify.power timestamp=1460437590 localtime=Apr 11 22:06:30 GOOD ---- verify.cros timestamp=1460437592 localtime=Apr 11 22:06:32 FAIL ---- verify.good_au timestamp=1460437593 localtime=Apr 11 22:06:33 Last AU on this DUT failed START ---- repair.firmware timestamp=1460437593 localtime=Apr 11 22:06:33 FAIL ---- repair.firmware timestamp=1460437593 localtime=Apr 11 22:06:33 Firmware repair is not applicable to host chromeos4-row1-rack3-host13. END FAIL ---- repair.firmware timestamp=1460437593 localtime=Apr 11 22:06:33 GOOD ---- verify.tpm timestamp=1460437593 localtime=Apr 11 22:06:33 START ---- repair.powerwash timestamp=1460437593 localtime=Apr 11 22:06:33 START ---- reboot timestamp=1460437593 localtime=Apr 11 22:06:33 GOOD ---- reboot.start timestamp=1460437593 localtime=Apr 11 22:06:33 GOOD ---- reboot.verify timestamp=1460437632 localtime=Apr 11 22:07:12 END GOOD ---- reboot kernel=3.8.11 localtime=Apr 11 22:07:13 timestamp=1460437633 START ---- reboot timestamp=1460437639 localtime=Apr 11 22:07:19 GOOD ---- reboot.start timestamp=1460437639 localtime=Apr 11 22:07:19 GOOD ---- reboot.verify timestamp=1460437661 localtime=Apr 11 22:07:41 END GOOD ---- reboot kernel=3.8.11 localtime=Apr 11 22:07:41 timestamp=1460437661 FAIL ---- repair.powerwash timestamp=1460437722 localtime=Apr 11 22:08:42 Failed to install device image using payload at http://100.107.160.8:8082/update/wolf-release/R50-7978.48.0 on chromeos4-row1-rack3-host13. Update failed. Returned update_engine error code: ERROR_CODE=49, ERROR_MESSAGE=ErrorCode::kNonCriticalUpdateInOOBE. Reported error: AutoservRunError END FAIL ---- repair.powerwash timestamp=1460437722 localtime=Apr 11 22:08:42 START ---- repair.usb timestamp=1460437722 localtime=Apr 11 22:08:42 FAIL ---- repair.usb timestamp=1460437772 localtime=Apr 11 22:09:32 Download image to usb failed. END FAIL ---- repair.usb timestamp=1460437772 localtime=Apr 11 22:09:32 END FAIL ---- repair timestamp=1460437772 localtime=Apr 11 22:09:32 ... and here's the subsequent passing verify: START ---- verify timestamp=1460439401 localtime=Apr 11 22:36:41 GOOD ---- verify.ssh timestamp=1460439401 localtime=Apr 11 22:36:41 GOOD ---- verify.power timestamp=1460439401 localtime=Apr 11 22:36:41 GOOD ---- verify.tpm timestamp=1460439402 localtime=Apr 11 22:36:42 GOOD ---- verify.good_au timestamp=1460439402 localtime=Apr 11 22:36:42 GOOD ---- verify.cros timestamp=1460439405 localtime=Apr 11 22:36:45 GOOD ---- verify.PASS timestamp=1460439405 localtime=Apr 11 22:36:45 END GOOD ---- verify timestamp=1460439405 localtime=Apr 11 22:36:45 {'host_attributes': {'powerunit_hostname': 'chromeos4-row1_2-rack3-rpm2', 'powerunit_outlet': '.A13'}, 'hostname': 'chromeos4-row1-rack3-host13'} verified successfully