New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 602873 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner: ----
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Feature



Sign in to add a comment

Powerwash repair can leave behind incorrect dev_image directory

Reported by jrbarnette@chromium.org, Apr 13 2016

Issue description

In some cases, if powerwash repair fails, it can leave behind
a broken DUT that improperly passes verify checks.

Here's a snippet of history from a particular DUT:
    2016-04-11 22:39:35  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack3-host13/413056-provision/
    2016-04-11 22:36:20  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack3-host13/413029-verify/
    2016-04-11 22:06:03  NO http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack3-host13/412944-repair/
    2016-04-11 22:02:02  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack3-host13/412923-provision/

The provision at 22:02 failed because of a CQ bug that made
the DUT unable to perform AU (the DUT had no working servo).

The repair at 22:06 failed, because AU was the only available
option for repair, and the DUT was unable.  However, the repair
did run power wash, and successfully ran stateful update.  The
result was that the DUT was left with a pending, working stateful
update that gets applied on the next reboot.

Normally, powerwash makes a DUT fail verify until repaired.
However, the application of the stateful update makes the DUT
appear healthy, because:
 1) The python interpreter reappears in /usr/local/bin, and
 2) The stateful update regenerates /var/tmp, which clobbers
    the /var/tmp/provision_failed flag file.

I don't fully understand how this happened, because I can't see
where the reboot happened in the sequence above.  However, the
fact that the DUT passed verify after powerwash is pretty much
dispositve: only a successful stateful_update could do that.

 
For the record, here's the status.log showing the failed
repair performing the powerwash:

START	----	repair	timestamp=1460437567	localtime=Apr 11 22:06:07	
	GOOD	----	verify.ssh	timestamp=1460437589	localtime=Apr 11 22:06:29	
	GOOD	----	verify.power	timestamp=1460437590	localtime=Apr 11 22:06:30	
	GOOD	----	verify.cros	timestamp=1460437592	localtime=Apr 11 22:06:32	
	FAIL	----	verify.good_au	timestamp=1460437593	localtime=Apr 11 22:06:33	Last AU on this DUT failed
	START	----	repair.firmware	timestamp=1460437593	localtime=Apr 11 22:06:33	
		FAIL	----	repair.firmware	timestamp=1460437593	localtime=Apr 11 22:06:33	Firmware repair is not applicable to host chromeos4-row1-rack3-host13.
	END FAIL	----	repair.firmware	timestamp=1460437593	localtime=Apr 11 22:06:33	
	GOOD	----	verify.tpm	timestamp=1460437593	localtime=Apr 11 22:06:33	
	START	----	repair.powerwash	timestamp=1460437593	localtime=Apr 11 22:06:33	
		START	----	reboot	timestamp=1460437593	localtime=Apr 11 22:06:33	
			GOOD	----	reboot.start	timestamp=1460437593	localtime=Apr 11 22:06:33	
			GOOD	----	reboot.verify	timestamp=1460437632	localtime=Apr 11 22:07:12	
		END GOOD	----	reboot	kernel=3.8.11	localtime=Apr 11 22:07:13	timestamp=1460437633	
		START	----	reboot	timestamp=1460437639	localtime=Apr 11 22:07:19	
			GOOD	----	reboot.start	timestamp=1460437639	localtime=Apr 11 22:07:19	
			GOOD	----	reboot.verify	timestamp=1460437661	localtime=Apr 11 22:07:41	
		END GOOD	----	reboot	kernel=3.8.11	localtime=Apr 11 22:07:41	timestamp=1460437661	
		FAIL	----	repair.powerwash	timestamp=1460437722	localtime=Apr 11 22:08:42	Failed to install device image using payload at http://100.107.160.8:8082/update/wolf-release/R50-7978.48.0 on chromeos4-row1-rack3-host13. Update failed. Returned update_engine error code: ERROR_CODE=49, ERROR_MESSAGE=ErrorCode::kNonCriticalUpdateInOOBE. Reported error: AutoservRunError
	END FAIL	----	repair.powerwash	timestamp=1460437722	localtime=Apr 11 22:08:42	
	START	----	repair.usb	timestamp=1460437722	localtime=Apr 11 22:08:42	
		FAIL	----	repair.usb	timestamp=1460437772	localtime=Apr 11 22:09:32	Download image to usb failed.
	END FAIL	----	repair.usb	timestamp=1460437772	localtime=Apr 11 22:09:32	
END FAIL	----	repair	timestamp=1460437772	localtime=Apr 11 22:09:32	


... and here's the subsequent passing verify:
START	----	verify	timestamp=1460439401	localtime=Apr 11 22:36:41	
	GOOD	----	verify.ssh	timestamp=1460439401	localtime=Apr 11 22:36:41	
	GOOD	----	verify.power	timestamp=1460439401	localtime=Apr 11 22:36:41	
	GOOD	----	verify.tpm	timestamp=1460439402	localtime=Apr 11 22:36:42	
	GOOD	----	verify.good_au	timestamp=1460439402	localtime=Apr 11 22:36:42	
	GOOD	----	verify.cros	timestamp=1460439405	localtime=Apr 11 22:36:45	
	GOOD	----	verify.PASS	timestamp=1460439405	localtime=Apr 11 22:36:45	
END GOOD	----	verify	timestamp=1460439405	localtime=Apr 11 22:36:45	{'host_attributes': {'powerunit_hostname': 'chromeos4-row1_2-rack3-rpm2', 'powerunit_outlet': '.A13'}, 'hostname': 'chromeos4-row1-rack3-host13'} verified successfully

I think an analogous problem can happen if stateful update fails,
but the AU step succeeds:  On the next boot the DUT will have a
new build with an out-of-date stateful.  This might not have the
same consequence of a false repair success, though.

I think what's needed is a check, somewhere, to confirm that
the version of the stateful image matches the installed root FS.
I can think of two ways to make this work:
  * Add a verifier check in Autotest.
  * Change chromeos_startup to remove rather than install
    dev_image_new if there's a mismatch at boot time.

We'd also need to make sure that the stateful content
identifies its proper version, e.g. with a copy of
lsb-release from the root FS.

Re this comment:
> I don't fully understand how this happened, because I can't see
> where the reboot happened in the sequence above.  However, the
> fact that the DUT passed verify after powerwash is pretty much
> dispositve: only a successful stateful_update could do that.

There's a full explanation of why the reboot occurred in this
particular instance here:
    https://bugs.chromium.org/p/chromium/issues/detail?id=602887#c11

N.B.  AFAICT, this behavior will only occur when a DUT has a
working servo with a broken USB stick.  That's not meant to
happen, but it's not exceedingly rare.  So, when this does happen,
we want the broken DUT to _stay_ broken, and not cycle through
failing repair, passing repair, and then failing provision.

Comment 4 by autumn@chromium.org, Apr 18 2016

Owner: jrbarnette@chromium.org

Comment 5 by benhenry@google.com, Apr 26 2016

Components: Infra>Client>ChromeOS
Labels: -Infra-ChromeOS
Project Member

Comment 6 by sheriffbot@chromium.org, Apr 27 2017

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been available for more than 365 days, and should be re-evaluated. Please re-triage this issue.
The Hotlist-Recharge-Cold label is applied for tracking purposes, and should not be removed after re-triaging the issue.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Type-Bug -Pri-2 Pri-3 Type-Feature
Owner: ----
Let's call this a (low-priority) feature request; c#2 describes
what's being requested.

Comment 8 by aut...@google.com, May 16 2017

Labels: OKR Hotlist-Fixit
Labels: -OKR
Cc: jrbarnette@chromium.org
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>Test
Labels: -Hotlist-Recharge-Cold
Status: Available (was: Untriaged)
Status: Fixed (was: Available)
In the current machine_install() code, if AU fails, the update
invokes `stateful_update --stateful_change=reset`, which will
cancel the pending stateful update.

That change was added 7/2017, or a bit more than a year after
this was filed.

So... Fixed now.

Sign in to add a comment