On whirlwind, powerwash breaks stateful_update
Reported by
jrbarnette@chromium.org,
Jun 22 2016
|
||||||||||
Issue description
Consider the following repair job on a whirlwind:
http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row10-rack12-host2/56738755-repair/
If you go through status.log from the repair, you find
that repair.powerwash was attempted, and failed with this
line:
FAIL ---- repair.powerwash timestamp=1466542915 localtime=Jun 21 14:01:55 Failed to perform stateful update on chromeos4-row10-rack12-host2
Looking in debug/autoserv.DEBUG, you find these lines:
06/21 14:01:02.456 DEBUG| ssh_host:0180| Running (ssh) '/tmp/stateful_update http://100.107.160.4:8082/static/whirlwind-release/R50-7978.2.0 --stateful_change=clean 2>&1'
06/21 14:01:02.886 DEBUG| base_utils:0278| [stdout] Downloading stateful payload from http://100.107.160.4:8082/static/whirlwind-release/R50-7978.2.0/stateful.tgz
06/21 14:01:02.888 DEBUG| base_utils:0278| [stdout] eval: 1: wget: not found
Logging into the DUT (now running the repair test image), you
can see this:
root@chromeos4-row10-rack12-host2 $ which wget
/usr/local/bin/wget
So, Powerwash on whirlwind test images wipes out `wget`, which
means that repair.powerwash must always fail.
There are two possible fixes:
* Change whirlwind base images to include wget.
* Change stateful_update to use curl, or else to support a
choice of either wget or curl.
,
Jun 23 2016
> How does this work on other chromeos images? Is wget always > included in the base image? Yes, on standard Chrome OS, both curl and wget are part of the root file system in the base image.. > Does whirlwind not depend on some package that normally pulls this in? That would be the expected reason for the difference. There's also an off chance that including 'wget' depends directly on the 'embedded' use flag.
,
Jun 23 2016
I just wonder if wget should be added to a chromeos base package instead of specifically whirlwind or jetstream. Since autotest seems to require it after a powerwash, and this applies to all boards, it seems like this should be pulled into chromeos generically.
,
Jun 23 2016
> I just wonder if wget should be added to a chromeos base > package instead of specifically whirlwind or jetstream. Yes, making sure 'wget' is a dependency of the base package would be a Good Thing.
,
Jun 23 2016
I will make a patch. But can you confirm that "repair.powerwash" is only used on test images? ie this is part of autotest infrastructure?
,
Jun 23 2016
> But can you confirm that "repair.powerwash" is only used > on test images? ie this is part of autotest infrastructure? I'm not sure what you're asking here. Powerwash is a product feature, and it gets tested in contexts other than repair. 'repair.powerwash' _is_ specific to the autotest infrastructure, but it's not the only way this bug can be triggered. All of the cases at work here apply only to test images in the test lab. However, if the fix is merely to make 'wget' part of the base image, then the real risk you want to ask about is "is it OK to include this package in the consumer product, even though it won't be used (or needed) by the consumers?"
,
Jun 23 2016
I intend to make 'wget' only part of the test image, such that this won't impact consumers. I wanted to know if wget was only required by the test infrastructure itself, and it sounds like that is the case.
,
Jun 23 2016
> I intend to make 'wget' only part of the test image, > such that this won't impact consumers. Ah. No, that won't work. 'wget' is _already_ part of the test image only. That's the bug. > I wanted to know if wget was only required by the > test infrastructure itself, and it sounds like that > is the case. The problem is that test-image specific bits are stored in stateful, under /usr/local. When you run 'powerwash', those bits get wiped out. In the case of whirlwind, the bits that get wiped out include the 'wget' command, which breaks our ability to update the device in the lab. There's a powerwash test that runs regularly with every release build, so the only thing protecting us from a total whirlwind outage is that for whirlwind, servo reliably repairs the devices by re-installing from USB. But we'd rather not make lab reliability only as good as servo repair.
,
Jun 24 2016
I'm a little confused. If wget is part of the test image, why is powerwash having problems restoring it? I think python may have a similar issue, I think I've seen python commands fail when Whirlwind DUTs are in this state. I initially thought that issue and the wget issue were linked.
,
Jun 24 2016
> I'm a little confused. If wget is part of the test image, > why is powerwash having problems restoring it? Powerwash restores nothing. It wipes out all content in the stateful partition. With the exception of certain upstart jobs, all test-image specific content is stored in the stateful partition. So, powerwash wipes out all test-image only commands, including 'python' and 'wget'. For purposes of updating a DUT, we can live without python; it's not needed for update. However, the 'stateful_update' script relies on 'wget'. So, because powerwash obliterates 'wget', we can't apply updates after powerwash. Being unable to apply an update isn't 100% fatal; servo repair knows how to fix it, but that's undesirable. One of the standard repair procedures is "power wash, then apply an update". That procedure breaks on whirlwind.
,
Jun 28 2016
,
Jun 28 2016
,
Jun 29 2016
https://chromium-review.googlesource.com/#/c/356826/ > If wget is part of the test image, why is powerwash having problems restoring it? I only meant to include it on the root file system for test images. See above patch. Are you able to test with this patch?
,
Jun 30 2017
,
Jun 30 2017
,
Jul 17 2017
Removing myself as I nolonger work on this.
,
Sep 16 2017
Updating this bug with status. There are two ways to fix this
problem:
1) Add 'wget' to the root file system in test images, so that
'wget' is present even after powerwash. This in the approach
of the CL in c#13.
2) Change 'stateful_update' to use 'curl' instead of 'wget'.
The 'curl' command is already present in the root file system
of all images, because it's used by chromeos-base/dev-install,
which is part of the base Chromium OS image.
I'm partial to option 2). It has the advantage that once the change
is deployed to the lab, the lab will be able to repair any build.
That means the change doesn't have to be merged to beta or stable
brances, and we don't have to worry about old FSI images that must
be installed from time-to-time.
Note that option 2) works in part because 'stateful_update' is part
of stateful(!), so after powerwash, we install the script from the
devserver.
Option 1) has the advantage of not depending on the quirks of the
lab installation flows, but is otherwise inferior because it won't
allow repairing the old builds without 'wget' that will still be
present in the lab from time-to-time.
,
Sep 17 2017
let's use curl to fetch things. we usually need libcurl anyways by system packages.
,
Sep 27 2017
Hrmph! 'wget' was added to jetstream images with crosreview.com/i/383874, which landed in R61-9610.0.0 as a fix for b/62098277. Changing stateful_update to use 'curl' is arguably still better (see c#19 and c#20), so let's leave this open. Changing (and testing) stateful_update is more a thing requiring expertise from the Infra team, though, so let's also change ownership.
,
Nov 27 2017
Should be fixed with the CL for bug 785487 : https://chromium-review.googlesource.com/773114
,
Jan 22 2018
,
Jan 23 2018
|
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by cbook@google.com
, Jun 23 2016