Fix update_engine not to return "kNonCriticalUpdateInOOBE" in lab/developer contexts
Reported by
jrbarnette@chromium.org,
Apr 5 2016
|
||||||||||||||
Issue description
Eight DUTs in the test lab have failed to repair with an error
that looks like this:
Failed to install device image using payload at http://172.17.40.28:8082/update/x86-alex-release/R50-7978.36.0 on chromeos2-row1-rack4-host10. Update failed. Returned update_engine error code: ERROR_CODE=49, ERROR_MESSAGE=ErrorCode::kNonCriticalUpdateInOOBE. Reported error: AutoservRunError
Here's the full list of problem DUTs:
hostname S last checked URL
chromeos2-row1-rack4-host10 NO 2016-04-05 08:55:00 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row1-rack4-host10/366557-repair/
chromeos2-row1-rack5-host10 NO 2016-04-05 08:55:00 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row1-rack5-host10/366559-repair/
chromeos2-row3-rack9-host5 NO 2016-04-05 08:00:37 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack9-host5/53470404-repair/
chromeos2-row4-rack2-host9 NO 2016-04-05 08:55:00 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack2-host9/366556-repair/
chromeos2-row4-rack3-host10 NO 2016-04-05 08:55:00 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack3-host10/366558-repair/
chromeos2-row5-rack7-host8 NO 2016-04-05 08:42:37 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row5-rack7-host8/303187-repair/
chromeos4-row2-rack8-host14 NO 2016-04-05 08:45:31 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row2-rack8-host14/343597-repair/
chromeos4-row2-rack8-host21 NO 2016-04-05 08:45:34 http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row2-rack8-host21/343598-repair/
A sample status.log from one failure is attached.
,
Apr 5 2016
+shuqianz
,
Apr 5 2016
The CL that changed the update_engine behavior is https://android-review.googlesource.com/#/c/211404/ . The Chrome side change just updated the histogram enum. This is part of bug 587101 . I don't think update_engine has actually been uprevved yet, because https://chromium-review.googlesource.com/#/c/336880/ (ToT) and https://chromium-review.googlesource.com/#/c/336812/ (M50) have not landed. I have an action to fix up the autotests before pushing them.
,
Apr 5 2016
,
Apr 5 2016
update_engine _has_ been uprevved, because this code change is breaking repair. Could this be fixed asap? since CQ is failed due to this reason: https://uberchromegw.corp.google.com/i/chromeos/builders/link-paladin/builds/24598 https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_speedy-paladin/builds/1547
,
Apr 5 2016
,
Apr 5 2016
,
Apr 5 2016
ToT of chromiumos-overlay is c792b80f2466f9304e7e169cc8c672feae3969d0 and AFAICT that does not include my CL. This failure looks different from my VMtest failure. Is it possible that update_engine was compiled from the -9999 ebuild, installed on these machines, and that is preventing them from getting provisioned for a new test run?
,
Apr 5 2016
Sorry, first time being a deputy...Maybe a silly question, how does that tot doesn't include the change, but CQ reports failures about the entries in the change?
,
Apr 5 2016
I haven't determined how exactly the changes got onto the failing DUTs, but the fact is that they're there, and those changes cause failures. We'll need a fix to update_engine ASAP.
,
Apr 5 2016
I believe that the failed DUT derive from this CL:
https://chromium-review.googlesource.com/#/c/336880/
I've marked the CL -1 Verified.
Meantime, we need to figure out how to fix the currently
broken devices.
,
Apr 5 2016
Make provision send a non-empty deadline attribute in the omaha response.
,
Apr 5 2016
It would be better not to require devserver changes to fix this problem. Provisioning is done by using update_engine_client to ask for an unconditional update. The code that's checking for "is OOBE complete?" needs to ignore OOBE status in that particular update case.
,
Apr 5 2016
What does it mean to "ask for an unconditional update"? It is my understanding that the only flag the client can send to the daemon as part of an AttemptUpdate dbus call is kAttemptUpdateFlagNonInteractive. Maybe we need a new "kAttemptUpdateFlagUnconditional" flag which skips the various policy checks?
,
Apr 5 2016
Update after trying to fix these DUTs with help of Richard: Hostname Status chromeos2-row1-rack4-host10 Yes chromeos2-row1-rack5-host10 Yes chromeos2-row3-rack9-host5 Yes chromeos2-row4-rack2-host9 NO, ---, ask for repair, https://b.corp.google.com/u/0/issues/28025023 chromeos2-row4-rack3-host10 Yes chromeos2-row5-rack7-host8 Yes chromeos4-row2-rack8-host14 Yes chromeos4-row2-rack8-host21 NO, ---, ask for repair offline.
,
Apr 5 2016
It looks like this same failure happened on a veyron_speedy @ chromeos4-row4-rack11-host1.
,
Apr 5 2016
Re comment #14: We request update with a command like this:
update_engine_client --update --omaha_url=$URL
For some appropriate "$URL". I'm not sure whether the "do it
unconditionally" logic comes from the devserver response or from
the command, or from the fact that it's a test image, but somehow,
that request forces an update. That behavior needs to continue,
even after the OOBE check code is added.
,
Apr 5 2016
I don't know if there is any "do it unconditionally" logic, but maybe it is something we should add. Is there any documentation on which tests / processes / product requirements / etc. rely on the existing update_engine policies? It would be helpful to understand the full impact before we start pushing changes. Right now it seems like we're just changing things and then looking to see what breaks.
,
Apr 5 2016
Following a tip from sosa@ I found the code relevant to the
"--update" option to update_engine_client:
int UpdateEngineClient::ProcessFlags() {
// ... irrelevant code skipped
// Initiate an update check, if necessary.
if (do_update_request) {
LOG_IF(WARNING, FLAGS_reboot) << "-reboot flag ignored.";
string app_version = FLAGS_app_version;
if (FLAGS_update && app_version.empty()) {
app_version = "ForcedUpdate";
LOG(INFO) << "Forcing an update by setting app_version to ForcedUpdate.";
}
LOG(INFO) << "Initiating update check and install.";
if (!client_->AttemptUpdate(app_version, FLAGS_omaha_url,
FLAGS_interactive)) {
LOG(ERROR) << "Error checking for update.";
return 1;
}
}
// ...
}
I haven't successfully followed the logic through to the
other side of the D-Bus interface, but basically, when "app_version"
is "ForcedUpdate", update_engine is required to apply the update.
It may be that this rule only applies to test images.
,
Apr 5 2016
Following further, I see that the D-Bus call in question winds up in UpdateAttempter::CheckForUpdate(). AFAICT, that function is expected to trigger an update that will happen unconditionally, provided only that it's running a test image.
,
Apr 5 2016
In the other side of the DBus app the "ForcedUpdate" is passed to the dev-server. Making the devserver reply with a non-empty deadline is the way to signal an uncoditional update. This is a fix for a CL that didn't land in update_engine in Chrome OS yet... do you want to "fix" this by landing another CL in update_engine? That won't recover your existing devices.
,
Apr 5 2016
At this point, the failed devices have been recovered. What's required now is that before the update_engine changes can be committed, update_engine must be fixed not to cause this problem again.
,
Apr 5 2016
Adding to comment #22: As I said before, I'm skeptical that changing the devserver is the right fix. However, if you can make that case, then in order to commit the update_engine change what's required is to make and commit the devserver change, and then work with the Infra team to make sure that the change is deployed.
,
Apr 5 2016
Up to you guys. I'm not gonna make any case nor implement the change. I was pointing out a very simple fix for this and the autoupdate_EndToEndTest that will break after provision_Autoupdate is "fixed" by forcing the update in update_engine with a test-only policy that overrides the normal policies that deviates client-side code under test further from the normal behavior. But it is really up to you.
,
Apr 6 2016
This is no longer a lab issue.
,
Apr 7 2016
This error is popping up on veyron_speedy again @ chromeos4-row4-rack11-host1. Does this need to be manually fixed again?
,
Apr 7 2016
I've confirmed that the DUT has been out of service since the bad CQ run that caused the problem in the lab. I've fixed it so that I expect the DUT to go back into service now. However, I can't explain why it stayed broken and undetected for so long. That'll be a separate bug. For anyone else seeing this: This bug is now about fixing update_engine. If there are more DUTs with this symptom, please file a new bug, or contact the deputy directly.
,
Apr 7 2016
,
Apr 7 2016
> basically, when "app_version" is "ForcedUpdate", update_engine is > required to apply the update. It may be that this rule only applies > to test images. It doesn't look like update_engine implements special treatment for "ForcedUpdate". If "ForcedUpdate" is sent to Omaha instead of the CHROMEOS_RELEASE_VERSION field from /etc/lsb-release, then (presumably) Omaha will not perform a version check or try to compute a delta. > Following further, I see that the D-Bus call in question winds up > in UpdateAttempter::CheckForUpdate(). AFAICT, that function is > expected to trigger an update that will happen unconditionally, > provided only that it's running a test image. The term "unconditionally" is somewhat ambiguous, as there are still other situations in which an update will not be applied. For instance, running `update_engine_client --update` from a USB drive will still trigger the "Not enough slots" policy violation in chromeos_policy.cc. I suspect that CPanel policies might also be applied, although I have not tested it. Since the update_engine binary allows overriding app_version from the command line, I'd rather not use the "ForcedUpdate" version string to indicate that the user is trying to force an update. Instead, it makes more sense to add a flag indicating that the request is from the CLI client, which runs in God Mode so it can bypass restrictions that apply to interactive requests from Chrome.
,
Apr 7 2016
I would make the devserver return a non-empty deadline either always or whenever it gets the "ForcedUpdate" string in the app_version. ForcedUpdate is hard-coded in the update_engine_client command. There's no such thing as "God Mode", that would be just yet another way to specify a policy that should override some other policy (like the one kevin is introducing) but not override other policies (like don't update if you only have one slot).
,
Apr 8 2016
My first attempt at a fix, please review: https://android-review.googlesource.com/#/c/214513 I considered making "ForcedUpdate" a constant, but couldn't find an appropriate source file that is shared between update_engine and update_engine_client. I also considered adding a new kAttemptUpdateFlagCli bit, until I realized that plumbing that through update_engine would require touching a whole bunch of files and APIs because the only flag we're actually passing around is the "interactive" bit. Also, I was unclear on whether it's OK to include common_service.h from the various other pieces of update_engine.
,
Apr 8 2016
,
Apr 12 2016
Same error is happening on wolf-tot-paladin : chromeos4-row1-rack3-host13 https://uberchromegw.corp.google.com/i/chromeos/builders/wolf-tot-paladin/builds/6342
,
Apr 18 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/655e394da6d038ec1bb81389fc3b9bea0044708d commit 655e394da6d038ec1bb81389fc3b9bea0044708d Author: Kevin Cernekee <cernekee@chromium.org> Date: Sat Apr 16 20:06:14 2016 Force --critical_update when invoking devserver `cros flash ssh://<IP> <IMAGE>` will start up devserver.py on the DUT. Since update_engine now skips non-critical updates until OOBE is done, forced updates need to specify a deadline or else the operation may fail. BUG= chromium:600737 TEST=verify that `cros flash` fails with kNonCriticalUpdateInOOBE, then apply this patch and verify that it works again Change-Id: I5ae35a8cbd98d527430555b6b7b4cf66c9551010 Reviewed-on: https://chromium-review.googlesource.com/339274 Commit-Ready: Kevin Cernekee <cernekee@chromium.org> Tested-by: Kevin Cernekee <cernekee@chromium.org> Reviewed-by: Andrew de los Reyes <adlr@chromium.org> Reviewed-by: Alex Deymo <deymo@chromium.org> [modify] https://crrev.com/655e394da6d038ec1bb81389fc3b9bea0044708d/lib/dev_server_wrapper.py
,
Apr 18 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/1fd5e3042c6d8de3bdbb8cc169b5e0c5bd61af93 commit 1fd5e3042c6d8de3bdbb8cc169b5e0c5bd61af93 Author: Kevin Cernekee <cernekee@chromium.org> Date: Sat Apr 16 20:10:15 2016 autoupdate_EndToEnd: Force --critical_update when invoking devserver This test case will fail if the device has not completed OOBE. Fix this by specifying a deadline in the Omaha response. BUG= chromium:600737 TEST=none Change-Id: I0c2ba71bb482e0bbb4b4b717eaf5f08b51775291 Reviewed-on: https://chromium-review.googlesource.com/339216 Commit-Ready: Kevin Cernekee <cernekee@chromium.org> Tested-by: Kevin Cernekee <cernekee@chromium.org> Reviewed-by: Andrew de los Reyes <adlr@chromium.org> Reviewed-by: Alex Deymo <deymo@chromium.org> [modify] https://crrev.com/1fd5e3042c6d8de3bdbb8cc169b5e0c5bd61af93/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py
,
May 3 2016
This seems to be continually happening for provision jobs for chromeos2-row5-rack10-host11 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row5-rack10-host11/54877191-provision/debug/ What is the correct way to fix this DUT?
,
May 3 2016
Oy, a guado_moblab.
The general procedure is to log in to the DUT, and run these two
commands:
rm /var/tmp/provision_failed
touch /home/chronos/.oobe_completed
Then force a reverify on the host.
I've run the steps; we'll see what happens.
,
May 3 2016
Also, if anyone _else_ find any more DUTs like this: Please file a new bug. This bug isn't about failing DUTs, it's about the software that caused the DUTs to fail, and that software bug has been fixed. |
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by xixuan@chromium.org
, Apr 5 2016