New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 600737 link

Starred by 5 users

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: May 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 601511



Sign in to add a comment

Fix update_engine not to return "kNonCriticalUpdateInOOBE" in lab/developer contexts

Reported by jrbarnette@chromium.org, Apr 5 2016

Issue description

Eight DUTs in the test lab have failed to repair with an error
that looks like this:
    Failed to install device image using payload at http://172.17.40.28:8082/update/x86-alex-release/R50-7978.36.0 on chromeos2-row1-rack4-host10. Update failed. Returned update_engine error code: ERROR_CODE=49, ERROR_MESSAGE=ErrorCode::kNonCriticalUpdateInOOBE. Reported error: AutoservRunError

Here's the full list of problem DUTs:
    hostname                       S   last checked         URL
    chromeos2-row1-rack4-host10    NO  2016-04-05 08:55:00  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row1-rack4-host10/366557-repair/
    chromeos2-row1-rack5-host10    NO  2016-04-05 08:55:00  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row1-rack5-host10/366559-repair/
    chromeos2-row3-rack9-host5     NO  2016-04-05 08:00:37  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack9-host5/53470404-repair/
    chromeos2-row4-rack2-host9     NO  2016-04-05 08:55:00  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack2-host9/366556-repair/
    chromeos2-row4-rack3-host10    NO  2016-04-05 08:55:00  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack3-host10/366558-repair/
    chromeos2-row5-rack7-host8     NO  2016-04-05 08:42:37  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row5-rack7-host8/303187-repair/
    chromeos4-row2-rack8-host14    NO  2016-04-05 08:45:31  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row2-rack8-host14/343597-repair/
    chromeos4-row2-rack8-host21    NO  2016-04-05 08:45:34  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row2-rack8-host21/343598-repair/

A sample status.log from one failure is attached.

 
chromeos2-row1-rack4-host10.log
2.7 KB View Download
Cc: tnagel@chromium.org cernekee@chromium.org
This error just starts from 4.1, so it may be caused by recent change https://codereview.chromium.org/1848533002,

related bug: https://bugs.chromium.org/p/chromium/issues/detail?id=587101

cernekee@, can u have a look at it?
Cc: shuqianz@chromium.org
+shuqianz
The CL that changed the update_engine behavior is https://android-review.googlesource.com/#/c/211404/ .  The Chrome side change just updated the histogram enum.  This is part of  bug 587101 .

I don't think update_engine has actually been uprevved yet, because https://chromium-review.googlesource.com/#/c/336880/ (ToT) and https://chromium-review.googlesource.com/#/c/336812/ (M50) have not landed.  I have an action to fix up the autotests before pushing them.
Cc: aaboagye@chromium.org abhishekbh@chromium.org
update_engine _has_ been uprevved, because this code change is breaking repair.

Could this be fixed asap? since CQ is failed due to this reason:

https://uberchromegw.corp.google.com/i/chromeos/builders/link-paladin/builds/24598
https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_speedy-paladin/builds/1547
Cc: -aaboagye@chromium.org -tnagel@chromium.org kevcheng@chromium.org
Cc: aaboagye@chromium.org
Cc: de...@chromium.org
ToT of chromiumos-overlay is c792b80f2466f9304e7e169cc8c672feae3969d0 and AFAICT that does not include my CL.

This failure looks different from my VMtest failure.  Is it possible that update_engine was compiled from the -9999 ebuild, installed on these machines, and that is preventing them from getting provisioned for a new test run?
Sorry, first time being a deputy...Maybe a silly question, how does that tot doesn't include the change, but CQ reports failures about the entries in the change?
I haven't determined how exactly the changes got onto the failing DUTs,
but the fact is that they're there, and those changes cause failures.

We'll need a fix to update_engine ASAP.

I believe that the failed DUT derive from this CL:
    https://chromium-review.googlesource.com/#/c/336880/

I've marked the CL -1 Verified.

Meantime, we need to figure out how to fix the currently
broken devices.

Make provision send a non-empty deadline attribute in the omaha response.
It would be better not to require devserver changes to fix
this problem.

Provisioning is done by using update_engine_client to ask for
an unconditional update.  The code that's checking for "is OOBE
complete?" needs to ignore OOBE status in that particular update
case.

What does it mean to "ask for an unconditional update"?

It is my understanding that the only flag the client can send to the daemon as part of an AttemptUpdate dbus call is kAttemptUpdateFlagNonInteractive.  Maybe we need a new "kAttemptUpdateFlagUnconditional" flag which skips the various policy checks?
Update after trying to fix these DUTs with help of Richard:

Hostname                       Status

chromeos2-row1-rack4-host10    Yes
chromeos2-row1-rack5-host10    Yes
chromeos2-row3-rack9-host5     Yes
chromeos2-row4-rack2-host9     NO, ---, ask for repair, https://b.corp.google.com/u/0/issues/28025023
chromeos2-row4-rack3-host10    Yes
chromeos2-row5-rack7-host8     Yes
chromeos4-row2-rack8-host14    Yes
chromeos4-row2-rack8-host21    NO, ---, ask for repair offline.
It looks like this same failure happened on a veyron_speedy @ chromeos4-row4-rack11-host1.
Re comment #14:  We request update with a command like this:
    update_engine_client --update --omaha_url=$URL

For some appropriate "$URL".  I'm not sure whether the "do it
unconditionally" logic comes from the devserver response or from
the command, or from the fact that it's a test image, but somehow,
that request forces an update.  That behavior needs to continue,
even after the OOBE check code is added.

I don't know if there is any "do it unconditionally" logic, but maybe it is something we should add.

Is there any documentation on which tests / processes / product requirements / etc. rely on the existing update_engine policies?  It would be helpful to understand the full impact before we start pushing changes.  Right now it seems like we're just changing things and then looking to see what breaks.
Cc: sosa@chromium.org
Owner: cernekee@chromium.org
Following a tip from sosa@ I found the code relevant to the
"--update" option to update_engine_client:

int UpdateEngineClient::ProcessFlags() {
  // ... irrelevant code skipped
  // Initiate an update check, if necessary.
  if (do_update_request) {
    LOG_IF(WARNING, FLAGS_reboot) << "-reboot flag ignored.";
    string app_version = FLAGS_app_version;
    if (FLAGS_update && app_version.empty()) {
      app_version = "ForcedUpdate";
      LOG(INFO) << "Forcing an update by setting app_version to ForcedUpdate.";
    }
    LOG(INFO) << "Initiating update check and install.";
    if (!client_->AttemptUpdate(app_version, FLAGS_omaha_url,
                                FLAGS_interactive)) {
      LOG(ERROR) << "Error checking for update.";
      return 1;
    }
  }
  // ...
}

I haven't successfully followed the logic through to the
other side of the D-Bus interface, but basically, when "app_version"
is "ForcedUpdate", update_engine is required to apply the update.
It may be that this rule only applies to test images.

Following further, I see that the D-Bus call in question winds up
in UpdateAttempter::CheckForUpdate().  AFAICT, that function is
expected to trigger an update that will happen unconditionally,
provided only that it's running a test image.

In the other side of the DBus app the "ForcedUpdate" is passed to the dev-server.

Making the devserver reply with a non-empty deadline is the way to signal an uncoditional update. This is a fix for a CL that didn't land in update_engine in Chrome OS yet... do you want to "fix" this by landing another CL in update_engine? That won't recover your existing devices.
At this point, the failed devices have been recovered.

What's required now is that before the update_engine changes
can be committed, update_engine must be fixed not to cause this
problem again.

Adding to comment #22:  As I said before, I'm skeptical that
changing the devserver is the right fix.  However, if you can
make that case, then in order to commit the update_engine change
what's required is to make and commit the devserver change, and
then work with the Infra team to make sure that the change is
deployed.

Cc: -de...@chromium.org
Up to you guys. I'm not gonna make any case nor implement the change.

I was pointing out a very simple fix for this and the autoupdate_EndToEndTest that will break after provision_Autoupdate is "fixed" by forcing the update in update_engine with a test-only policy that overrides the normal policies that deviates client-side code under test further from the normal behavior. But it is really up to you.
Cc: -shuqianz@chromium.org -sosa@chromium.org -kevcheng@chromium.org
Components: Internals>Installer
Labels: -Infra-ChromeOS
Status: Assigned (was: Available)
This is no longer a lab issue.

Cc: xixuan@chromium.org adlr@chromium.org shchen@chromium.org
This error is popping up on veyron_speedy again @ chromeos4-row4-rack11-host1.  Does this need to be manually fixed again?
Summary: Fix update_engine not to return "kNonCriticalUpdateInOOBE" in lab/developer contexts (was: Devices failing to update for repair, error is "kNonCriticalUpdateInOOBE")
I've confirmed that the DUT has been out of service since the
bad CQ run that caused the problem in the lab.

I've fixed it so that I expect the DUT to go back into service
now.  However, I can't explain why it stayed broken and undetected
for so long.  That'll be a separate bug.

For anyone else seeing this:  This bug is now about fixing
update_engine.  If there are more DUTs with this symptom,
please file a new bug, or contact the deputy directly.

Blocking: 601511
> basically, when "app_version" is "ForcedUpdate", update_engine is
> required to apply the update.  It may be that this rule only applies
> to test images.

It doesn't look like update_engine implements special treatment for "ForcedUpdate".  If "ForcedUpdate" is sent to Omaha instead of the CHROMEOS_RELEASE_VERSION field from /etc/lsb-release, then (presumably) Omaha will not perform a version check or try to compute a delta.

> Following further, I see that the D-Bus call in question winds up
> in UpdateAttempter::CheckForUpdate().  AFAICT, that function is
> expected to trigger an update that will happen unconditionally,
> provided only that it's running a test image.

The term "unconditionally" is somewhat ambiguous, as there are still other situations in which an update will not be applied.  For instance, running `update_engine_client --update` from a USB drive will still trigger the "Not enough slots" policy violation in chromeos_policy.cc.  I suspect that CPanel policies might also be applied, although I have not tested it.

Since the update_engine binary allows overriding app_version from the command line, I'd rather not use the "ForcedUpdate" version string to indicate that the user is trying to force an update.  Instead, it makes more sense to add a flag indicating that the request is from the CLI client, which runs in God Mode so it can bypass restrictions that apply to interactive requests from Chrome.

I would make the devserver return a non-empty deadline either always or whenever it gets the "ForcedUpdate" string in the app_version. ForcedUpdate is hard-coded in the update_engine_client command.

There's no such thing as "God Mode", that would be just yet another way to specify a policy that should override some other policy (like the one kevin is introducing) but not override other policies (like don't update if you only have one slot).
My first attempt at a fix, please review:

https://android-review.googlesource.com/#/c/214513

I considered making "ForcedUpdate" a constant, but couldn't find an appropriate source file that is shared between update_engine and update_engine_client.

I also considered adding a new kAttemptUpdateFlagCli bit, until I realized that plumbing that through update_engine would require touching a whole bunch of files and APIs because the only flag we're actually passing around is the "interactive" bit.  Also, I was unclear on whether it's OK to include common_service.h from the various other pieces of update_engine.
Cc: dgarr...@chromium.org sosa@chromium.org
Same error is happening on wolf-tot-paladin : chromeos4-row1-rack3-host13

https://uberchromegw.corp.google.com/i/chromeos/builders/wolf-tot-paladin/builds/6342
Project Member

Comment 34 by bugdroid1@chromium.org, Apr 18 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/655e394da6d038ec1bb81389fc3b9bea0044708d

commit 655e394da6d038ec1bb81389fc3b9bea0044708d
Author: Kevin Cernekee <cernekee@chromium.org>
Date: Sat Apr 16 20:06:14 2016

Force --critical_update when invoking devserver

`cros flash ssh://<IP> <IMAGE>` will start up devserver.py on the DUT.
Since update_engine now skips non-critical updates until OOBE is done,
forced updates need to specify a deadline or else the operation may fail.

BUG= chromium:600737 
TEST=verify that `cros flash` fails with kNonCriticalUpdateInOOBE, then
     apply this patch and verify that it works again

Change-Id: I5ae35a8cbd98d527430555b6b7b4cf66c9551010
Reviewed-on: https://chromium-review.googlesource.com/339274
Commit-Ready: Kevin Cernekee <cernekee@chromium.org>
Tested-by: Kevin Cernekee <cernekee@chromium.org>
Reviewed-by: Andrew de los Reyes <adlr@chromium.org>
Reviewed-by: Alex Deymo <deymo@chromium.org>

[modify] https://crrev.com/655e394da6d038ec1bb81389fc3b9bea0044708d/lib/dev_server_wrapper.py

Project Member

Comment 35 by bugdroid1@chromium.org, Apr 18 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/1fd5e3042c6d8de3bdbb8cc169b5e0c5bd61af93

commit 1fd5e3042c6d8de3bdbb8cc169b5e0c5bd61af93
Author: Kevin Cernekee <cernekee@chromium.org>
Date: Sat Apr 16 20:10:15 2016

autoupdate_EndToEnd: Force --critical_update when invoking devserver

This test case will fail if the device has not completed OOBE.  Fix this
by specifying a deadline in the Omaha response.

BUG= chromium:600737 
TEST=none

Change-Id: I0c2ba71bb482e0bbb4b4b717eaf5f08b51775291
Reviewed-on: https://chromium-review.googlesource.com/339216
Commit-Ready: Kevin Cernekee <cernekee@chromium.org>
Tested-by: Kevin Cernekee <cernekee@chromium.org>
Reviewed-by: Andrew de los Reyes <adlr@chromium.org>
Reviewed-by: Alex Deymo <deymo@chromium.org>

[modify] https://crrev.com/1fd5e3042c6d8de3bdbb8cc169b5e0c5bd61af93/server/site_tests/autoupdate_EndToEndTest/autoupdate_EndToEndTest.py

This seems to be continually happening for provision jobs for chromeos2-row5-rack10-host11

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row5-rack10-host11/54877191-provision/debug/

What is the correct way to fix this DUT?
Oy, a guado_moblab.

The general procedure is to log in to the DUT, and run these two
commands:
    rm /var/tmp/provision_failed
    touch /home/chronos/.oobe_completed 

Then force a reverify on the host.

I've run the steps; we'll see what happens.

Status: Verified (was: Assigned)
Also, if anyone _else_ find any more DUTs like this:  Please
file a new bug.  This bug isn't about failing DUTs, it's about
the software that caused the DUTs to fail, and that software
bug has been fixed.

Sign in to add a comment