New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 877107 link

Starred by 1 user

autoupdate_ForcedOOBEUpdate sometimes kills DUTs

Reported by jrbarnette@chromium.org, Aug 23

Issue description

Recently, a number of DUTs in the test lab have been discovered to have
Google-signed production firmware and software installed.  This isn't
normal.  At deployment time, DUTs have dev-signed firmware and test images.
These units can't be recovered via the usual mechanisms:  They can only
be recovered by taking them back manually through the deployment flow.
Frequently, the procedure requires removing the unit from the shelf.

Exploring the test history of the known failed units, all of them ran
some variant of autoupdate_ForcedOOBEUpdate shortly prior to failing.
Digging through what logs are available, the mechanism seems to be this:
  * The test by design tells Chrome that a mandatory update must be downloaded
    and installed from OOBE.
  * Also by design, the test arranges for the mandatory update be a test
    image supplied by a lab devserver.
  * Somewhere in the process, the test fails, and the DUT, rather than
    download from the lab devserver, chooses to check for updates from the
    URL configured in /etc/lsb-release.
  * Even in test images, the update URL is the standard Omaha URL.  Omaha
    receives the request, and delivers a consumer image to the DUT.
  * The DUT installs the image, and because it's a consumer image, postinst
    for the consumer image runs 'chromeos-firmwareupdate'
  * Because the DUTs in the lab have write-protect disabled, the firmware
    update installs both the RW and _RO_ firmware, and because it's a consumer
    image, the firmware is Google-signed.
  * Once Google-signed firmware is installed, it's game over.  The DUT is no
    longer able to run tests, and can only be repaired with manual intervention.

The sequence above seems to happen some time _after_ the original test failure.
That is, the download and install are happening in the background even after
the test has stopped.  Subsequent tests eventually fail when the DUT automatically
reboots with the new (consumer) image.

 
I'm going to spend time today evaluating options.  However, unless
there's a clear alternative, the culprit test will need to be disabled
until we have a full explanation.

For the longer term, I think we'll need to do several things:
  * Figure out what cleanup the test is missing, and add that
    cleanup to turn off the OOBE update.
  * Change test images so that they don't have a valid Omaha
    URL in /etc/lsb-release.
  * Look to see if there are also changes to update_engine that would
    break the chain of bad events.
  * Consider re-enabling WP for DUTs in the lab.
  * Consider changing the postinst logic for "should I run firmware
    update" to prevent installing firmware in a case like this.

Cc: vapier@chromium.org bhthompson@google.com
Cc: ahass...@chromium.org
Can you provide links to the example runs you have found?

There are several variations of the test so I am interested to see if all of them are doing this or just the "interrupt" variation.
Based on the one run I do know about:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=221759048

It failed to due to a servo error. The test has not used servo in a long time.

I think I could see a scenario where the test fails in the middle of the update (due to a servo failure or something else) and then removes the new lsb-release file pointing to the devserver and quits. Then if some other job (repair/reset) comes along and reboots the DUT before the update fails or completes it will start a new check to production omaha after the reboot.

One idea is the test cleanup could add a TPM reset? 
> There are several variations of the test so I am interested to see
> if all of them are doing this or just the "interrupt" variation.

I saw more than one variation of the test in the failure histories.
I think most/all of the variations can cause this.
Another sample:

$ dut-status -f -u '2018-07-30 23:47:20' -d 2 chromeos4-row7-rack4-host1
chromeos4-row7-rack4-host1
    2018-07-30 22:47:20  NO https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row7-rack4-host1/855165-repair/
    2018-07-30 22:15:18  -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row7-rack4-host1/854951-reset/
    2018-07-30 21:59:08  -- https://stainless.corp.google.com/browse/chromeos-autotest-results/222379537-chromeos-test/
    2018-07-30 21:58:29  OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row7-rack4-host1/854791-reset/
    2018-07-30 21:57:29  -- https://stainless.corp.google.com/browse/chromeos-autotest-results/222379458-chromeos-test/
    2018-07-30 21:56:47  OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row7-rack4-host1/854780-reset/
    2018-07-30 21:55:46  -- https://stainless.corp.google.com/browse/chromeos-autotest-results/222379415-chromeos-test/
    2018-07-30 21:55:08  OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row7-rack4-host1/854758-reset/
    2018-07-30 21:51:22  -- https://stainless.corp.google.com/browse/chromeos-autotest-results/222379343-chromeos-test/
    2018-07-30 21:50:50  OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row7-rack4-host1/854737-reset/
[ ... ]

The AFE job link:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=222379343

That one's identified as "autoupdate_ForcedOOBEUpdate".

Both of those failures are on old branches (R65 and R66). The test variation in comment #8 without a suffix "autoupdate_ForcedOOBEUpdate" does not exist anymore.

Looking at the flow of events from the links in comment #7: the last line in the update_engine logs at the time of the failure is:
[0812/122458:INFO:delta_performer.cc(217)] Completed 386/628 operations (61%), 328636781/547713133 bytes downloaded (60%), overall progress 60%

Then in the reset job that follows it, update_engine_client --status reports this

[0812/122624:INFO:update_engine_client.cc(508)] Querying Update Engine status...
08/12 12:26:25.069 DEBUG|             utils:0286| [stdout] LAST_CHECKED_TIME=1534101981
08/12 12:26:25.070 DEBUG|             utils:0286| [stdout] PROGRESS=0.010031
08/12 12:26:25.070 DEBUG|             utils:0286| [stdout] CURRENT_OP=UPDATE_STATUS_DOWNLOADING
08/12 12:26:25.070 DEBUG|             utils:0286| [stdout] NEW_VERSION=10718.71.2
08/12 12:26:25.070 DEBUG|             utils:0286| [stdout] NEW_SIZE=552249379

So it seems like the update was started again. My guess is by the restart UI command in reset did this.

It makes sense that it would happen on older branches only because production omaha will actually have updates for these (vs when running on unreleased ToT builds)
Looking at the flow of events in comment #8 are very interesting: The test failed because update_engine_client --status did not change to DOWNLOADING within a certain timeout. But looking at update_engine logs the update did start shortly after the test failed. 

Then if you look at the update_engine_client --status in each of the reset logs you can see the update status grows continually until it gets to FINALIZING and then in the final test logs you can actually see some update engine logs:
https://stainless.corp.google.com/browse/chromeos-autotest-results/222379537-chromeos-test/
https://storage.cloud.google.com/chromeos-autotest-results/222379537-chromeos-test/chromeos4-row7-rack4-host1/kernel_CryptoAPI/sysinfo/var/log_diff/update_engine/update_engine.20180730-215327

This shows that it was still trying to talk to the devserver:

[0730/215941:INFO:omaha_request_action.cc(680)] Posting an Omaha request to http://100.115.219.132:37374/update

And if I am not mistaken that the new rootfs is a test image:

lsb-release inside the new rootfs:
CHROMEOS_RELEASE_APPID={777CE760-E315-FF86-2837-D51D10BA7C52}
CHROMEOS_BOARD_APPID={777CE760-E315-FF86-2837-D51D10BA7C52}
CHROMEOS_CANARY_APPID={90F229CE-83E2-4FAF-8479-E368A34938B1}
DEVICETYPE=CHROMEBOOK
CHROMEOS_ARC_VERSION=4734325
CHROMEOS_ARC_ANDROID_SDK_VERSION=25
GOOGLE_RELEASE=10323.97.0
CHROMEOS_DEVSERVER=
CHROMEOS_RELEASE_BUILDER_PATH=squawks-release/R65-10323.97.0
CHROMEOS_RELEASE_BUILD_NUMBER=10323
CHROMEOS_RELEASE_BRANCH_NUMBER=97
CHROMEOS_RELEASE_CHROME_MILESTONE=65
CHROMEOS_RELEASE_PATCH_NUMBER=0
CHROMEOS_RELEASE_TRACK=testimage-channel
CHROMEOS_RELEASE_DESCRIPTION=10323.97.0 (Official Build) dev-channel squawks test
CHROMEOS_RELEASE_BUILD_TYPE=Official Build
CHROMEOS_RELEASE_NAME=Chrome OS
CHROMEOS_RELEASE_BOARD=squawks
CHROMEOS_RELEASE_VERSION=10323.97.0
CHROMEOS_AUSERVER=https://tools.google.com/service/update2

The final reset job fails to restart ui, reboots instead and then we can no longer access the DUT

Thinking again, restarting UI does not interrupt updates. What happened in #7 was update_engine_client --status reported progress of 0.000018 even though the update_engine.log reported 60%. This happens sometimes when updates are resumed (see crbug.com/874221). The update_engine_client --status reported in the following reset job was 0.010031 which was just the update continuing.

So it seems like in both #7 and #8 the updates reported in the reset jobs are still the OOBE job talking to the devserver. 

The update in #7 appears to reboot itself when it is done. I don't see anywhere it talks to production omaha. The update in #8 completes and sends its update complete event (event code 3) to the devserver. Then is rebooted by the final reset job and cannot be reached. So again I don't think this talked to production omaha anywhere. 
> [ ... ] So again I don't think this talked to production omaha anywhere.

We know that the DUTs installed an image from Omaha:  When you put them
into recovery mode and press <TAB>, they show production firmware keys
and TPM version values, and when you boot them, they're running a
Google-signed OS image.

So somewhere, somehow, the DUTs talked to Omaha.

If you have physical access to one or more of the devices are there update_engine logs for the production omaha update? 
> If you have physical access to one or more of the devices are there update_engine logs for the production omaha update? 

Yes and no.  The first few attempts to look at failed devices did show
that logs were present still from when the DUT was still running a
test image.  However, all of the DUTs had been broken for so long that
they almost immediately deleted the relevant logs as part of standard
log rotation/cleanup.

OK.  At this point, I think the next steps are the following:
 A) Disable all forms of the ForcedOOBEUpdate test, both on canary
    and beta channel.
 B) Figure out how this problem happens, and change the test
    to avoid it.
 C) Change the default CHROMEOS_AUSERVER setting in /etc/lsb-release
    to be a non-functional URL, including for beta channel.
 D) Re-enable the test once we've completed either B) or C).

Cc: dchan@google.com
Are there more instances you can link to similar to #7 and #8? I am happy to look through all you have to find out if there is a production omaha call anywhere in the logs

There is no need to disable the test, all we need to do is restart update-engine in the test cleanup() and the in progress update will stop.
FYI test is actually useful (found a current RB-S bug  crbug.com/873270 ) so disabling all instances of it seems unnecessarily aggressive 
> [ ... ] all we need to do is restart update-engine in the test cleanup()
> and the in progress update will stop.

IIUC, in the particular case under test, the designed behavior is for
update_engine to restart downloading if it's interrupted.  So, I'm
inclined to be skeptical of this without more information.

Moreover, the proposed fix isn't proof against test aborts that occur at
the wrong moment.

> FYI test is actually useful (found a current RB-S bug  crbug.com/873270 )
> so disabling all instances of it seems unnecessarily aggressive

The problems that we're seeing are expensive to deal with, so it's
necessary to be aggressive.  We need to do something today, and we
want high confidence that whatever we do will address the problem.
So far, disabling the test is the only change I see that meets that
high bar.

Once we've definitively stopped the problem, we'll have breathing room
to work out a more permanent solution, such as by changing CHROMEOS_AUSERVER
in /etc/lsb-release.

If you reboot during an OOBE update it will make another update request at the update screen (or just bypass OOBE altogether as in the bug above). If you restart or stop update-engine the update will just stop.

If the test is aborted without running cleanup() all that will happen is that the new lsb-release file will remain and any subsequent update requests will try to talk to a devserver. 
> If you reboot during an OOBE update it will make another update
> request at the update screen (or just bypass OOBE altogether as
> in the bug above). If you restart or stop update-engine the update
> will just stop.

We have no guarantee that the DUT won't reboot and trigger the problem.


> If the test is aborted without running cleanup() all that will
> happen is that the new lsb-release file will remain and any
> subsequent update requests will try to talk to a devserver. 

We have no guarantee of this, either, since we don't know the full
mechanism for how this problem is occurring.

Even if it is somehow pinging production omaha, that will only return an update response if there is an update available (which wont happen for ToT) and if the omaha config is marked "deadline:now" as needing a forced update. Otherwise it will return no update. 
> Even if it is somehow pinging production omaha, that will only
> return an update response if there is an update available (which
> wont happen for ToT) and if the omaha config is marked "deadline:now"
> as needing a forced update. Otherwise it will return no update. 

But despite this, somewhere, somehow, pings to Omaha _are_ getting
update responses.

I believe that there are things that we can do that will allow us
to re-enable the test, even if we don't fully understand the cause.
Most especially, it seems likely that if we fix CHROMEOS_AUSERVER to
have a non-Omaha URL (and merge the fix to the branches), the problem
will stop.  I can also believe that we can track down the root cause
of the problem, and explain how to fix the test to be safe, complete
with a proof via testing.

However, none of the preferred solutions can be executed today and
also have certainty of stopping the problem.  So, we need to stop the
damage ASAP while we sort out a better answer that we can apply in
a week's time or so.

I meant specifically ToT (daily dev/canary) runs will not get an update from production omaha because it will always be running a later build than production omaha is serving. 

All three runs you have shown me are late branch builds. I can see how production omaha would serve those a build. What were the build numbers of the official build installed on these DUTs by mistake? Knowing that could help us figure out what happened. Did it change branches?  

Do you have an example of this happening on a ToT run?

> Do you have an example of this happening on a ToT run?

No.

The issue is that we can't afford to spend time looking for more data
on this problem before we act.  We need to apply a change that stops
the bleeding soon (like today), and we need absolute certainty that
the change will stop the problem.  Disabling the test meets both
criteria.  So far, nothing else proposed can do that.

+ahassani
Would it be possible to have dev image NOT talk to omaha at all?
Another option is to block Omaha host/IP for the lab.
If there are only three examples on dead branches I don't understand the urgency here 
Any test that (potentially) lab killing has to be banned until proven otherwise.  The repair involves manual intervention which is time consuming
> If there are only three examples on dead branches I don't understand the urgency here 

These are just the three that we have with a clear history.  There are two others
that got into this state, but without a complete test history.  And those were
found without a lot of searching.  The techs in the lab also report that they see
this kind of symptom regularly.  Based on that, I'd expect a dozen or two units
in need of attention.

However, making a definitive list is expensive:  Proving this problem has occurred
on any given device requires visual inspection, and we have no good way to create
a list of candidate devices that's short enough to be reasonable:  there are more
than 360 broken devices that might be inspected.

So, although this problem isn't critically urgent (it's P1, not P0), it cannot be
allowed to languish:  There must be mitigation, and the mitigation must be both
certain and soon.

+jrbarnette can you provide a list of hosts or log locations so we can review the logs ourshelves?
> +jrbarnette can you provide a list of hosts or log locations so we can review the logs ourshelves?

Two known instances with good logs are reported in #c7 and #c8.  There's a
reference to one other in #c5; this is its history:

chromeos4-row2-rack8-host4
    2018-07-28 23:10:19  NO https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack8-host4/1989221-repair/
    2018-07-28 22:42:07  -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack8-host4/1988869-provision/
    2018-07-28 19:53:53  -- https://stainless.corp.google.com/browse/chromeos-autotest-results/221759048-chromeos-test/
    2018-07-28 19:53:20  OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack8-host4/1986948-reset/

I know of at least two others, but the logs are incomplete, and I'm not
sure of their names.

I don't understand why devices that have a production image cannot be restored to test image via servo, I thought the main purpose of the servo set up was to ensure we can bring a DUT back automatically no matter what (unless there's a hardware failure). 


I asked Richard the same question in person. His answer, paraphrased:

DUTs in the lab have a writable RO firmware, and have different firmware with test keys. The update installs production firmware with production keys. This breaks servo's ability to run recovery.

> I don't understand why devices that have a production image cannot
> be restored to test image via servo, I thought the main purpose of
> the servo set up was to ensure we can bring a DUT back automatically
> no matter what (unless there's a hardware failure). 

The process of installing dev-signed firmware on a unit requires manual
intervention, roughly equivalent to the manual work we have to do when
devices are first delivered from the factory.

We currently don't have an automated procedure that can use servo to bypass
those manual steps.  It's also not yet clear to me whether it's possible:
part of the process of installing dev-signed firmware requires clearing the
TPM, which can only be done by booting the DUT, which may or may not be
possible without manual intervention.

I did some digging around, and I suspect that it is technically
possible to perform deployment from the factory where the only
manual operation is to remove write-protect.  Everything else is
in principle automatable.

The practical implications for this bug are small:  although in
theory the process can be automated, in practice, the code to
perform the automation doesn't yet exist, and will need several
weeks to develop.  Moreover, the automated procedures would be
predicated on knowing in advance that a DUT has been affected
by this problem, and would still have to be manually invoked by
a human.

Bug 877180 describes the automated procedure that could be used to
fix the DUTs found to be affected by this bug.  Addressing that bug
has value in its own right, since it would make it cheaper to deploy
new DUTs from the factory.

Cc: akes...@chromium.org nsanders@chromium.org
Can we cc someone who has expertise with servo to this bug? If servos are not able to bring a DUT back into service, that seems like a basic problem that should be solved. 

Added a few others to the bug. 


Servo can definitely do this, I've been discussing offline w/ Richard and John the procedure.

The main limitations are:
* Firmware recovery hasn't been needed previously in ATL so there's not much existing infra automation.
* Serial number, hwid, vpd are stored in firmware and not backed up elsewhere, so if they are deleted they can't easily be recovered. So the corrupted firmware should be saved before recovery, and the data restored after repair.
> Can we cc someone who has expertise with servo to this bug? [ ... ]

Wait.  This bug is not about fixing the broken devices; this bug is
about fixing the bugs in update_engine and the associated AU test
that broke the devices in the first place.

The bug about automatically repairing the broken devices is bug 877180.

Project Member

Comment 40 by bugdroid1@chromium.org, Aug 25

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/02a863a3454edb57060de03357e5b32f022a1e63

commit 02a863a3454edb57060de03357e5b32f022a1e63
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Sat Aug 25 21:33:27 2018

[autotest] Temporarily disable autoupdate_ForcedOOBEUpdate.

The test is sometimes causing devices to download and apply consumer
images from Omaha, which leaves the devices untestable, and requires
a relatively expensive manual intervention to fix.

This disables the test to stop the bleeding until the problem can be
put under better control.

BUG= chromium:877107 
TEST=None

Change-Id: I4d25f74abb761bec3d85d10792bc45ba3b8d6c5d
Reviewed-on: https://chromium-review.googlesource.com/1187576
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Congbin Guo <guocb@chromium.org>

[modify] https://crrev.com/02a863a3454edb57060de03357e5b32f022a1e63/server/site_tests/autoupdate_ForcedOOBEUpdate/control.interrupt.full
[modify] https://crrev.com/02a863a3454edb57060de03357e5b32f022a1e63/server/site_tests/autoupdate_ForcedOOBEUpdate/control.full
[modify] https://crrev.com/02a863a3454edb57060de03357e5b32f022a1e63/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.delta
[modify] https://crrev.com/02a863a3454edb57060de03357e5b32f022a1e63/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.full
[modify] https://crrev.com/02a863a3454edb57060de03357e5b32f022a1e63/server/site_tests/autoupdate_ForcedOOBEUpdate/control.delta

Cc: cindyb@chromium.org
Labels: Merge-Request-69
The mitigation change in #c40 needs to be merged to the R69 branch.
The CL is here:
    https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1192119

Project Member

Comment 42 by sheriffbot@chromium.org, Aug 27

Labels: -Merge-Request-69 Merge-Review-69 Hotlist-Merge-Review
This bug requires manual review: We are only 7 days from stable.
Please contact the milestone owner if you have questions.
Owners: amineer@(Android), kariahda@(iOS), cindyb@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Owner: dhadd...@chromium.org
I'm going to continue working to get the mitigation change
merged back to the branch.  Meantime, follow-up is (has been)
needed on these two items:
 B) Figure out how this problem happens, and change the test
    to avoid it.
 C) Change the default CHROMEOS_AUSERVER setting in /etc/lsb-release
    to be a non-functional URL, including for beta channel.

C) may turn out to require changing update_engine, but I think
somebody else has to own that problem.

Do you have any more examples of this happening to look through?
> Do you have any more examples of this happening to look through?

Given time, we can find them.  The process will be somewhat labor
intensive.

Right now, my guess is that the most cost-effective actions we can make
are
 1) Figure out how to remove all update_engine access to Omaha from
    test images, unless the Omaha URL is supplied on the command line.
 2) Try reproducing the problem.  Focus should probably be on reproducing
    with an older version that's been seen to cause the problem.

Option 1) is particularly important, because it shouldn't be an open-ended
problem, and once it's done, we can reasonably re-enable the test.

Cc: dchan@chromium.org
what about blocking omaha access from lab?  (see c#27).

> what about blocking omaha access from lab?  (see c#27).

I consider this an option of last resort.  My concern is that we'd
be creating a one-off network configuration that we'd have trouble
replicating and maintaining.

Cc: waff...@chromium.org shapiroc@chromium.org
+cc shapiroc, waffles.

We explored disabling omaha access to the lab, the way to do that seems to be from the golden-eye or the omaha end, since omaha has multiple IPs that may change. I'm not sure we should be doing it from that end though.

Richard, how much work is it to implement the test image change?


> We explored disabling omaha access to the lab, the way to do that
> seems to be from the golden-eye or the omaha end, since omaha has
> multiple IPs that may change. I'm not sure we should be doing it
> from that end though.

I don't know what it even means to disable this from the Golden Eye
or Omaha side.  Although, at first blush, I'd be guessing we shouldn't
do that.


> Richard, how much work is it to implement the test image change?

I believe that it's cheap; that's why I harp on it so much.

However, the changes will have to be done by somebody else, so
my opinion has to be discounted against that.

I believe one or both of the following changes would be entailed:
  * Change /etc/lsb-release so that for test images, CHROMEOS_AUSERVER
    holds an invalid URL.  This change is almost certainly trivial.
    The problem is, I don't know how to prove that it's enough to
    guarantee the fix.
  * Change update_engine so that all uses of kOmahaDefaultAUTestURL
    are blocked, skipped, or otherwise ignored on test images.  There
    are only four references in source, but that doesn't guarantee that
    this is easy.

Here are the logs for chromeos4-row8-rack7-host11

logs.tar
3.6 MB Download
Thank you for the logs! I know what is happening now. It *is* because we are running the test on really old builds.

1. The test creates a new lsb-release at /mnt/stateful_partition/etc/lsb-release. This lsb-release points to a devserver.
2. The test starts a forced update at OOBE.
3. The test fails (e.g the update started too slowly)
4. The new lsb-release file is deleted as part of the test cleanup()
5. The DUT finishes the update in the background.
6. It automatically reboots when the update is finished. (This is normal for OOBE updates).
7. The DUT reboots back to the OOBE update screen and does another update check (this happens so the DUT can successfully deal with stepping stones).
8. Since we deleted the lsb-release it makes the request to production omaha
9. Since we are running a REALLY old build (It is an R65 build linked to in comment #50), the production omaha says there is a build and returns the latest serving stable build.

So in this latest example we updated from R65 to R67 stable.
OK.  I spoke offline with bhthompson@ to try and understand why we
were testing such old builds in the first place.  I don't understand
the full details, but basically, lakitu sometimes needs to build and
test its stuff against branches older than the current stable branch.

The problem is that our current build code doesn't know how to build
and test just lakitu, so those older branch builds end up building and
testing every board, including tests in the lab.

Really, we shouldn't do that.


Putting it all together, I think fixing this bug entails the following
tasks:
  * Fix the test so that it can reasonably be expected to block this
    particular failure in all cases.
  * Change Chrome OS so that test images don't contact the standard
    Omaha URL.  That means changing /etc/lsb-release or update_engine
    or both.
  * Fix it so that building and testing old branches for lakitu doesn't
    require testing in the lab.  Not building all those extra boards would
    be good, too.

Additionally, it would be good if the lab were able to deal with this kind
of failure without requiring manual intervention.  That will be partly
addressed by bug 877180, but additional work is required that will entail
a new project, and a design document.
Cc: norvez@chromium.org
+cc norvez for proposed update engine changes, see bullet 2 in comment #54. 
Can we get a bug filed for bullet #3 so we do not lose track of that? That seems like wasted resources. 
Labels: M-69
Blockedon: 881386 881382 881376
I've filed three bugs, one for each of the items above.

Owner: jrbarnette@chromium.org
Status: Started (was: Assigned)
Given that we have new bugs for the long term fixes, I'm reclaiming
this bug for the immediate mitigation changes.  That means these
merge CLs:
    R65 - https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1211162
    R66 - https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1211163
    R67 - https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1211165
    R68 - https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1211164
    R69 - https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1192119

All of those need to go in.  The older ones should probably be
considered permanent.

Labels: Merge-Approved-66 Merge-Approved-67 Merge-Approved-65 Merge-Approved-68
Merge approved for 68 and prior.
Project Member

Comment 61 by bugdroid1@chromium.org, Sep 6

Labels: merge-merged-release-R65-10323.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7ee86a49f97dcc7372651e4a5a0c420cfd6c2541

commit 7ee86a49f97dcc7372651e4a5a0c420cfd6c2541
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Thu Sep 06 17:41:55 2018

[autotest] Temporarily disable autoupdate_ForcedOOBEUpdate.

The test is sometimes causing devices to download and apply consumer
images from Omaha, which leaves the devices untestable, and requires
a relatively expensive manual intervention to fix.

This disables the test to stop the bleeding until the problem can be
put under better control.

BUG= chromium:877107 
TEST=None

Change-Id: I4d25f74abb761bec3d85d10792bc45ba3b8d6c5d
Reviewed-on: https://chromium-review.googlesource.com/1211162
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/7ee86a49f97dcc7372651e4a5a0c420cfd6c2541/server/site_tests/autoupdate_ForcedOOBEUpdate/control

Project Member

Comment 62 by bugdroid1@chromium.org, Sep 6

Labels: merge-merged-release-R66-10452.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0b252c74f561075a0dc601f30fdb8d323f2dac99

commit 0b252c74f561075a0dc601f30fdb8d323f2dac99
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Thu Sep 06 17:43:02 2018

[autotest] Temporarily disable autoupdate_ForcedOOBEUpdate.

The test is sometimes causing devices to download and apply consumer
images from Omaha, which leaves the devices untestable, and requires
a relatively expensive manual intervention to fix.

This disables the test to stop the bleeding until the problem can be
put under better control.

BUG= chromium:877107 
TEST=None

Change-Id: I4d25f74abb761bec3d85d10792bc45ba3b8d6c5d
Reviewed-on: https://chromium-review.googlesource.com/1211163
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/0b252c74f561075a0dc601f30fdb8d323f2dac99/server/site_tests/autoupdate_ForcedOOBEUpdate/control.interrupt.full
[modify] https://crrev.com/0b252c74f561075a0dc601f30fdb8d323f2dac99/server/site_tests/autoupdate_ForcedOOBEUpdate/control.full
[modify] https://crrev.com/0b252c74f561075a0dc601f30fdb8d323f2dac99/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.delta
[modify] https://crrev.com/0b252c74f561075a0dc601f30fdb8d323f2dac99/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.full
[modify] https://crrev.com/0b252c74f561075a0dc601f30fdb8d323f2dac99/server/site_tests/autoupdate_ForcedOOBEUpdate/control.delta

Project Member

Comment 63 by bugdroid1@chromium.org, Sep 6

Labels: merge-merged-release-R67-10575.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/08be427a7f1f872febd3d89933db2d88b0d17f8c

commit 08be427a7f1f872febd3d89933db2d88b0d17f8c
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Thu Sep 06 17:44:20 2018

[autotest] Temporarily disable autoupdate_ForcedOOBEUpdate.

The test is sometimes causing devices to download and apply consumer
images from Omaha, which leaves the devices untestable, and requires
a relatively expensive manual intervention to fix.

This disables the test to stop the bleeding until the problem can be
put under better control.

BUG= chromium:877107 
TEST=None

Change-Id: I4d25f74abb761bec3d85d10792bc45ba3b8d6c5d
Reviewed-on: https://chromium-review.googlesource.com/1211165
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/08be427a7f1f872febd3d89933db2d88b0d17f8c/server/site_tests/autoupdate_ForcedOOBEUpdate/control.interrupt.full
[modify] https://crrev.com/08be427a7f1f872febd3d89933db2d88b0d17f8c/server/site_tests/autoupdate_ForcedOOBEUpdate/control.full
[modify] https://crrev.com/08be427a7f1f872febd3d89933db2d88b0d17f8c/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.delta
[modify] https://crrev.com/08be427a7f1f872febd3d89933db2d88b0d17f8c/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.full
[modify] https://crrev.com/08be427a7f1f872febd3d89933db2d88b0d17f8c/server/site_tests/autoupdate_ForcedOOBEUpdate/control.delta

Project Member

Comment 64 by bugdroid1@chromium.org, Sep 6

Labels: merge-merged-release-R68-10718.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/41974366c3d0990e46a60300ed4cc82c8ba2d1c1

commit 41974366c3d0990e46a60300ed4cc82c8ba2d1c1
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Thu Sep 06 17:45:16 2018

[autotest] Temporarily disable autoupdate_ForcedOOBEUpdate.

The test is sometimes causing devices to download and apply consumer
images from Omaha, which leaves the devices untestable, and requires
a relatively expensive manual intervention to fix.

This disables the test to stop the bleeding until the problem can be
put under better control.

BUG= chromium:877107 
TEST=None

Change-Id: I4d25f74abb761bec3d85d10792bc45ba3b8d6c5d
Reviewed-on: https://chromium-review.googlesource.com/1211164
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/41974366c3d0990e46a60300ed4cc82c8ba2d1c1/server/site_tests/autoupdate_ForcedOOBEUpdate/control.interrupt.full
[modify] https://crrev.com/41974366c3d0990e46a60300ed4cc82c8ba2d1c1/server/site_tests/autoupdate_ForcedOOBEUpdate/control.full
[modify] https://crrev.com/41974366c3d0990e46a60300ed4cc82c8ba2d1c1/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.delta
[modify] https://crrev.com/41974366c3d0990e46a60300ed4cc82c8ba2d1c1/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.full
[modify] https://crrev.com/41974366c3d0990e46a60300ed4cc82c8ba2d1c1/server/site_tests/autoupdate_ForcedOOBEUpdate/control.delta

Labels: -Merge-Review-69 Merge-Approved-69
Merge approved, M69.
Project Member

Comment 66 by bugdroid1@chromium.org, Sep 7

Labels: merge-merged-release-R69-10895.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/63371df4f6d822133377c94abacb77ed6949664f

commit 63371df4f6d822133377c94abacb77ed6949664f
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Fri Sep 07 17:16:30 2018

[autotest] Temporarily disable autoupdate_ForcedOOBEUpdate.

The test is sometimes causing devices to download and apply consumer
images from Omaha, which leaves the devices untestable, and requires
a relatively expensive manual intervention to fix.

This disables the test to stop the bleeding until the problem can be
put under better control.

BUG= chromium:877107 
TEST=None

Change-Id: I4d25f74abb761bec3d85d10792bc45ba3b8d6c5d
Reviewed-on: https://chromium-review.googlesource.com/1187576
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Congbin Guo <guocb@chromium.org>
(cherry picked from commit 02a863a3454edb57060de03357e5b32f022a1e63)
Reviewed-on: https://chromium-review.googlesource.com/1192119
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/63371df4f6d822133377c94abacb77ed6949664f/server/site_tests/autoupdate_ForcedOOBEUpdate/control.interrupt.full
[modify] https://crrev.com/63371df4f6d822133377c94abacb77ed6949664f/server/site_tests/autoupdate_ForcedOOBEUpdate/control.full
[modify] https://crrev.com/63371df4f6d822133377c94abacb77ed6949664f/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.delta
[modify] https://crrev.com/63371df4f6d822133377c94abacb77ed6949664f/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.full
[modify] https://crrev.com/63371df4f6d822133377c94abacb77ed6949664f/server/site_tests/autoupdate_ForcedOOBEUpdate/control.delta

Blockedon: -881386 -881382 -881376
Status: Fixed (was: Started)
OK.  At this point, I believe we can be confident that this problem
will quit happening in the lab.  We have three bugs and one project
for how to handle/prevent this sort of failure in future.

So, I'm declaring victory for this bug; the other bugs can track the
other work.

Project Member

Comment 68 by sheriffbot@chromium.org, Sep 11

This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Approved-65 -Merge-Approved-66 -Merge-Approved-67 -Merge-Approved-68 -Merge-Approved-69
Project Member

Comment 70 by bugdroid1@chromium.org, Sep 14

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/eea9f6f3f47206cda3208ad6f5e56df80355d210

commit eea9f6f3f47206cda3208ad6f5e56df80355d210
Author: David Haddock <dhaddock@chromium.org>
Date: Fri Sep 14 19:08:49 2018

Revert "[autotest] Temporarily disable autoupdate_ForcedOOBEUpdate."

This reverts commit 02a863a3454edb57060de03357e5b32f022a1e63.

Reason for revert: I have a CL to fix the test and I will submit this revert with that change

Original change's description:
> [autotest] Temporarily disable autoupdate_ForcedOOBEUpdate.
>
> The test is sometimes causing devices to download and apply consumer
> images from Omaha, which leaves the devices untestable, and requires
> a relatively expensive manual intervention to fix.
>
> This disables the test to stop the bleeding until the problem can be
> put under better control.
>
> BUG= chromium:877107 
> TEST=None
>
> Change-Id: I4d25f74abb761bec3d85d10792bc45ba3b8d6c5d
> Reviewed-on: https://chromium-review.googlesource.com/1187576
> Tested-by: Richard Barnette <jrbarnette@chromium.org>
> Reviewed-by: Congbin Guo <guocb@chromium.org>

Bug:  chromium:877107 
Change-Id: If46e80ce26a3dff6b3faae09a9c91a6592c222d6
Reviewed-on: https://chromium-review.googlesource.com/1219549
Commit-Ready: danny chan <dchan@chromium.org>
Tested-by: David Haddock <dhaddock@chromium.org>
Reviewed-by: David Haddock <dhaddock@chromium.org>

[modify] https://crrev.com/eea9f6f3f47206cda3208ad6f5e56df80355d210/server/site_tests/autoupdate_ForcedOOBEUpdate/control.interrupt.full
[modify] https://crrev.com/eea9f6f3f47206cda3208ad6f5e56df80355d210/server/site_tests/autoupdate_ForcedOOBEUpdate/control.full
[modify] https://crrev.com/eea9f6f3f47206cda3208ad6f5e56df80355d210/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.delta
[modify] https://crrev.com/eea9f6f3f47206cda3208ad6f5e56df80355d210/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.full
[modify] https://crrev.com/eea9f6f3f47206cda3208ad6f5e56df80355d210/server/site_tests/autoupdate_ForcedOOBEUpdate/control.delta

Sign in to add a comment