autoupdate_ForcedOOBEUpdate sometimes kills DUTs
Reported by
jrbarnette@chromium.org,
Aug 23
|
||||||||||||||||||||||
Issue description
Recently, a number of DUTs in the test lab have been discovered to have
Google-signed production firmware and software installed. This isn't
normal. At deployment time, DUTs have dev-signed firmware and test images.
These units can't be recovered via the usual mechanisms: They can only
be recovered by taking them back manually through the deployment flow.
Frequently, the procedure requires removing the unit from the shelf.
Exploring the test history of the known failed units, all of them ran
some variant of autoupdate_ForcedOOBEUpdate shortly prior to failing.
Digging through what logs are available, the mechanism seems to be this:
* The test by design tells Chrome that a mandatory update must be downloaded
and installed from OOBE.
* Also by design, the test arranges for the mandatory update be a test
image supplied by a lab devserver.
* Somewhere in the process, the test fails, and the DUT, rather than
download from the lab devserver, chooses to check for updates from the
URL configured in /etc/lsb-release.
* Even in test images, the update URL is the standard Omaha URL. Omaha
receives the request, and delivers a consumer image to the DUT.
* The DUT installs the image, and because it's a consumer image, postinst
for the consumer image runs 'chromeos-firmwareupdate'
* Because the DUTs in the lab have write-protect disabled, the firmware
update installs both the RW and _RO_ firmware, and because it's a consumer
image, the firmware is Google-signed.
* Once Google-signed firmware is installed, it's game over. The DUT is no
longer able to run tests, and can only be repaired with manual intervention.
The sequence above seems to happen some time _after_ the original test failure.
That is, the download and install are happening in the background even after
the test has stopped. Subsequent tests eventually fail when the DUT automatically
reboots with the new (consumer) image.
,
Aug 23
For the longer term, I think we'll need to do several things:
* Figure out what cleanup the test is missing, and add that
cleanup to turn off the OOBE update.
* Change test images so that they don't have a valid Omaha
URL in /etc/lsb-release.
* Look to see if there are also changes to update_engine that would
break the chain of bad events.
* Consider re-enabling WP for DUTs in the lab.
* Consider changing the postinst logic for "should I run firmware
update" to prevent installing firmware in a case like this.
,
Aug 23
,
Aug 23
Can you provide links to the example runs you have found? There are several variations of the test so I am interested to see if all of them are doing this or just the "interrupt" variation.
,
Aug 23
Based on the one run I do know about: https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=221759048 It failed to due to a servo error. The test has not used servo in a long time. I think I could see a scenario where the test fails in the middle of the update (due to a servo failure or something else) and then removes the new lsb-release file pointing to the devserver and quits. Then if some other job (repair/reset) comes along and reboots the DUT before the update fails or completes it will start a new check to production omaha after the reboot. One idea is the test cleanup could add a TPM reset?
,
Aug 23
> There are several variations of the test so I am interested to see > if all of them are doing this or just the "interrupt" variation. I saw more than one variation of the test in the failure histories. I think most/all of the variations can cause this.
,
Aug 23
One sample failure sequence:
$ dut-status -d 2 -f -u '2018-08-12 14:00:00' chromeos4-row5-rack12-host1
chromeos4-row5-rack12-host1
2018-08-12 13:02:44 NO https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row5-rack12-host1/1481543-repair/
2018-08-12 12:30:42 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row5-rack12-host1/1481325-reset/
2018-08-12 12:29:23 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/226650058-chromeos-test/
2018-08-12 12:27:50 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row5-rack12-host1/1481313-reset/
2018-08-12 12:26:29 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/226652298-chromeos-test/
2018-08-12 12:25:43 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row5-rack12-host1/1481301-reset/
2018-08-12 12:20:54 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/226649446-chromeos-test/
2018-08-12 12:19:08 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row5-rack12-host1/1481261-reset/
[ ... ]
This is the test job that set the problem in motion:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=226649446
"control.srv" shows that as "autoupdate_ForcedOOBEUpdate.interrupt.full"
,
Aug 23
Another sample:
$ dut-status -f -u '2018-07-30 23:47:20' -d 2 chromeos4-row7-rack4-host1
chromeos4-row7-rack4-host1
2018-07-30 22:47:20 NO https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row7-rack4-host1/855165-repair/
2018-07-30 22:15:18 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row7-rack4-host1/854951-reset/
2018-07-30 21:59:08 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/222379537-chromeos-test/
2018-07-30 21:58:29 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row7-rack4-host1/854791-reset/
2018-07-30 21:57:29 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/222379458-chromeos-test/
2018-07-30 21:56:47 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row7-rack4-host1/854780-reset/
2018-07-30 21:55:46 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/222379415-chromeos-test/
2018-07-30 21:55:08 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row7-rack4-host1/854758-reset/
2018-07-30 21:51:22 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/222379343-chromeos-test/
2018-07-30 21:50:50 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row7-rack4-host1/854737-reset/
[ ... ]
The AFE job link:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=222379343
That one's identified as "autoupdate_ForcedOOBEUpdate".
,
Aug 23
Both of those failures are on old branches (R65 and R66). The test variation in comment #8 without a suffix "autoupdate_ForcedOOBEUpdate" does not exist anymore. Looking at the flow of events from the links in comment #7: the last line in the update_engine logs at the time of the failure is: [0812/122458:INFO:delta_performer.cc(217)] Completed 386/628 operations (61%), 328636781/547713133 bytes downloaded (60%), overall progress 60% Then in the reset job that follows it, update_engine_client --status reports this [0812/122624:INFO:update_engine_client.cc(508)] Querying Update Engine status... 08/12 12:26:25.069 DEBUG| utils:0286| [stdout] LAST_CHECKED_TIME=1534101981 08/12 12:26:25.070 DEBUG| utils:0286| [stdout] PROGRESS=0.010031 08/12 12:26:25.070 DEBUG| utils:0286| [stdout] CURRENT_OP=UPDATE_STATUS_DOWNLOADING 08/12 12:26:25.070 DEBUG| utils:0286| [stdout] NEW_VERSION=10718.71.2 08/12 12:26:25.070 DEBUG| utils:0286| [stdout] NEW_SIZE=552249379 So it seems like the update was started again. My guess is by the restart UI command in reset did this. It makes sense that it would happen on older branches only because production omaha will actually have updates for these (vs when running on unreleased ToT builds)
,
Aug 23
Looking at the flow of events in comment #8 are very interesting: The test failed because update_engine_client --status did not change to DOWNLOADING within a certain timeout. But looking at update_engine logs the update did start shortly after the test failed. Then if you look at the update_engine_client --status in each of the reset logs you can see the update status grows continually until it gets to FINALIZING and then in the final test logs you can actually see some update engine logs: https://stainless.corp.google.com/browse/chromeos-autotest-results/222379537-chromeos-test/ https://storage.cloud.google.com/chromeos-autotest-results/222379537-chromeos-test/chromeos4-row7-rack4-host1/kernel_CryptoAPI/sysinfo/var/log_diff/update_engine/update_engine.20180730-215327 This shows that it was still trying to talk to the devserver: [0730/215941:INFO:omaha_request_action.cc(680)] Posting an Omaha request to http://100.115.219.132:37374/update And if I am not mistaken that the new rootfs is a test image: lsb-release inside the new rootfs: CHROMEOS_RELEASE_APPID={777CE760-E315-FF86-2837-D51D10BA7C52} CHROMEOS_BOARD_APPID={777CE760-E315-FF86-2837-D51D10BA7C52} CHROMEOS_CANARY_APPID={90F229CE-83E2-4FAF-8479-E368A34938B1} DEVICETYPE=CHROMEBOOK CHROMEOS_ARC_VERSION=4734325 CHROMEOS_ARC_ANDROID_SDK_VERSION=25 GOOGLE_RELEASE=10323.97.0 CHROMEOS_DEVSERVER= CHROMEOS_RELEASE_BUILDER_PATH=squawks-release/R65-10323.97.0 CHROMEOS_RELEASE_BUILD_NUMBER=10323 CHROMEOS_RELEASE_BRANCH_NUMBER=97 CHROMEOS_RELEASE_CHROME_MILESTONE=65 CHROMEOS_RELEASE_PATCH_NUMBER=0 CHROMEOS_RELEASE_TRACK=testimage-channel CHROMEOS_RELEASE_DESCRIPTION=10323.97.0 (Official Build) dev-channel squawks test CHROMEOS_RELEASE_BUILD_TYPE=Official Build CHROMEOS_RELEASE_NAME=Chrome OS CHROMEOS_RELEASE_BOARD=squawks CHROMEOS_RELEASE_VERSION=10323.97.0 CHROMEOS_AUSERVER=https://tools.google.com/service/update2 The final reset job fails to restart ui, reboots instead and then we can no longer access the DUT
,
Aug 23
Thinking again, restarting UI does not interrupt updates. What happened in #7 was update_engine_client --status reported progress of 0.000018 even though the update_engine.log reported 60%. This happens sometimes when updates are resumed (see crbug.com/874221). The update_engine_client --status reported in the following reset job was 0.010031 which was just the update continuing. So it seems like in both #7 and #8 the updates reported in the reset jobs are still the OOBE job talking to the devserver. The update in #7 appears to reboot itself when it is done. I don't see anywhere it talks to production omaha. The update in #8 completes and sends its update complete event (event code 3) to the devserver. Then is rebooted by the final reset job and cannot be reached. So again I don't think this talked to production omaha anywhere.
,
Aug 23
> [ ... ] So again I don't think this talked to production omaha anywhere. We know that the DUTs installed an image from Omaha: When you put them into recovery mode and press <TAB>, they show production firmware keys and TPM version values, and when you boot them, they're running a Google-signed OS image. So somewhere, somehow, the DUTs talked to Omaha.
,
Aug 23
If you have physical access to one or more of the devices are there update_engine logs for the production omaha update?
,
Aug 23
> If you have physical access to one or more of the devices are there update_engine logs for the production omaha update? Yes and no. The first few attempts to look at failed devices did show that logs were present still from when the DUT was still running a test image. However, all of the DUTs had been broken for so long that they almost immediately deleted the relevant logs as part of standard log rotation/cleanup.
,
Aug 23
OK. At this point, I think the next steps are the following:
A) Disable all forms of the ForcedOOBEUpdate test, both on canary
and beta channel.
B) Figure out how this problem happens, and change the test
to avoid it.
C) Change the default CHROMEOS_AUSERVER setting in /etc/lsb-release
to be a non-functional URL, including for beta channel.
D) Re-enable the test once we've completed either B) or C).
,
Aug 23
Are there more instances you can link to similar to #7 and #8? I am happy to look through all you have to find out if there is a production omaha call anywhere in the logs There is no need to disable the test, all we need to do is restart update-engine in the test cleanup() and the in progress update will stop.
,
Aug 23
FYI test is actually useful (found a current RB-S bug crbug.com/873270 ) so disabling all instances of it seems unnecessarily aggressive
,
Aug 23
> [ ... ] all we need to do is restart update-engine in the test cleanup() > and the in progress update will stop. IIUC, in the particular case under test, the designed behavior is for update_engine to restart downloading if it's interrupted. So, I'm inclined to be skeptical of this without more information. Moreover, the proposed fix isn't proof against test aborts that occur at the wrong moment.
,
Aug 23
> FYI test is actually useful (found a current RB-S bug crbug.com/873270 ) > so disabling all instances of it seems unnecessarily aggressive The problems that we're seeing are expensive to deal with, so it's necessary to be aggressive. We need to do something today, and we want high confidence that whatever we do will address the problem. So far, disabling the test is the only change I see that meets that high bar. Once we've definitively stopped the problem, we'll have breathing room to work out a more permanent solution, such as by changing CHROMEOS_AUSERVER in /etc/lsb-release.
,
Aug 23
If you reboot during an OOBE update it will make another update request at the update screen (or just bypass OOBE altogether as in the bug above). If you restart or stop update-engine the update will just stop. If the test is aborted without running cleanup() all that will happen is that the new lsb-release file will remain and any subsequent update requests will try to talk to a devserver.
,
Aug 23
> If you reboot during an OOBE update it will make another update > request at the update screen (or just bypass OOBE altogether as > in the bug above). If you restart or stop update-engine the update > will just stop. We have no guarantee that the DUT won't reboot and trigger the problem. > If the test is aborted without running cleanup() all that will > happen is that the new lsb-release file will remain and any > subsequent update requests will try to talk to a devserver. We have no guarantee of this, either, since we don't know the full mechanism for how this problem is occurring.
,
Aug 23
Even if it is somehow pinging production omaha, that will only return an update response if there is an update available (which wont happen for ToT) and if the omaha config is marked "deadline:now" as needing a forced update. Otherwise it will return no update.
,
Aug 23
> Even if it is somehow pinging production omaha, that will only > return an update response if there is an update available (which > wont happen for ToT) and if the omaha config is marked "deadline:now" > as needing a forced update. Otherwise it will return no update. But despite this, somewhere, somehow, pings to Omaha _are_ getting update responses. I believe that there are things that we can do that will allow us to re-enable the test, even if we don't fully understand the cause. Most especially, it seems likely that if we fix CHROMEOS_AUSERVER to have a non-Omaha URL (and merge the fix to the branches), the problem will stop. I can also believe that we can track down the root cause of the problem, and explain how to fix the test to be safe, complete with a proof via testing. However, none of the preferred solutions can be executed today and also have certainty of stopping the problem. So, we need to stop the damage ASAP while we sort out a better answer that we can apply in a week's time or so.
,
Aug 23
I meant specifically ToT (daily dev/canary) runs will not get an update from production omaha because it will always be running a later build than production omaha is serving. All three runs you have shown me are late branch builds. I can see how production omaha would serve those a build. What were the build numbers of the official build installed on these DUTs by mistake? Knowing that could help us figure out what happened. Did it change branches? Do you have an example of this happening on a ToT run?
,
Aug 23
> Do you have an example of this happening on a ToT run? No. The issue is that we can't afford to spend time looking for more data on this problem before we act. We need to apply a change that stops the bleeding soon (like today), and we need absolute certainty that the change will stop the problem. Disabling the test meets both criteria. So far, nothing else proposed can do that.
,
Aug 23
+ahassani Would it be possible to have dev image NOT talk to omaha at all?
,
Aug 23
Another option is to block Omaha host/IP for the lab.
,
Aug 24
If there are only three examples on dead branches I don't understand the urgency here
,
Aug 24
Any test that (potentially) lab killing has to be banned until proven otherwise. The repair involves manual intervention which is time consuming
,
Aug 24
> If there are only three examples on dead branches I don't understand the urgency here These are just the three that we have with a clear history. There are two others that got into this state, but without a complete test history. And those were found without a lot of searching. The techs in the lab also report that they see this kind of symptom regularly. Based on that, I'd expect a dozen or two units in need of attention. However, making a definitive list is expensive: Proving this problem has occurred on any given device requires visual inspection, and we have no good way to create a list of candidate devices that's short enough to be reasonable: there are more than 360 broken devices that might be inspected. So, although this problem isn't critically urgent (it's P1, not P0), it cannot be allowed to languish: There must be mitigation, and the mitigation must be both certain and soon.
,
Aug 24
+jrbarnette can you provide a list of hosts or log locations so we can review the logs ourshelves?
,
Aug 24
> +jrbarnette can you provide a list of hosts or log locations so we can review the logs ourshelves?
Two known instances with good logs are reported in #c7 and #c8. There's a
reference to one other in #c5; this is its history:
chromeos4-row2-rack8-host4
2018-07-28 23:10:19 NO https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack8-host4/1989221-repair/
2018-07-28 22:42:07 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack8-host4/1988869-provision/
2018-07-28 19:53:53 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/221759048-chromeos-test/
2018-07-28 19:53:20 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack8-host4/1986948-reset/
I know of at least two others, but the logs are incomplete, and I'm not
sure of their names.
,
Aug 24
I don't understand why devices that have a production image cannot be restored to test image via servo, I thought the main purpose of the servo set up was to ensure we can bring a DUT back automatically no matter what (unless there's a hardware failure).
,
Aug 24
I asked Richard the same question in person. His answer, paraphrased: DUTs in the lab have a writable RO firmware, and have different firmware with test keys. The update installs production firmware with production keys. This breaks servo's ability to run recovery.
,
Aug 24
> I don't understand why devices that have a production image cannot > be restored to test image via servo, I thought the main purpose of > the servo set up was to ensure we can bring a DUT back automatically > no matter what (unless there's a hardware failure). The process of installing dev-signed firmware on a unit requires manual intervention, roughly equivalent to the manual work we have to do when devices are first delivered from the factory. We currently don't have an automated procedure that can use servo to bypass those manual steps. It's also not yet clear to me whether it's possible: part of the process of installing dev-signed firmware requires clearing the TPM, which can only be done by booting the DUT, which may or may not be possible without manual intervention.
,
Aug 25
I did some digging around, and I suspect that it is technically possible to perform deployment from the factory where the only manual operation is to remove write-protect. Everything else is in principle automatable. The practical implications for this bug are small: although in theory the process can be automated, in practice, the code to perform the automation doesn't yet exist, and will need several weeks to develop. Moreover, the automated procedures would be predicated on knowing in advance that a DUT has been affected by this problem, and would still have to be manually invoked by a human. Bug 877180 describes the automated procedure that could be used to fix the DUTs found to be affected by this bug. Addressing that bug has value in its own right, since it would make it cheaper to deploy new DUTs from the factory.
,
Aug 25
Can we cc someone who has expertise with servo to this bug? If servos are not able to bring a DUT back into service, that seems like a basic problem that should be solved. Added a few others to the bug.
,
Aug 25
Servo can definitely do this, I've been discussing offline w/ Richard and John the procedure. The main limitations are: * Firmware recovery hasn't been needed previously in ATL so there's not much existing infra automation. * Serial number, hwid, vpd are stored in firmware and not backed up elsewhere, so if they are deleted they can't easily be recovered. So the corrupted firmware should be saved before recovery, and the data restored after repair.
,
Aug 25
> Can we cc someone who has expertise with servo to this bug? [ ... ] Wait. This bug is not about fixing the broken devices; this bug is about fixing the bugs in update_engine and the associated AU test that broke the devices in the first place. The bug about automatically repairing the broken devices is bug 877180.
,
Aug 25
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/02a863a3454edb57060de03357e5b32f022a1e63 commit 02a863a3454edb57060de03357e5b32f022a1e63 Author: Richard Barnette <jrbarnette@chromium.org> Date: Sat Aug 25 21:33:27 2018 [autotest] Temporarily disable autoupdate_ForcedOOBEUpdate. The test is sometimes causing devices to download and apply consumer images from Omaha, which leaves the devices untestable, and requires a relatively expensive manual intervention to fix. This disables the test to stop the bleeding until the problem can be put under better control. BUG= chromium:877107 TEST=None Change-Id: I4d25f74abb761bec3d85d10792bc45ba3b8d6c5d Reviewed-on: https://chromium-review.googlesource.com/1187576 Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Congbin Guo <guocb@chromium.org> [modify] https://crrev.com/02a863a3454edb57060de03357e5b32f022a1e63/server/site_tests/autoupdate_ForcedOOBEUpdate/control.interrupt.full [modify] https://crrev.com/02a863a3454edb57060de03357e5b32f022a1e63/server/site_tests/autoupdate_ForcedOOBEUpdate/control.full [modify] https://crrev.com/02a863a3454edb57060de03357e5b32f022a1e63/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.delta [modify] https://crrev.com/02a863a3454edb57060de03357e5b32f022a1e63/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.full [modify] https://crrev.com/02a863a3454edb57060de03357e5b32f022a1e63/server/site_tests/autoupdate_ForcedOOBEUpdate/control.delta
,
Aug 27
The mitigation change in #c40 needs to be merged to the R69 branch.
The CL is here:
https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1192119
,
Aug 27
This bug requires manual review: We are only 7 days from stable. Please contact the milestone owner if you have questions. Owners: amineer@(Android), kariahda@(iOS), cindyb@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Aug 29
I'm going to continue working to get the mitigation change
merged back to the branch. Meantime, follow-up is (has been)
needed on these two items:
B) Figure out how this problem happens, and change the test
to avoid it.
C) Change the default CHROMEOS_AUSERVER setting in /etc/lsb-release
to be a non-functional URL, including for beta channel.
C) may turn out to require changing update_engine, but I think
somebody else has to own that problem.
,
Aug 30
Do you have any more examples of this happening to look through?
,
Aug 30
> Do you have any more examples of this happening to look through?
Given time, we can find them. The process will be somewhat labor
intensive.
Right now, my guess is that the most cost-effective actions we can make
are
1) Figure out how to remove all update_engine access to Omaha from
test images, unless the Omaha URL is supplied on the command line.
2) Try reproducing the problem. Focus should probably be on reproducing
with an older version that's been seen to cause the problem.
Option 1) is particularly important, because it shouldn't be an open-ended
problem, and once it's done, we can reasonably re-enable the test.
,
Aug 30
what about blocking omaha access from lab? (see c#27).
,
Aug 30
> what about blocking omaha access from lab? (see c#27). I consider this an option of last resort. My concern is that we'd be creating a one-off network configuration that we'd have trouble replicating and maintaining.
,
Aug 30
+cc shapiroc, waffles. We explored disabling omaha access to the lab, the way to do that seems to be from the golden-eye or the omaha end, since omaha has multiple IPs that may change. I'm not sure we should be doing it from that end though. Richard, how much work is it to implement the test image change?
,
Aug 30
> We explored disabling omaha access to the lab, the way to do that
> seems to be from the golden-eye or the omaha end, since omaha has
> multiple IPs that may change. I'm not sure we should be doing it
> from that end though.
I don't know what it even means to disable this from the Golden Eye
or Omaha side. Although, at first blush, I'd be guessing we shouldn't
do that.
> Richard, how much work is it to implement the test image change?
I believe that it's cheap; that's why I harp on it so much.
However, the changes will have to be done by somebody else, so
my opinion has to be discounted against that.
I believe one or both of the following changes would be entailed:
* Change /etc/lsb-release so that for test images, CHROMEOS_AUSERVER
holds an invalid URL. This change is almost certainly trivial.
The problem is, I don't know how to prove that it's enough to
guarantee the fix.
* Change update_engine so that all uses of kOmahaDefaultAUTestURL
are blocked, skipped, or otherwise ignored on test images. There
are only four references in source, but that doesn't guarantee that
this is easy.
,
Aug 30
OK. Regarding more instances of this problem: A tech in Stierlin Ct.
has found 8 more systems with the symptom. One of them failed
barely a week ago. History is below.
chromeos4-row8-rack7-host11
2018-08-24 20:07:23 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack7-host11/1477172-reset/
2018-08-24 20:03:33 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/230784996-chromeos-test/
2018-08-24 20:02:02 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack7-host11/1477161-reset/
2018-08-24 19:58:23 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/230784976-chromeos-test/
2018-08-24 19:56:57 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack7-host11/1477136-reset/
2018-08-24 19:53:15 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/230784958-chromeos-test/
2018-08-24 19:51:48 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack7-host11/1477110-reset/
2018-08-24 19:46:48 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/230784929-chromeos-test/
2018-08-24 19:46:04 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack7-host11/1477087-reset/
I'm going to see if we can boot it and get post-failure logs from the device.
,
Aug 30
Here are the logs for chromeos4-row8-rack7-host11
,
Aug 30
Thank you for the logs! I know what is happening now. It *is* because we are running the test on really old builds. 1. The test creates a new lsb-release at /mnt/stateful_partition/etc/lsb-release. This lsb-release points to a devserver. 2. The test starts a forced update at OOBE. 3. The test fails (e.g the update started too slowly) 4. The new lsb-release file is deleted as part of the test cleanup() 5. The DUT finishes the update in the background. 6. It automatically reboots when the update is finished. (This is normal for OOBE updates). 7. The DUT reboots back to the OOBE update screen and does another update check (this happens so the DUT can successfully deal with stepping stones). 8. Since we deleted the lsb-release it makes the request to production omaha 9. Since we are running a REALLY old build (It is an R65 build linked to in comment #50), the production omaha says there is a build and returns the latest serving stable build. So in this latest example we updated from R65 to R67 stable.
,
Sep 5
OK. I spoke offline with bhthompson@ to try and understand why we were testing such old builds in the first place. I don't understand the full details, but basically, lakitu sometimes needs to build and test its stuff against branches older than the current stable branch. The problem is that our current build code doesn't know how to build and test just lakitu, so those older branch builds end up building and testing every board, including tests in the lab. Really, we shouldn't do that.
,
Sep 5
Putting it all together, I think fixing this bug entails the following
tasks:
* Fix the test so that it can reasonably be expected to block this
particular failure in all cases.
* Change Chrome OS so that test images don't contact the standard
Omaha URL. That means changing /etc/lsb-release or update_engine
or both.
* Fix it so that building and testing old branches for lakitu doesn't
require testing in the lab. Not building all those extra boards would
be good, too.
Additionally, it would be good if the lab were able to deal with this kind
of failure without requiring manual intervention. That will be partly
addressed by bug 877180, but additional work is required that will entail
a new project, and a design document.
,
Sep 6
+cc norvez for proposed update engine changes, see bullet 2 in comment #54.
,
Sep 6
Can we get a bug filed for bullet #3 so we do not lose track of that? That seems like wasted resources.
,
Sep 6
,
Sep 6
Given that we have new bugs for the long term fixes, I'm reclaiming
this bug for the immediate mitigation changes. That means these
merge CLs:
R65 - https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1211162
R66 - https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1211163
R67 - https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1211165
R68 - https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1211164
R69 - https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1192119
All of those need to go in. The older ones should probably be
considered permanent.
,
Sep 6
Merge approved for 68 and prior.
,
Sep 6
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7ee86a49f97dcc7372651e4a5a0c420cfd6c2541 commit 7ee86a49f97dcc7372651e4a5a0c420cfd6c2541 Author: Richard Barnette <jrbarnette@chromium.org> Date: Thu Sep 06 17:41:55 2018 [autotest] Temporarily disable autoupdate_ForcedOOBEUpdate. The test is sometimes causing devices to download and apply consumer images from Omaha, which leaves the devices untestable, and requires a relatively expensive manual intervention to fix. This disables the test to stop the bleeding until the problem can be put under better control. BUG= chromium:877107 TEST=None Change-Id: I4d25f74abb761bec3d85d10792bc45ba3b8d6c5d Reviewed-on: https://chromium-review.googlesource.com/1211162 Reviewed-by: Richard Barnette <jrbarnette@google.com> Tested-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/7ee86a49f97dcc7372651e4a5a0c420cfd6c2541/server/site_tests/autoupdate_ForcedOOBEUpdate/control
,
Sep 6
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0b252c74f561075a0dc601f30fdb8d323f2dac99 commit 0b252c74f561075a0dc601f30fdb8d323f2dac99 Author: Richard Barnette <jrbarnette@chromium.org> Date: Thu Sep 06 17:43:02 2018 [autotest] Temporarily disable autoupdate_ForcedOOBEUpdate. The test is sometimes causing devices to download and apply consumer images from Omaha, which leaves the devices untestable, and requires a relatively expensive manual intervention to fix. This disables the test to stop the bleeding until the problem can be put under better control. BUG= chromium:877107 TEST=None Change-Id: I4d25f74abb761bec3d85d10792bc45ba3b8d6c5d Reviewed-on: https://chromium-review.googlesource.com/1211163 Reviewed-by: Richard Barnette <jrbarnette@google.com> Tested-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/0b252c74f561075a0dc601f30fdb8d323f2dac99/server/site_tests/autoupdate_ForcedOOBEUpdate/control.interrupt.full [modify] https://crrev.com/0b252c74f561075a0dc601f30fdb8d323f2dac99/server/site_tests/autoupdate_ForcedOOBEUpdate/control.full [modify] https://crrev.com/0b252c74f561075a0dc601f30fdb8d323f2dac99/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.delta [modify] https://crrev.com/0b252c74f561075a0dc601f30fdb8d323f2dac99/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.full [modify] https://crrev.com/0b252c74f561075a0dc601f30fdb8d323f2dac99/server/site_tests/autoupdate_ForcedOOBEUpdate/control.delta
,
Sep 6
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/08be427a7f1f872febd3d89933db2d88b0d17f8c commit 08be427a7f1f872febd3d89933db2d88b0d17f8c Author: Richard Barnette <jrbarnette@chromium.org> Date: Thu Sep 06 17:44:20 2018 [autotest] Temporarily disable autoupdate_ForcedOOBEUpdate. The test is sometimes causing devices to download and apply consumer images from Omaha, which leaves the devices untestable, and requires a relatively expensive manual intervention to fix. This disables the test to stop the bleeding until the problem can be put under better control. BUG= chromium:877107 TEST=None Change-Id: I4d25f74abb761bec3d85d10792bc45ba3b8d6c5d Reviewed-on: https://chromium-review.googlesource.com/1211165 Reviewed-by: Richard Barnette <jrbarnette@google.com> Tested-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/08be427a7f1f872febd3d89933db2d88b0d17f8c/server/site_tests/autoupdate_ForcedOOBEUpdate/control.interrupt.full [modify] https://crrev.com/08be427a7f1f872febd3d89933db2d88b0d17f8c/server/site_tests/autoupdate_ForcedOOBEUpdate/control.full [modify] https://crrev.com/08be427a7f1f872febd3d89933db2d88b0d17f8c/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.delta [modify] https://crrev.com/08be427a7f1f872febd3d89933db2d88b0d17f8c/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.full [modify] https://crrev.com/08be427a7f1f872febd3d89933db2d88b0d17f8c/server/site_tests/autoupdate_ForcedOOBEUpdate/control.delta
,
Sep 6
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/41974366c3d0990e46a60300ed4cc82c8ba2d1c1 commit 41974366c3d0990e46a60300ed4cc82c8ba2d1c1 Author: Richard Barnette <jrbarnette@chromium.org> Date: Thu Sep 06 17:45:16 2018 [autotest] Temporarily disable autoupdate_ForcedOOBEUpdate. The test is sometimes causing devices to download and apply consumer images from Omaha, which leaves the devices untestable, and requires a relatively expensive manual intervention to fix. This disables the test to stop the bleeding until the problem can be put under better control. BUG= chromium:877107 TEST=None Change-Id: I4d25f74abb761bec3d85d10792bc45ba3b8d6c5d Reviewed-on: https://chromium-review.googlesource.com/1211164 Reviewed-by: Richard Barnette <jrbarnette@google.com> Tested-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/41974366c3d0990e46a60300ed4cc82c8ba2d1c1/server/site_tests/autoupdate_ForcedOOBEUpdate/control.interrupt.full [modify] https://crrev.com/41974366c3d0990e46a60300ed4cc82c8ba2d1c1/server/site_tests/autoupdate_ForcedOOBEUpdate/control.full [modify] https://crrev.com/41974366c3d0990e46a60300ed4cc82c8ba2d1c1/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.delta [modify] https://crrev.com/41974366c3d0990e46a60300ed4cc82c8ba2d1c1/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.full [modify] https://crrev.com/41974366c3d0990e46a60300ed4cc82c8ba2d1c1/server/site_tests/autoupdate_ForcedOOBEUpdate/control.delta
,
Sep 7
Merge approved, M69.
,
Sep 7
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/63371df4f6d822133377c94abacb77ed6949664f commit 63371df4f6d822133377c94abacb77ed6949664f Author: Richard Barnette <jrbarnette@chromium.org> Date: Fri Sep 07 17:16:30 2018 [autotest] Temporarily disable autoupdate_ForcedOOBEUpdate. The test is sometimes causing devices to download and apply consumer images from Omaha, which leaves the devices untestable, and requires a relatively expensive manual intervention to fix. This disables the test to stop the bleeding until the problem can be put under better control. BUG= chromium:877107 TEST=None Change-Id: I4d25f74abb761bec3d85d10792bc45ba3b8d6c5d Reviewed-on: https://chromium-review.googlesource.com/1187576 Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Congbin Guo <guocb@chromium.org> (cherry picked from commit 02a863a3454edb57060de03357e5b32f022a1e63) Reviewed-on: https://chromium-review.googlesource.com/1192119 Reviewed-by: Richard Barnette <jrbarnette@google.com> Tested-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/63371df4f6d822133377c94abacb77ed6949664f/server/site_tests/autoupdate_ForcedOOBEUpdate/control.interrupt.full [modify] https://crrev.com/63371df4f6d822133377c94abacb77ed6949664f/server/site_tests/autoupdate_ForcedOOBEUpdate/control.full [modify] https://crrev.com/63371df4f6d822133377c94abacb77ed6949664f/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.delta [modify] https://crrev.com/63371df4f6d822133377c94abacb77ed6949664f/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.full [modify] https://crrev.com/63371df4f6d822133377c94abacb77ed6949664f/server/site_tests/autoupdate_ForcedOOBEUpdate/control.delta
,
Sep 10
OK. At this point, I believe we can be confident that this problem will quit happening in the lab. We have three bugs and one project for how to handle/prevent this sort of failure in future. So, I'm declaring victory for this bug; the other bugs can track the other work.
,
Sep 11
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Sep 11
,
Sep 14
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/eea9f6f3f47206cda3208ad6f5e56df80355d210 commit eea9f6f3f47206cda3208ad6f5e56df80355d210 Author: David Haddock <dhaddock@chromium.org> Date: Fri Sep 14 19:08:49 2018 Revert "[autotest] Temporarily disable autoupdate_ForcedOOBEUpdate." This reverts commit 02a863a3454edb57060de03357e5b32f022a1e63. Reason for revert: I have a CL to fix the test and I will submit this revert with that change Original change's description: > [autotest] Temporarily disable autoupdate_ForcedOOBEUpdate. > > The test is sometimes causing devices to download and apply consumer > images from Omaha, which leaves the devices untestable, and requires > a relatively expensive manual intervention to fix. > > This disables the test to stop the bleeding until the problem can be > put under better control. > > BUG= chromium:877107 > TEST=None > > Change-Id: I4d25f74abb761bec3d85d10792bc45ba3b8d6c5d > Reviewed-on: https://chromium-review.googlesource.com/1187576 > Tested-by: Richard Barnette <jrbarnette@chromium.org> > Reviewed-by: Congbin Guo <guocb@chromium.org> Bug: chromium:877107 Change-Id: If46e80ce26a3dff6b3faae09a9c91a6592c222d6 Reviewed-on: https://chromium-review.googlesource.com/1219549 Commit-Ready: danny chan <dchan@chromium.org> Tested-by: David Haddock <dhaddock@chromium.org> Reviewed-by: David Haddock <dhaddock@chromium.org> [modify] https://crrev.com/eea9f6f3f47206cda3208ad6f5e56df80355d210/server/site_tests/autoupdate_ForcedOOBEUpdate/control.interrupt.full [modify] https://crrev.com/eea9f6f3f47206cda3208ad6f5e56df80355d210/server/site_tests/autoupdate_ForcedOOBEUpdate/control.full [modify] https://crrev.com/eea9f6f3f47206cda3208ad6f5e56df80355d210/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.delta [modify] https://crrev.com/eea9f6f3f47206cda3208ad6f5e56df80355d210/server/site_tests/autoupdate_ForcedOOBEUpdate/control.cellular.full [modify] https://crrev.com/eea9f6f3f47206cda3208ad6f5e56df80355d210/server/site_tests/autoupdate_ForcedOOBEUpdate/control.delta |
||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||
Comment 1 by jrbarnette@chromium.org
, Aug 23