Rollback protections apparently not right during provisioning
Reported by
jrbarnette@chromium.org,
Oct 24 2017
|
||||||||||
Issue descriptionIn a number of cases, provision seems to be failing in a way that it shouldn't. Two key examples are in these two bugs: * Bug 777250 - provision failed because Chrome failed to start. * Bug 765686 - provision fails because the system-services job gets held up. In both cases, the failure looks like this: * Provisioning runs, and reports that the image is successfully installed. * Post-provisioning, we run verify(), which reports a problem with system-services. THAT'S SUPPOSED TO BE IMPOSSIBLE: * If Chrome fails to start, system-services will fail to start. * If system-services fails to start, then update-engine will fail to start. * If update-engine fails to start, then we'll never run "chromeos-setgoodkernel" * If we never run "set good kernel", then the "successful" and "tries" flags in the GPT for the just booted kernel will be 0 and non-zero, respectively. * After the reboot, the provision code specifically tests the "successful" and "tries" flags for correct values. If they're wrong, provisioning will fail. Put more simply: Provisioning is expected to find the errors in those two bugs, and fail before verify() runs, but that's not happening. The logs (and testing) confirms that when provisioning runs, the flags are in fact set wrong after reboot. That is, the flags claim that the build booted successfully, even though it never did. Probably, that's an update-engine bug. Assuming this is confirmed as an update-engine bug, it's potentially serious. It means that rollback protections for consumer devices may fail in the event of a bad build being rolled out. Rollback protection also matters to the test lab; it's one of our defenses to stop bad builds from disabling testing.
,
Oct 24 2017
I ran provisioning with customized devserver code that dumps the full output from 'cgpt show' on the boot disk immediately after update_engine installs the build on the inactive slot, and before the reboot. Those logs show that the GPT partition flags were set correctly _at that instant_. Logs are attached. Note that the logs show that the GPT has the correct settings before reboot, but not after reboot. The build installed was one of the bad builds from bug 777250. That build booted but never started update_engine.
,
Oct 24 2017
> Well, one thing I don't understand, If those sequence of
> failures causes that we don't run 'chromeos-setgoodkernel'
> (assuming update_engine failure too), then who set's the
> flag to claim the build booted successfully? How can that be possible?
That's the bug: If we knew the answers to your questions, we'd
know the root cause, and we'd be on our way to a fix.
Recapping what was learned in c#2, the sequence looks like this:
* update_engine installs the build and correctly updates the
CGPT flags:
+ Old partition is marked "priority=1 tries=0 successful=1"
+ New partition is marked "priority=2 tries=6 successful=0"
* After reboot, update_engine never starts, yet we find these
settings:
+ Old partition is marked "priority=1 tries=0 successful=1"
+ New partition is marked "priority=2 tries=0 successful=1"
> How can I reproduce the problem? Does a cros flash has similar symptom?
I'm not sure of all the conditions required to reproduce the
problem. The problem does reliably reproduce if you use provisioning
to install this build:
peach_pit-tot-chrome-pfq-informational/R64-10061.0.0-b8278
Based on what I've seen so far, I'd expect you could reproduce the
problem with the following:
* Comment out the 'start on' line in /etc/init/update-engine.conf.
* Install an image with that change using cros flash.
I note that I've seen some cases where the problem didn't reproduce:
the GPT had the expected "unsuccessful" settings after booting a
build that didn't start update_engine. This includes _some_ of the
builds produced because of bug 777250. That means the procedure above
may or may not reproduce the failure reliably.
,
Oct 25 2017
I can't reproduce it on my veyron_minnie. As soon as I can get peach-pit dut, I'll try to reproduce it there.
,
Oct 25 2017
with neither provisioning nor cros flash I can reproduce this bug: And I'm assuming provisioning can be done with: test_that --build=peach_pit-tot-chrome-pfq-informational/R64-10061.0.0-b8278 --board=peach_pit chromeos6-row2-rack11-host17.cros provision_AutoUpdate Is it possible that a specific DUT is behaving like this?
,
Oct 25 2017
> Is it possible that a specific DUT is behaving like this? No. I checked logs from every DUT on that particular build, and they all showed the symptom of this bug. Moreover, when I tested, I used a different DUT from all the others, and it also showed the symptom consistently. I do note that some builds showed the symptom more than others, so it's possible that there some sort of intermittency in the failures.
,
Oct 25 2017
Regarding reproducing this with test_that, I'd recommend this command
line:
test_that --build=peach_pit-tot-chrome-pfq-informational/R64-10061.0.0-b8278 --board=peach_pit :lab: dummy_Pass
You should also be able to use that approach with a custom image built with
a trybot.
,
Oct 26 2017
> test_that --build=peach_pit-tot-chrome-pfq-informational/R64-10061.0.0-b8278 --board=peach_pit :lab: dummy_Pass
I've tried the above command, it's not clearly working. I don't know
what's wrong...
> test_that --build=peach_pit-tot-chrome-pfq-informational/R64-10061.0.0-b8278 --board=peach_pit chromeos6-row2-rack11-host17.cros provision_AutoUpdate
This command is probably, overall, a bad idea. The act of specifying a
build should trigger the necessary provisioning code; it's not clear to
me that provision_AutoUpdate can be specified this way. Using "dummy_Pass"
as the test is the best solution.
You can also just go to the AFE and click the "Create Job" button.
When it comes up, request a run of "dummy_Pass" against whatever build
you think can help debug the problem. If you want custom changes, use
a trybot build.
Here's a sample test job that reproduced the failure:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=151058750
,
Oct 26 2017
> test_that --build=peach_pit-tot-chrome-pfq-informational/R64-10061.0.0-b8278 --board=peach_pit :lab: dummy_Pass I've rechecked: The command above will work. _Except_ right now there may be problems because of a shortage of working, idle peach_pit DUTs in the suites pool.
,
Oct 26 2017
I had locked one of DUTs for this test, but I see there is another one which is not locked and is ready. Other ones are repairing and failing on repair. I just unlocked mine. Maybe try again now?
,
Oct 26 2017
Ok. I think I found the issue (which seems there is no issue). Please let me know if you think my finding is incorrect. I'm using the logs to the job you posted: >Here's a sample test job that reproduced the failure: > https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=151058750 file autoupdate_logs/CrOS_update_chromeos6-row2-rack10-host5_20020.log Update engine finishes the update: 2017/10/23 15:45:55.260 INFO | auto_updater:0700| Waiting for update...status: UPDATE_STATUS_FINALIZING at progress 0.000000 2017/10/23 15:46:05.261 DEBUG| cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpnO42qo/testing_rsa root@chromeos6-row2-rack10-host5 -- update_engine_client --status 2017/10/23 15:46:05.583 DEBUG| cros_build_lib:0642| (stdout): LAST_CHECKED_TIME=1508798536 PROGRESS=1.000000 CURRENT_OP=UPDATE_STATUS_UPDATED_NEED_REBOOT NEW_VERSION=999999.0.0 NEW_SIZE=518683006 device is rebooted: 2017/10/23 15:46:08.560 DEBUG| cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpnO42qo/testing_rsa root@chromeos6-row2-rack10-host5 -- reboot after the device comes up: 2017/10/23 15:46:54.497 DEBUG| auto_updater:1252| Start pre-setup for stateful update... Then it restarts the update-engine service: 2017/10/23 15:46:59.476 DEBUG| cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpnO42qo/testing_rsa root@chromeos6-row2-rack10-host5 -- stop update-engine Warning: Permanently added 'chromeos6-row2-rack10-host5,100.115.129.79' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos6-row2-rack10-host5,100.115.129.79' (ED25519) to the list of known hosts. stop: Unknown instance: 2017/10/23 15:46:59.722 DEBUG| cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpnO42qo/testing_rsa root@chromeos6-row2-rack10-host5 -- start update-engine This causes the update engine to come up and run set-goodkernel. Devince is rebooted again: 2017/10/23 15:47:44.544 DEBUG| cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpnO42qo/testing_rsa root@chromeos6-row2-rack10-host5 -- reboot After the reboot the cgpt is checked: 2017/10/23 15:48:45.299 DEBUG| cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpnO42qo/testing_rsa root@chromeos6-row2-rack10-host5 -- cgpt show -n -i 2 -S '$(rootdev -s -d)' 2017/10/23 15:48:45.565 DEBUG| cros_build_lib:0642| (stdout): 1 which shows success. Normally it shouldn't have because update_engine didn't come up. But since we force restart the update engine in the middle, this happens. Probably the reason you see this happens intermittently, is that sometimes the first time device gets rebooted before the UE can set the flag :)
,
Oct 26 2017
Actually I think the issue is the restarting of the UE and the second reboot. I don't know why any of them is needed.
,
Oct 26 2017
> [ ... ] But since we force restart the update engine in the middle, > this happens. [ ... ] Oh, <sigh>. I worried about this sort of thing, and tried to check at least one set of logs for this sort of thing, but I couldn't see it. I haven't studied the code, but if provisioning manually restarts update_engine after applying an update and rebooting, that's just wrong, and we have to fix it.
,
Oct 26 2017
I'm not sure I understand it right. Proper fix in my understanding: 1) Remove 'restart update-engine' in pre_check of stateful update 2) Remove 'reboot' after the stateful update is finished. Is that correct @ahassani @jrbarnette ?
,
Oct 26 2017
@xixuan, I'm not really sure if its that simple. There should've been a reason we do it this way. dhaddock@ might be able to shed a light on this. It could've been written this way so the errors we see show exactly the cause of the problem. Personally I like current's error reporting that is more specific otherwise everything would get blamed on UE. If there was a way to improve this issue but keep reporting the correct cause (which in this case was that chrome did not come up), I'm all for it.
,
Oct 26 2017
Re #15, thanks for the clarification.
I compare the old provision flow and the new one. There's a slight difference: Old provision combine update_rootfs and update_stateful together, the sequence is:
reset_update_engine()
reset_stateful_partition()
if update_rootfs:
update_rootfs()
update_stateful()
New provision code split the whole code flow to 2 functions, update_rootfs() and update_stateful(), and in each function there exists 'reset_update_engine()'.
Also old provision doesn't verify update_stateful succeed if rootfs update is required also. New provision changes that, and it's why we have a second reboot in PostCheckStatefulUpdate.
So we can fix it by checking if there's already a rootfs update being kicked off before. If so, we don't reset update engine for the next stateful update, I will wait for more comments to see if it's proper.
,
Oct 26 2017
I had a look back through my CLs and the first time I touched auto_updater.py, the provision code already did the UE restart and second reboot. So I just kept it as is when i added the au test
,
Dec 2 2017
We have one support case, where customer demonstrated successful rollback (from 60 to 59 on peppy) by pressing ALT-vol_up-X 6 times in a row once device is booted. so #1 is possible, however I could not reproduce on same device same version. Attached video and logs. https://drive.google.com/drive/folders/1UGeYMclfvfrzdsVtBy0XYJTyi1OOEBtP?usp=sharing
,
Dec 2 2017
,
Dec 4 2017
comment 18 May not be directly related to this specific bug, But something that wasn't mention is that was it just updated to 60 from 59? And if yes, from video I can see the reboots happen really fast. Usually UE needs sometime to verify the kernel. But since the reboots are happening very fast, UE cannot verify it and After a few retries, it rolls back. I do not fully know if we have this protection or it is just a bug.
,
Dec 5 2017
> We have one support case, where customer demonstrated successful
> rollback (from 60 to 59 on peppy) by pressing ALT-vol_up-X 6 times
> in a row once device is booted. so #1 is possible, however I could
> not reproduce on same device same version.
Depending on the exact circumstances, rollback in that case may
actually be the system WAI.
In any event, current explanation for this bug means that it
can only occur in the Autotest test lab, not in a customer
environment. That means that this bug isn't a good vehicle
for discussing the problem in c#18.
marchuk@ - could you please do the following ASAP:
1) File a new bug detailing the circumstances of the events in
the video. Specifically, be sure to provide enough information
for someone to reproduce it locally.
2) Make sure the new bug has the attention of the appropriate
product engineering team. Given that ahassani@ has taken an
interest, he seems a likely starting point.
3) Drop the Hotlist-Enterprise from this bug in favor the other
bug.
,
Jun 8 2018
Hi, this bug has not been updated recently. Please acknowledge the bug and provide status within two weeks (6/22/2018), or the bug will be archived. Thank you.
,
Aug 2
,
Aug 29
Mark it as wontfix as @jrbarnette recently refactor this code. |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by ahass...@chromium.org
, Oct 24 2017