New issue
Advanced search Search tips

Issue 777969 link

Starred by 3 users

Issue metadata

Status: WontFix
Owner:
Closed: Aug 29
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Rollback protections apparently not right during provisioning

Reported by jrbarnette@chromium.org, Oct 24 2017

Issue description

In a number of cases, provision seems to be failing in a way
that it shouldn't.  Two key examples are in these two bugs:
  * Bug 777250 - provision failed because Chrome failed to start.
  * Bug 765686 - provision fails because the system-services job
    gets held up.

In both cases, the failure looks like this:
  * Provisioning runs, and reports that the image is
    successfully installed.
  * Post-provisioning, we run verify(), which reports a problem
    with system-services.

THAT'S SUPPOSED TO BE IMPOSSIBLE:
  * If Chrome fails to start, system-services will fail to start.
  * If system-services fails to start, then update-engine will
    fail to start.
  * If update-engine fails to start, then we'll never run
    "chromeos-setgoodkernel"
  * If we never run "set good kernel", then the "successful"
    and "tries" flags in the GPT for the just booted kernel
    will be 0 and non-zero, respectively.
  * After the reboot, the provision code specifically tests
    the "successful" and "tries" flags for correct values.
    If they're wrong, provisioning will fail.

Put more simply: Provisioning is expected to find the errors in
those two bugs, and fail before verify() runs, but that's not
happening.

The logs (and testing) confirms that when provisioning runs,
the flags are in fact set wrong after reboot.  That is, the
flags claim that the build booted successfully, even though
it never did.  Probably, that's an update-engine bug.

Assuming this is confirmed as an update-engine bug, it's
potentially serious.  It means that rollback protections for
consumer devices may fail in the event of a bad build being
rolled out.  Rollback protection also matters to the test lab;
it's one of our defenses to stop bad builds from disabling
testing.

 
Labels: Restrict-View-Google
Well, one thing I don't understand, If those sequence of failures causes that we don't run 'chromeos-setgoodkernel' (assuming update_engine failure too), then who set's the flag to claim the build booted successfully? How can that be possible?

How can I reproduce the problem? Does a cros flash has similar symptom?


I ran provisioning with customized devserver code that dumps the
full output from 'cgpt show' on the boot disk immediately after
update_engine installs the build on the inactive slot, and before
the reboot.  Those logs show that the GPT partition flags were set
correctly _at that instant_.

Logs are attached.  Note that the logs show that the GPT has the
correct settings before reboot, but not after reboot.  The build
installed was one of the bad builds from bug 777250.  That build
booted but never started update_engine.
> Well, one thing I don't understand, If those sequence of
> failures causes that we don't run 'chromeos-setgoodkernel'
> (assuming update_engine failure too), then who set's the
> flag to claim the build booted successfully? How can that be possible?

That's the bug:  If we knew the answers to your questions, we'd
know the root cause, and we'd be on our way to a fix.

Recapping what was learned in c#2, the sequence looks like this:
  * update_engine installs the build and correctly updates the
    CGPT flags:
      + Old partition is marked "priority=1 tries=0 successful=1"
      + New partition is marked "priority=2 tries=6 successful=0"
  * After reboot, update_engine never starts, yet we find these
    settings:
      + Old partition is marked "priority=1 tries=0 successful=1"
      + New partition is marked "priority=2 tries=0 successful=1"

> How can I reproduce the problem? Does a cros flash has similar symptom?

I'm not sure of all the conditions required to reproduce the
problem.  The problem does reliably reproduce if you use provisioning
to install this build:
    peach_pit-tot-chrome-pfq-informational/R64-10061.0.0-b8278

Based on what I've seen so far, I'd expect you could reproduce the
problem with the following:
  * Comment out the 'start on' line in /etc/init/update-engine.conf.
  * Install an image with that change using cros flash.

I note that I've seen some cases where the problem didn't reproduce:
the GPT had the expected "unsuccessful" settings after booting a
build that didn't start update_engine.  This includes _some_ of the
builds produced because of bug 777250.  That means the procedure above
may or may not reproduce the failure reliably.


I can't reproduce it on my veyron_minnie. As soon as I can get peach-pit dut, I'll try to reproduce it there.
with neither provisioning nor cros flash I can reproduce this bug:

And I'm assuming provisioning can be done with:
test_that --build=peach_pit-tot-chrome-pfq-informational/R64-10061.0.0-b8278 --board=peach_pit chromeos6-row2-rack11-host17.cros provision_AutoUpdate

Is it possible that a specific DUT is behaving like this?
> Is it possible that a specific DUT is behaving like this?

No.  I checked logs from every DUT on that particular build, and
they all showed the symptom of this bug.  Moreover, when I tested,
I used a different DUT from all the others, and it also showed the
symptom consistently.

I do note that some builds showed the symptom more than others, so
it's possible that there some sort of intermittency in the failures.

Regarding reproducing this with test_that, I'd recommend this command
line:
    test_that --build=peach_pit-tot-chrome-pfq-informational/R64-10061.0.0-b8278 --board=peach_pit :lab: dummy_Pass

You should also be able to use that approach with a custom image built with
a trybot.

>     test_that --build=peach_pit-tot-chrome-pfq-informational/R64-10061.0.0-b8278 --board=peach_pit :lab: dummy_Pass

I've tried the above command, it's not clearly working.  I don't know
what's wrong...

> test_that --build=peach_pit-tot-chrome-pfq-informational/R64-10061.0.0-b8278 --board=peach_pit chromeos6-row2-rack11-host17.cros provision_AutoUpdate

This command is probably, overall, a bad idea.  The act of specifying a
build should trigger the necessary provisioning code; it's not clear to
me that provision_AutoUpdate can be specified this way.  Using "dummy_Pass"
as the test is the best solution.

You can also just go to the AFE and click the "Create Job" button.
When it comes up, request a run of "dummy_Pass" against whatever build
you think can help debug the problem.  If you want custom changes, use
a trybot build.

Here's a sample test job that reproduced the failure:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=151058750

>     test_that --build=peach_pit-tot-chrome-pfq-informational/R64-10061.0.0-b8278 --board=peach_pit :lab: dummy_Pass

I've rechecked:  The command above will work.  _Except_ right now
there may be problems because of a shortage of working, idle
peach_pit DUTs in the suites pool.

I had locked one of DUTs for this test, but I see there is another one which is not locked and is ready. Other ones are repairing and failing on repair. I just unlocked mine. Maybe try again now?
Labels: -Restrict-View-Google
Ok. I think I found the issue (which seems there is no issue). Please let me know if you think my finding is incorrect. I'm using the logs to the job you posted:

>Here's a sample test job that reproduced the failure:
>   https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=151058750

file autoupdate_logs/CrOS_update_chromeos6-row2-rack10-host5_20020.log

Update engine finishes the update:
2017/10/23 15:45:55.260 INFO |      auto_updater:0700| Waiting for update...status: UPDATE_STATUS_FINALIZING at progress 0.000000
2017/10/23 15:46:05.261 DEBUG|    cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpnO42qo/testing_rsa root@chromeos6-row2-rack10-host5 -- update_engine_client --status
2017/10/23 15:46:05.583 DEBUG|    cros_build_lib:0642| (stdout):
LAST_CHECKED_TIME=1508798536
PROGRESS=1.000000
CURRENT_OP=UPDATE_STATUS_UPDATED_NEED_REBOOT
NEW_VERSION=999999.0.0
NEW_SIZE=518683006


device is rebooted:

2017/10/23 15:46:08.560 DEBUG|    cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpnO42qo/testing_rsa root@chromeos6-row2-rack10-host5 -- reboot

after the device comes up:

2017/10/23 15:46:54.497 DEBUG|      auto_updater:1252| Start pre-setup for stateful update...

Then it restarts the update-engine service:

2017/10/23 15:46:59.476 DEBUG|    cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpnO42qo/testing_rsa root@chromeos6-row2-rack10-host5 -- stop update-engine
Warning: Permanently added 'chromeos6-row2-rack10-host5,100.115.129.79' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos6-row2-rack10-host5,100.115.129.79' (ED25519) to the list of known hosts.
stop: Unknown instance: 
2017/10/23 15:46:59.722 DEBUG|    cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpnO42qo/testing_rsa root@chromeos6-row2-rack10-host5 -- start update-engine

This causes the update engine to come up and run set-goodkernel.

Devince is rebooted again:

2017/10/23 15:47:44.544 DEBUG|    cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpnO42qo/testing_rsa root@chromeos6-row2-rack10-host5 -- reboot

After the reboot the cgpt is checked:

2017/10/23 15:48:45.299 DEBUG|    cros_build_lib:0593| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpnO42qo/testing_rsa root@chromeos6-row2-rack10-host5 -- cgpt show -n -i 2 -S '$(rootdev -s -d)'
2017/10/23 15:48:45.565 DEBUG|    cros_build_lib:0642| (stdout):
1

which shows success. Normally it shouldn't have because update_engine didn't come up. But since we force restart the update engine in the middle, this happens. Probably the reason you see this happens intermittently, is that sometimes the first time device gets rebooted before the UE can set the flag :)


Cc: dhadd...@chromium.org
Actually I think the issue is the restarting of the UE and the second reboot. I don't know why any of them is needed.
Cc: xixuan@chromium.org
> [ ... ] But since we force restart the update engine in the middle,
> this happens. [ ... ]

Oh, <sigh>.  I worried about this sort of thing, and tried to check
at least one set of logs for this sort of thing, but I couldn't see it.

I haven't studied the code, but if provisioning manually restarts
update_engine after applying an update and rebooting, that's just
wrong, and we have to fix it.

Owner: xixuan@chromium.org
I'm not sure I understand it right.

Proper fix in my understanding:

1) Remove 'restart update-engine' in pre_check of stateful update
2) Remove 'reboot' after the stateful update is finished.

Is that correct @ahassani @jrbarnette ?
@xixuan, I'm not really sure if its that simple. There should've been a reason we do it this way. dhaddock@ might be able to shed a light on this. It could've been written this way so the errors we see show exactly the cause of the problem. Personally I like current's error reporting that is more specific otherwise everything would get blamed on UE. If there was a way to improve this issue but keep reporting the correct cause (which in this case was that chrome did not come up), I'm all for it.
Cc: davidri...@chromium.org
Re #15, thanks for the clarification.

I compare the old provision flow and the new one. There's a slight difference: Old provision combine update_rootfs and update_stateful together, the sequence is:

reset_update_engine()
reset_stateful_partition()
if update_rootfs:
    update_rootfs()
update_stateful()

New provision code split the whole code flow to 2 functions, update_rootfs() and update_stateful(), and in each function there exists 'reset_update_engine()'.

Also old provision doesn't verify update_stateful succeed if rootfs update is required also. New provision changes that, and it's why we have a second reboot in PostCheckStatefulUpdate.

So we can fix it by checking if there's already a rootfs update being kicked off before. If so, we don't reset update engine for the next stateful update, I will wait for more comments to see if it's proper.
I had a look back through my CLs and the first time I touched auto_updater.py, the provision code already did the UE restart and second reboot. So I just kept it as is when i added the au test
Cc: marchuk@google.com
Labels: Hotlist-Enterprise
We have one support case, where customer demonstrated successful rollback (from 60 to 59 on peppy) by pressing ALT-vol_up-X 6 times in a row once device is booted. so #1 is possible, however I could not reproduce on same device same version.

Attached video and logs.
https://drive.google.com/drive/folders/1UGeYMclfvfrzdsVtBy0XYJTyi1OOEBtP?usp=sharing
Labels: OS-Chrome
comment 18 May not be directly related to this specific bug, But something that wasn't mention is that was it just updated to 60 from 59? And if yes, from video I can see the reboots happen really fast. Usually UE needs sometime to verify the kernel. But since the reboots are happening very fast, UE cannot verify it and After a few retries, it rolls back. I do not fully know if we have this protection or it is just a bug.
> We have one support case, where customer demonstrated successful
> rollback (from 60 to 59 on peppy) by pressing ALT-vol_up-X 6 times
> in a row once device is booted. so #1 is possible, however I could
> not reproduce on same device same version.

Depending on the exact circumstances, rollback in that case may
actually be the system WAI.

In any event, current explanation for this bug means that it
can only occur in the Autotest test lab, not in a customer
environment.  That means that this bug isn't a good vehicle
for discussing the problem in c#18.

marchuk@ - could you please do the following ASAP:
 1) File a new bug detailing the circumstances of the events in
    the video.  Specifically, be sure to provide enough information
    for someone to reproduce it locally.
 2) Make sure the new bug has the attention of the appropriate
    product engineering team.  Given that ahassani@ has taken an
    interest, he seems a likely starting point.
 3) Drop the Hotlist-Enterprise from this bug in favor the other
    bug.

Hi, this bug has not been updated recently. Please acknowledge the bug and provide status within two weeks (6/22/2018), or the bug will be archived. Thank you.
Status: Assigned (was: Available)
Status: WontFix (was: Assigned)
Mark it as wontfix as @jrbarnette recently refactor this code.

Sign in to add a comment