New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 3 users

Issue metadata

Status: Fixed
Closed: Feb 2017
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 0
Type: Bug

Blocked on:
issue chrome-os-partner:62911
issue 692625

Sign in to add a comment

reef release keeps failing due to system rolled back to previous build

Project Member Reported by, Feb 9 2017

Issue description

Comment 2 by, Feb 9 2017


Comment 3 by, Feb 9 2017

for M58

Failed steps failed cbuildbot [reef-release] failed paygenbuildcanary failed paygentestdev failed hwtest [sanity]

Comment 4 by, Feb 9 2017

+waihong sheriff

Comment 5 by, Feb 9 2017

We will need green 57 on reef soon otherwise we won't be able to signoff on reef without stresslab test result.

Two issues:
 (1) PaygenTestDev failure
 (2) HWTest failure

(2) is caused by failing to provision. This provision failure may also affect (1) to generate some random failures.

The provision log:

02/09 05:54:31.537 INFO |    cros_build_lib:0564| RunCommand: /tmp/_autotmp_qwpSmL/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' cat gs://chromeos-image-archive/reef-release/R58-9267.0.0/metadata.json
02/09 05:54:33.648 DEBUG|   reporting_utils:0308| return code: 1; command: /tmp/_autotmp_qwpSmL/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' cat gs://chromeos-image-archive/reef-release/R58-9267.0.0/metadata.json
CommandException: No URLs matched: gs://chromeos-image-archive/reef-release/R58-9267.0.0/metadata.json

cmd=['/tmp/_autotmp_qwpSmL/common/gsutil_4.19.tar.gz/gsutil/gsutil', '-o', 'Boto:num_retries=10', 'cat', 'gs://chromeos-image-archive/reef-release/R58-9267.0.0/metadata.json'], extra env={'BOTO_CONFIG': '/usr/local/google/home/chromeos-test/.boto'}

I tried to run the same command on my desktop "gsutil -o 'Boto:num_retries=10' cat gs://chromeos-image-archive/reef-release/R58-9267.0.0/metadata.json" that worked fine. Probably some access issue specific in the lab.

Assigned to infra deputy to check the access from the test server.

Comment 8 by, Feb 9 2017

The provision log is actually here:

02/09 05:43:20.951 DEBUG|        dev_server:1856| Failed to trigger auto-update process on devserver
02/09 05:43:20.952 DEBUG|        base_utils:0185| Running 'ssh 'curl ""''
02/09 05:43:22.410 DEBUG|        base_utils:0185| Running 'ssh 'curl ""''
02/09 05:43:24.159 DEBUG|        dev_server:1688| Saving auto-update logs into /usr/local/autotest/results/hosts/chromeos2-row4-rack7-host19/59894268-provision/autoupdate_logs/CrOS_update_chromeos2-row4-rack7-host19_2006.log
02/09 05:43:24.160 DEBUG|        dev_server:1955| Exception raised on auto_update attempt #1:
 Traceback (most recent call last):
  File "/home/chromeos-test/chromiumos/src/platform/dev/", line 226, in TriggerAU
  File "/home/chromeos-test/chromiumos/chromite/lib/", line 985, in PostCheckCrOSUpdate
    'build' % (self.update_version, self.device.hostname))
  File "/home/chromeos-test/chromiumos/chromite/lib/", line 809, in _VerifyBootExpectations
    raise RootfsUpdateError(rollback_message)
RootfsUpdateError: Build reef-release/R58-9267.0.0 failed to boot on chromeos2-row4-rack7-host19; system rolled back to previous build

And the AU Logs can be found here:

Comment 9 by, Feb 9 2017

Looks like the SSH connection is down post update:

2017/02/09 05:50:40.705 DEBUG|      auto_updater:0907| Start pre-setup for stateful update...
2017/02/09 05:50:40.705 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpudfDzJ/testing_rsa root@ -- sudo stop ap-update-manager
Warning: Permanently added '' (RSA) to the list of known hosts.
stop: Unknown job: ap-update-manager
Timeout, server not responding.
2017/02/09 05:51:21.141 ERROR|     remote_access:0851| Error connecting to device
2017/02/09 05:51:21.142 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpudfDzJ/testing_rsa root@ -- rm -rf /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.8PfuILvL9F
2017/02/09 05:52:24.840 DEBUG|       cros_update:0229| Error happens in CrOS auto-update: SSHConnectionError(None,)
I checked chromeos2-row4-rack6-host2 as an example, and my found is:

1. The host chromeos2-row4-rack6-host2's cros-version label (reef-release/R56-9000.50.0) is never changed, which means it can only pass tests like "autoupdate_EndToEndTest_paygen_au_beta_full_9000.50.0", no other versions.

2. Auto-update in repair works, which installs default "reef-release/R56-9000.50.0".

3. Auto-update in provision of new release version doesn't work, and fail due to one major kind of failures: "Build reef-release/<a new version> failed to boot on chromeos2-row4-rack6-host2", which means the installation process is finished, but the host cannot find right boot partition.
   So I suggest to cros flash all affected DUTs (chromeos2-row4-rack6-host2, chromeos2-row4-rack7-host13, chromeos2-row4-rack7-host19) with a newer version to further see whether these new builds can be installed in the host. Maybe they just have some remaining files of old versions which blocks them to be updated to new ones.

I created a cautotest job to flash all 3 of those devices to: reef-release/R58-9266.0.0

The flash jobs all failed.
The flash jobs do the same things as normal provision jobs, so they failed in the same "system rolled back to previous build" error. May need other ways to flash.
Weirdly cros flash worked...:

$ cros flash chromeos2-row4-rack6-host2.cros remote/reef-release/R58-9266.0.0
14:59:04: NOTICE: Preparing to update the remote device chromeos2-row4-rack6-host2.cros
 [##############################################################################################] 100%
15:04:13: NOTICE: Finalizing image.
15:04:55: NOTICE: Update completed.
15:05:20: NOTICE: rebooting device...
15:07:06: NOTICE: Verifying that the device has been updated...
15:07:07: NOTICE: Update performed successfully.
15:07:07: NOTICE: cros flash completed successfully.

$ ssh root@chromeos2-row4-rack6-host2.cros
The authenticity of host ' (' can't be established.
RSA key fingerprint is SHA256:FzZ7jruzIC9aBJbVgOcmpI5d5szMeTPqQsr0doostwE.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '' (RSA) to the list of known hosts.
localhost ~ # cat /etc/lsb-release 

Xixuan what does that mean?
Here is the straight provision logs:

They look the same as the previous failure as Waihong said.
I will cros flash host13 later to see what happened. 

Looks like provision has more verification steps than cros-flash and these steps fail.
findings after further debugging:

I cannot reproduce the error "system rolled back to previous build" on both host13 & host19. But I have the error that the DUT often lose their SSHconnection during some specific steps of provisioning. 

After add a retry fix of SSHConnectionError, the problem seems gone at least in my workstation, host13 & host19 are successfully provisioned with local devserver & autotest.

So I will upload these fixes, let's see whether this will fix the reef provision problem.
"system rolled back to previous build" means that the build crashed
and burned at initial boot.  Usually, that means the build is bad.

I've reproduced the failure by the simple expedient of running
"dummy_Pass" on a random DUT, using reef-release/R58-9268.0.0:

We need to pass this problem to a sheriff to debug and triage the

Status: Assigned (was: Available)
Simran tells me he's overloaded, so on his behalf:

waihong@ - Tag, you're it!  You should probably pass
this to someone who's sheriff for 2/13.

Comment 20 by, Feb 11 2017

I thought Xixuan was applying a retry fix to try and resolve this?
Summary: reef release keeps failing due to system rolled back to previous build (was: no green build for reef family)
oh, after I apply the retry fix locally, "system rolled back to previous build" error comes back. With Richard's help, it's found that this may be the problem of build itself. 

So I will still upload the retry fix to make the provision error more focused (on system rolled back). The tracking bug would be @Benson will take care of the rollback error.
+aaron +mary

We identified several problems on the reef system deployed here. One is that the repair system firmware was positively ancient, at 8992.0.

Richard bumped it to 9042.14.

Second thing is that according to the logs, the cr50-updater is causing the system to reboot. It's not known how many times it causes the system to reboot shortly after boot, but this could potentially cause a rollback.
I'm going to try to repro this locally on a system using cros flash.
Labels: -Pri-1 Blocks-FSI Pri-0
High criticality to this bug, as we need a good build soon for factory.
I happen to know why "cros flash" works but provision job fails.

"cros flash" does the default "old state" stateful update:

Because it doesn't pass any argument to stateful_update script:

However, provision job does an "clean state" stateful update.

The "clean state" update will remove all other directories except dev_image and var_overlay under /mnt/stateful_partition.

Unfortunately, we keep some state about cr50 update in /mnt/stateful_partition/encrypted (file /var/cache/cr50.some_md5sum.state) which gets removed. So it triggers reboot (4-times) to construct a new cr50 state file:

We should revisit this reboot (4 times) which happens on every provision job in the lab.
Labels: proj-cr50
FWIW, a M57 build passed, despite the odds.

Vadim, why would we be rebooting 4 times though? If the h1 firmware is up to date it should fall all the way through with a completed status, no?

Comment 28 by, Feb 13 2017

Labels: bvttriage
Added this week sheriffs to follow up.
I think we need to bump the fw base of pyro and snappy as well for whatever their fallback version is. That might get us passing builds on all of those boards. It won't cause the general issue of cr50 potentially not liking our build process. I fear that bouncing around back and forth with different images and firmware versions will continue to cause problems.
Blockedon: 692625 chrome-os-partner:62911
Paygenbuildcanary issue tracked here :  issue 692625 
Paygentest autest, provisioning tracked here :

Comment 32 by, Feb 15 2017

so, no matter what the fix is for the 'steady' cr50 image case, there will have to be reboots if the new chrome os image has a newer cr50 image in it.

Is this going to be a problem for the lab?
jrbarnette@ can answer.

How many reboot? I think 4-reboot is not acceptable. Lab machines will be scheduled tests running on different channels. The cr50 image not matched case may happen several times a day.
Re 25, 

The code you're looking is wrong.

cros flash uses codes in:$&l=449

provision also uses auto-update codes in:$&l=1

The code to call provisioning is:$&l=155

Basically, they use the same codes. But you're right at they do use different setting of '--stateful_change=clean'. I enable the clobber_stateful=True for cros flash, but still can successfully finish cros flash. So '--stateful_change=clean' may not explain why cros_flash succeeds but provision fails.
Re c#34, thanks for correcting it.

The multiple reboots happen if the following two conditions happen:
 (1) stateful_change=clean to wipe the cr50 state file in /var/cache
 (2) the usb_updater command fails to updates cr50 firmware

I don't know why (2) happens on some cases but some other cases seem working fine.

For example, this one fails:

Check the /var/log/messages:
2017-02-15T07:22:23.124049-08:00 WARNING kernel: [   14.165281] init: cr50-update main process (1971) terminated with status 3

And this one works:

Check the /var/log/messages:
2017-02-14T01:44:33.806180+00:00 NOTICE cr50-update[2554]: Will run /usr/sbin/usb_updater  -u /opt/google/cr50/firmware/
2017-02-14T01:44:33.862727+00:00 NOTICE cr50-update[2571]: cr50 is running updated firmware
> so, no matter what the fix is for the 'steady' cr50
> image case, there will have to be reboots if the new
> chrome os image has a newer cr50 image in it.
> Is this going to be a problem for the lab?

We need to figure out first what's a problem for the users.
In particular, a user experience that looks like "I rebooted
to get an update, and then I saw login screen, and then the
system rebooted 4 times" won't work for users.  I doubt that
system will be acceptable it reboots even once after login
screen, because users expect to be able to use the system
once login screen is up.

The expected user visible behavior is that any work required
after an update happens while the splash screen is up, before
we start Chrome:
  * If the work is notably long running, we need to display
    a message to explain the delay.
  * If the work will require a reboot at the end, we need
    to display a message setting user expectations.
  * All the work should be done after a single extra reboot.
  * Events of this sort should be relatively rare during the
    life of the product.

We already do stuff like this for firmware update, so the basic
code facilities we need to do it already exist.

If the systems will work as described above, the lab can learn
to cope.  If the systems can't work this way, this feature
needs review at a higher level for its appropriateness, and
getting the lab to cope could be extra work.
Richard : 

There's already a bug for displaying a warning message about an impending reboot:

I agree though that rebooting 4 times is seriously not a good user experience. Going over the update script, that 4 times behavior is how the script deals with unknown errors in the usb_updater. The idea is that rather than dealing with the error in an meaningful way (or even displaying the error to the log), the script just crosses its fingers and hopes a reboot will clear the error condition and we'll be better off the next boot.

Aaron and I were talking last night, and we agreed that this whole rebooting as a last resort is just not right and will definitely need to change.
Project Member

Comment 38 by, Feb 23 2017

Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable?

If a fix is in active development, please set the status to Started.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit - Your friendly Sheriffbot
Status: Fixed (was: Assigned)

Sign in to add a comment