reef release keeps failing due to system rolled back to previous build |
|||||||||||||||||||
Issue descriptionWe have not have a green build for the reef family for a long time https://cros-goldeneye.corp.google.com/chromeos/console/listBuild?boards=amenia%2Cpyro%2Creef%2Csand%2Csnappy&milestone=&chromeOsVersion=&chromeVersion=&startTimeFrom=&startTimeTo=#%2F
,
Feb 9 2017
,
Feb 9 2017
for M58 https://uberchromegw.corp.google.com/i/chromeos/builders/reef-release/builds/826 Failed steps failed cbuildbot [reef-release] failed paygenbuildcanary failed paygentestdev failed hwtest [sanity]
,
Feb 9 2017
+waihong sheriff
,
Feb 9 2017
We will need green 57 on reef soon otherwise we won't be able to signoff on reef without stresslab test result.
,
Feb 9 2017
https://uberchromegw.corp.google.com/i/chromeos/builders/reef-release/builds/828 Two issues: (1) PaygenTestDev failure (2) HWTest failure (2) is caused by failing to provision. This provision failure may also affect (1) to generate some random failures. The provision log: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/100293536-chromeos-test/hostless/debug/ """ 02/09 05:54:31.537 INFO | cros_build_lib:0564| RunCommand: /tmp/_autotmp_qwpSmL/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' cat gs://chromeos-image-archive/reef-release/R58-9267.0.0/metadata.json 02/09 05:54:33.648 DEBUG| reporting_utils:0308| return code: 1; command: /tmp/_autotmp_qwpSmL/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' cat gs://chromeos-image-archive/reef-release/R58-9267.0.0/metadata.json CommandException: No URLs matched: gs://chromeos-image-archive/reef-release/R58-9267.0.0/metadata.json cmd=['/tmp/_autotmp_qwpSmL/common/gsutil_4.19.tar.gz/gsutil/gsutil', '-o', 'Boto:num_retries=10', 'cat', 'gs://chromeos-image-archive/reef-release/R58-9267.0.0/metadata.json'], extra env={'BOTO_CONFIG': '/usr/local/google/home/chromeos-test/.boto'} """ I tried to run the same command on my desktop "gsutil -o 'Boto:num_retries=10' cat gs://chromeos-image-archive/reef-release/R58-9267.0.0/metadata.json" that worked fine. Probably some access issue specific in the lab. Assigned to infra deputy to check the access from the test server.
,
Feb 9 2017
,
Feb 9 2017
The provision log is actually here: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/100293577-chromeos-test/chromeos2-row4-rack7-host19/debug 02/09 05:43:20.951 DEBUG| dev_server:1856| Failed to trigger auto-update process on devserver 02/09 05:43:20.952 DEBUG| base_utils:0185| Running 'ssh 100.115.245.200 'curl "http://100.115.245.200:8082/handler_cleanup?pid=2006&host_name=chromeos2-row4-rack7-host19"'' 02/09 05:43:22.410 DEBUG| base_utils:0185| Running 'ssh 100.115.245.200 'curl "http://100.115.245.200:8082/collect_cros_au_log?pid=2006&host_name=chromeos2-row4-rack7-host19"'' 02/09 05:43:24.159 DEBUG| dev_server:1688| Saving auto-update logs into /usr/local/autotest/results/hosts/chromeos2-row4-rack7-host19/59894268-provision/autoupdate_logs/CrOS_update_chromeos2-row4-rack7-host19_2006.log 02/09 05:43:24.160 DEBUG| dev_server:1955| Exception raised on auto_update attempt #1: Traceback (most recent call last): File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update.py", line 226, in TriggerAU chromeos_AU.PostCheckCrOSUpdate() File "/home/chromeos-test/chromiumos/chromite/lib/auto_updater.py", line 985, in PostCheckCrOSUpdate 'build' % (self.update_version, self.device.hostname)) File "/home/chromeos-test/chromiumos/chromite/lib/auto_updater.py", line 809, in _VerifyBootExpectations raise RootfsUpdateError(rollback_message) RootfsUpdateError: Build reef-release/R58-9267.0.0 failed to boot on chromeos2-row4-rack7-host19; system rolled back to previous build And the AU Logs can be found here: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row4-rack7-host19/59894268-provision/autoupdate_logs/
,
Feb 9 2017
Looks like the SSH connection is down post update: 2017/02/09 05:50:40.705 DEBUG| auto_updater:0907| Start pre-setup for stateful update... 2017/02/09 05:50:40.705 DEBUG| cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpudfDzJ/testing_rsa root@100.115.227.16 -- sudo stop ap-update-manager Warning: Permanently added '100.115.227.16' (RSA) to the list of known hosts. stop: Unknown job: ap-update-manager Timeout, server 100.115.227.16 not responding. 2017/02/09 05:51:21.141 ERROR| remote_access:0851| Error connecting to device 100.115.227.16 2017/02/09 05:51:21.142 DEBUG| cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpudfDzJ/testing_rsa root@100.115.227.16 -- rm -rf /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.8PfuILvL9F 2017/02/09 05:52:24.840 DEBUG| cros_update:0229| Error happens in CrOS auto-update: SSHConnectionError(None,)
,
Feb 9 2017
I checked chromeos2-row4-rack6-host2 as an example, and my found is: 1. The host chromeos2-row4-rack6-host2's cros-version label (reef-release/R56-9000.50.0) is never changed, which means it can only pass tests like "autoupdate_EndToEndTest_paygen_au_beta_full_9000.50.0", no other versions. 2. Auto-update in repair works, which installs default "reef-release/R56-9000.50.0". 3. Auto-update in provision of new release version doesn't work, and fail due to one major kind of failures: "Build reef-release/<a new version> failed to boot on chromeos2-row4-rack6-host2", which means the installation process is finished, but the host cannot find right boot partition. So I suggest to cros flash all affected DUTs (chromeos2-row4-rack6-host2, chromeos2-row4-rack7-host13, chromeos2-row4-rack7-host19) with a newer version to further see whether these new builds can be installed in the host. Maybe they just have some remaining files of old versions which blocks them to be updated to new ones.
,
Feb 9 2017
I created a cautotest job to flash all 3 of those devices to: reef-release/R58-9266.0.0 http://cautotest/afe/#tab_id=view_job&object_id=100350958
,
Feb 9 2017
The flash jobs all failed.
,
Feb 9 2017
The flash jobs do the same things as normal provision jobs, so they failed in the same "system rolled back to previous build" error. May need other ways to flash.
,
Feb 9 2017
Weirdly cros flash worked...: $ cros flash chromeos2-row4-rack6-host2.cros remote/reef-release/R58-9266.0.0 14:59:04: NOTICE: Preparing to update the remote device chromeos2-row4-rack6-host2.cros [##############################################################################################] 100% 15:04:13: NOTICE: Finalizing image. 15:04:55: NOTICE: Update completed. 15:05:20: NOTICE: rebooting device... 15:07:06: NOTICE: Verifying that the device has been updated... 15:07:07: NOTICE: Update performed successfully. 15:07:07: NOTICE: cros flash completed successfully. $ ssh root@chromeos2-row4-rack6-host2.cros The authenticity of host 'chromeos2-row4-rack6-host2.cros.corp.google.com (100.115.226.233)' can't be established. RSA key fingerprint is SHA256:FzZ7jruzIC9aBJbVgOcmpI5d5szMeTPqQsr0doostwE. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'chromeos2-row4-rack6-host2.cros.corp.google.com' (RSA) to the list of known hosts. localhost ~ # cat /etc/lsb-release CHROMEOS_RELEASE_APPID={8396029B-FCEF-9EEC-C684-3BCB8E3E9429} CHROMEOS_BOARD_APPID={8396029B-FCEF-9EEC-C684-3BCB8E3E9429} CHROMEOS_CANARY_APPID={90F229CE-83E2-4FAF-8479-E368A34938B1} DEVICETYPE=REFERENCE CHROMEOS_ARC_VERSION=3704772 CHROMEOS_ARC_ANDROID_SDK_VERSION=25 GOOGLE_RELEASE=9266.0.0 CHROMEOS_DEVSERVER= CHROMEOS_RELEASE_BUILDER_PATH=reef-release/R58-9266.0.0 .... Xixuan what does that mean?
,
Feb 9 2017
Here is the straight provision logs: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row4-rack6-host2/59900910-provision/debug/ They look the same as the previous failure as Waihong said.
,
Feb 9 2017
I will cros flash host13 later to see what happened. Looks like provision has more verification steps than cros-flash and these steps fail.
,
Feb 10 2017
findings after further debugging: I cannot reproduce the error "system rolled back to previous build" on both host13 & host19. But I have the error that the DUT often lose their SSHconnection during some specific steps of provisioning. After add a retry fix of SSHConnectionError, the problem seems gone at least in my workstation, host13 & host19 are successfully provisioned with local devserver & autotest. So I will upload these fixes, let's see whether this will fix the reef provision problem.
,
Feb 10 2017
"system rolled back to previous build" means that the build crashed
and burned at initial boot. Usually, that means the build is bad.
I've reproduced the failure by the simple expedient of running
"dummy_Pass" on a random DUT, using reef-release/R58-9268.0.0:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=100541116
We need to pass this problem to a sheriff to debug and triage the
crash.
,
Feb 11 2017
Simran tells me he's overloaded, so on his behalf: waihong@ - Tag, you're it! You should probably pass this to someone who's sheriff for 2/13.
,
Feb 11 2017
I thought Xixuan was applying a retry fix to try and resolve this?
,
Feb 11 2017
oh, after I apply the retry fix locally, "system rolled back to previous build" error comes back. With Richard's help, it's found that this may be the problem of build itself. So I will still upload the retry fix to make the provision error more focused (on system rolled back). The tracking bug would be crbug.com/672919. @Benson will take care of the rollback error.
,
Feb 11 2017
+aaron +mary We identified several problems on the reef system deployed here. One is that the repair system firmware was positively ancient, at 8992.0. Richard bumped it to 9042.14. Second thing is that according to the logs, the cr50-updater is causing the system to reboot. It's not known how many times it causes the system to reboot shortly after boot, but this could potentially cause a rollback.
,
Feb 11 2017
I'm going to try to repro this locally on a system using cros flash.
,
Feb 11 2017
High criticality to this bug, as we need a good build soon for factory.
,
Feb 11 2017
I happen to know why "cros flash" works but provision job fails. "cros flash" does the default "old state" stateful update: https://chromium.googlesource.com/chromiumos/platform/dev-util/+/0.11.257.B/stateful_update#78 Because it doesn't pass any argument to stateful_update script: https://chromium.googlesource.com/chromiumos/chromite/+/2a14f951a57771e7d2e7bb8ae424ab88d8c6604f/cros/commands/cros_flash.py#229 However, provision job does an "clean state" stateful update. https://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/client/common_lib/cros/autoupdater.py#499 The "clean state" update will remove all other directories except dev_image and var_overlay under /mnt/stateful_partition. https://chromium.googlesource.com/chromiumos/platform/init/+/factory-2848.B/chromeos_startup#222 Unfortunately, we keep some state about cr50 update in /mnt/stateful_partition/encrypted (file /var/cache/cr50.some_md5sum.state) which gets removed. So it triggers reboot (4-times) to construct a new cr50 state file: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/chromeos-base/chromeos-cr50/files/cr50-update.conf#98 We should revisit this reboot (4 times) which happens on every provision job in the lab.
,
Feb 11 2017
FWIW, a M57 build passed, despite the odds. https://uberchromegw.corp.google.com/i/chromeos_release/builders/reef-release%20release-R57-9202.B/builds/19 9202.24.
,
Feb 13 2017
Vadim, why would we be rebooting 4 times though? If the h1 firmware is up to date it should fall all the way through with a completed status, no?
,
Feb 13 2017
,
Feb 13 2017
Added this week sheriffs to follow up.
,
Feb 13 2017
I think we need to bump the fw base of pyro and snappy as well for whatever their fallback version is. That might get us passing builds on all of those boards. It won't cause the general issue of cr50 potentially not liking our build process. I fear that bouncing around back and forth with different images and firmware versions will continue to cause problems.
,
Feb 15 2017
Paygenbuildcanary issue tracked here : issue 692625 Paygentest autest, provisioning tracked here : crosbug.com/p/62911
,
Feb 15 2017
so, no matter what the fix is for the 'steady' cr50 image case, there will have to be reboots if the new chrome os image has a newer cr50 image in it. Is this going to be a problem for the lab?
,
Feb 15 2017
jrbarnette@ can answer. How many reboot? I think 4-reboot is not acceptable. Lab machines will be scheduled tests running on different channels. The cr50 image not matched case may happen several times a day.
,
Feb 16 2017
Re 25, The code you're looking is wrong. cros flash uses codes in: https://cs.corp.google.com/chromeos_public/chromite/cli/flash.py?q=flash&sq=package:%5Echromeos_(internal%7Cpublic)$&l=449 provision also uses auto-update codes in: https://cs.corp.google.com/chromeos_public/chromite/lib/auto_updater.py?q=auto_updater.py&sq=package:%5Echromeos_(internal%7Cpublic)$&l=1 The code to call provisioning is: https://cs.corp.google.com/chromeos_public/src/platform/dev/cros_update.py?q=cros_update&sq=package:%5Echromeos_(internal%7Cpublic)$&l=155 Basically, they use the same codes. But you're right at they do use different setting of '--stateful_change=clean'. I enable the clobber_stateful=True for cros flash, but still can successfully finish cros flash. So '--stateful_change=clean' may not explain why cros_flash succeeds but provision fails.
,
Feb 16 2017
Re c#34, thanks for correcting it.
The multiple reboots happen if the following two conditions happen:
(1) stateful_change=clean to wipe the cr50 state file in /var/cache
(2) the usb_updater command fails to updates cr50 firmware
https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/chromeos-base/chromeos-cr50/files/cr50-update.conf#186
I don't know why (2) happens on some cases but some other cases seem working fine.
For example, this one fails:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/101475380-chromeos-test/chromeos2-row4-rack6-host3/
Check the /var/log/messages:
2017-02-15T07:22:23.124049-08:00 WARNING kernel: [ 14.165281] init: cr50-update main process (1971) terminated with status 3
And this one works:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row4-rack6-host3/59971470-provision/sysinfo/var/log/
Check the /var/log/messages:
2017-02-14T01:44:33.806180+00:00 NOTICE cr50-update[2554]: Will run /usr/sbin/usb_updater -u /opt/google/cr50/firmware/cr50.bin.prod
2017-02-14T01:44:33.862727+00:00 NOTICE cr50-update[2571]: cr50 is running updated firmware
,
Feb 16 2017
> so, no matter what the fix is for the 'steady' cr50
> image case, there will have to be reboots if the new
> chrome os image has a newer cr50 image in it.
>
> Is this going to be a problem for the lab?
We need to figure out first what's a problem for the users.
In particular, a user experience that looks like "I rebooted
to get an update, and then I saw login screen, and then the
system rebooted 4 times" won't work for users. I doubt that
system will be acceptable it reboots even once after login
screen, because users expect to be able to use the system
once login screen is up.
The expected user visible behavior is that any work required
after an update happens while the splash screen is up, before
we start Chrome:
* If the work is notably long running, we need to display
a message to explain the delay.
* If the work will require a reboot at the end, we need
to display a message setting user expectations.
* All the work should be done after a single extra reboot.
* Events of this sort should be relatively rare during the
life of the product.
We already do stuff like this for firmware update, so the basic
code facilities we need to do it already exist.
If the systems will work as described above, the lab can learn
to cope. If the systems can't work this way, this feature
needs review at a higher level for its appropriateness, and
getting the lab to cope could be extra work.
,
Feb 16 2017
Richard : There's already a bug for displaying a warning message about an impending reboot: https://code.google.com/p/chrome-os-partner/issues/detail?id=58357 I agree though that rebooting 4 times is seriously not a good user experience. Going over the update script, that 4 times behavior is how the script deals with unknown errors in the usb_updater. The idea is that rather than dealing with the error in an meaningful way (or even displaying the error to the log), the script just crosses its fingers and hopes a reboot will clear the error condition and we'll be better off the next boot. Aaron and I were talking last night, and we agreed that this whole rebooting as a last resort is just not right and will definitely need to change.
,
Feb 23 2017
Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable? If a fix is in active development, please set the status to Started. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Feb 23 2017
|
|||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||
Comment 1 by dchan@google.com
, Feb 9 2017