version labeling on Google_Kevin.8785.94.6 broke automated firmware update |
|||||
Issue descriptionSee also auto-filed issue 677273. GS bucket and DUT name: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/93082186-chromeos-test/chromeos2-row6-rack5-host3/ stdio link and snippet: https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-release/builds/715/steps/HWTest%20%5Bsanity%5D/logs/stdio chromeos-server22-37: 335e54c068b69410 3 Autotest instance: cautotest 12-28-2016 [04:59:03] Created suite job: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=93082088 @@@STEP_LINK@Link to suite@http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=93082088@@@ Suite job [ PASSED ] provision [ FAILED ] provision FAIL: DUT firmware requires update from Google_Kevin.8785.118.0 to Google_Kevin.8785.94.6, completed successfully
,
Dec 29 2016
This is still making the kevin canary fail. Richard is probably not around. Can someone else fix this? https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/93264700-chromeos-test/chromeos2-row6-rack5-host3/debug/ 12/29 05:21:19.857 DEBUG| base_utils:0280| [stdout] CHROMEOS_RELEASE_VERSION=9132.0.0 12/29 05:21:19.857 DEBUG| base_utils:0280| [stdout] CHROMEOS_AUSERVER=https://tools.google.com/service/update2 12/29 05:21:19.894 INFO | server_job:0153| GOOD ---- verify.cros timestamp=1483017679 localtime=Dec 29 05:21:19 12/29 05:21:19.895 INFO | repair:0105| Skipping this operation: All host verification checks pass 12/29 05:21:19.896 DEBUG| repair:0106| The following dependencies failed: 12/29 05:21:19.896 DEBUG| repair:0108| The firmware on this DUT is up-to-date 12/29 05:21:19.896 ERROR| control:0071| DUT firmware requires update from Google_Kevin.8785.118.0 to Google_Kevin.8785.94.6 Traceback (most recent call last):
,
Dec 29 2016
I'll lock that dut in the meantime so it doesn't cause more issues.
,
Dec 29 2016
actually, that won't do anything, they're all at that version. To get the release back to green, I'll temporarily set the stable fw version to 8785.118.0 until Richard or someone on the Kevin team can figure out what the expected fw should be.
,
Dec 29 2016
$ ./atest stable_version modify --board kevin/rwfw --version Google_Kevin.8785.118.0 Stable version for board kevin/rwfw is changed from Google_Kevin.8785.94.6 to Google_Kevin.8785.118.0.
,
Dec 30 2016
> $ ./atest stable_version modify --board kevin/rwfw --version Google_Kevin.8785.118.0 > Stable version for board kevin/rwfw is changed from Google_Kevin.8785.94.6 to Google_Kevin.8785.118.0. Ai Ya! Why was this necessary? It's very likely that this will cause more trouble, not less.
,
Dec 30 2016
All the duts in the bvt pool were at that fw version so the release builders kept failing the hwtest stage. Since the fw stable_version bump the builder went green: https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-release I can change it back but builder will go red again (maybe that's ok given this situation?).
,
Dec 30 2016
The firmware assigned to a board must match the firmware bundled
with the current repair image. The current kevin repair image
is R56-9000.35.0, which bundles firmware Google_Kevin.8785.94.6.
One of two things can/will eventually go wrong:
1) A DUT that doesn't have 8785.118.0, because the verify/repair
sequence can't work if the target firmware isn't bundled
in the repair image.
2) Next Tuesday, the firmware will be automatically reset to
the firmware bundled in the (next) repair image, which, if
it isn't 8785.118.0, will put us back where we were at the
outset.
The fix requires two things:
1) Set the repair image to a build that bundles the firmware
we want (the .118.0 version).
2) Figure out how all those kevins got the .118.0 firmware
in the first place, and fix it so it won't happen again.
,
Dec 30 2016
It looks like the 8785.94.6 firmware bundle doesn't identify
itself as expected.
Tracking the history of chromeos2-row6-rack5-host3, you see
this:
A) Last firmware check with Google_Kevin.8785.94.4 assigned:
http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row6-rack5-host3/59294741-reset/
B) First firmware check after assigning Google_Kevin.8785.94.6:
http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row6-rack5-host3/59294874-reset/
C) Second firmware check after assigning Google_Kevin.8785.94.6:
http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row6-rack5-host3/59294926-reset/
A) and B) look like I'd expect; A) reports no firmware update, and
B) reports updating to the bundle in the build.
Looking at C, we discover a problem:
12/27 04:12:12.306 INFO | repair:0327| Verifying this condition: The firmware on this DUT is up-to-date
12/27 04:12:12.471 DEBUG| ssh_host:0177| Running (ssh) 'crossystem fwid'
12/27 04:12:12.898 DEBUG| base_utils:0299| [stdout] Google_Kevin.8785.118.0
Here's how the bundle identified itself:
12/27 04:12:12.899 DEBUG| ssh_host:0177| Running (ssh) 'chromeos-firmwareupdate -V'
12/27 04:12:14.383 DEBUG| base_utils:0280| [stdout]
12/27 04:12:14.384 DEBUG| base_utils:0280| [stdout] flashrom(8): fe63e6a6f2431040d9cb7a62fdb6b11d */build/kevin/usr/sbin/flashrom
12/27 04:12:14.384 DEBUG| base_utils:0280| [stdout] ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, for GNU/Linux 2.6.16, BuildID[sha1]=9537c66614f060e027e143d229586f66d4bfc902, stripped
12/27 04:12:14.384 DEBUG| base_utils:0280| [stdout] 0.9.4 : 65be03a : Nov 08 2016 10:51:57 UTC
12/27 04:12:14.385 DEBUG| base_utils:0280| [stdout]
12/27 04:12:14.385 DEBUG| base_utils:0280| [stdout] BIOS image: 02b38affb90cd9c07dd8425bfffb3be7 */build/kevin/tmp/portage/chromeos-base/chromeos-firmware-kevin-0.0.1-r54/work/chromeos-firmware-kevin-0.0.1/.dist/kevin_fw_8785.94.6_8785.118.0.tbz2/image.bin
12/27 04:12:14.385 DEBUG| base_utils:0280| [stdout] BIOS version: Google_Kevin.8785.94.6
12/27 04:12:14.385 DEBUG| base_utils:0280| [stdout] EC image: 17a2133f4872e7da717bfaaba8026baa */build/kevin/tmp/portage/chromeos-base/chromeos-firmware-kevin-0.0.1-r54/work/chromeos-firmware-kevin-0.0.1/.dist/kevin_ec_8785.94.6_8785.118.0.tbz2/ec.bin
12/27 04:12:14.385 DEBUG| base_utils:0280| [stdout] EC version: kevin_v1.10.116-b2d1ab0
The bundle plainly calls itself Google_Kevin.8785.94.6, but the
bits seems to think they're really 8785.118.0. Hence the trouble.
,
Dec 30 2016
The reason that this problem showed up now is that as of R57-9129.0.0, kevin bundles Google_Kevin.8785.122.0. That changed the failure mode. For now, this is the fix: $ atest stable_version modify -b kevin -i R57-9135.0.0 Stable version for board kevin is changed from R56-9000.35.0 to R57-9135.0.0. $ atest stable_version modify -b kevin/rwfw -i Google_Kevin.8785.122.0 Stable version for board kevin/rwfw is changed from Google_Kevin.8785.118.0 to Google_Kevin.8785.122.0. The full root cause can wait until later. Someone who understands firmware builds needs to weigh in with an explanation.
,
Jan 9 2017
Ultimately, this is a bug in the firmware bundle for Google_Kevin.8785.94.6. I think that means the firmware team has to own the problem.
,
Feb 15 2018
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. If you change it back, also remove the "Hotlist-Recharge-Cold" label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Feb 15 2018
AFAIK, whatever firmware update process caused this problem in
the first place could still happen again. In that case, some
day in the future, some new hardware model will suffer the same
fate. That is, all testing for the model will suddenly start
failing until developers intervene and work around the problem.
So, our options are:
A) Go figure out how to ensure that the version strings
printed by "chromeos-updatefirmware -V" will always match
the version string reported by "crossystem fwid" after the
firmware is installed.
B) Continue to ignore this, and find out (the hard way) how
long it takes until this problem causes another lab outage.
Option A) ain't free, and the cost of option B) is offset by
the event being unlikely. So, if we judge that the expected
cost of option B) is cheaper, we can just close this as WontFix.
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by kevcheng@chromium.org
, Dec 28 2016