HWTest [sanity] failure on tricky-chrome-pfq : bad SSD in chromeos4-row2-rack4-host15 |
|||||
Issue descriptiontricky-chrome-pfq failed last night on HWTest [sanity]: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=3284873 https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8925917399675732544/+/steps/HWTest__sanity_/0/stdout [1;33m01:07:45: WARNING: Exception is not retriable return code: 3; command: /b/swarming/wyPJZSD/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --[etc] Triggered task: tricky-chrome-pfq/R73-11481.0.0-rc1-sanity chromeos-golo-server2-92: 420cdb7c843e0c10 3 Autotest instance created: cautotest-prod 12-28-2018 [00:59:10] Created suite job: http://cautotest-prod/afe/#tab_id=view_job&object_id=271348045 @@@STEP_LINK@Link to suite@http://cautotest-prod/afe/#tab_id=view_job&object_id=271348045@@@ 12-28-2018 [01:07:05] Suite job is finished. 12-28-2018 [01:07:05] Start collecting test results and dump them to json. Suite job [ PASSED ] provision [ FAILED ] provision FAIL: Download and install failed from chromeos4-devserver5.cros.corp.google.com onto chromeos4-row2-rack4-host15: command execution error From the provision_AutoUpdate.ERROR log in https://stainless.corp.google.com/browse/chromeos-autotest-results/271348205-chromeos-test/ : 12/28 01:01:07.047 ERROR| utils:0287| [stderr] [1227/235737:INFO:update_engine_client.cc(508)] Querying Update Engine status... 12/28 01:01:29.273 ERROR| utils:0287| [stderr] cat: /tmp/sysinfo/autoserv-3fnhDK/.checksum: No such file or directory 12/28 01:03:14.281 ERROR| utils:0287| [stderr] mux_client_request_session: read from master failed: Broken pipe 12/28 01:03:20.107 ERROR| utils:0287| [stderr] [1228/010319:INFO:update_engine_client.cc(508)] Querying Update Engine status... 12/28 01:04:27.382 ERROR| autoupdater:0889| quick-provision script failed; will fall back to update_engine. Traceback (most recent call last): File "/usr/local/autotest/server/cros/autoupdater.py", line 882, in _install_via_quick_provision self._run(command) File "/usr/local/autotest/server/cros/autoupdater.py", line 356, in _run return self.host.run(cmd, *args, **kwargs) File "/usr/local/autotest/server/hosts/ssh_host.py", line 335, in run return self.run_very_slowly(*args, **kwargs) File "/usr/local/autotest/server/hosts/ssh_host.py", line 324, in run_very_slowly ssh_failure_retry_ok) File "/usr/local/autotest/server/hosts/ssh_host.py", line 268, in _run raise error.AutoservRunError("command execution error", result) AutoservRunError: command execution error * Command: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_wQLndDssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos4-row2-rack4-host15 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::_install_via_quick_provision|_run|run] -> ssh_run(/bin/bash /tmp/quick-provision --noreboot tricky-chrome- pfq/R73-11481.0.0-rc1 http://100.115.219.133:8082/static)\";fi; /bin/bash /tmp/quick-provision --noreboot tricky-chrome-pfq/R73-11481.0.0-rc1 http://100.115.219.133:8082/static" Exit status: 1 Duration: 65.7212760448 Assigning to hardware deputy pprabhu@ (who also owns Issue 882152 , which has the same traceback for a failure in_install_via_quick_provision).
,
Dec 28
Looking at the quick-provision.log in the logs above, filesystem hash verification failed in postinst: [1228/010426.449473:ERROR:chromeos_verity.cc(280)] Filesystem hash verification failed [1228/010426.450787:ERROR:chromeos_verity.cc(281)] Expected 3e5ce7bac0b7e1c526c40be0608d1deec4c9b34f != actual a7ca88b293618a46b2d7cd7548484886a62d1465 ... 2018-12-28 01:04:26-08:00 ERROR: FATAL: postinst failed. 2018-12-28 01:04:26-08:00 INFO: Updated status: FATAL: postinst failed. So, either the ChromeOS image or the DUT SSD are bad. More likely the latter. DUT in question: chromeos4-row2-rack4-host15 Looking at its task history: test ... test ... test 2018-12-28 02:42:51 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3945081-provision/ 2018-12-28 02:29:30 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3944936-repair/ 2018-12-28 02:24:16 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3944889-provision/ test ... test ... test 2018-12-28 01:12:47 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3944163-provision/ 2018-12-28 01:05:17 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3944086-repair/ 2018-12-28 01:00:53 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3944055-provision/ 2018-12-27 21:58:57 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3942769-repair/ 2018-12-27 21:51:54 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3942721-provision/ 2018-12-27 21:45:34 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3942658-repair/ 2018-12-27 21:41:50 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3942624-provision/ test ... test ... test 2018-12-27 20:07:44 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3942166-provision/ 2018-12-27 13:59:07 OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3939758-repair/ 2018-12-27 13:55:23 -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3939721-provision/ So, it seems to be getting provisioned and running quite a few tests, but it does go into short provision-repair loops each time. Need to dig into some of the failed provision logs here.
,
Dec 28
Yep, each of those failed provisions is due to postinst filesystem hash verification failure. Here's the OS version for all the provision attempts (reverse chronologically) pasted above: PASS tricky-release/R72-11316.50.0 FAIL tricky-release/R72-11316.50.0 PASS tricky-release/R71-11151.85.0 FAIL tricky-chrome-pfq/R73-11481.0.0-rc1 FAIL tricky-release/R73-11480.0.0 FAIL tricky-release/R73-11480.0.0 PASS tricky-chrome-pfq/R73-11479.0.0-rc1 FAIL tricky-release/R73-11479.0.0 Nothing special about the repair tasks that were able to leave the DUT in a state where following provision passed, vs those where the following provision failed. Some repairs show: Saw file system error: [ 5.012493] EXT4-fs error (device sda1): __ext4_get_inode_loc:3769: inode #407977: block 1574058: comm find: unable to read itable block in status.log Overall, this be a case of a DUT with a bad SSD.
,
Dec 28
Locked DUT to remove from fleet.
pprabhu@pprabhu:chromiumos$ atest host mod -l -r 'bad SSD crbug/918153' chromeos4-row2-rack4-host15
Locked host:
chromeos4-row2-rack4-host15
Filed b/122106764 to get the DUT removed / replaced.
,
Dec 28
,
Dec 28
,
Dec 31
pprabhu@ - Since b/122106764 is tracking the device replacement, and tricky-chrome-pfq has had 6 green runs since this was filed, I'm marking this "Fixed". Please chime in if there's anything else to add here. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by pprabhu@chromium.org
, Dec 28