New issue
Advanced search Search tips

Issue 918153 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Dec 31
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

HWTest [sanity] failure on tricky-chrome-pfq : bad SSD in chromeos4-row2-rack4-host15

Project Member Reported by glevin@chromium.org, Dec 28

Issue description

tricky-chrome-pfq failed last night on HWTest [sanity]:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=3284873

https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8925917399675732544/+/steps/HWTest__sanity_/0/stdout

01:07:45: WARNING: Exception is not retriable return code: 3; command: /b/swarming/wyPJZSD/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --[etc]
Triggered task: tricky-chrome-pfq/R73-11481.0.0-rc1-sanity
chromeos-golo-server2-92: 420cdb7c843e0c10 3
  Autotest instance created: cautotest-prod
  12-28-2018 [00:59:10] Created suite job: http://cautotest-prod/afe/#tab_id=view_job&object_id=271348045
  @@@STEP_LINK@Link to suite@http://cautotest-prod/afe/#tab_id=view_job&object_id=271348045@@@
  12-28-2018 [01:07:05] Suite job is finished.
  12-28-2018 [01:07:05] Start collecting test results and dump them to json.
  Suite job   [ PASSED ]
  provision   [ FAILED ]
  provision     FAIL: Download and install failed from chromeos4-devserver5.cros.corp.google.com onto chromeos4-row2-rack4-host15: command execution error


From the provision_AutoUpdate.ERROR log in
https://stainless.corp.google.com/browse/chromeos-autotest-results/271348205-chromeos-test/ :

12/28 01:01:07.047 ERROR|             utils:0287| [stderr] [1227/235737:INFO:update_engine_client.cc(508)] Querying Update Engine status...
12/28 01:01:29.273 ERROR|             utils:0287| [stderr] cat: /tmp/sysinfo/autoserv-3fnhDK/.checksum: No such file or directory
12/28 01:03:14.281 ERROR|             utils:0287| [stderr] mux_client_request_session: read from master failed: Broken pipe
12/28 01:03:20.107 ERROR|             utils:0287| [stderr] [1228/010319:INFO:update_engine_client.cc(508)] Querying Update Engine status...
12/28 01:04:27.382 ERROR|       autoupdater:0889| quick-provision script failed; will fall back to update_engine.
Traceback (most recent call last):
  File "/usr/local/autotest/server/cros/autoupdater.py", line 882, in _install_via_quick_provision
    self._run(command)
  File "/usr/local/autotest/server/cros/autoupdater.py", line 356, in _run
    return self.host.run(cmd, *args, **kwargs)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 335, in run
    return self.run_very_slowly(*args, **kwargs)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 324, in run_very_slowly
    ssh_failure_retry_ok)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 268, in _run
    raise error.AutoservRunError("command execution error", result)
AutoservRunError: command execution error
* Command: 
    /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_wQLndDssh-master/socket
    -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
    -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
    ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22
    chromeos4-row2-rack4-host15 "export LIBC_FATAL_STDERR_=1; if type
    \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
    \"server[stack::_install_via_quick_provision|_run|run] ->
    ssh_run(/bin/bash /tmp/quick-provision --noreboot tricky-chrome-
    pfq/R73-11481.0.0-rc1 http://100.115.219.133:8082/static)\";fi; /bin/bash
    /tmp/quick-provision --noreboot tricky-chrome-pfq/R73-11481.0.0-rc1
    http://100.115.219.133:8082/static"
Exit status: 1
Duration: 65.7212760448


Assigning to hardware deputy pprabhu@ (who also owns  Issue 882152 , which has the same traceback for a failure in_install_via_quick_provision).
 
Labels: Hotlist-Deputy
Likely unrelated to  issue 882152 , which was fixed via a devserver push in November.
Status: Started (was: Assigned)
Looking at the quick-provision.log in the logs above, filesystem hash verification failed in postinst:

[1228/010426.449473:ERROR:chromeos_verity.cc(280)] Filesystem hash verification failed
[1228/010426.450787:ERROR:chromeos_verity.cc(281)] Expected 3e5ce7bac0b7e1c526c40be0608d1deec4c9b34f != actual a7ca88b293618a46b2d7cd7548484886a62d1465
...
2018-12-28 01:04:26-08:00 ERROR: FATAL: postinst failed.
2018-12-28 01:04:26-08:00 INFO: Updated status: FATAL: postinst failed.

So, either the ChromeOS image or the DUT SSD are bad.

More likely the latter.
DUT in question: chromeos4-row2-rack4-host15

Looking at its task history:

test ... test  ... test

 2018-12-28 02:42:51  OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3945081-provision/
    2018-12-28 02:29:30  OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3944936-repair/
    2018-12-28 02:24:16  -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3944889-provision/

test ... test ... test

   2018-12-28 01:12:47  OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3944163-provision/
    2018-12-28 01:05:17  OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3944086-repair/
    2018-12-28 01:00:53  -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3944055-provision/
    2018-12-27 21:58:57  OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3942769-repair/
    2018-12-27 21:51:54  -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3942721-provision/
    2018-12-27 21:45:34  OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3942658-repair/
    2018-12-27 21:41:50  -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3942624-provision/

test ... test ... test
    2018-12-27 20:07:44  OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3942166-provision/
    2018-12-27 13:59:07  OK https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3939758-repair/
    2018-12-27 13:55:23  -- https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row2-rack4-host15/3939721-provision/


So, it seems to be getting provisioned and running quite a few tests, but it does go into short provision-repair loops each time.
Need to dig into some of the failed provision logs here.
Yep, each of those failed provisions is due to postinst filesystem hash verification failure.

Here's the OS version for all the provision attempts (reverse chronologically) pasted above:


PASS tricky-release/R72-11316.50.0
FAIL tricky-release/R72-11316.50.0

PASS tricky-release/R71-11151.85.0
FAIL tricky-chrome-pfq/R73-11481.0.0-rc1
FAIL tricky-release/R73-11480.0.0
FAIL tricky-release/R73-11480.0.0
 
PASS tricky-chrome-pfq/R73-11479.0.0-rc1
FAIL tricky-release/R73-11479.0.0

Nothing special about the repair tasks that were able to leave the DUT in a state where following provision passed, vs those where the following provision failed.

Some repairs show:
Saw file system error: [    5.012493] EXT4-fs error (device sda1): __ext4_get_inode_loc:3769: inode #407977: block 1574058: comm find: unable to read itable block

in status.log

Overall, this be a case of a DUT with a bad SSD.
Locked DUT to remove from fleet.

pprabhu@pprabhu:chromiumos$ atest host mod -l -r 'bad SSD crbug/918153' chromeos4-row2-rack4-host15
Locked host:
        chromeos4-row2-rack4-host15

Filed b/122106764 to get the DUT removed / replaced.
Owner: ----
Status: ExternalDependency (was: Started)
Summary: HWTest [sanity] failure on tricky-chrome-pfq : bad SSD in chromeos4-row2-rack4-host15 (was: HWTest [sanity] failure on tricky-chrome-pfq : quick_provision failure)
Owner: pprabhu@chromium.org
Status: Fixed (was: ExternalDependency)
pprabhu@ - Since b/122106764 is tracking the device replacement, and tricky-chrome-pfq has had 6 green runs since this was filed, I'm marking this "Fixed".
Please chime in if there's anything else to add here.

Sign in to add a comment