DUT (chromeos4-row1-rack13-host9) with bad file system failing multiple tests. |
||||||||
Issue descriptionhttps://uberchromegw.corp.google.com/i/chromeos/builders/kip-release/builds/878 01/26 23:12:03.774 DEBUG| base_utils:0185| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/get_au_status?build_name=kip-release/R58-9225.0.0&force_update=True&pid=16681&host_name=chromeos4-row1-rack13-host9&full_update=False"'' 01/26 23:12:05.073 DEBUG| dev_server:1833| Failed to trigger auto-update process on devserver 01/26 23:12:05.074 DEBUG| base_utils:0185| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/handler_cleanup?pid=16681&host_name=chromeos4-row1-rack13-host9"'' 01/26 23:12:06.533 DEBUG| base_utils:0185| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/collect_cros_au_log?pid=16681&host_name=chromeos4-row1-rack13-host9"'' 01/26 23:12:07.860 DEBUG| dev_server:1665| Saving auto-update logs into /usr/local/autotest/results/hosts/chromeos4-row1-rack13-host9/445669-provision/20172601231104/autoupdate_logs/CrOS_update_chromeos4-row1-rack13-host9_16681.log 01/26 23:12:07.860 DEBUG| dev_server:1932| Exception raised on auto_update attempt #1: Traceback (most recent call last): File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update.py", line 179, in TriggerAU yes=True) File "/home/chromeos-test/chromiumos/chromite/lib/auto_updater.py", line 663, in __init__ clobber_stateful=clobber_stateful, yes=yes) File "/home/chromeos-test/chromiumos/chromite/lib/auto_updater.py", line 202, in __init__ self.device_dev_dir = os.path.join(self.device.work_dir, 'src') File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 652, in work_dir capture_output=True).output.strip() File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 849, in BaseRunCommand return self.GetAgent().RemoteSh(cmd, **kwargs) File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 336, in RemoteSh return cros_build_lib.RunCommand(ssh_cmd, **kwargs) File "/home/chromeos-test/chromiumos/chromite/lib/cros_build_lib.py", line 625, in RunCommand raise RunCommandError(msg, cmd_result) RunCommandError: return code: 1; command: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpXWremS/testing_rsa root@chromeos4-row1-rack13-host9 -- mkdir -p /mnt/stateful_partition/unencrypted/preserve/cros-update '&&' mktemp -d '--tmpdir=/mnt/stateful_partition/unencrypted/preserve/cros-update' Warning: Permanently added 'chromeos4-row1-rack13-host9,100.115.192.126' (RSA) to the list of known hosts. mktemp: failed to create directory via template '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.XXXXXXXXXX': Input/output error
,
Jan 27 2017
Agreed! I misread the log. Thanks.
,
Jan 27 2017
OTOH, something *is* wrong with that DUT, and it's getting circulated back in supply.
Running on devserver: chromeos4-devserver8
chromeos-test@chromeos4-devserver8:/var/log/devserver$ grep 'mnt/stateful_partition/unencrypted' console.log | grep -i mktemp | grep RunCommandError | awk '{print $20}' | sort | uniq -c | sort -n
6 'root@chromeos4-row1-rack13-host9',
6 root@chromeos4-row1-rack13-host9
,
Jan 27 2017
That DUT was in a privision-repair cycle.
A reset (finally!) this morning seems to have fixed it:
2017-01-27 07:51:20 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/98278619-chromeos-test/
2017-01-27 07:51:02 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446528-reset/
2017-01-27 07:44:33 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/98278609-chromeos-test/
2017-01-27 07:44:17 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446526-reset/
2017-01-27 07:37:56 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/98278034-chromeos-test/
2017-01-27 07:37:36 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446521-reset/
2017-01-27 07:34:40 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446518-repair/
2017-01-27 07:30:39 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446490-provision/
2017-01-27 07:30:11 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446488-repair/
2017-01-27 07:26:32 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446464-provision/
2017-01-27 07:26:04 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446460-repair/
2017-01-27 07:22:25 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446451-provision/
2017-01-27 07:21:58 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446450-repair/
2017-01-27 07:18:02 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446449-provision/
2017-01-27 07:17:37 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446448-repair/
2017-01-27 07:13:12 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446443-provision/
2017-01-27 02:35:03 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446207-repair/
,
Jan 27 2017
Even worse, I don't see anything different between one of the earlier repairs and the one just before the Reset. So likely, it wasn't the repair that triggered the reset... We just got (un)lucky. Also, why are these in the repair / reset logs? 01/27 07:37:45.924 INFO | server_job:0183| INFO ---- cleanup timestamp=1485531465 localtime=Jan 27 07:37:45 Can't cleanup label 'pool:bvt'. 01/27 07:37:45.925 INFO | server_job:0183| INFO ---- cleanup timestamp=1485531465 localtime=Jan 27 07:37:45 Can't cleanup label 'board:kip'. 01/27 07:37:45.925 INFO | server_job:0183| INFO ---- cleanup timestamp=1485531465 localtime=Jan 27 07:37:45 Can't cleanup label 'cros-version:kip-release/R58-9226.0.0'.
,
Jan 27 2017
I locked the DUT, and logged on. First thing off, I tried out
the commands that reported failing during provisioning:
localhost ~ # mkdir -p /mnt/stateful_partition/unencrypted/preserve/cros-update
localhost ~ # mktemp -d '--tmpdir=/mnt/stateful_partition/unencrypted/preserve/cros-update'
mktemp: failed to create directory via template '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.XXXXXXXXXX': Input/output error
... so I searched /var/log/messages for references to mmcblk0,
which turned up a fair amount of stuff like this:
2017-01-27T15:56:35.979589+00:00 ERR kernel: [ 222.023789] end_request: I/O error, dev mmcblk0, sector 6488136
2017-01-27T15:56:35.979590+00:00 ERR kernel: [ 222.023796] Buffer I/O error on device mmcblk0p3, logical block 251401
I expect if we power wash the DUT, it will get better for
a while, but maybe it's time just to retire it.
,
Jan 27 2017
Given that we think the DUT is bad, I've removed it from
the BVT pool:
balance_pool bvt kip
Balancing kip bvt pool:
Total 6 DUTs, 5 working, 1 broken, 0 reserved.
Target is 6 working DUTs; grow pool by 1 DUTs.
kip bvt pool has 15 spares available.
kip bvt pool will return 1 broken DUTs, leaving 0 still in the pool.
Transferring 1 DUTs from bvt to suites.
Updating host: chromeos4-row1-rack13-host9.
Removing labels ['pool:bvt'] from host chromeos4-row1-rack13-host9
Adding labels ['pool:suites'] to host chromeos4-row1-rack13-host9
Transferring 1 DUTs from suites to bvt.
Updating host: chromeos4-row2-rack13-host5.
Removing labels ['pool:suites'] from host chromeos4-row2-rack13-host5
Adding labels ['pool:bvt'] to host chromeos4-row2-rack13-host5
,
Jan 27 2017
I've manually power-washed the DUT. We can check on the status of the repair in a bit.
,
Jan 27 2017
I've just checked the DUT; it shows no evidence of being power-washed. That's... weird. For now, I'm re-running the procedure; I'm going to check on the intermediate steps to make sure it's all working as expected. At some point, either there'll be a clear violation of the laws of physics, or I'll have a more plausible error to point to.
,
Jan 27 2017
Power wash definitely failed. The basic sequence looks like this: * Touch the "factory_install_reset" as required to request power wash. * Reboot to force power wash. * Log in, and check on things. At the end of it all the "factory_install_reset" file is gone, but the stateful file system is unchanged. Mostly, I count that as "not possible", although if the disk is bad, lots of "impossible" things may suddenly become more likely. I'm going to try to force install from USB via servo. We'll see where that goes.
,
Jan 27 2017
The repair failed:
http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446700-repair/
The results look sufficiently definitive to me; time to decommission
this DUT.
,
Jan 27 2017
Filed this ticket:
https://b2.corp.google.com/issues/34769067
At this point, the DUT has failed repair, and won't (can't) go
back into service for testing. So, there's nothing more to do
here.
,
Apr 17 2017
,
May 30 2017
,
Aug 1 2017
,
Aug 1 2017
tests passed in the host: https://ubercautotest.corp.google.com/afe/#tab_id=view_host&object_id=6113 |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by pprabhu@chromium.org
, Jan 27 2017Status: WontFix (was: Untriaged)