New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 686107 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

DUT (chromeos4-row1-rack13-host9) with bad file system failing multiple tests.

Project Member Reported by semenzato@chromium.org, Jan 27 2017

Issue description


https://uberchromegw.corp.google.com/i/chromeos/builders/kip-release/builds/878

01/26 23:12:03.774 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/get_au_status?build_name=kip-release/R58-9225.0.0&force_update=True&pid=16681&host_name=chromeos4-row1-rack13-host9&full_update=False"''
01/26 23:12:05.073 DEBUG|        dev_server:1833| Failed to trigger auto-update process on devserver
01/26 23:12:05.074 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/handler_cleanup?pid=16681&host_name=chromeos4-row1-rack13-host9"''
01/26 23:12:06.533 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/collect_cros_au_log?pid=16681&host_name=chromeos4-row1-rack13-host9"''
01/26 23:12:07.860 DEBUG|        dev_server:1665| Saving auto-update logs into /usr/local/autotest/results/hosts/chromeos4-row1-rack13-host9/445669-provision/20172601231104/autoupdate_logs/CrOS_update_chromeos4-row1-rack13-host9_16681.log
01/26 23:12:07.860 DEBUG|        dev_server:1932| Exception raised on auto_update attempt #1:
 Traceback (most recent call last):
  File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update.py", line 179, in TriggerAU
    yes=True)
  File "/home/chromeos-test/chromiumos/chromite/lib/auto_updater.py", line 663, in __init__
    clobber_stateful=clobber_stateful, yes=yes)
  File "/home/chromeos-test/chromiumos/chromite/lib/auto_updater.py", line 202, in __init__
    self.device_dev_dir = os.path.join(self.device.work_dir, 'src')
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 652, in work_dir
    capture_output=True).output.strip()
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 849, in BaseRunCommand
    return self.GetAgent().RemoteSh(cmd, **kwargs)
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 336, in RemoteSh
    return cros_build_lib.RunCommand(ssh_cmd, **kwargs)
  File "/home/chromeos-test/chromiumos/chromite/lib/cros_build_lib.py", line 625, in RunCommand
    raise RunCommandError(msg, cmd_result)
RunCommandError: return code: 1; command: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpXWremS/testing_rsa root@chromeos4-row1-rack13-host9 -- mkdir -p /mnt/stateful_partition/unencrypted/preserve/cros-update '&&' mktemp -d '--tmpdir=/mnt/stateful_partition/unencrypted/preserve/cros-update'
Warning: Permanently added 'chromeos4-row1-rack13-host9,100.115.192.126' (RSA) to the list of known hosts.
mktemp: failed to create directory via template '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.XXXXXXXXXX': Input/output error


 
Owner: semenzato@chromium.org
Status: WontFix (was: Untriaged)
From the exception above, we failed to create directory *on the DUT*, not the devserver.
This may mean that the DUT's filesystem is corrupt, but normal repair flow should take care of that and no action is needed.
Agreed!  I misread the log.  Thanks.
Cc: semenzato@chromium.org
Owner: pprabhu@chromium.org
Status: Assigned (was: WontFix)
Summary: DUT (chromeos4-row1-rack13-host9) with bad file system failing multiple tests. (was: mktemp: input/output error creating directory on devserver)
OTOH, something *is* wrong with that DUT, and it's getting circulated back in supply.

Running on devserver: chromeos4-devserver8

chromeos-test@chromeos4-devserver8:/var/log/devserver$ grep 'mnt/stateful_partition/unencrypted' console.log | grep -i mktemp | grep RunCommandError | awk '{print $20}' | sort | uniq -c | sort -n
      6 'root@chromeos4-row1-rack13-host9',
      6 root@chromeos4-row1-rack13-host9
That DUT was in a privision-repair cycle.
A reset (finally!) this morning seems to have fixed it:

    2017-01-27 07:51:20  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/98278619-chromeos-test/
    2017-01-27 07:51:02  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446528-reset/
    2017-01-27 07:44:33  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/98278609-chromeos-test/
    2017-01-27 07:44:17  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446526-reset/
    2017-01-27 07:37:56  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/98278034-chromeos-test/
    2017-01-27 07:37:36  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446521-reset/
    2017-01-27 07:34:40  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446518-repair/
    2017-01-27 07:30:39  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446490-provision/
    2017-01-27 07:30:11  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446488-repair/
    2017-01-27 07:26:32  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446464-provision/
    2017-01-27 07:26:04  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446460-repair/
    2017-01-27 07:22:25  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446451-provision/
    2017-01-27 07:21:58  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446450-repair/
    2017-01-27 07:18:02  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446449-provision/
    2017-01-27 07:17:37  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446448-repair/
    2017-01-27 07:13:12  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446443-provision/
    2017-01-27 02:35:03  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446207-repair/
Labels: -Pri-1 Pri-2
Owner: jrbarnette@chromium.org
Even worse, I don't see anything different between one of the earlier repairs and the one just before the Reset. So likely, it wasn't the repair that triggered the reset... We just got (un)lucky.

Also, why are these in the repair / reset logs?
01/27 07:37:45.924 INFO |        server_job:0183| 	INFO	----	cleanup	timestamp=1485531465	localtime=Jan 27 07:37:45	Can't cleanup label 'pool:bvt'.
01/27 07:37:45.925 INFO |        server_job:0183| 	INFO	----	cleanup	timestamp=1485531465	localtime=Jan 27 07:37:45	Can't cleanup label 'board:kip'.
01/27 07:37:45.925 INFO |        server_job:0183| 	INFO	----	cleanup	timestamp=1485531465	localtime=Jan 27 07:37:45	Can't cleanup label 'cros-version:kip-release/R58-9226.0.0'.
I locked the DUT, and logged on.  First thing off, I tried out
the commands that reported failing during provisioning:
    localhost ~ # mkdir -p /mnt/stateful_partition/unencrypted/preserve/cros-update
    localhost ~ # mktemp -d '--tmpdir=/mnt/stateful_partition/unencrypted/preserve/cros-update'
    mktemp: failed to create directory via template '/mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.XXXXXXXXXX': Input/output error

... so I searched /var/log/messages for references to mmcblk0,
which turned up a fair amount of stuff like this:
    2017-01-27T15:56:35.979589+00:00 ERR kernel: [  222.023789] end_request: I/O error, dev mmcblk0, sector 6488136
    2017-01-27T15:56:35.979590+00:00 ERR kernel: [  222.023796] Buffer I/O error on device mmcblk0p3, logical block 251401

I expect if we power wash the DUT, it will get better for
a while, but maybe it's time just to retire it.

Given that we think the DUT is bad, I've removed it from
the BVT pool:
    balance_pool bvt kip

    Balancing kip bvt pool:
    Total 6 DUTs, 5 working, 1 broken, 0 reserved.
    Target is 6 working DUTs; grow pool by 1 DUTs.
    kip bvt pool has 15 spares available.
    kip bvt pool will return 1 broken DUTs, leaving 0 still in the pool.
    Transferring 1 DUTs from bvt to suites.
    Updating host: chromeos4-row1-rack13-host9.
    Removing labels ['pool:bvt'] from host chromeos4-row1-rack13-host9
    Adding labels ['pool:suites'] to host chromeos4-row1-rack13-host9
    Transferring 1 DUTs from suites to bvt.
    Updating host: chromeos4-row2-rack13-host5.
    Removing labels ['pool:suites'] from host chromeos4-row2-rack13-host5
    Adding labels ['pool:bvt'] to host chromeos4-row2-rack13-host5

I've manually power-washed the DUT.  We can check on the status
of the repair in a bit.

I've just checked the DUT; it shows no evidence of being
power-washed.  That's... weird.

For now, I'm re-running the procedure; I'm going to check
on the intermediate steps to make sure it's all working
as expected.  At some point, either there'll be a clear
violation of the laws of physics, or I'll have a more
plausible error to point to.

Power wash definitely failed.  The basic sequence looks like this:
  * Touch the "factory_install_reset" as required to request power wash.
  * Reboot to force power wash.
  * Log in, and check on things.

At the end of it all the "factory_install_reset" file is gone, but
the stateful file system is unchanged.  Mostly, I count that as
"not possible", although if the disk is bad, lots of "impossible"
things may suddenly become more likely.

I'm going to try to force install from USB via servo.  We'll see
where that goes.

The repair failed:
    http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack13-host9/446700-repair/

The results look sufficiently definitive to me; time to decommission
this DUT.

Status: Fixed (was: Assigned)
Filed this ticket:
    https://b2.corp.google.com/issues/34769067

At this point, the DUT has failed repair, and won't (can't) go
back into service for testing.  So, there's nothing more to do
here.

Comment 13 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 14 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61
Status: Verified (was: Fixed)
tests passed in the host: 
https://ubercautotest.corp.google.com/afe/#tab_id=view_host&object_id=6113

Sign in to add a comment