Repeated chromeos4-row5-rack13-host5 failures |
|||||||
Issue description
Not investigated in detail, but flake with this DUT has failed 2 CQ runs in a row with the same error.
START ---- provision timestamp=1507748839 localtime=Oct 11 12:07:19
GOOD ---- verify.servo_ssh timestamp=1507748844 localtime=Oct 11 12:07:24
GOOD ---- verify.update timestamp=1507748849 localtime=Oct 11 12:07:29
GOOD ---- verify.brd_config timestamp=1507748850 localtime=Oct 11 12:07:30
GOOD ---- verify.ser_config timestamp=1507748851 localtime=Oct 11 12:07:31
GOOD ---- verify.job timestamp=1507748852 localtime=Oct 11 12:07:32
GOOD ---- verify.servod timestamp=1507748910 localtime=Oct 11 12:08:30
GOOD ---- verify.pwr_button timestamp=1507748911 localtime=Oct 11 12:08:31
GOOD ---- verify.lid_open timestamp=1507748913 localtime=Oct 11 12:08:33
GOOD ---- verify.PASS timestamp=1507748913 localtime=Oct 11 12:08:33
START provision_AutoUpdate provision_AutoUpdate timestamp=1507748913 localtime=Oct 11 12:08:33
FAIL provision_AutoUpdate provision_AutoUpdate timestamp=1507749061 localtime=Oct 11 12:11:01 Unhandled AutoservSSHTimeout: ('ssh timed out', * Command:
/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_IZb0rXssh-
master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
-o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
chromeos4-row5-rack13-host5 "export LIBC_FATAL_STDERR_=1; if type
\"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack:
:get_chromeos_release_milestone|_get_lsb_release_content|run] ->
ssh_run(cat \\\"/etc/lsb-release\\\")\";fi; cat \"/etc/lsb-release\""
Exit status: 255
Duration: 63.3627679348
stderr:
ssh: connect to host chromeos4-row5-rack13-host5 port 22: Connection timed out)
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/test.py", line 806, in _call_test_function
return func(*args, **dargs)
File "/usr/local/autotest/client/common_lib/test.py", line 470, in execute
dargs)
File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry
postprocess_profiled_run, args, dargs)
File "/usr/local/autotest/client/common_lib/test.py", line 380, in _call_run_once
self.run_once(*args, **dargs)
File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 121, in run_once
with_cheets=with_cheets)
File "/usr/local/autotest/server/afe_utils.py", line 124, in machine_install_and_update_labels
*args, **dargs)
File "/usr/local/autotest/server/hosts/cros_host.py", line 804, in machine_install_by_devserver
force_original = self.get_chromeos_release_milestone() is None
File "/usr/local/autotest/server/hosts/cros_host.py", line 1397, in get_chromeos_release_milestone
lsb_release_content=self._get_lsb_release_content())
File "/usr/local/autotest/server/hosts/cros_host.py", line 1376, in _get_lsb_release_content
'cat "%s"' % client_constants.LSB_RELEASE).stdout.strip()
File "/usr/local/autotest/server/hosts/ssh_host.py", line 318, in run
return self.run_very_slowly(*args, **kwargs)
File "/usr/local/autotest/server/hosts/ssh_host.py", line 307, in run_very_slowly
ssh_failure_retry_ok)
File "/usr/local/autotest/server/hosts/ssh_host.py", line 249, in _run
raise error.AutoservSSHTimeout("ssh timed out", result)
AutoservSSHTimeout: ('ssh timed out', * Command:
/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_IZb0rXssh-
master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
-o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
chromeos4-row5-rack13-host5 "export LIBC_FATAL_STDERR_=1; if type
\"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack:
:get_chromeos_release_milestone|_get_lsb_release_content|run] ->
ssh_run(cat \\\"/etc/lsb-release\\\")\";fi; cat \"/etc/lsb-release\""
Exit status: 255
Duration: 63.3627679348
stderr:
ssh: connect to host chromeos4-row5-rack13-host5 port 22: Connection timed out)
END FAIL provision_AutoUpdate provision_AutoUpdate timestamp=1507749061 localtime=Oct 11 12:11:01
END FAIL ---- provision timestamp=1507749061 localtime=Oct 11 12:11:01
INFO ---- ---- timestamp=1507749061 job_abort_reason= localtime=Oct 11 12:11:01
,
Oct 11 2017
Looking at the DUTs history, it looks like it thinks it is unplugged, goes offline, fails provision, fails verify, then is repaired successfully. I've locked the DUT until this is understood.
,
Oct 11 2017
,
Oct 11 2017
Software bugs can cause a DUT to believe it has no power. One of our repair rules specifically accommodates that by triggering update to the stable image if the DUT fails the power check. Is this the only DUT showing the failure? How much faith do we have that this isn't software?
,
Oct 11 2017
> Software bugs can cause a DUT to believe it has no power. By this I mean specifically "software bugs in the product", such as kernel driver bugs.
,
Oct 11 2017
I only noticed this pattern on that DUT, and it was the only DUT to fail to provision for 2 CQ runs in a row.
,
Oct 11 2017
Looking at the history, the complaint about power only happened once. It's quite odd, since toggling power with the RPM did seem to fix the issue, but it's unlikely that the DUT actually had no power. Looking at the history further back, the DUT has an unfortunate habit of going offline during provisioning tasks. Eventually, it gets back online, typically with some sort of reset from servo...
,
Oct 11 2017
> I've locked the DUT until this is understood. It's a good idea to balance pools after locking a DUT, to get the pool back to full strength.
,
Oct 12 2017
Thanks the reminder, I usually do, but had forgotten. Could this DUT be failing in a sporadic way? What do I do with it now?
,
Oct 12 2017
> Could this DUT be failing in a sporadic way? What do I do with it now? Maybe log in, check for crashes, and poke around in /var/log/messages? Given the failures, it seems unlikely it'll be usable for any testing, but it would be helpful to find solid evidence of a hardware failure.
,
Oct 12 2017
Back to the deputy.
,
Oct 14 2017
Transferring outstanding deputy bugs.
,
Nov 3 2017
Filed a ticket at b/68863158. Deputy can just file the dut repair ticket and close the bug.
,
Nov 3 2017
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by dgarr...@chromium.org
, Oct 11 2017