New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 773877 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Nov 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Repeated chromeos4-row5-rack13-host5 failures

Project Member Reported by dgarr...@chromium.org, Oct 11 2017

Issue description

Not investigated in detail, but flake with this DUT has failed 2 CQ runs in a row with the same error.

START	----	provision	timestamp=1507748839	localtime=Oct 11 12:07:19	
	GOOD	----	verify.servo_ssh	timestamp=1507748844	localtime=Oct 11 12:07:24	
	GOOD	----	verify.update	timestamp=1507748849	localtime=Oct 11 12:07:29	
	GOOD	----	verify.brd_config	timestamp=1507748850	localtime=Oct 11 12:07:30	
	GOOD	----	verify.ser_config	timestamp=1507748851	localtime=Oct 11 12:07:31	
	GOOD	----	verify.job	timestamp=1507748852	localtime=Oct 11 12:07:32	
	GOOD	----	verify.servod	timestamp=1507748910	localtime=Oct 11 12:08:30	
	GOOD	----	verify.pwr_button	timestamp=1507748911	localtime=Oct 11 12:08:31	
	GOOD	----	verify.lid_open	timestamp=1507748913	localtime=Oct 11 12:08:33	
	GOOD	----	verify.PASS	timestamp=1507748913	localtime=Oct 11 12:08:33	
	START	provision_AutoUpdate	provision_AutoUpdate	timestamp=1507748913	localtime=Oct 11 12:08:33	
		FAIL	provision_AutoUpdate	provision_AutoUpdate	timestamp=1507749061	localtime=Oct 11 12:11:01	Unhandled AutoservSSHTimeout: ('ssh timed out', * Command: 
      /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_IZb0rXssh-
      master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
      -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
      ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
      chromeos4-row5-rack13-host5 "export LIBC_FATAL_STDERR_=1; if type
      \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack:
      :get_chromeos_release_milestone|_get_lsb_release_content|run] ->
      ssh_run(cat \\\"/etc/lsb-release\\\")\";fi; cat \"/etc/lsb-release\""
  Exit status: 255
  Duration: 63.3627679348
  
  stderr:
  ssh: connect to host chromeos4-row5-rack13-host5 port 22: Connection timed out)
  Traceback (most recent call last):
    File "/usr/local/autotest/client/common_lib/test.py", line 806, in _call_test_function
      return func(*args, **dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 470, in execute
      dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry
      postprocess_profiled_run, args, dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 380, in _call_run_once
      self.run_once(*args, **dargs)
    File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 121, in run_once
      with_cheets=with_cheets)
    File "/usr/local/autotest/server/afe_utils.py", line 124, in machine_install_and_update_labels
      *args, **dargs)
    File "/usr/local/autotest/server/hosts/cros_host.py", line 804, in machine_install_by_devserver
      force_original = self.get_chromeos_release_milestone() is None
    File "/usr/local/autotest/server/hosts/cros_host.py", line 1397, in get_chromeos_release_milestone
      lsb_release_content=self._get_lsb_release_content())
    File "/usr/local/autotest/server/hosts/cros_host.py", line 1376, in _get_lsb_release_content
      'cat "%s"' % client_constants.LSB_RELEASE).stdout.strip()
    File "/usr/local/autotest/server/hosts/ssh_host.py", line 318, in run
      return self.run_very_slowly(*args, **kwargs)
    File "/usr/local/autotest/server/hosts/ssh_host.py", line 307, in run_very_slowly
      ssh_failure_retry_ok)
    File "/usr/local/autotest/server/hosts/ssh_host.py", line 249, in _run
      raise error.AutoservSSHTimeout("ssh timed out", result)
  AutoservSSHTimeout: ('ssh timed out', * Command: 
      /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_IZb0rXssh-
      master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
      -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
      ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
      chromeos4-row5-rack13-host5 "export LIBC_FATAL_STDERR_=1; if type
      \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack:
      :get_chromeos_release_milestone|_get_lsb_release_content|run] ->
      ssh_run(cat \\\"/etc/lsb-release\\\")\";fi; cat \"/etc/lsb-release\""
  Exit status: 255
  Duration: 63.3627679348
  
  stderr:
  ssh: connect to host chromeos4-row5-rack13-host5 port 22: Connection timed out)
	END FAIL	provision_AutoUpdate	provision_AutoUpdate	timestamp=1507749061	localtime=Oct 11 12:11:01	
END FAIL	----	provision	timestamp=1507749061	localtime=Oct 11 12:11:01	
INFO	----	----	timestamp=1507749061	job_abort_reason=	localtime=Oct 11 12:11:01	
 
Looking at the DUTs history, it looks like it thinks it is unplugged, goes offline, fails provision, fails verify, then is repaired successfully.

I've locked the DUT until this is understood.
Cc: jrbarnette@chromium.org
Software bugs can cause a DUT to believe it has no power.
One of our repair rules specifically accommodates that by
triggering update to the stable image if the DUT fails
the power check.

Is this the only DUT showing the failure?  How much faith
do we have that this isn't software?

> Software bugs can cause a DUT to believe it has no power.

By this I mean specifically "software bugs in the product",
such as kernel driver bugs.

I only noticed this pattern on that DUT, and it was the only DUT to fail to provision for 2 CQ runs in a row.
Looking at the history, the complaint about power only happened once.
It's quite odd, since toggling power with the RPM did seem to fix the
issue, but it's unlikely that the DUT actually had no power.

Looking at the history further back, the DUT has an unfortunate habit
of going offline during provisioning tasks.  Eventually, it gets back
online, typically with some sort of reset from servo...

> I've locked the DUT until this is understood.

It's a good idea to balance pools after locking a DUT, to get the
pool back to full strength.

Owner: jrbarnette@chromium.org
Thanks the reminder, I usually do, but had forgotten.

Could this DUT be failing in a sporadic way? What do I do with it now?
> Could this DUT be failing in a sporadic way? What do I do with it now?

Maybe log in, check for crashes, and poke around in /var/log/messages?

Given the failures, it seems unlikely it'll be usable for any testing,
but it would be helpful to find solid evidence of a hardware failure.

Owner: dgarr...@chromium.org
Status: Assigned (was: Untriaged)
Back to the deputy.

Owner: nxia@chromium.org
Transferring outstanding deputy bugs.

Comment 13 by nxia@chromium.org, Nov 3 2017

Cc: dgarr...@chromium.org
Filed a ticket at b/68863158.

Deputy can just file the dut repair ticket and close the bug. 

Comment 14 by nxia@chromium.org, Nov 3 2017

Status: Fixed (was: Assigned)

Sign in to add a comment