New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 619989 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Jun 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

provision: FAIL: Timeout encountered (during verify.good_au)

Project Member Reported by akes...@chromium.org, Jun 14 2016

Issue description

https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_speedy-paladin/builds/2288
https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/66711811-chromeos-test/chromeos4-row4-rack11-host20/status.log


START	----	provision	timestamp=1465882757	localtime=Jun 13 22:39:17	
	START	provision_AutoUpdate	provision_AutoUpdate	timestamp=1465882757	localtime=Jun 13 22:39:17	
		START	----	----	timestamp=1465882790	localtime=Jun 13 22:39:50	
			GOOD	----	sysinfo.before	timestamp=1465882790	localtime=Jun 13 22:39:50	
		END GOOD	----	----	timestamp=1465882790	localtime=Jun 13 22:39:50	
		START	----	reboot	timestamp=1465883582	localtime=Jun 13 22:53:02	
			GOOD	----	reboot.start	timestamp=1465883582	localtime=Jun 13 22:53:02	
			GOOD	----	reboot.verify	timestamp=1465883615	localtime=Jun 13 22:53:35	
		END GOOD	----	reboot	kernel=3.14.0	localtime=Jun 13 22:53:36	timestamp=1465883616	
		START	----	reboot	timestamp=1465883974	localtime=Jun 13 22:59:34	
			GOOD	----	reboot.start	timestamp=1465883974	localtime=Jun 13 22:59:34	
			GOOD	----	reboot.verify	timestamp=1465884036	localtime=Jun 13 23:00:36	
		END GOOD	----	reboot	kernel=3.14.0	localtime=Jun 13 23:00:36	timestamp=1465884036	
		GOOD	provision_AutoUpdate	provision_AutoUpdate	timestamp=1465884063	localtime=Jun 13 23:01:03	completed successfully
	END GOOD	provision_AutoUpdate	provision_AutoUpdate	timestamp=1465884063	localtime=Jun 13 23:01:03	
	GOOD	----	verify.ssh	timestamp=1465884063	localtime=Jun 13 23:01:03	
	GOOD	----	verify.power	timestamp=1465884064	localtime=Jun 13 23:01:04	
	GOOD	----	verify.writable	timestamp=1465884064	localtime=Jun 13 23:01:04	
	GOOD	----	verify.tpm	timestamp=1465884065	localtime=Jun 13 23:01:05	
	FAIL	----	verify.good_au	timestamp=1465884071	localtime=Jun 13 23:01:11	Timeout encountered: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_KBBeWDssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row4-rack11-host20 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::_verify_list|_verify_host|verify] -> ssh_run(test -f /var/tmp/provision_failed)\";fi; test -f /var/tmp/provision_failed"
  * Command: 
      /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_KBBeWDssh-
      master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
      -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
      ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
      chromeos4-row4-rack11-host20 "export LIBC_FATAL_STDERR_=1; if type
      \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
      \"server[stack::_verify_list|_verify_host|verify] -> ssh_run(test -f
      /var/tmp/provision_failed)\";fi; test -f /var/tmp/provision_failed"
  Exit status: -15
  Duration: 6.02845191956
  
	GOOD	----	verify.python	timestamp=1465884072	localtime=Jun 13 23:01:12	
	GOOD	----	verify.cros	timestamp=1465884077	localtime=Jun 13 23:01:17	
	GOOD	----	verify.label	timestamp=1465884087	localtime=Jun 13 23:01:27	
END FAIL	----	provision	timestamp=1465884087	localtime=Jun 13 23:01:27	
INFO	----	----	timestamp=1465884087	job_abort_reason=	localtime=Jun 13 23:01:27	
 
Cc: jrbarnette@chromium.org
Not clear to me what timed out, seems like that verifier took only a few seconds to complete.
What timed out was this command to the DUT:
    test -f /var/tmp/provision_failed

I checked the source code, and here's the call:
        result = host.run('test -f %s' % host.PROVISION_FAILED,
                          ignore_status=True, timeout=5)

I've no idea why the timeout is set so low.  Although arguably
not actually unreasonable, it is apparently too short.

I expect that the proper fix is to remove the timeout.  The calling
verifier can and _should_ assume that ssh to the DUT works properly.
However, "properly" in this context doesn't include specific timing
guarantees, and having the verifier impose its own timing constraint
is, at best, unhelpful.

Yeah, sounds like we should certainly extend this timeout. And/or wrap it in a retry (do we have retries on such operations, or the concept of a verifier-wide retry?)
Retry in this instance isn't appropriate.  Really, a call-specific
timeout is also not appropriate.  The code should just do this:
        result = host.run('test -f %s' % host.PROVISION_FAILED,
                          ignore_status=True)

Why is retry not appropriate? Don't we want to protect against, say, flaky ssh dropout? (that applies in general to every ssh call we might make)

In #5, what timeout will this result in?
> Don't we want to protect against, say, flaky ssh dropout?

Mostly I think the answer is "no, that's too expensive".  If
there really needs to be a different answer, we should fix it
in 'Host.run()' for all hosts.  Nothing in the `Verifier`
subclasses should use anything but default arguments.  The
point of the verifiers is that they demonstrate the DUT is
working for all ordinary purposes, so applying options that
make failure less (or more) likely defeats the purpose.

> In #5, what timeout will this result in?

The default ssh connection timeout is about 2 minutes.  If
connection succeeds, there's a separate timeout for the total
command runtime; I think that that's 1 hour.

Cc: akes...@chromium.org dshi@chromium.org cywang@chromium.org bccheng@chromium.org
 Issue 620257  has been merged into this issue.
Labels: -Pri-2 Pri-1
Owner: akes...@chromium.org
Status: Started (was: Untriaged)
https://chromium-review.googlesource.com/#/c/352763/
RE #9 this is actually happening a lot lately. See also https://uberchromegw.corp.google.com/i/chromeos/builders/whirlwind-paladin/builds/4961

and also on peppy https://uberchromegw.corp.google.com/i/chromeos/builders/peppy-paladin/builds/12167

Might indicate a product issue as well that is slowing down something in this verifier (sshd being slow to connect?). Still, having this arbitrary 5 second timeout seems like the wrong way to catch this.
This could be caused by slowness in the product, but most
likely that would be "slowness in sshd", which isn't a
problem that is appropriate for a lot of attention.

This could also be slowness on the lab network.  That
_could_ be an important problem, but as noted, the selected
timeout isn't the way to find such a problem.
Project Member

Comment 15 by bugdroid1@chromium.org, Jun 16 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d3cbc876631e2c772fcc821d2c7e15b7c3e1c664

commit d3cbc876631e2c772fcc821d2c7e15b7c3e1c664
Author: Aviv Keshet <akeshet@chromium.org>
Date: Wed Jun 15 18:03:49 2016

autotest: remove too-short timeout on verify.good_au

The default host.run timeout should be correct. The arbitrary and short
existing timeout is causing flake.

BUG= chromium:619989 
TEST=None

Change-Id: I023433c0ba7af8da87c7b5b5beefdfe438013012
Reviewed-on: https://chromium-review.googlesource.com/352763
Commit-Queue: Aviv Keshet <akeshet@chromium.org>
Trybot-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/d3cbc876631e2c772fcc821d2c7e15b7c3e1c664/server/hosts/cros_repair.py

Above fix is rolling out to prod now...
Status: Fixed (was: Started)
I haven't seen any occurrences of this since that rollout, so I'm cautiously optimistic that it is fixed...
Closing... please feel free to reopen if its not fixed.
Status: Verified (was: Fixed)

Sign in to add a comment