provision: FAIL: Timeout encountered (during verify.good_au) |
||||||
Issue descriptionhttps://uberchromegw.corp.google.com/i/chromeos/builders/veyron_speedy-paladin/builds/2288 https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/66711811-chromeos-test/chromeos4-row4-rack11-host20/status.log START ---- provision timestamp=1465882757 localtime=Jun 13 22:39:17 START provision_AutoUpdate provision_AutoUpdate timestamp=1465882757 localtime=Jun 13 22:39:17 START ---- ---- timestamp=1465882790 localtime=Jun 13 22:39:50 GOOD ---- sysinfo.before timestamp=1465882790 localtime=Jun 13 22:39:50 END GOOD ---- ---- timestamp=1465882790 localtime=Jun 13 22:39:50 START ---- reboot timestamp=1465883582 localtime=Jun 13 22:53:02 GOOD ---- reboot.start timestamp=1465883582 localtime=Jun 13 22:53:02 GOOD ---- reboot.verify timestamp=1465883615 localtime=Jun 13 22:53:35 END GOOD ---- reboot kernel=3.14.0 localtime=Jun 13 22:53:36 timestamp=1465883616 START ---- reboot timestamp=1465883974 localtime=Jun 13 22:59:34 GOOD ---- reboot.start timestamp=1465883974 localtime=Jun 13 22:59:34 GOOD ---- reboot.verify timestamp=1465884036 localtime=Jun 13 23:00:36 END GOOD ---- reboot kernel=3.14.0 localtime=Jun 13 23:00:36 timestamp=1465884036 GOOD provision_AutoUpdate provision_AutoUpdate timestamp=1465884063 localtime=Jun 13 23:01:03 completed successfully END GOOD provision_AutoUpdate provision_AutoUpdate timestamp=1465884063 localtime=Jun 13 23:01:03 GOOD ---- verify.ssh timestamp=1465884063 localtime=Jun 13 23:01:03 GOOD ---- verify.power timestamp=1465884064 localtime=Jun 13 23:01:04 GOOD ---- verify.writable timestamp=1465884064 localtime=Jun 13 23:01:04 GOOD ---- verify.tpm timestamp=1465884065 localtime=Jun 13 23:01:05 FAIL ---- verify.good_au timestamp=1465884071 localtime=Jun 13 23:01:11 Timeout encountered: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_KBBeWDssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row4-rack11-host20 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::_verify_list|_verify_host|verify] -> ssh_run(test -f /var/tmp/provision_failed)\";fi; test -f /var/tmp/provision_failed" * Command: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_KBBeWDssh- master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row4-rack11-host20 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::_verify_list|_verify_host|verify] -> ssh_run(test -f /var/tmp/provision_failed)\";fi; test -f /var/tmp/provision_failed" Exit status: -15 Duration: 6.02845191956 GOOD ---- verify.python timestamp=1465884072 localtime=Jun 13 23:01:12 GOOD ---- verify.cros timestamp=1465884077 localtime=Jun 13 23:01:17 GOOD ---- verify.label timestamp=1465884087 localtime=Jun 13 23:01:27 END FAIL ---- provision timestamp=1465884087 localtime=Jun 13 23:01:27 INFO ---- ---- timestamp=1465884087 job_abort_reason= localtime=Jun 13 23:01:27
,
Jun 14 2016
What timed out was this command to the DUT:
test -f /var/tmp/provision_failed
I checked the source code, and here's the call:
result = host.run('test -f %s' % host.PROVISION_FAILED,
ignore_status=True, timeout=5)
I've no idea why the timeout is set so low. Although arguably
not actually unreasonable, it is apparently too short.
,
Jun 14 2016
I expect that the proper fix is to remove the timeout. The calling verifier can and _should_ assume that ssh to the DUT works properly. However, "properly" in this context doesn't include specific timing guarantees, and having the verifier impose its own timing constraint is, at best, unhelpful.
,
Jun 14 2016
Yeah, sounds like we should certainly extend this timeout. And/or wrap it in a retry (do we have retries on such operations, or the concept of a verifier-wide retry?)
,
Jun 14 2016
Retry in this instance isn't appropriate. Really, a call-specific
timeout is also not appropriate. The code should just do this:
result = host.run('test -f %s' % host.PROVISION_FAILED,
ignore_status=True)
,
Jun 14 2016
Why is retry not appropriate? Don't we want to protect against, say, flaky ssh dropout? (that applies in general to every ssh call we might make) In #5, what timeout will this result in?
,
Jun 14 2016
> Don't we want to protect against, say, flaky ssh dropout? Mostly I think the answer is "no, that's too expensive". If there really needs to be a different answer, we should fix it in 'Host.run()' for all hosts. Nothing in the `Verifier` subclasses should use anything but default arguments. The point of the verifiers is that they demonstrate the DUT is working for all ordinary purposes, so applying options that make failure less (or more) likely defeats the purpose. > In #5, what timeout will this result in? The default ssh connection timeout is about 2 minutes. If connection succeeds, there's a separate timeout for the total command runtime; I think that that's 1 hour.
,
Jun 15 2016
Issue 620257 has been merged into this issue.
,
Jun 15 2016
,
Jun 15 2016
,
Jun 15 2016
https://chromium-review.googlesource.com/#/c/352763/
,
Jun 15 2016
RE #9 this is actually happening a lot lately. See also https://uberchromegw.corp.google.com/i/chromeos/builders/whirlwind-paladin/builds/4961 and also on peppy https://uberchromegw.corp.google.com/i/chromeos/builders/peppy-paladin/builds/12167 Might indicate a product issue as well that is slowing down something in this verifier (sshd being slow to connect?). Still, having this arbitrary 5 second timeout seems like the wrong way to catch this.
,
Jun 15 2016
,
Jun 15 2016
This could be caused by slowness in the product, but most likely that would be "slowness in sshd", which isn't a problem that is appropriate for a lot of attention. This could also be slowness on the lab network. That _could_ be an important problem, but as noted, the selected timeout isn't the way to find such a problem.
,
Jun 16 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d3cbc876631e2c772fcc821d2c7e15b7c3e1c664 commit d3cbc876631e2c772fcc821d2c7e15b7c3e1c664 Author: Aviv Keshet <akeshet@chromium.org> Date: Wed Jun 15 18:03:49 2016 autotest: remove too-short timeout on verify.good_au The default host.run timeout should be correct. The arbitrary and short existing timeout is causing flake. BUG= chromium:619989 TEST=None Change-Id: I023433c0ba7af8da87c7b5b5beefdfe438013012 Reviewed-on: https://chromium-review.googlesource.com/352763 Commit-Queue: Aviv Keshet <akeshet@chromium.org> Trybot-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/d3cbc876631e2c772fcc821d2c7e15b7c3e1c664/server/hosts/cros_repair.py
,
Jun 16 2016
Above fix is rolling out to prod now...
,
Jun 16 2016
I haven't seen any occurrences of this since that rollout, so I'm cautiously optimistic that it is fixed...
,
Jun 27 2016
Closing... please feel free to reopen if its not fixed.
,
Jun 27 2016
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by akes...@chromium.org
, Jun 14 2016