daisy_skate-chrome-pfq provision failing repeatedly |
|||||
Issue descriptionhttps://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-chrome-pfq provision FAIL: Unhandled AutoservSSHTimeout: ('ssh timed out', * Command: Output below this line is for buildbot consumption: @@@STEP_LINK@provision: 17 reports, FAIL: Unhandled AutoservSSHTimeout: ('ssh timed out', * Command:@https://code.google.com/p/chromium/issues/detail?id=589367@@@ @@@STEP_LINK@Flaky test dashboard view for test provision@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=provision@@@ Will return from run_suite with status: INFRA_FAILURE Is this a case of a recent update poisoning daisy_skate devices, or some kind of infra problem?
,
Mar 11 2016
logs:
03/10 19:42:53.427 DEBUG| ssh_host:0153| Running (ssh) '/usr/bin/update_engine_client -status 2>&1 | grep CURRENT_OP'
03/10 19:42:53.696 DEBUG| base_utils:0268| [stdout] CURRENT_OP=UPDATE_STATUS_IDLE
03/10 19:42:53.705 INFO | servo_host:0510| servo host chromeos4-row9-rack7-host11-servo does not require an update.
03/10 19:42:53.707 DEBUG| ssh_host:0153| Running (ssh) 'test -f /var/lib/servod/config'
03/10 19:42:53.885 DEBUG| ssh_host:0153| Running (ssh) 'pgrep servod'
03/10 19:42:54.101 DEBUG| base_utils:0268| [stdout] 476
03/10 19:42:54.102 DEBUG| base_utils:0268| [stdout] 547
03/10 19:42:54.102 DEBUG| base_utils:0268| [stdout] 548
03/10 19:42:54.106 INFO | servo_host:0381| servod is running, PID=476,547,548
03/10 19:42:55.310 INFO | servo:0496| Setting usb_mux_oe1 to on
03/10 19:42:55.653 INFO | servo:0496| Setting prtctl4_pwren to off
03/10 19:42:57.915 DEBUG| servo:0225| Servo initialized, version is servo_v3
03/10 19:42:57.915 INFO | servo_host:0540| Sanity checks pass on servo host chromeos4-row9-rack7-host11-servo
03/10 19:42:57.982 DEBUG| ssh_host:0153| Running (ssh) 'test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start'
03/10 19:42:57.983 INFO | abstract_ssh:0749| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_1XvZtNssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row9-rack7-host11'
03/10 19:42:57.984 DEBUG| base_utils:0177| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_1XvZtNssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row9-rack7-host11'
03/10 19:43:03.052 INFO | abstract_ssh:0764| Timed out waiting for master-ssh connection to be established.
03/10 19:45:06.505 ERROR| base_utils:0268| [stderr] ssh: connect to host chromeos4-row9-rack7-host11 port 22: Connection timed out
03/10 19:45:06.506 INFO | remote:0074| Failed to copy /var/log/messages at startup: ('ssh timed out', * Command:
/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_1XvZtNssh-
master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
-o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
chromeos4-row9-rack7-host11 "export LIBC_FATAL_STDERR_=1; test ! -e
/var/log/messages || cp -f /var/log/messages
/var/tmp/messages.autotest_start"
Exit status: 255
Duration: 123.423555851
+@jrbarnette, the ssh timeout could also be caused by the servo update?
,
Mar 11 2016
Here's the error that triggered the failure:
03/10 19:45:06.505 ERROR| base_utils:0268| [stderr] ssh: connect to host chromeos4-row9-rack7-host11 port 22: Connection timed out
That's a problem with the DUT, not the servo.
I think (but I haven't checked in depth) that most servo
failures will be ignored in this context.
,
Mar 11 2016
Adding more information, I think many of the failures in question happen because the DUT is offline when provision starts. I don't know why the DUT is offline; that needs investigation.
,
Mar 11 2016
For the specific failure on chromeos4-row9-rack7-host11, I checked the history: The DUT was offline at the start of provisioning. The provisioning failure triggered repair, and the servo repaired the DUT by power cycling it. The DUT is in service now, and seems to be running tests successfully.
,
Mar 14 2016
@ richard - is there any more work needed on this? Should we use this as the FR for fixing provisioning to be smarter in the future?
,
Mar 14 2016
I've filed bug 594828 for improvements to Provision task failure diagnosis. I don't think that for this failure there's anything else to be done.
,
Apr 27 2016
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by sha...@chromium.org
, Mar 11 2016