New issue
Advanced search Search tips
Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 692247
Closed: Mar 2017
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 688187

Sign in to add a comment

Falco-chrome-pfq failed because of Provision failure: Unhandled DevServerException

Project Member Reported by, Jan 31 2017 Back to list

Issue description

This provision failures on Falco-chrome-pfq started from and has lasted for a few days (the latest failed builder with the same failure: 

Selected error message:
provision                                 [ FAILED ]
provision                                   FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row3-rack4-host3: SSHConnectionError: ssh: connect to host chromeos2-row3-rack4-host3 port 22: Connection timed out
00:44:12: ERROR: ** HWTest did not complete due to infrastructure issues (code 3) **

00:44:12: ERROR: BaseException in _RunParallelStages <class 'chromite.lib.failures_lib.StepFailure'>: 
Traceback (most recent call last):
  File "/b/cbuild/internal_master/chromite/lib/", line 441, in _Run
    self._task(*self._task_args, **self._task_kwargs)
  File "/b/cbuild/internal_master/chromite/cbuildbot/stages/", line 625, in Run
    raise failures_lib.StepFailure()
Traceback (most recent call last):
  File "/b/cbuild/internal_master/chromite/cbuildbot/builders/", line 118, in _RunParallelStages
  File "/b/cbuild/internal_master/chromite/lib/", line 678, in RunParallelSteps
    return [queue.get_nowait() for queue in queues]
  File "/b/cbuild/internal_master/chromite/lib/", line 675, in RunParallelSteps
  File "/usr/lib/python2.7/", line 24, in __exit__
  File "/b/cbuild/internal_master/chromite/lib/", line 561, in ParallelTasks
    raise BackgroundFailure(exc_infos=errors)
BackgroundFailure: <class 'chromite.lib.failures_lib.StepFailure'>: 
Traceback (most recent call last):
  File "/b/cbuild/internal_master/chromite/lib/", line 441, in _Run
    self._task(*self._task_args, **self._task_kwargs)
  File "/b/cbuild/internal_master/chromite/cbuildbot/stages/", line 625, in Run
    raise failures_lib.StepFailure()

00:44:12: INFO: Running cidb query on pid 1338, repr(query) starts with <sqlalchemy.sql.expression.Select at 0x6f70a90; Select object>
00:44:17: INFO: cidb query succeeded after 1 retries
00:44:17: INFO: Running cidb query on pid 1338, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x6f63410>
00:44:17: INFO: Running cidb query on pid 1338, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x6f63550>
00:44:17: INFO: Running cidb query on pid 1338, repr(query) starts with <sqlalchemy.sql.expression.Select at 0x6f638d0; Select object>
01:02:38: INFO: Running cidb query on pid 31932, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x47ef5d0>
01:02:39: WARNING: HttpsMonitor.send received status 429: {
  "error": {
    "code": 429,
    "message": "Resource has been exhausted (e.g. check quota).",
    "status": "RESOURCE_EXHAUSTED"

Comment 1 by, Jan 31 2017

Description: Show this description
Components: Infra

Comment 3 by, Jan 31 2017

I think it could be flaky network connection, or network card or the DUT?

Comment 4 by, Jan 31 2017

+current infra trooper 

Comment 5 by, Jan 31 2017

Looking at the debug info of the autotest suite (, I found the following error message: 

01/31 12:24:42.250 ERROR|        dev_server:0488| Devserver check_health failed. Free disk space is low. Only 19GB is available.
01/31 12:24:44.557 ERROR|        base_utils:0280| [stderr] [0131/] Querying Update Engine status...
01/31 12:24:54.193 ERROR|        base_utils:0280| [stderr] 2017-01-31 12:24:53 URL: [11496355/11496355] -> "/tmp/sysinfo/autoserv-eDvJ5H/packages/client-autotest.tar.bz2" [1]
01/31 12:24:55.657 ERROR|        base_utils:0280| [stderr] 2017-01-31 12:24:55 URL: [34489/34489] -> "/tmp/sysinfo/autoserv-eDvJ5H/packages.checksum" [1]
01/31 12:24:56.818 ERROR|        base_utils:0280| [stderr] cat: /tmp/sysinfo/autoserv-eDvJ5H/.checksum: No such file or directory
01/31 12:58:50.854 ERROR|        base_utils:0280| [stderr] ssh: connect to host chromeos2-row3-rack3-host9 port 22: Connection timed out
01/31 12:58:50.857 ERROR|              test:0240| Error cleaning up the sysinfo autotest/host objects, ignoring it
Traceback (most recent call last):
  File "/usr/local/autotest/server/", line 233, in cleanup
  File "/usr/local/autotest/server/hosts/", line 1128, in close
    super(CrosHost, self).close()
  File "/usr/local/autotest/server/hosts/", line 732, in close
    super(AbstractSSHHost, self).close()
  File "/usr/local/autotest/server/hosts/", line 52, in close'rm -rf "%s"' % (utils.sh_escape(dir)))
  File "/usr/local/autotest/server/hosts/", line 283, in run
  File "/usr/local/autotest/server/hosts/", line 232, in _run
    raise error.AutoservSSHTimeout("ssh timed out", result)
AutoservSSHTimeout: ('ssh timed out', * Command: 
    /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_uOpgfpssh-
    master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
    -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
    ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
    chromeos2-row3-rack3-host9 "export LIBC_FATAL_STDERR_=1; if type
    \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
    \"server[stack::close|close|close] -> ssh_run(rm -rf \\\"/tmp/sysinfo
    /autoserv-eDvJ5H\\\")\";fi; rm -rf \"/tmp/sysinfo/autoserv-eDvJ5H\""
Exit status: 255
Duration: 63.1774010658

Seems it's related to Issue 687237 (several devservers are failing health check with low disk messages)?? I'll wait to see if the cleaning-up work in that issue could also fix this issue....

Comment 6 by, Feb 1 2017

Several informational builds start to be affected by this failure...

Any thoughts?
ssh connection timeout might be ?
Components: -Infra
Chrome Infra does not handle issues related to ChromeOS.

Comment 10 by, Feb 1 2017

Components: Infra>Client>ChromeOS

Comment 11 by, Feb 1 2017

I noticed that we had a green cycle yesterday night. So seems the cleaning-up work in Issue 687237 did fix the problem. But then this morning the same failure came back again: 

 02-01-2017 [03:50:00] Output below this line is for buildbot consumption:
@@@STEP_LINK@[Auto-Bug]: provision: FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row3-rack4-host3: SSHConnectionError: ssh: connect to host chromeos2-row3-rack4-host3 port 22: Connection timed out, 41 reports@
@@@STEP_LINK@[Test-Logs]: provision: FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row3-rack4-host3: SSHConnectionError: ssh: connect to host chromeos2-row3-rack4-host3 port 22: Connection timed out@http://cautotest/tko/retrieve_logs.cgi?job=/results/99166108-chromeos-test/@@@
Will return from run_suite with status: INFRA_FAILURE

Comment 12 by, Feb 1 2017

+cc semenzato@  uekawa@

Comment 13 by, Feb 2 2017

This issue stays in Falco-chrome-pfq and has been blocking Chrome PFQ for 3 days. (Note: there was one green Falco-chrome-pfq run yesterday then it failed again with the same failure for the following runs. I think the green cycle might be because of the cleaning-up work in Issue 687237 or dhcp config failure in Issue 687402 was resolved.) The latest failed one is

akeshet@, could you take a look at it or find a better owner for this issue please? It seems likely to be Infra errors.
looking at the past 200 runs, falco chrome pfq has been more red than green since around Jan 17
for the last two runs chromeos2-row3-rack4-host3 did not respond after AU, whether that means it's a faulty device or faulty change I don't know, but it's hard to believe something is specific to falco.

Comment 16 by, Feb 2 2017

An update on this thread: We have another green cycle early this morning. However by looking at the history it's hard to tell this issue has been fixed or it was just happenstance. I'll keep monitoring the new run. 

Comment 17 by, Feb 2 2017

The new Falco-chrome-pfq run (see just failed again with the same SSHConnectionError timeout error along with a new HWTest video_ChromeHWDecodeUsed error (see  Issue 688118 ). 
the last run also had ssh timeout for chromeos2-row3-rack4-host3
All of the recent failures were attempts to provision chromeos2-row3-rack4-host3.

We should file an issue for Infra to remove that from the pool.

Comment 20 by, Feb 3 2017

Re#19: Thanks for the observation! Filed  Issue 688187  to track the removal of chromeos2-row3-rack4-host3. 

Comment 21 by, Feb 3 2017

Blockedon: 688187
ok, let's see if locking the host improves the situation.
Now it failed with
a new and different host.

From the last, chromeos2-row3-rack4-host1 has failed, so it's probably not the hosts.

  provision                                 [ FAILED ]
  provision                                   FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos2-row3-rack4-host1: DeviceNotPingableError: Device chromeos2-row3-rack4-host1 is not pingable.

From the recent builds:
5280 was a provision failure to chromeos2-row3-rack4-host3
5281 was a provision failure to chromeos2-row3-rack4-host3
5282 was a slightly different provision failure to chromeos2-row3-rack3-host9
5283 was a provision failure to chromeos2-row3-rack4-host1
5284 succeeded

So, I am pretty certain that there was definitely a problem with chromeos2-row3-rack4-host3.

It is less clear about 3-3-9 or 3-4-1, but when I looked further back in the history those two hosts failed fairly often (but I'm not sure how many builders are actually in the pool, so that may or may not be indicative of anything).

It would be helpful to be able to see the history for a particular DUT, but I am unaware of how to do so (there is an open YAQS question,

akeshet@ - any suggestions?

Comment 26 by, Feb 3 2017

Re#24: The DeviceNotPingableError error is a different error. It's the same issue as in Issue 683304. It happened intermittently for the past week and semenzato@ has CLs to fix it:)

Now the latest build passed ( Maybe the lock of chromeos2-row3-rack4-host3 in  Issue 688187  did work, or maybe it's just happenstance. I'll keep monitoring it.
I suspect that chromeos2-row3-rack4-host3 was extremely flakey (or just plain bad), and that the others are only semi-flakey. We'll keep an eye on things and may need to investigate further next week.
$ dut-status -f chromeos2-row3-rack4-host3
    2017-02-02 17:39:35  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99410498-chromeos-test/
    2017-02-02 17:39:11  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512914-reset/
    2017-02-02 16:40:46  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99396164-chromeos-test/
    2017-02-02 16:40:20  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512705-reset/
    2017-02-02 16:39:35  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99395491-chromeos-test/
    2017-02-02 16:31:01  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512673-provision/
    2017-02-02 16:21:17  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512648-repair/
    2017-02-02 16:07:22  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512590-provision/
    2017-02-02 16:06:46  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512586-repair/
    2017-02-02 16:06:29  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512584-reset/
    2017-02-02 15:54:49  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99395660-chromeos-test/
    2017-02-02 15:54:27  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512541-reset/
    2017-02-02 15:53:39  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99392328-chromeos-test/
    2017-02-02 15:53:15  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512537-reset/
    2017-02-02 15:52:35  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99392322-chromeos-test/
    2017-02-02 15:52:10  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512531-reset/
    2017-02-02 15:51:07  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99392301-chromeos-test/
    2017-02-02 15:50:47  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512525-reset/
    2017-02-02 15:49:45  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99392291-chromeos-test/
    2017-02-02 15:49:20  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512516-reset/
    2017-02-02 15:48:09  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99392285-chromeos-test/
    2017-02-02 15:39:53  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512461-provision/
    2017-02-02 11:30:38  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/99380649-chromeos-test/
    2017-02-02 11:22:26  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512175-provision/
    2017-02-02 11:13:09  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512126-repair/
    2017-02-02 11:02:48  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512110-provision/

It does seem somewhat provision-failure-prone. You might get some insight examining those repair logs.
I was looking at some arbitrary var/log/messages 



I was a little bit surprised that timezone is mixed, 
on one boot it is GMT and another became PST.
For some reason the time doesn't look right.

2017-02-02T17:00:57.486227+00:00 ERR wpa_supplicant[641]: rmdir[ctrl_interface=/var/run/wpa_supplicant]: Permission denied
2017-02-02T09:01:03.714591-08:00 INFO kernel: [    0.000000] Initializing cgroup subsys cpu

this is for

   2017-02-02 11:13:09  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1512126-repair/

what's this timestamp? is this a run at around 16:00PST ?
I'm trying to wrap my head around which run is the failing run and which one is the repair
The GMT/PST problem is known and probably not fixable.  Usually it's not too difficult to understand by context whether you need to subtract 8 hours from the syslog timestamps.

I've been looking at failures to reconnect after reboot.  See issue 677572, particularly comment 32.  In that failure, eth0 doesn't come up after a reboot.  check_ethernet.hook tries to bring it up, attempting several reboots in the process.  Finally autoserv does an EC reset (or power cycle), which, unlike the reboot, cuts power to the USB dongle.  That makes the dongle work again.

I wonder if this is the same situation.

Owner: ----
chromeos2-row3-rack4-host3 was manually reflashed and seems to provision fine now. LMK if there is more to be done here.
issue 666781 shows that this happens to a few other boards, but falco is the most frequent by far. I suggest keeping this bug open until we can identify why the other falco DUTs are so flaky.
There was a definitely problem with chromeos2-row3-rack4-host3, fixed in  issue 688187 .

Starting with build #5284 (when chromeos2-row3-rack4-host3 was pulled), we have had 3 failures out of 17 runs, all due to failures on chromeos2-row3-rack4-host1. This isn't *super* high, but it's higher than we would like, so I would file a new issue to have chromeos2-row3-rack4-host1 reflashed to see if that helps reduce flaco flakiness. My notes that chromeos2-row3-rack4-host1 has been more flaky than average for a while, but it has largely been masked by consistent failures on chromeos2-row3-rack4-host1.

issue 689794 is a similar problem for chromeos2-row3-rack9-host11 which runs samus-android-pfq. So it might suggest something is wrong for the whole chromeos2-row3.

Does anyone know who in the infra team we should route this to?
List of all similar issues: issue 689794, issue 666781, issue 687883, issue 687432

Looks like the problem actually happens on all boards and all DUTs.

Comment 38 by, Feb 9 2017


Comment 39 by, Feb 11 2017

Discussed with xixuan@

She has a retry around these SSHConnectionError in the pipeline that should hopefully resolve this (or atleast reduce it) soon.
When should we expect this to cycle green? The latest run is still failing.
after CL is pushed in. ETA is the end of this week. We don't want the retry to add the devserver's load and as a result break CQ. So we will first adding a metrics to monitor such errors, then push CL 441389 in to see whether it makes any difference.
xixuan@, is the cl CL already in? Is the  issue 692247  caused by the same root you are trying to fix by adding more re-tries?
No. Haven't submitted the CL.

I update on that bug (692247). it's not a network issue for that host.

Comment 45 by, Feb 15 2017

OK.  Going back to first principles, here's history
around the original failure on chromeos2-row3-rack4-host3:

    2017-01-31 01:48:53  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1503494-reset/
    2017-01-31 01:47:52  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/98984451-chromeos-test/
    2017-01-31 01:36:51  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1503470-provision/
    2017-01-31 00:19:15  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1503299-repair/
    2017-01-31 00:03:00  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1503245-provision/
    2017-01-30 20:24:12  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502943-repair/
    2017-01-30 20:19:37  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502918-provision/
    2017-01-30 20:18:59  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502914-repair/
    2017-01-30 20:14:34  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502883-provision/
    2017-01-30 20:13:58  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502878-repair/
    2017-01-30 19:54:50  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502803-provision/
    2017-01-30 19:28:46  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/98934521-chromeos-test/
    2017-01-30 19:28:22  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502743-reset/
    2017-01-30 19:27:48  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502736-cleanup/
    2017-01-30 18:24:11  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/98928486-chromeos-test/
    2017-01-30 18:19:41  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host3/1502422-reset/

So, the provision failure was part of a period
of general failures that lasted for over 7 hours.

Looking at the central sequence of provision/repair
events, there's a common theme:
  * Provision failed after the devserver was unable to
    talk to the DUT over ssh.
  * The DUT was accessible over ssh less than a minute
    later when the shard started repair.

This is consistent with some more recently observed
problems attributed to devserver load.

Mergedinto: 692247
Status: Duplicate

Sign in to add a comment