New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 7 users

Issue metadata

Status: WontFix
Closed: Jun 2017
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 734741
issue 689794
issue 692247
issue 693597
issue 695366
issue 695983
issue 696668
issue 696679
issue 696690
issue 696764
issue 696772
issue 697274
issue 698309

Sign in to add a comment

Excessive provisioning failures killing CQ runs

Project Member Reported by, Feb 23 2017 Back to list

Issue description

We're seeing a large number of CQ runs failing because
of provisioning failures.  There's no single, identified
root cause.

This is a tracking bug for purposes of reporting status
as we work through the symptoms to real, actionable bugs.

For reference, to recognize a provisioning failure, go
to a failed CQ master run, like this:

The provisioning failures generally report with an error
message like this:
    daisy_skate-paladin: The HWTest [bvt-inline] stage failed: ** Suite timed out before completion **

I'm looking at but not seeing a smoking gun.

Comment 4 by, Feb 23 2017

kevin paladin just failed again w/ provisioning error, "No hosts within board:kevin satisfy arc, cts_abi_x86...". Obvious error here, kevin is an ARM platform!?!

 Issue 695567  filed.

Comment 5 by, Feb 23 2017

I'm completely on the wrong track, please ignore #4.
I've created a spreadsheet with individual board paladin failure
that is or at least looks like a provisioning failure:

I've collected some 55+ failures, and that's only going back to
last Saturday, or so.

I'm continuing to fill in links to build failures.  The next step would be
for some hardy volunteer to start filling in the column under the rubric
"Test log URL".

For consistency, do it this way:
  * Follow the link to the build failure.
  * Find the "Link to suite" under the suite failure.
  * On the suite page, go to the list of child jobs at the
    bottom, and follow the link to the failed test(s).  The
    failure is likely to be on the last screen; click "Next"
    to find it.
  * For the failed test, paste the URL labeled "All logs".

If this information is painful for you to gather and fill in, if you modify the classifier library to detect these failures we can definitely generate some to all of this from a script.  I have a lot of the logic in place from sheriff-o-matic and suite report work.
At this point, we've gathered log references to not quite 70
failed provisioning jobs.  I've started to characterize the
symptoms of the various failures.

I've begun attaching symptoms to the most recent failures.
A common theme is that the largest chunk of failures is
attributable to a DUT that is offline when contacted by ssh.

There's no indicator in the final exception as to exactly which
ssh operation actually failed, so it's not known whether the
problem is "reboot failed" or "DUT was offline at the outset".
My previous most recent experience is that this symptom really
is a failure on the DUT:  Repair fixes it by power cycling the

Comment 10 by, Feb 24 2017

I am wondering if we're seeing just the tip of the iceberg with respect to provision failures.  I just randomly ran into this:

    2017-02-23 11:39:36  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602804-provision/
    2017-02-23 10:14:54  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602682-repair/
    2017-02-23 10:10:57  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602669-provision/
    2017-02-23 10:10:21  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602665-repair/
    2017-02-23 10:06:29  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602649-provision/
    2017-02-23 10:05:45  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602638-repair/
    2017-02-23 10:01:49  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602616-provision/
    2017-02-23 10:01:13  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602613-repair/
    2017-02-23 09:56:55  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602597-provision/
    2017-02-23 09:56:20  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602594-repair/
    2017-02-23 09:52:05  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602574-provision/
    2017-02-23 09:51:25  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602571-repair/
    2017-02-23 09:46:47  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602554-provision/
    2017-02-23 09:46:11  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602551-repair/

I checked the second-to-last provision job (1602682) and although the DUT and the devserver appear both in good health, the devserver signals problems very quickly:

02/23 10:13:23.207 DEBUG|         cros_host:0672| Resolving a devserver for auto-update
02/23 10:13:23.211 DEBUG|        dev_server:0694| The host chromeos2-row3-rack4-host1 ( is in a restricted subnet. Try to locate a devserver inside subnet
02/23 10:13:23.214 DEBUG|        base_utils:0185| Running 'ssh 'curl ""''
02/23 10:13:24.905 DEBUG|         cros_host:0719| Resolved devserver for auto-update:
02/23 10:13:24.907 DEBUG|        dev_server:1915| Start CrOS auto-update for host chromeos2-row3-rack4-host1 at 1 time(s).
02/23 10:13:24.909 DEBUG|        base_utils:0185| Running 'ssh 'curl ""''
02/23 10:13:26.415 INFO |        dev_server:1735| Received response from devserver for cros_au call: '[true, 20581]'
02/23 10:13:26.418 DEBUG|        dev_server:1850| start process 20581 for auto_update in devserver
02/23 10:13:26.418 DEBUG|        dev_server:1756| Check the progress for auto-update process 20581
02/23 10:13:26.419 DEBUG|        base_utils:0185| Running 'ssh 'curl ""''
02/23 10:13:27.729 DEBUG|        dev_server:1791| Current CrOS auto-update status: CrOS update is just started.
02/23 10:13:37.781 DEBUG|        base_utils:0185| Running 'ssh 'curl ""''
02/23 10:13:39.218 DEBUG|        dev_server:1791| Current CrOS auto-update status: CrOS update is just started.
02/23 10:13:49.270 DEBUG|        base_utils:0185| Running 'ssh 'curl ""''
02/23 10:13:50.723 DEBUG|        dev_server:1853| Failed to trigger auto-update process on devserver
02/23 10:13:50.725 DEBUG|        base_utils:0185| Running 'ssh 'curl ""''
02/23 10:13:52.172 DEBUG|        base_utils:0185| Running 'ssh 'curl ""''
02/23 10:13:53.558 DEBUG|        dev_server:1685| Saving auto-update logs into /usr/local/autotest/results/hosts/chromeos2-row3-rack4-host1/1602669-provision/20172302101055/autoupdate_logs/CrOS_update_chromeos2-row3-rack4-host1_20581.log
02/23 10:13:53.559 DEBUG|        dev_server:1952| Exception raised on auto_update attempt #1:
DeviceNotPingableError: Device chromeos2-row3-rack4-host1 is not pingable.

The second attempt is almost identical.

So what happened?  The DUT was reachable at 10:13:15, but no longer at 10:13:50?

Comment 12 by, Feb 25 2017

Re #11, the DUT is reachable from shard, but not reachable from devserver at that time. The provision at 2017-02-23 11:39:36 happens to select another devserver ( which is proved to be ok. So I think it's the problem between that devserver and this DUT.

So maybe we can add a retry mechanism which is if the DUT is not pingable from a devserver, we just resolve another one, instead of retrying with the same devserver.

Comment 14 by, Feb 27 2017

Labels: -Pri-3 Pri-0

Comment 15 by, Feb 27 2017

This is blocking CQ hence P0.

Comment 16 by, Feb 27 2017

Added xixuan@ and dshi@ to CC, who are in Chrome OS Infrastructure Deputy.

Comment 17 by, Feb 27 2017

+sheriff, gardeners FYI

Comment 18 by, Feb 27 2017

Blockedon: 695983
Blockedon: 693597
Blockedon: 696668 696679 696764 696772
Blockedon: 697274
Labels: -Pri-0 Pri-1
OK.  Updating status here.  Executive summary:  We're not seeing
as many provisioning failures in the CQ.  The last two provision
failures to kill a CQ run were these:

Those builds are now 27 and 33 hours old, respectively.

There've been a few actions to explain the reduction:
  * We got two failed devservers back online.  Many of the failures
    seemed to be driven by too much devserver load.
  * We made the kip builders experimental; see  bug 696679 .

We also successfully pushed to prod yesterday.  That seems
to have stopped the "Suite timed out before completion" symptom.
Provisioning failures are now reported as test failures (as they
used to be) - see the 13798 master paladin run (above).

Blockedon: 696690 692247 698309
Adding some other bugs identifying provision failures.

I expect that some of the nominally blocking bugs are
duplicates, not actionable, or not provision failures.
A round of cleanup/clarification is needed to get the
list down.

Blockedon: 689794
Blockedon: 695366
Blockedon: 734741
Status: WontFix
Closing this out:  As best I can tell, the original problem
stemmed from a constellation of problems:
 1) Generally, provisioning is subject to intermittent
    failures.  The failure rate may be as high as 5%.
 2) Provisioning failures by themselves don't cause suite
    failures.  Instead, the suite retries the affected test,
    while the system repairs the DUT to prove that the
    provisioning failure wasn't a DUT failure.
 3) The consequence of the repair that follows a provisioning
    failure is that the suite runs with one less DUT, and
    consequently runs slower.
 4) At the time that this bug was filed, there were separate,
    unrelated performance issues causing much of the testing
    to run slow.
 5) Combining 1-4, the consequence of a provisioning failure
    was to slow down testing enough to cause suite timeouts.

The performance issues from 4) have been resolved, meaning 5)
no longer occurs.  Addressing provisioning failures is an OKR
for the Infra team, which doesn't belong in _this_ bug.

Sign in to add a comment