Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Starred by 7 users
Status: WontFix
Owner:
Closed: Jun 30
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on: View detail
issue 696690
issue 696772
issue 734741
issue 689794
issue 692247
issue 693597
issue 695366
issue 695983
issue 696668
issue 696679
issue 696764
issue 697274
issue 698309



Sign in to add a comment
Excessive provisioning failures killing CQ runs
Project Member Reported by jrbarnette@chromium.org, Feb 23 2017 Back to list
We're seeing a large number of CQ runs failing because
of provisioning failures.  There's no single, identified
root cause.

This is a tracking bug for purposes of reporting status
as we work through the symptoms to real, actionable bugs.

For reference, to recognize a provisioning failure, go
to a failed CQ master run, like this:
    https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/13738

The provisioning failures generally report with an error
message like this:
    daisy_skate-paladin: The HWTest [bvt-inline] stage failed: ** Suite timed out before completion **

 
I'm looking at https://viceroy.corp.google.com/chromeos/provision?duration=8d but not seeing a smoking gun.
Cc: adurbin@chromium.org
Cc: davidri...@chromium.org
Comment 4 by sha...@chromium.org, Feb 23 2017
kevin paladin just failed again w/ provisioning error, "No hosts within board:kevin satisfy arc, cts_abi_x86...". Obvious error here, kevin is an ARM platform!?!

 Issue 695567  filed.
Comment 5 by sha...@chromium.org, Feb 23 2017
I'm completely on the wrong track, please ignore #4.
I've created a spreadsheet with individual board paladin failure
that is or at least looks like a provisioning failure:
    https://docs.google.com/spreadsheets/d/1L45c6BbFam2tVeBnAqjePMWYhU2cegT_iXYyFnZSR7A/edit#gid=0

I've collected some 55+ failures, and that's only going back to
last Saturday, or so.

I'm continuing to fill in links to build failures.  The next step would be
for some hardy volunteer to start filling in the column under the rubric
"Test log URL".

For consistency, do it this way:
  * Follow the link to the build failure.
  * Find the "Link to suite" under the suite failure.
  * On the suite page, go to the list of child jobs at the
    bottom, and follow the link to the failed test(s).  The
    failure is likely to be on the last screen; click "Next"
    to find it.
  * For the failed test, paste the URL labeled "All logs".

If this information is painful for you to gather and fill in, if you modify the classifier library to detect these failures we can definitely generate some to all of this from a script.  I have a lot of the logic in place from sheriff-o-matic and suite report work.
At this point, we've gathered log references to not quite 70
failed provisioning jobs.  I've started to characterize the
symptoms of the various failures.

I've begun attaching symptoms to the most recent failures.
A common theme is that the largest chunk of failures is
attributable to a DUT that is offline when contacted by ssh.

There's no indicator in the final exception as to exactly which
ssh operation actually failed, so it's not known whether the
problem is "reboot failed" or "DUT was offline at the outset".
My previous most recent experience is that this symptom really
is a failure on the DUT:  Repair fixes it by power cycling the
DUT.

Comment 10 by hywu@chromium.org, Feb 24 2017
Cc: hywu@chromium.org
I am wondering if we're seeing just the tip of the iceberg with respect to provision failures.  I just randomly ran into this:

    2017-02-23 11:39:36  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602804-provision/
    2017-02-23 10:14:54  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602682-repair/
    2017-02-23 10:10:57  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602669-provision/
    2017-02-23 10:10:21  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602665-repair/
    2017-02-23 10:06:29  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602649-provision/
    2017-02-23 10:05:45  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602638-repair/
    2017-02-23 10:01:49  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602616-provision/
    2017-02-23 10:01:13  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602613-repair/
    2017-02-23 09:56:55  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602597-provision/
    2017-02-23 09:56:20  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602594-repair/
    2017-02-23 09:52:05  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602574-provision/
    2017-02-23 09:51:25  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602571-repair/
    2017-02-23 09:46:47  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602554-provision/
    2017-02-23 09:46:11  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602551-repair/


I checked the second-to-last provision job (1602682) and although the DUT and the devserver appear both in good health, the devserver signals problems very quickly:

02/23 10:13:23.207 DEBUG|         cros_host:0672| Resolving a devserver for auto-update
02/23 10:13:23.211 DEBUG|        dev_server:0694| The host chromeos2-row3-rack4-host1 (100.115.226.34) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.224.0:19.
02/23 10:13:23.214 DEBUG|        base_utils:0185| Running 'ssh 100.115.245.198 'curl "http://100.115.245.198:8082/check_health?"''
02/23 10:13:24.905 DEBUG|         cros_host:0719| Resolved devserver for auto-update: http://100.115.245.198:8082
02/23 10:13:24.907 DEBUG|        dev_server:1915| Start CrOS auto-update for host chromeos2-row3-rack4-host1 at 1 time(s).
02/23 10:13:24.909 DEBUG|        base_utils:0185| Running 'ssh 100.115.245.198 'curl "http://100.115.245.198:8082/cros_au?build_name=falco-release/R58-9310.0.0&async=True&force_update=True&host_name=chromeos2-row3-rack4-host1&full_update=False"''
02/23 10:13:26.415 INFO |        dev_server:1735| Received response from devserver for cros_au call: '[true, 20581]'
02/23 10:13:26.418 DEBUG|        dev_server:1850| start process 20581 for auto_update in devserver
02/23 10:13:26.418 DEBUG|        dev_server:1756| Check the progress for auto-update process 20581
02/23 10:13:26.419 DEBUG|        base_utils:0185| Running 'ssh 100.115.245.198 'curl "http://100.115.245.198:8082/get_au_status?build_name=falco-release/R58-9310.0.0&force_update=True&pid=20581&host_name=chromeos2-row3-rack4-host1&full_update=False"''
02/23 10:13:27.729 DEBUG|        dev_server:1791| Current CrOS auto-update status: CrOS update is just started.
02/23 10:13:37.781 DEBUG|        base_utils:0185| Running 'ssh 100.115.245.198 'curl "http://100.115.245.198:8082/get_au_status?build_name=falco-release/R58-9310.0.0&force_update=True&pid=20581&host_name=chromeos2-row3-rack4-host1&full_update=False"''
02/23 10:13:39.218 DEBUG|        dev_server:1791| Current CrOS auto-update status: CrOS update is just started.
02/23 10:13:49.270 DEBUG|        base_utils:0185| Running 'ssh 100.115.245.198 'curl "http://100.115.245.198:8082/get_au_status?build_name=falco-release/R58-9310.0.0&force_update=True&pid=20581&host_name=chromeos2-row3-rack4-host1&full_update=False"''
02/23 10:13:50.723 DEBUG|        dev_server:1853| Failed to trigger auto-update process on devserver
02/23 10:13:50.725 DEBUG|        base_utils:0185| Running 'ssh 100.115.245.198 'curl "http://100.115.245.198:8082/handler_cleanup?pid=20581&host_name=chromeos2-row3-rack4-host1"''
02/23 10:13:52.172 DEBUG|        base_utils:0185| Running 'ssh 100.115.245.198 'curl "http://100.115.245.198:8082/collect_cros_au_log?pid=20581&host_name=chromeos2-row3-rack4-host1"''
02/23 10:13:53.558 DEBUG|        dev_server:1685| Saving auto-update logs into /usr/local/autotest/results/hosts/chromeos2-row3-rack4-host1/1602669-provision/20172302101055/autoupdate_logs/CrOS_update_chromeos2-row3-rack4-host1_20581.log
02/23 10:13:53.559 DEBUG|        dev_server:1952| Exception raised on auto_update attempt #1:
...
DeviceNotPingableError: Device chromeos2-row3-rack4-host1 is not pingable.

The second attempt is almost identical.

So what happened?  The DUT was reachable at 10:13:15, but no longer at 10:13:50?

Comment 12 by ihf@chromium.org, Feb 25 2017
Cc: ihf@chromium.org
Re #11, the DUT is reachable from shard, but not reachable from devserver 100.115.245.198 at that time. The provision at 2017-02-23 11:39:36 happens to select another devserver (100.115.245.200) which is proved to be ok. So I think it's the problem between that devserver and this DUT.

So maybe we can add a retry mechanism which is if the DUT is not pingable from a devserver, we just resolve another one, instead of retrying with the same devserver.
Comment 14 by oka@chromium.org, Feb 27 2017
Labels: -Pri-3 Pri-0
Comment 15 by oka@chromium.org, Feb 27 2017
This is blocking CQ hence P0.
Comment 16 by oka@chromium.org, Feb 27 2017
Cc: dshi@chromium.org xixuan@chromium.org
Added xixuan@ and dshi@ to CC, who are in Chrome OS Infrastructure Deputy.
Comment 17 by ihf@google.com, Feb 27 2017
Cc: levarum@chromium.org marc...@chromium.org afakhry@chromium.org
+sheriff, gardeners FYI
Comment 18 by oka@chromium.org, Feb 27 2017
Cc: oka@chromium.org
Blockedon: 695983
Blockedon: 693597
Blockedon: 696668 696679 696764 696772
Blockedon: 697274
Labels: -Pri-0 Pri-1
OK.  Updating status here.  Executive summary:  We're not seeing
as many provisioning failures in the CQ.  The last two provision
failures to kill a CQ run were these:
    https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/13798
    https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/13794

Those builds are now 27 and 33 hours old, respectively.

There've been a few actions to explain the reduction:
  * We got two failed devservers back online.  Many of the failures
    seemed to be driven by too much devserver load.
  * We made the kip builders experimental; see  bug 696679 .

We also successfully pushed to prod yesterday.  That seems
to have stopped the "Suite timed out before completion" symptom.
Provisioning failures are now reported as test failures (as they
used to be) - see the 13798 master paladin run (above).

Blockedon: 696690 692247 698309
Adding some other bugs identifying provision failures.

I expect that some of the nominally blocking bugs are
duplicates, not actionable, or not provision failures.
A round of cleanup/clarification is needed to get the
list down.

Blockedon: 689794
Blockedon: 695366
Blockedon: 734741
Status: WontFix
Closing this out:  As best I can tell, the original problem
stemmed from a constellation of problems:
 1) Generally, provisioning is subject to intermittent
    failures.  The failure rate may be as high as 5%.
 2) Provisioning failures by themselves don't cause suite
    failures.  Instead, the suite retries the affected test,
    while the system repairs the DUT to prove that the
    provisioning failure wasn't a DUT failure.
 3) The consequence of the repair that follows a provisioning
    failure is that the suite runs with one less DUT, and
    consequently runs slower.
 4) At the time that this bug was filed, there were separate,
    unrelated performance issues causing much of the testing
    to run slow.
 5) Combining 1-4, the consequence of a provisioning failure
    was to slow down testing enough to cause suite timeouts.

The performance issues from 4) have been resolved, meaning 5)
no longer occurs.  Addressing provisioning failures is an OKR
for the Infra team, which doesn't belong in _this_ bug.

Sign in to add a comment