Excessive provisioning failures killing CQ runs
Reported by
jrbarnette@chromium.org,
Feb 23 2017
|
||||||||||||||||
Issue description
We're seeing a large number of CQ runs failing because
of provisioning failures. There's no single, identified
root cause.
This is a tracking bug for purposes of reporting status
as we work through the symptoms to real, actionable bugs.
For reference, to recognize a provisioning failure, go
to a failed CQ master run, like this:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/13738
The provisioning failures generally report with an error
message like this:
daisy_skate-paladin: The HWTest [bvt-inline] stage failed: ** Suite timed out before completion **
,
Feb 23 2017
,
Feb 23 2017
,
Feb 23 2017
kevin paladin just failed again w/ provisioning error, "No hosts within board:kevin satisfy arc, cts_abi_x86...". Obvious error here, kevin is an ARM platform!?! Issue 695567 filed.
,
Feb 23 2017
I'm completely on the wrong track, please ignore #4.
,
Feb 23 2017
I've created a spreadsheet with individual board paladin failure
that is or at least looks like a provisioning failure:
https://docs.google.com/spreadsheets/d/1L45c6BbFam2tVeBnAqjePMWYhU2cegT_iXYyFnZSR7A/edit#gid=0
I've collected some 55+ failures, and that's only going back to
last Saturday, or so.
I'm continuing to fill in links to build failures. The next step would be
for some hardy volunteer to start filling in the column under the rubric
"Test log URL".
For consistency, do it this way:
* Follow the link to the build failure.
* Find the "Link to suite" under the suite failure.
* On the suite page, go to the list of child jobs at the
bottom, and follow the link to the failed test(s). The
failure is likely to be on the last screen; click "Next"
to find it.
* For the failed test, paste the URL labeled "All logs".
,
Feb 23 2017
If this information is painful for you to gather and fill in, if you modify the classifier library to detect these failures we can definitely generate some to all of this from a script. I have a lot of the logic in place from sheriff-o-matic and suite report work.
,
Feb 23 2017
At this point, we've gathered log references to not quite 70 failed provisioning jobs. I've started to characterize the symptoms of the various failures.
,
Feb 24 2017
I've begun attaching symptoms to the most recent failures. A common theme is that the largest chunk of failures is attributable to a DUT that is offline when contacted by ssh. There's no indicator in the final exception as to exactly which ssh operation actually failed, so it's not known whether the problem is "reboot failed" or "DUT was offline at the outset". My previous most recent experience is that this symptom really is a failure on the DUT: Repair fixes it by power cycling the DUT.
,
Feb 24 2017
,
Feb 24 2017
I am wondering if we're seeing just the tip of the iceberg with respect to provision failures. I just randomly ran into this:
2017-02-23 11:39:36 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602804-provision/
2017-02-23 10:14:54 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602682-repair/
2017-02-23 10:10:57 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602669-provision/
2017-02-23 10:10:21 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602665-repair/
2017-02-23 10:06:29 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602649-provision/
2017-02-23 10:05:45 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602638-repair/
2017-02-23 10:01:49 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602616-provision/
2017-02-23 10:01:13 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602613-repair/
2017-02-23 09:56:55 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602597-provision/
2017-02-23 09:56:20 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602594-repair/
2017-02-23 09:52:05 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602574-provision/
2017-02-23 09:51:25 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602571-repair/
2017-02-23 09:46:47 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602554-provision/
2017-02-23 09:46:11 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row3-rack4-host1/1602551-repair/
I checked the second-to-last provision job (1602682) and although the DUT and the devserver appear both in good health, the devserver signals problems very quickly:
02/23 10:13:23.207 DEBUG| cros_host:0672| Resolving a devserver for auto-update
02/23 10:13:23.211 DEBUG| dev_server:0694| The host chromeos2-row3-rack4-host1 (100.115.226.34) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.224.0:19.
02/23 10:13:23.214 DEBUG| base_utils:0185| Running 'ssh 100.115.245.198 'curl "http://100.115.245.198:8082/check_health?"''
02/23 10:13:24.905 DEBUG| cros_host:0719| Resolved devserver for auto-update: http://100.115.245.198:8082
02/23 10:13:24.907 DEBUG| dev_server:1915| Start CrOS auto-update for host chromeos2-row3-rack4-host1 at 1 time(s).
02/23 10:13:24.909 DEBUG| base_utils:0185| Running 'ssh 100.115.245.198 'curl "http://100.115.245.198:8082/cros_au?build_name=falco-release/R58-9310.0.0&async=True&force_update=True&host_name=chromeos2-row3-rack4-host1&full_update=False"''
02/23 10:13:26.415 INFO | dev_server:1735| Received response from devserver for cros_au call: '[true, 20581]'
02/23 10:13:26.418 DEBUG| dev_server:1850| start process 20581 for auto_update in devserver
02/23 10:13:26.418 DEBUG| dev_server:1756| Check the progress for auto-update process 20581
02/23 10:13:26.419 DEBUG| base_utils:0185| Running 'ssh 100.115.245.198 'curl "http://100.115.245.198:8082/get_au_status?build_name=falco-release/R58-9310.0.0&force_update=True&pid=20581&host_name=chromeos2-row3-rack4-host1&full_update=False"''
02/23 10:13:27.729 DEBUG| dev_server:1791| Current CrOS auto-update status: CrOS update is just started.
02/23 10:13:37.781 DEBUG| base_utils:0185| Running 'ssh 100.115.245.198 'curl "http://100.115.245.198:8082/get_au_status?build_name=falco-release/R58-9310.0.0&force_update=True&pid=20581&host_name=chromeos2-row3-rack4-host1&full_update=False"''
02/23 10:13:39.218 DEBUG| dev_server:1791| Current CrOS auto-update status: CrOS update is just started.
02/23 10:13:49.270 DEBUG| base_utils:0185| Running 'ssh 100.115.245.198 'curl "http://100.115.245.198:8082/get_au_status?build_name=falco-release/R58-9310.0.0&force_update=True&pid=20581&host_name=chromeos2-row3-rack4-host1&full_update=False"''
02/23 10:13:50.723 DEBUG| dev_server:1853| Failed to trigger auto-update process on devserver
02/23 10:13:50.725 DEBUG| base_utils:0185| Running 'ssh 100.115.245.198 'curl "http://100.115.245.198:8082/handler_cleanup?pid=20581&host_name=chromeos2-row3-rack4-host1"''
02/23 10:13:52.172 DEBUG| base_utils:0185| Running 'ssh 100.115.245.198 'curl "http://100.115.245.198:8082/collect_cros_au_log?pid=20581&host_name=chromeos2-row3-rack4-host1"''
02/23 10:13:53.558 DEBUG| dev_server:1685| Saving auto-update logs into /usr/local/autotest/results/hosts/chromeos2-row3-rack4-host1/1602669-provision/20172302101055/autoupdate_logs/CrOS_update_chromeos2-row3-rack4-host1_20581.log
02/23 10:13:53.559 DEBUG| dev_server:1952| Exception raised on auto_update attempt #1:
...
DeviceNotPingableError: Device chromeos2-row3-rack4-host1 is not pingable.
The second attempt is almost identical.
So what happened? The DUT was reachable at 10:13:15, but no longer at 10:13:50?
,
Feb 25 2017
,
Feb 25 2017
Re #11, the DUT is reachable from shard, but not reachable from devserver 100.115.245.198 at that time. The provision at 2017-02-23 11:39:36 happens to select another devserver (100.115.245.200) which is proved to be ok. So I think it's the problem between that devserver and this DUT. So maybe we can add a retry mechanism which is if the DUT is not pingable from a devserver, we just resolve another one, instead of retrying with the same devserver.
,
Feb 27 2017
,
Feb 27 2017
This is blocking CQ hence P0.
,
Feb 27 2017
Added xixuan@ and dshi@ to CC, who are in Chrome OS Infrastructure Deputy.
,
Feb 27 2017
+sheriff, gardeners FYI
,
Feb 27 2017
,
Feb 27 2017
,
Feb 27 2017
,
Mar 2 2017
OK. Updating status here. Executive summary: We're not seeing
as many provisioning failures in the CQ. The last two provision
failures to kill a CQ run were these:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/13798
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/13794
Those builds are now 27 and 33 hours old, respectively.
There've been a few actions to explain the reduction:
* We got two failed devservers back online. Many of the failures
seemed to be driven by too much devserver load.
* We made the kip builders experimental; see bug 696679 .
We also successfully pushed to prod yesterday. That seems
to have stopped the "Suite timed out before completion" symptom.
Provisioning failures are now reported as test failures (as they
used to be) - see the 13798 master paladin run (above).
,
Mar 14 2017
,
Mar 14 2017
,
Jun 19 2017
,
Jun 30 2017
Closing this out: As best I can tell, the original problem
stemmed from a constellation of problems:
1) Generally, provisioning is subject to intermittent
failures. The failure rate may be as high as 5%.
2) Provisioning failures by themselves don't cause suite
failures. Instead, the suite retries the affected test,
while the system repairs the DUT to prove that the
provisioning failure wasn't a DUT failure.
3) The consequence of the repair that follows a provisioning
failure is that the suite runs with one less DUT, and
consequently runs slower.
4) At the time that this bug was filed, there were separate,
unrelated performance issues causing much of the testing
to run slow.
5) Combining 1-4, the consequence of a provisioning failure
was to slow down testing enough to cause suite timeouts.
The performance issues from 4) have been resolved, meaning 5)
no longer occurs. Addressing provisioning failures is an OKR
for the Infra team, which doesn't belong in _this_ bug.
|
||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||
Comment 1 by akes...@chromium.org
, Feb 23 2017