limit provision time to 30 minutes |
||||||||||
Issue description
Looking at provision rates across ~100 CQ runs (~1 week) shows about 7% overall provision failure rate:
Provision: {
fail: {
average: 0:22:28.392885,
count: 1265,
max: 0:52:37,
min: 0:00:12,
total: 19 days, 17:48:37
},
pass: {
average: 0:10:21.496341,
count: 18038,
max: 0:28:44,
min: 0:03:05,
total: 129 days, 18:02:31
}
},
Data mined from https://davidriley.users.x20web.corp.google.com/builds100-20170602.json under the following path ['all_hqes_timings']['by_test_status']['Provision'].
I sliced things by looking at builds where there was more than one provision failure. The data shows that the 1265 failures occurred over 11 CQ runs from 54 distinct slave builds, and resulted in the master CQ run always failing, and taking an extra hour to do so vs when there was not a provision failure.
total failures 1265
above threshold
occurrences: 54
summary master (avg 3:01:07.893784 of 11 builds) status: Counter({u'fail': 11})
summary slaves (avg 2:48:17.454514 of 54 builds) status: Counter({u'fail': 46, u'inflight': 6, u'pass': 2})
below threshold
occurrences: 1808
summary master (avg 1:56:01.518971 of 112 builds) status: Counter({u'fail': 61, u'pass': 51})
summary slaves (avg 1:17:12.898115 of 1632 builds) status: Counter({u'pass': 1542, u'fail': 58, u'inflight': 32})
These numbers seem really high -- 1265 failures over 54 builds is an average of 23 failures in each of those cases. Here's a sample:
master: 1561562 master-paladin:14893 -> fail (self destructed False) duration 3:35:15.405130
slave: 1561566 cyan-paladin:2741 -> fail (self destructed False) duration 3:05:58.144340
failures 20
master: 1561562 master-paladin:14893 -> fail (self destructed False) duration 3:35:15.405130
slave: 1561576 kevin-paladin:1281 -> fail (self destructed False) duration 2:52:46.076413
failures 28
master: 1561562 master-paladin:14893 -> fail (self destructed False) duration 3:35:15.405130
slave: 1561584 veyron_minnie-paladin:2743 -> fail (self destructed False) duration 3:20:28.321197 (slowest_slave,slowest_undestructed_slave)
failures 18
master: 1561562 master-paladin:14893 -> fail (self destructed False) duration 3:35:15.405130
slave: 1561592 veyron_mighty-paladin:5475 -> fail (self destructed False) duration 2:28:11.683460
failures 27
master: 1561562 master-paladin:14893 -> fail (self destructed False) duration 3:35:15.405130
slave: 1561598 elm-paladin:2930 -> fail (self destructed False) duration 2:39:26.661851
failures 30
Here's a handy viceroy chart of one of them:
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/1281
https://viceroy.corp.google.com/chromeos/suite_details?build_id=1561576
Here's a viceroy chart of one where the build actually succeeded with only three provision failures:
https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_mighty-paladin/builds/5474
master: 1561199 master-paladin:14892 -> fail (self destructed False) duration 2:02:05.533723
slave: 1561365 veyron_mighty-paladin:5474 -> pass (self destructed False) duration 1:43:33.723783 (slowest_slave,slowest_undestructed_slave)
failures 3
https://viceroy.corp.google.com/chromeos/suite_details?build_id=1561365
It's not clear which of these failures are due to product or infrastructure and is probably the first thing to investigate.
Immediate takeaways:
- I think are aborting faster if there are a bunch of provision failures
- Ensure that we failure provisions quicker than letting them run 20-50 minutes
,
Jun 3 2017
To start, how about we set the maximum provision time to 30 minutes? That's greater than the max for passes, and would reduce the average time for failures, so seems like an easy win. The percentage of failures seems like the main thing to look at though. One note, part of the reason why there are so many failures is because the same CL went through the CQ several times causing all provisions to fail
,
Jun 3 2017
Re c#2: You pointed out after that there was a particular bad CL which was causing some boards to fail to boot. I'm not sure if was really obvious, but it'd be useful if we can extract the fact that the build failed to provision on any DUT.
,
Jun 3 2017
Setting a 30min provision limit sounds sensible to me. I'm not sure if there is other action being requested on this bug? http://shortn/_O1TSBeEqhv also bolsters this case. On ToT, the 99.9th percentile of successful provision jobs is around 30min.
,
Jun 3 2017
Looking over ~400 CQ runs (data included):
Provision: {
fail: {
average: 0:18:37.763157,
count: 2090,
max: 1:07:00,
min: 0:00:11,
total: 27 days, 0:55:25
},
pass: {
average: 0:10:38.022425,
count: 62384,
max: 0:41:08,
min: 0:02:53,
total: 460 days, 16:13:11
}
},
There were 93 successful provisions over 30 minutes. 3 over 40 minutes.
Provision success duration >= 30 seconds
occurrences: 93
summary master (avg 2:47:55.205322 of 32 builds) status: Counter({u'fail': 21, u'pass': 11})
summary slaves (avg 2:12:03.536017 of 40 builds) status: Counter({u'pass': 30, u'fail': 7, u'inflight': 3})
Provision success duration >= 40 seconds
occurrences: 3
summary master (avg 3:43:28.410645 of 2 builds) status: Counter({u'fail': 2})
summary slaves (avg 2:42:51.873022 of 3 builds) status: Counter({u'fail': 2, u'pass': 1})
,
Jun 3 2017
c#5: s/seconds/minutes/
,
Jun 3 2017
I think we should aim to understand why 7% of provisions are failing, and clearly distinguish between bad product code and infra-related failure. That 400 CQ (~1 month) shows a 3.35% failure rate which still seems too high. I believe that we have the hardware capacity to handle all of our provision requests, so it's either software deficiencies in our provisioning, insufficient network, and/or poor load balancing which we should be able to fix. The provision failure rate should entirely be due to bad product code (or bad DUTs that should be removed) and not be experiencing infra flake. I'd really rather drop that timeout to 20 minutes if we can, but I don't think we're there yet. I spent a bunch of time trying to find out where a provision timeout exists, and I can't find any such thing. Maybe there's something in the HQEs but I couldn't find code to fill in anything for provisions there either. Anybody know offhand?
,
Jun 6 2017
In addition to limiting provision time to 30 minutes (jrbarnette@ - do you know how to implement this timeout), I think we could also implement a policy of: - do not provision a single DUT more than twice in a single suite If provisions are not counting against test retries, we might also want to implement a max re-provision across the entire suite to avoid cases like all the slave builds of: https://viceroy.corp.google.com/chromeos/build_details?build_id=1562746 Example: https://viceroy.corp.google.com/chromeos/suite_details?build_id=1562882 The proposal I just listed is somewhat different because it attempts to mitigate the effects of a bad DUT. davidjames@: Can you add this to your spreadsheet?
,
Jun 12 2017
,
Jun 12 2017
They both relate to provisioning but are mostly orthogonal issues.
,
Jun 14 2017
> do not provision a single DUT more than twice in a single suite See Issue 732001 . The bug has touched on a lot of topics. I think the most concrete action here is to reduce the provision timeout to 30m (which sgtm). Therefore, hijacking this bug for that purpose, and assigning to deputy as P1.
,
Jun 14 2017
> The bug has touched on a lot of topics. I think the most > concrete action here is to reduce the provision timeout to 30m > (which sgtm). TTBOMK, there's no provision timeout. More broadly, there's no timeout of any sort for special tasks; that includes provision. There's a longstanding request for the feature, see bug 271838. The last few comments on the bug suggest that it's not too much work. I'll note that as a practical matter, I think the only way that provision tasks can exceed 30 minutes is if they retry internally. I've seen provision logs showing up to 4 different attempts to provision. It's very likely we can get shorter provision tasks just by eliminating most or all of the retry logic.
,
Jun 14 2017
I am fine with a 30 minute provision timeout, if it materializes. (20 minutes might be too short on old Chromebooks, but also remember that test images keep growing in size, so it might also be too short at some point for newer DUTs.) But I also think that issue 732001 may just be the fix. Still ongoing provisioning could be aborted once all tests in the suite finish. With regards to internal retries, these are usually more efficient than external retries (say, because we don't have to transport the image to the DUT anymore). Getting rid of such optimizations would be sad. So we need to understand the long case better first.
,
Jun 16 2017
jrbarnette@: Is the priority correct (Pri-1 implies that you are expecting to close this bug roughly in a month) I'll de-dup the older request to this bug because that no longer has much relevant information.
,
Jun 16 2017
Issue 271838 has been merged into this issue.
,
Jun 16 2017
> Issue 271838 has been merged into this issue. Wait. There are two different fixes possible for this bug: * Impose a timeout on special tasks (the request in bug 271838) * Change the retry strategy for provisioning (different from bug 271838). Unless/until we're committed to the first fix, this bug isn't a duplicate. If we are committing to the first fix, then I'd rather close this bug as a duplicate of the other, because the other bug has more background on the change required.
,
Jun 16 2017
Looks like this bug's ask is (a). Duplicate either way. Your choice. Just keep one.
,
Jun 16 2017
I'm in favor of the timeout. It's simpler, and we already have data to support that it is safe.
,
Oct 11 2017
Status of this?
,
Oct 12 2017
I'm not working on this. Assuming our goal is a simple, hard timeout, we should address this by addressing bug 271838.
,
Oct 12 2017
,
Oct 13 2017
I did some digging, and I'm no longer convinced of the wisdom of the proposed change. I started by running a query across all the shards that fetched summary data about every provision task I could get my hands on. Since the putative rationale for this change is "make the CQ better", I winnowed the data down to paladin builds only. The result was records on over a quarter million provision tasks, going back as far as 6/11 this year. The file is attached. My conclusions are below. First note: In the sample, 3.6% of the provision tasks failed. That's consistent with other observed failure rates. Digging in further: > On ToT, the 99.9th percentile of successful provision jobs is around 30min. In the sample I looked at, 99.2% of the successful tasks completed in less than 30 minutes; that means our likelihood of failing a good provisioning with a 30 minute timeout is low. Unfortunately, the likelihood of rejecting a good provisioning task isn't the metric that matters. The metric that matters is "if we get to 30 minutes, what's the chance that the build will fail?" Looking at all the tasks that exceeded the 30 minute timeout tells the story differently: 1905 failed 1912 passed IOW, after 30 minutes, there's still a 50-50 chance that the task will succeed. Put another way: For every long-running failure that we make fail faster, we add a false failure for a task that would have succeeded. Note also, that although individually 99% of provision tasks are under 30 minutes, there are several hundred of them in every CQ run. Grouping the data by build, it turns out that out of 1245 builds recorded in the sample, 177 (14%) of them had at least one successful provision task that ran over 30 minutes. The CQ is somewhat robust to such failures, but not perfectly so. Thus, it's likely that a 30 minute timeout would increase the CQ failure rate; at minimum, false failures caused by the timeout could add to paladin run times.
,
Jan 26 2018
Based on the difficulties cited in fixing this, and on the recent improvements in provisioning performance, this isn't worth doing. |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by davidri...@chromium.org
, Jun 3 201711.5 KB
11.5 KB View Download