New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 729255 link

Starred by 4 users

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Feature

Blocked on:
issue 271838



Sign in to add a comment

limit provision time to 30 minutes

Project Member Reported by davidri...@chromium.org, Jun 3 2017

Issue description

Looking at provision rates across ~100 CQ runs (~1 week) shows about 7% overall provision failure rate:
      Provision: {
        fail: {
          average: 0:22:28.392885,
          count: 1265,
          max: 0:52:37,
          min: 0:00:12,
          total: 19 days, 17:48:37
        },
        pass: {
          average: 0:10:21.496341,
          count: 18038,
          max: 0:28:44,
          min: 0:03:05,
          total: 129 days, 18:02:31
        }
      }, 

Data mined from https://davidriley.users.x20web.corp.google.com/builds100-20170602.json under the following path ['all_hqes_timings']['by_test_status']['Provision'].

I sliced things by looking at builds where there was more than one provision failure.  The data shows that the 1265 failures occurred over 11 CQ runs from 54 distinct slave builds, and resulted in the master CQ run always failing, and taking an extra hour to do so vs when there was not a provision failure.
total failures 1265
above threshold
occurrences: 54
summary master (avg 3:01:07.893784 of 11 builds) status: Counter({u'fail': 11})
summary slaves (avg 2:48:17.454514 of 54 builds) status: Counter({u'fail': 46, u'inflight': 6, u'pass': 2})
below threshold
occurrences: 1808
summary master (avg 1:56:01.518971 of 112 builds) status: Counter({u'fail': 61, u'pass': 51})
summary slaves (avg 1:17:12.898115 of 1632 builds) status: Counter({u'pass': 1542, u'fail': 58, u'inflight': 32})

These numbers seem really high -- 1265 failures over 54 builds is an average of 23 failures in each of those cases.  Here's a sample:
master: 1561562 master-paladin:14893 -> fail (self destructed False) duration 3:35:15.405130
 slave: 1561566 cyan-paladin:2741 -> fail (self destructed False) duration 3:05:58.144340
failures 20
master: 1561562 master-paladin:14893 -> fail (self destructed False) duration 3:35:15.405130
 slave: 1561576 kevin-paladin:1281 -> fail (self destructed False) duration 2:52:46.076413
failures 28
master: 1561562 master-paladin:14893 -> fail (self destructed False) duration 3:35:15.405130
 slave: 1561584 veyron_minnie-paladin:2743 -> fail (self destructed False) duration 3:20:28.321197 (slowest_slave,slowest_undestructed_slave)
failures 18
master: 1561562 master-paladin:14893 -> fail (self destructed False) duration 3:35:15.405130
 slave: 1561592 veyron_mighty-paladin:5475 -> fail (self destructed False) duration 2:28:11.683460
failures 27
master: 1561562 master-paladin:14893 -> fail (self destructed False) duration 3:35:15.405130
 slave: 1561598 elm-paladin:2930 -> fail (self destructed False) duration 2:39:26.661851
failures 30

Here's a handy viceroy chart of one of them:
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/1281
https://viceroy.corp.google.com/chromeos/suite_details?build_id=1561576

Here's a viceroy chart of one where the build actually succeeded with only three provision failures:
https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_mighty-paladin/builds/5474
master: 1561199 master-paladin:14892 -> fail (self destructed False) duration 2:02:05.533723
 slave: 1561365 veyron_mighty-paladin:5474 -> pass (self destructed False) duration 1:43:33.723783 (slowest_slave,slowest_undestructed_slave)
failures 3
https://viceroy.corp.google.com/chromeos/suite_details?build_id=1561365

It's not clear which of these failures are due to product or infrastructure and is probably the first thing to investigate.

Immediate takeaways:
- I think are aborting faster if there are a bunch of provision failures
- Ensure that we failure provisions quicker than letting them run 20-50 minutes
 
Attached is the full set of builds where there was a provision failure (from 126 CQ runs).
provision-failures-100.txt
11.5 KB View Download
To start, how about we set the maximum provision time to 30 minutes? That's greater than the max for passes, and would reduce the average time for failures, so seems like an easy win.

The percentage of failures seems like the main thing to look at though.

One note, part of the reason why there are so many failures is because the same CL went through the CQ several times causing all provisions to fail
Re c#2: You pointed out after that there was a particular bad CL which was causing some boards to fail to boot.  I'm not sure if was really obvious, but it'd be useful if we can extract the fact that the build failed to provision on any DUT.
Setting a 30min provision limit sounds sensible to me. I'm not sure if there is other action being requested on this bug?

http://shortn/_O1TSBeEqhv also bolsters this case. On ToT, the 99.9th percentile of successful provision jobs is around 30min.
Looking over ~400 CQ runs (data included):
      Provision: {
        fail: {
          average: 0:18:37.763157,
          count: 2090,
          max: 1:07:00,
          min: 0:00:11,
          total: 27 days, 0:55:25
        },
        pass: {
          average: 0:10:38.022425,
          count: 62384,
          max: 0:41:08,
          min: 0:02:53,
          total: 460 days, 16:13:11
        }
      },

There were 93 successful provisions over 30 minutes.  3 over 40 minutes.
Provision success duration >= 30 seconds
occurrences: 93
summary master (avg 2:47:55.205322 of 32 builds) status: Counter({u'fail': 21, u'pass': 11})
summary slaves (avg 2:12:03.536017 of 40 builds) status: Counter({u'pass': 30, u'fail': 7, u'inflight': 3})

Provision success duration >= 40 seconds
occurrences: 3
summary master (avg 3:43:28.410645 of 2 builds) status: Counter({u'fail': 2})
summary slaves (avg 2:42:51.873022 of 3 builds) status: Counter({u'fail': 2, u'pass': 1})
provision-failures-400.txt
61.3 KB View Download
c#5:
s/seconds/minutes/
I think we should aim to understand why 7% of provisions are failing, and clearly distinguish between bad product code and infra-related failure.  That 400 CQ (~1 month) shows a 3.35% failure rate which still seems too high.  I believe that we have the hardware capacity to handle all of our provision requests, so it's either software deficiencies in our provisioning, insufficient network, and/or poor load balancing which we should be able to fix.  The provision failure rate should entirely be due to bad product code (or bad DUTs that should be removed) and not be experiencing infra flake.

I'd really rather drop that timeout to 20 minutes if we can, but I don't think we're there yet.

I spent a bunch of time trying to find out where a provision timeout exists, and I can't find any such thing.  Maybe there's something in the HQEs but I couldn't find code to fill in anything for provisions there either.  Anybody know offhand?
Cc: keta...@chromium.org jrbarnette@chromium.org pprabhu@chromium.org ihf@chromium.org snanda@chromium.org
In addition to limiting provision time to 30 minutes (jrbarnette@ - do you know how to implement this timeout), I think we could also implement a policy of:
- do not provision a single DUT more than twice in a single suite

If provisions are not counting against test retries, we might also want to implement a max re-provision across the entire suite to avoid cases like all the slave builds of:
https://viceroy.corp.google.com/chromeos/build_details?build_id=1562746
Example:
https://viceroy.corp.google.com/chromeos/suite_details?build_id=1562882

The proposal I just listed is somewhat different because it attempts to mitigate the effects of a bad DUT.

davidjames@: Can you add this to your spreadsheet?

Comment 9 by aut...@google.com, Jun 12 2017

Labels: OKR
Owner: akes...@chromium.org
Related to Issue 730067 ? 
They both relate to provisioning but are mostly orthogonal issues.
Labels: -OKR
Owner: jrbarnette@chromium.org
Status: Available (was: Untriaged)
Summary: limit provision time to 30 minutes (was: CQ provision rate has too many failures)
> do not provision a single DUT more than twice in a single suite
See  Issue 732001 .


The bug has touched on a lot of topics. I think the most concrete action here is to reduce the provision timeout to 30m (which sgtm).

Therefore, hijacking this bug for that purpose, and assigning to deputy as P1.


> The bug has touched on a lot of topics. I think the most
> concrete action here is to reduce the provision timeout to 30m
> (which sgtm).

TTBOMK, there's no provision timeout.  More broadly, there's no
timeout of any sort for special tasks; that includes provision.
There's a longstanding request for the feature, see bug 271838.
The last few comments on the bug suggest that it's not too much
work.

I'll note that as a practical matter, I think the only way that
provision tasks can exceed 30 minutes is if they retry internally.
I've seen provision logs showing up to 4 different attempts to
provision.  It's very likely we can get shorter provision tasks
just by eliminating most or all of the retry logic.

Comment 13 by ihf@chromium.org, Jun 14 2017

I am fine with a 30 minute provision timeout, if it materializes. (20 minutes might be too short on old Chromebooks, but also remember that test images keep growing in size, so it might also be too short at some point for newer DUTs.) But I also think that  issue 732001  may just be the fix. Still ongoing provisioning could be aborted once all tests in the suite finish.

With regards to internal retries, these are usually more efficient than external retries (say, because we don't have to transport the image to the DUT anymore). Getting rid of such optimizations would be sad. So we need to understand the long case better first.
Labels: -Type-Bug Type-Feature
jrbarnette@: Is the priority correct (Pri-1 implies that you are expecting to close this bug roughly in a month)

I'll de-dup the older request to this bug because that no longer has much relevant information.
Cc: cywang@chromium.org dshi@chromium.org bccheng@chromium.org
Issue 271838 has been merged into this issue.
>  Issue 271838  has been merged into this issue.

Wait.  There are two different fixes possible for this bug:
  * Impose a timeout on special tasks (the request in bug 271838)
  * Change the retry strategy for provisioning (different from
    bug 271838).

Unless/until we're committed to the first fix, this bug isn't
a duplicate.  If we are committing to the first fix, then I'd
rather close this bug as a duplicate of the other, because the
other bug has more background on the change required.

Looks like this bug's ask is (a).
Duplicate either way. Your choice. Just keep one.
I'm in favor of the timeout. It's simpler, and we already have data to support that it is safe.
Status: Assigned (was: Available)
Status of this?
Labels: -Pri-1 Pri-2
I'm not working on this.  Assuming our goal is a simple, hard
timeout, we should address this by addressing bug 271838.

Blockedon: 271838
Labels: -Pri-2 Pri-3
I did some digging, and I'm no longer convinced of the wisdom
of the proposed change.

I started by running a query across all the shards that fetched
summary data about every provision task I could get my hands on.
Since the putative rationale for this change is "make the CQ
better", I winnowed the data down to paladin builds only.  The
result was records on over a quarter million provision tasks,
going back as far as 6/11 this year.  The file is attached.

My conclusions are below.

First note:  In the sample, 3.6% of the provision tasks failed.
That's consistent with other observed failure rates.

Digging in further:
> On ToT, the 99.9th percentile of successful provision jobs is around 30min.

In the sample I looked at, 99.2% of the successful tasks completed
in less than 30 minutes; that means our likelihood of failing a
good provisioning with a 30 minute timeout is low.

Unfortunately, the likelihood of rejecting a good provisioning task
isn't the metric that matters.  The metric that matters is "if we get
to 30 minutes, what's the chance that the build will fail?"  Looking
at all the tasks that exceeded the 30 minute timeout tells the
story differently:
   1905 failed
   1912 passed
IOW, after 30 minutes, there's still a 50-50 chance that the task
will succeed.  Put another way:  For every long-running failure
that we make fail faster, we add a false failure for a task that
would have succeeded.

Note also, that although individually 99% of provision tasks are
under 30 minutes, there are several hundred of them in every CQ run.
Grouping the data by build, it turns out that out of 1245 builds
recorded in the sample, 177 (14%) of them had at least one successful
provision task that ran over 30 minutes.  The CQ is somewhat robust
to such failures, but not perfectly so.  Thus, it's likely that a
30 minute timeout would increase the CQ failure rate; at minimum,
false failures caused by the timeout could add to paladin run times.

Status: WontFix (was: Assigned)
Based on the difficulties cited in fixing this, and on the
recent improvements in provisioning performance, this isn't
worth doing.

Sign in to add a comment