New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 734741 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Feature

Blocking:
issue 695529



Sign in to add a comment

Proposal: Improve DUT utilization

Project Member Reported by steve...@chromium.org, Jun 19 2017

Issue description

While investigating a recent failure I ran into two related issues:

1. We seem to be spending a disproportionate amount of time provisioning DUTs to run HWTest suites. Clicking on 'Suite details' for any bvt-cq or bvt-arc run will generally show a few tests running on half a dozen or more boards with more time spent provisioning than running tests, e.g.
https://viceroy.corp.google.com/chromeos/suite_details?job_id=124126758

Arbitrary recent example found by searching for 'hwtest [bvt-cq]' on the paladin waterfall:
https://viceroy.corp.google.com/chromeos/suite_details?job_id=124126758

Even though it might add a tiny bit to our total test run time, provisioning fewer boards and running more tests seems like it could significantly cut down on the amount of provisioning we do, which seems like a potentially large overall win.

2. When we have a shortage of DUTs, this fails when the suite times out, with a warning message like "No output from <_BackgroundTask(_BackgroundTask-5:7:2, started)> for 8640 seconds".

We could do a better job of identifying DUT availability before running HWTests and fail early if insufficient DUTs are available. The problem is complicated because sometimes we may have DUTs becoming available "soon", and we want to avoid starving particular builders, but we should be able to at least fail more gracefully than we currently do.

We also appear to run bvt-cq even if bvt-inline fails because of a lack of DUT availability, e.g.:
https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/3217

 
Blocking: 695529
Not really "blocking"  issue 695529 , but potentially related.

Owner: steve...@chromium.org
Status: ExternalDependency (was: Untriaged)
1.
My understanding here is:
- Reduce the number of DUTs per suite (say in the CQ)
- Amount of time spent provisioning is pretty much the same (they happen in parallel)
- Amount of time spent in the suite increases (more tests per DUT)
- But this reduces the amount of provisioning done in the lab.
- This reduced load may speed things up.

We don't really have data to support the last statement above.

2. This is not the common case failure mode. That seems like builder / swarming proxy timeout. Please file a separate bug for just that with a link to the failed build. We do check for the expected number of DUTs and fail the HWTest if there aren't enough.
Cc: pprabhu@chromium.org
Owner: ----
Status: Untriaged (was: ExternalDependency)
Summary: Proposal: Improve DUT utilization (was: Proposal: Improve DUT utilization and availability checking)
1. We do have data that tells us that less provisioning = fewer provisioning failures (specifically, when devserver load gets high, provisioning failures increase). So we would not necessarily be optimizing for speed here, but for accuracy, which is generally more important.

2. I acknowledge that this is an edge case. I'll try to make time to write it up separately but it's less important, just something I noticed at the same time.

> 1. We do have data that tells us that less provisioning
> = fewer provisioning failures (specifically, when devserver
> load gets high, provisioning failures increase). So we would
> not necessarily be optimizing for speed here, but for accuracy,
> which is generally more important.

Hmmm...  The principle source of load is canary builds, meaning
pool:bvt.  My first thought was that because of the large number of
boards in a canary run, trying to reduce load by reducing DUTs
wouldn't help.  But, pool:bvt typically has only 6 DUTs.  So, reducing
by 1 would mean 17% less bandwidth, at the cost of 20% more time testing
(not counting the provisioning time).  There are multiple caveats:
  * The BVT pools are already deliberately lean.  Ideally, the pool
    utilization is already high, meaning we might not be able to get
    through normal test load if we reduced pool size.
  * The BVT pool is shared with the PFQ.  Those pools must have more
    than 6 DUTs because of the higher load.
  * Existing code requires a minimum of 4 DUTs on any BVT run.  That
    requirement translates into a minimum supply of 6 in the pool:
    For technical reasons, to prevent failures we need 2 DUTs more
    than the declared minimum.  So, if we reduce the pool size, we
    must also change the code.

So, an experiment to see "what happens if we make the BVT pools leaner"
is a step or two above non-trivial.

> 2. I acknowledge that this is an edge case. I'll try to
> make time to write it up separately but it's less important,
> just something I noticed at the same time.

The described error happened because it was a tryjob, and
because the suites pool was devoid of working DUTs.  The
same thing would also happen if all DUTs were tied up in
other testing.  That symptom would not happen to a regular
paladin or canary run.  For those builders, if the default
4 DUT minimum can't be satisfied, the suite will fail up
front.  Here's an example:
    https://uberchromegw.corp.google.com/i/chromeos/builders/nyan_kitty-paladin/builds/1996/steps/HWTest%20%5Bbvt-cq%5D/logs/stdio

The trybot behavior is unfortunate, and somewhat obnoxious,
but it's unlikely to become important enough to fix, because
the best we could do would be to improve the error message.

Labels: -Pri-2 Pri-3
Labels: -Type-Bug Type-Feature
Owner: pprabhu@chromium.org
Status: Unconfirmed (was: Untriaged)
The ask is bit unclear still.
needs to be fleshed out.
Status: Archived (was: Unconfirmed)
Bulk closing old unconfirmed issues.

Sign in to add a comment