Proposal: Improve DUT utilization |
|||||||
Issue descriptionWhile investigating a recent failure I ran into two related issues: 1. We seem to be spending a disproportionate amount of time provisioning DUTs to run HWTest suites. Clicking on 'Suite details' for any bvt-cq or bvt-arc run will generally show a few tests running on half a dozen or more boards with more time spent provisioning than running tests, e.g. https://viceroy.corp.google.com/chromeos/suite_details?job_id=124126758 Arbitrary recent example found by searching for 'hwtest [bvt-cq]' on the paladin waterfall: https://viceroy.corp.google.com/chromeos/suite_details?job_id=124126758 Even though it might add a tiny bit to our total test run time, provisioning fewer boards and running more tests seems like it could significantly cut down on the amount of provisioning we do, which seems like a potentially large overall win. 2. When we have a shortage of DUTs, this fails when the suite times out, with a warning message like "No output from <_BackgroundTask(_BackgroundTask-5:7:2, started)> for 8640 seconds". We could do a better job of identifying DUT availability before running HWTests and fail early if insufficient DUTs are available. The problem is complicated because sometimes we may have DUTs becoming available "soon", and we want to avoid starving particular builders, but we should be able to at least fail more gracefully than we currently do. We also appear to run bvt-cq even if bvt-inline fails because of a lack of DUT availability, e.g.: https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/3217
,
Jun 19 2017
1. My understanding here is: - Reduce the number of DUTs per suite (say in the CQ) - Amount of time spent provisioning is pretty much the same (they happen in parallel) - Amount of time spent in the suite increases (more tests per DUT) - But this reduces the amount of provisioning done in the lab. - This reduced load may speed things up. We don't really have data to support the last statement above. 2. This is not the common case failure mode. That seems like builder / swarming proxy timeout. Please file a separate bug for just that with a link to the failed build. We do check for the expected number of DUTs and fail the HWTest if there aren't enough.
,
Jun 19 2017
1. We do have data that tells us that less provisioning = fewer provisioning failures (specifically, when devserver load gets high, provisioning failures increase). So we would not necessarily be optimizing for speed here, but for accuracy, which is generally more important. 2. I acknowledge that this is an edge case. I'll try to make time to write it up separately but it's less important, just something I noticed at the same time.
,
Jun 20 2017
> 1. We do have data that tells us that less provisioning
> = fewer provisioning failures (specifically, when devserver
> load gets high, provisioning failures increase). So we would
> not necessarily be optimizing for speed here, but for accuracy,
> which is generally more important.
Hmmm... The principle source of load is canary builds, meaning
pool:bvt. My first thought was that because of the large number of
boards in a canary run, trying to reduce load by reducing DUTs
wouldn't help. But, pool:bvt typically has only 6 DUTs. So, reducing
by 1 would mean 17% less bandwidth, at the cost of 20% more time testing
(not counting the provisioning time). There are multiple caveats:
* The BVT pools are already deliberately lean. Ideally, the pool
utilization is already high, meaning we might not be able to get
through normal test load if we reduced pool size.
* The BVT pool is shared with the PFQ. Those pools must have more
than 6 DUTs because of the higher load.
* Existing code requires a minimum of 4 DUTs on any BVT run. That
requirement translates into a minimum supply of 6 in the pool:
For technical reasons, to prevent failures we need 2 DUTs more
than the declared minimum. So, if we reduce the pool size, we
must also change the code.
So, an experiment to see "what happens if we make the BVT pools leaner"
is a step or two above non-trivial.
,
Jun 20 2017
> 2. I acknowledge that this is an edge case. I'll try to
> make time to write it up separately but it's less important,
> just something I noticed at the same time.
The described error happened because it was a tryjob, and
because the suites pool was devoid of working DUTs. The
same thing would also happen if all DUTs were tied up in
other testing. That symptom would not happen to a regular
paladin or canary run. For those builders, if the default
4 DUT minimum can't be satisfied, the suite will fail up
front. Here's an example:
https://uberchromegw.corp.google.com/i/chromeos/builders/nyan_kitty-paladin/builds/1996/steps/HWTest%20%5Bbvt-cq%5D/logs/stdio
The trybot behavior is unfortunate, and somewhat obnoxious,
but it's unlikely to become important enough to fix, because
the best we could do would be to improve the error message.
,
Jun 20 2017
,
Jun 26 2017
,
Jun 26 2017
The ask is bit unclear still. needs to be fleshed out.
,
Mar 31 2018
Bulk closing old unconfirmed issues. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by steve...@chromium.org
, Jun 19 2017