New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 850276 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Last visit > 30 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Async HWTest from one builder intermingles with Blocking HWTest from another causing blocking HWTest stage to fail

Project Member Reported by pprabhu@chromium.org, Jun 6 2018

Issue description

The canary build https://uberchromegw.corp.google.com/i/chromeos/builders/lars-release/builds/2213 had the HWTest [bvt-arc] stage timeout because of https://uberchromegw.corp.google.com/i/chromeos_release/builders/lars-release%20release-R67-10575.54.0.B/builds/49 ASyncHWTest [bvt-perbuild] stage.

The problem is shown by this snippet of 

$ dut-status -f chromeos4-row11-rack7-host22
    2018-06-05 22:36:17  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row11-rack7-host22/967014-provision/
    2018-06-05 22:30:32  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/205859765-chromeos-test/
    2018-06-05 22:24:56  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row11-rack7-host22/966963-provision/
    2018-06-05 22:23:28  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/205883265-chromeos-test/
    2018-06-05 22:17:37  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row11-rack7-host22/966928-provision/
    2018-06-05 22:15:50  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/205859695-chromeos-test/


The suite timeline for the failed canary suite is also instructive. Notice the multiple provisions at the start of the suite. Some of those don't belong to the canary suite: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?cidbBuildId=2638292

We provisioned the DUT to R69-10756.0.0, ran 1 test, provisioned it to R67-10575.54.0, ran 1 test, then provisioned it to R69-10756.0.0 and ran 1 test.

The reason this happens is that not all HWTest requests from a builder are made together. There are multiple HWTest stages, and another builder may request suites at the same priority in between. In this case, the stable builder's perbuild suite interjected between the canaries sanity and other suites.

There are two things to consider here.

(1) short term: perbuild is an async suite. The builder wasn't even waiting for the suite to finish. It delayed suites from the canary that the builder _was_ waiting for, causing the canary build to fail.
  - We should kick off async suites at a lower priority than sync suites.
  - Hand in hand with that, we should allow more time for async suites to finish.

This does risk starving the async suites completely. But if we're so overloaded that the async suite are starved because of sync suites, we're already dead -- we can't possible meet the load requirement for sync suites if we start servicing async suites -- we're going to fail builds.


(2) There is a deeper question around smarter scheduling here. If we had not mixed the suites so, both would have finished faster. A future, smarter scheduler should be able to reorder requests to minimize provisions. They _are_ costly.

 
Oh wait, looks like the async HWTest are already at a lower priority "PostBuild" compared to blocking HWTest "CQ", "PFQ" or "Build".

This is a harder problem to resolve, and something that's definitely a feature for a future iteration of smart scheduling.
The sequence of events is (numbers are imaginary but indicative)

Let's say the pool has 8 DUTs.

[1] canary creates a blocking sanity suite with a single test at priority Build
[2] Autotest allocates a DUT to test from [1], it starts provision-execution.
[3] release creates an async bvt-perbuild suite, which contains 50 tests.
[4] Autotest allocates 7 DUTs to first 7 tests from [3], the are provisioned to release build.
[5] Test from [2] finishes.
[6] Autotest allocates the last DUT to the 8th test from [3], it is provisioned to release build.
[7] canary creates a blocking cts suite with 10 tests at priority Build.
[8] Autotest waits for tests from [4] and [6] to finish, and allocates these to tests from [7] since those are higher priority that the one from [3].
[9] Each of these DUTs is provisioned to canary build.

The result is that the first DUT is provisioned thrice, in [2], [6] and [9].
It would be worse if [1] had more tests (say 20), then all the DUTs would have been provisioned thrice.

It is not possible to avoid this without advanced allocation of DUTs by the canary builder (or by something smart that anticipates incoming load) to avoid execution of the request in [3], which led to the release build getting provisioned on the DUTs.

The other way to "fix" this problem would be by true preemption -- the high priority canary request in [7] just stops all the low priority release tests and takes over the DUTs. This would still lead to wasted provisions, but would reduce latency for the high priority blocking tests at the cost of asyn test latency.
I'd recommend that someone goes and identifies how often this has happened recently, and if it's more than the one build I noticed, then we should increase the timeouts to mitigate.
Owner: cra...@chromium.org
Status: Assigned (was: Untriaged)
->craigb is there a tracking label for torch / smart-scheduler related ideas?

Sign in to add a comment