Async HWTest from one builder intermingles with Blocking HWTest from another causing blocking HWTest stage to fail |
||
Issue descriptionThe canary build https://uberchromegw.corp.google.com/i/chromeos/builders/lars-release/builds/2213 had the HWTest [bvt-arc] stage timeout because of https://uberchromegw.corp.google.com/i/chromeos_release/builders/lars-release%20release-R67-10575.54.0.B/builds/49 ASyncHWTest [bvt-perbuild] stage. The problem is shown by this snippet of $ dut-status -f chromeos4-row11-rack7-host22 2018-06-05 22:36:17 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row11-rack7-host22/967014-provision/ 2018-06-05 22:30:32 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/205859765-chromeos-test/ 2018-06-05 22:24:56 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row11-rack7-host22/966963-provision/ 2018-06-05 22:23:28 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/205883265-chromeos-test/ 2018-06-05 22:17:37 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row11-rack7-host22/966928-provision/ 2018-06-05 22:15:50 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/205859695-chromeos-test/ The suite timeline for the failed canary suite is also instructive. Notice the multiple provisions at the start of the suite. Some of those don't belong to the canary suite: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?cidbBuildId=2638292 We provisioned the DUT to R69-10756.0.0, ran 1 test, provisioned it to R67-10575.54.0, ran 1 test, then provisioned it to R69-10756.0.0 and ran 1 test. The reason this happens is that not all HWTest requests from a builder are made together. There are multiple HWTest stages, and another builder may request suites at the same priority in between. In this case, the stable builder's perbuild suite interjected between the canaries sanity and other suites. There are two things to consider here. (1) short term: perbuild is an async suite. The builder wasn't even waiting for the suite to finish. It delayed suites from the canary that the builder _was_ waiting for, causing the canary build to fail. - We should kick off async suites at a lower priority than sync suites. - Hand in hand with that, we should allow more time for async suites to finish. This does risk starving the async suites completely. But if we're so overloaded that the async suite are starved because of sync suites, we're already dead -- we can't possible meet the load requirement for sync suites if we start servicing async suites -- we're going to fail builds. (2) There is a deeper question around smarter scheduling here. If we had not mixed the suites so, both would have finished faster. A future, smarter scheduler should be able to reorder requests to minimize provisions. They _are_ costly.
,
Jun 6 2018
I'd recommend that someone goes and identifies how often this has happened recently, and if it's more than the one build I noticed, then we should increase the timeouts to mitigate.
,
Jun 7 2018
->craigb is there a tracking label for torch / smart-scheduler related ideas? |
||
►
Sign in to add a comment |
||
Comment 1 by pprabhu@chromium.org
, Jun 6 2018