New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 850186 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

consider schedulering strategies for test retries that avoid inter-suite competition

Project Member Reported by pprabhu@chromium.org, Jun 6 2018

Issue description

Example canary build: https://uberchromegw.corp.google.com/i/chromeos/builders/lulu-release/builds/2248

Symptom: One of the non paygen HWTest suites times out. In this case HWTest [bvt-arc]

Root cause: bvt-arc suite was kicked off ~1 hour before paygen. Both suites had a 3 hour timeout. Some tests failed in bvt-arc suite. By the time they failed and were requeued, paygen tests had already been scheduled. So the retries eneded up behind paygen tests.
paygen tests took > 2 hours to finish (successfully), pushing the retries beyond the allowed 3 hour limit for bvt-arc.

This is clearly shown by the two suite timelines:
The early bvt-arc suite: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?suiteId=205881545
Note that only the tests till ~22:30 are actually tests from this suite. Other tests do not belong this suite. This is bug in how suite_timeline reporting works.

And the interjecting paygen suite: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?suiteId=205898669
 
Cc: xixuan@chromium.org akes...@chromium.org
Owner: pprabhu@chromium.org
Status: Assigned (was: Untriaged)
Note that both suites have the same priority: Build (http://shortn/_oGeGpyBoZ1)

craigb@: This is something to consider as you think about smarter scheduling of tests.

In this case, I propose that dynamic_suite should schedule retries at suite priority + 1, so that retries get ahead of any other suites that are at the same priority as the current suite.
Suites waiting around behind other suites both increase latency of suites and increase instances of timeouts. 
We already limit the total number of retries allowed within a suite, so allowing these retries at a higher priority will ensure that suites act as if the retries were scheduled at the same time as the original tests.

+xixuan: 
This has another positive side-effect.
Currently, dynamic_suite tries to schedule (manually curated) LONG tests before the SHORT ones. But this breaks with retries -- a retried LONG test ends up behind all the already scheduled SHORT ones. If retried tests have a higher priority, they'll get executed before other tests even in the same suite. This means that a retried LONG test will get executed before original SHORT ones. This also means that a retried SHORT test will get executed before original LONG ones, but once again we're protected by the cap on total number of retries allowed across the suite.

The actual implementation should be fairly trivial. Am I missing anything here?
Summary: paygen HWTest interfere with other HWTest suite scheduled earlier, causing that to timeout (was: paygen HWTest interfere with other HWTest suites with stricter deadline)
Cc: pprabhu@chromium.org jclinton@chromium.org
 Issue 850196  has been merged into this issue.
#1 sgtm
Owner: akes...@chromium.org
Does qschedular care / mitigate this problem?
Or is this a test planning fly?
Labels: -Pri-2 quotascheduler Pri-3
Summary: consider schedulering strategies for test retries that avoid inter-suite competition (was: paygen HWTest interfere with other HWTest suite scheduled earlier, causing that to timeout)
Possible FR for quotascheduler. Low priority

Sign in to add a comment