New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 727033 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Reschedule tests blocked on provision quicker

Project Member Reported by davidri...@chromium.org, May 27 2017

Issue description

For the following build, the entire build was blocked waiting for one test which was blocked on a provision which eventually failed (allowing the test to be rescheduled on another DUT and allowed the build to pass):
https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-paladin/builds/2689
https://viceroy.corp.google.com/chromeos/suite_details?build_id=1551613

For CQ runs, if DUTs are idle while the suite is blocked on provision jobs, reschedule the remaining tests on one of the DUTs that is already properly provisioned and currently idle.

In the example provided, this would have saved about ten minutes on the build.

 
Another example from the same CQ run:
https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-paladin/builds/2691
https://viceroy.corp.google.com/chromeos/suite_details?build_id=1551635

(These were the two slowest builds of the CQ run, speeding them up ten minutes would have sped up the entire CQ run ten minutes).

Comment 2 by dshi@chromium.org, May 28 2017

reschedule a job already assigned a dut can be tricky. A test job creates HQE, which kicks off the provision special task. To detach a job from a special task requires some messy change in host scheduler logic.

Another way to look at this problem is, why the provision job didn't have a timeout, say 15mins, then the test won't be waiting for 35mins till the suite job timed out, instead, it can be retried on another dut (assume provision failure can lead to test retry).
Well 15 minutes would be too short of a timeout -- we've got provision jobs that take that long.  The issues with short timeouts is if you run into any slow operation which might have a few minute timeout, it could push that entire provision into a timeout state, and then overall take longer.

Two options:
1. don't reschedule jobs, but instead schedule a new job which could run in parallel
2. for CQ provision without scheduling jobs; only schedule tests when devices are fully provisioned
Cc: ihf@chromium.org

Comment 5 by dshi@chromium.org, May 31 2017

1. don't reschedule jobs, but instead schedule a new job which could run in parallel
This requires suite job to have more logic to create a test job in the middle of suite run. It will also have to abort the other test job got stuck in provision. Otherwise, the suite job will still timed out due to that provision job.
I don't have a simple answer for the "reschedule" design. In an edge case, it's possible that the duts finished running other test jobs are assigned to other suites and being provisioned to a different build, so create a new test job will require a new provision job anyway. In the real world, we have dedicated cq pool, so other duts can be idle and still have the required build. anyhow, it will require quite some logic change in suite job.


2. for CQ provision without scheduling jobs; only schedule tests when devices are fully provisioned
sadly, provision special task is tied to a test job. The test job is assigned a host before provision task is created.
maybe we can have some change in suite job that:
1. search for dut with desired build label and being idle for x mins
2. If there is any test job being stuck in provision
3. If 1 & 2 are both true, decouple the provision task from the test job, and remove the host assignment from the test job. host scheduler will handle the rest and assign a dut with desired version label to the test job.

The tricky thing is that a test job may require multiple version label (in the case of FAFT test or android testbed test), so we might want to disable the behavior if there are multiple builds or the build is for testbed.

Cc: ayatane@chromium.org
There's already a project underway to separate provisioning into
its own suite.  It seems like most of the stuff under discussion
here either won't work with that design, or would be obviated.

My instinct is either to de-prioritize this until we see the
impact of the planned work, or just close it as WontFix.

Comment 7 by aut...@google.com, Jun 12 2017

Owner: ayatane@chromium.org
@ayatane - can you evaluate this as part of your provision work?
I don't understand the OP claim. The retry job looks like it was created 2 minutes after the initial attempt failed (in provision). Are you just asking for a shorter provision timeout, as in  Issue 732001  ?

I don't really understand #5.

The retry goes through the following steps in getting created:
 - previous job fails, updates its entry in afe (not sure how much overhead, may wait on some post-job special tasks)
 - dynamic_suite notices that a job has failed and creates new afe_job entry for the retry (~15s average since this is polling with 30s interval)
 - host-scheduler matches job to host (~a few seconds)
 - scheduler picks up matched job and starts relevant tasks (10s of seconds to a minute)

That plausibly adds up to the ~2 minute gap observed in OP data. 


My tempation is to wontfix this or merge it into 732001, since that seems like it would have most bang for buck in OP's example, and 
Status: WontFix (was: Untriaged)
Ok, I stopped writing because I understood the original requeset.

OP wants us to start parallel copies of a test if that is the only remaining test in the suite.

I don't see any convenient way to implement this in dynamic suite, without a major overhaul. And I see more bang for buck in 732001. Therefore, WontFixing.
Sorry, I linked to the wrong bug. The one that has a lot of bang for buck is reducing the provision timeout, whichever one that is.

Sign in to add a comment