New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 831873 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Jul 24
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

lucifer: Does not handle synch_count=1 jobs with multiple HQEs

Project Member Reported by ayatane@chromium.org, Apr 11 2018

Issue description

lucifer does not handle synch_count=1 jobs with multiple HQEs.

Results in scheduler crashes https://bugs.chromium.org/p/chromium/issues/detail?id=831689
 
Is this a valid use case at all?
Why did we have multiple DUTs assigned to this HQE in the first place?

If I understand it right, this job would have just run the test thrice, once per DUT.
If that is the case, skylab-swarming will _not_ support such jobs. Just create three different tasks. I also don't see lucifer having to support this in the current way.

We should see why we support this use case, and simply move it to creating three independent jobs.
It's a feature that used to work.  It's a shortcut to avoid creating a job three times to run it on three DUTs; you can just create one job with three DUTs instead.

The hard question is, whether or not it is worth restoring this feature, given how close we are to beginning skylab migration and in the skylab world this feature will not be supported (i.e., just create three jobs instead).
Cc: jrbarnette@chromium.org

Comment 4 Deleted

A bit more detail, the central issue is how execution groups are handled.

An execution group is a group of HQEs/hosts that run a single autoserv together.  Most HQEs are in an execution group by themselves.  synch count jobs have all of their HQEs in one execution group.  Thus, most jobs have one execution group.  However, non synch count jobs with multiple HQEs will have multiple execution groups, one per HQE.

(It is theoretically possible to have synch_count jobs with multiple execution groups.  I'm pretty sure no such job has ever been created.)

Lucifer considers each job to be one execution group.  The difficulty in restoring support for multiple execution groups is that the execution group transaction locks between Autotest and Lucifer are keyed on job id.  Also keying on HQE requires a database migration and a few trips to deploy everything in a backward compatible manner.
Labels: -Pri-1 Pri-2
Lower prio after mitigation https://bugs.chromium.org/p/chromium/issues/detail?id=832167
Status: WontFix (was: Assigned)
Separate bug for skylab multi-DUT jobs

Sign in to add a comment