New issue
Advanced search Search tips

Issue 843652 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Jul 9
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Feature

Blocking:
issue 842068



Sign in to add a comment

Add new "PreCQ" dimension to some percentage of Chrome OS Swarming machine pool

Project Member Reported by jclinton@chromium.org, May 16 2018

Issue description

Today, we have a uniform pool of machines that we schedule Swarming tasks against. We have enough capacity for all of the peak load that we might have.

We would like to schedule more PreCQ jobs than we will have capacity for without exhausting other things (CQ, Release, Tryjob, Firmware, etc.). We will be doing this because we want to increase PreCQ board coverage before we have new machines to fully cover that and we want to use Buildbucket's FIFO to manage the queuing implementation.

To accomplish this, we need to add a new dimension to all of the machines that we use for PreCQ today of "role:precq" and then update the Buildbucket PreCQ job to require that role.

Once we've done that, PreCQ jobs will only run up-to the number of machines that we have marked with this role. But other jobs can freely use machines in the PreCQ pool opportunistically. If these other jobs starving PreCQ becomes an issue, we can move PreCQ to a higher scheduler priority.

I have started on this.

 
Blocking: 842068
Don, do you have a spreadsheet somewhere that you are using to keep track of machine needs for each of the jobs that have been moved to Swarming? I want to make the number of machines that receive the "role:precq" label be equal to the number that you have set aside for PreCQ.

Cc: hidehiko@chromium.org nya@chromium.org
Per Don, we have 210 Swarming builders right now of which 80 of those came from moving PreCQ builders over to Swarming. We're looking to put an upper limit on the PreCQ consumption so 100 seems like a good guess.

I'm going to label 100 hosts as role:precq.

Per the email, splitting this to 100 bots for "role:precq" and 70 bots for "role:tryjob".  For simplicity, the CQ jobs will always be able to trump any other jobs since they have full run to all 210 bots.

We could slice and dice this further if we so choose.  In the future, using Machine Provider will allow us to set dimensions on machine allocation, avoiding the use of names for identification.  This also allows us to move some builder configs to a single location.

-- Mike
  
Project Member

Comment 6 by bugdroid1@chromium.org, Jun 25 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/d14f3fb5b98cfa7a646d7113f2c2a3b26ba912da

commit d14f3fb5b98cfa7a646d7113f2c2a3b26ba912da
Author: Mike Nichols <mikenichols@chromium.org>
Date: Mon Jun 25 23:09:06 2018

Project Member

Comment 7 by bugdroid1@chromium.org, Jun 25 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/bd9af87f8675132146849ed0a35046810ea55cb7

commit bd9af87f8675132146849ed0a35046810ea55cb7
Author: Mike Nichols <mikenichols@google.com>
Date: Mon Jun 25 23:18:06 2018

Project Member

Comment 8 by bugdroid1@chromium.org, Jun 26 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/6e71c00fe7521b7fa6461373b1351edeefa956b1

commit 6e71c00fe7521b7fa6461373b1351edeefa956b1
Author: Mike Nichols <mikenichols@chromium.org>
Date: Tue Jun 26 17:15:17 2018

Project Member

Comment 9 by bugdroid1@chromium.org, Jun 26 2018

Labels: merge-merged-config
The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/manifest-internal/+/2a09b439239959ab7c7b35846b58082caa659165

commit 2a09b439239959ab7c7b35846b58082caa659165
Author: Mike Nichols <mikenichols@chromium.org>
Date: Tue Jun 26 17:16:00 2018

The bot configurations are now in place:

PreCQ:  100 bots
TryJob:  70 bots
Prod:  209 bots
Experimental: 1 bot

We'll continue to adjust these pools as additional capacity is added, as well as the completion of the buildbot migration to swarming.  

-- Mike
Status: Fixed (was: Started)
We need to add role:precq to our PreCQ Launcher Buildbucket requests in order to unblock  issue 842068 . Should be a one line change. I didn't have a separate bug for that. Could track that here or open another bug.
I don't think this is needed.  The benefit of associating the builders with mixins, associating roles, means we can associate builders with bot groups without the need to update any calling tasks:  

PreCQ bots:
https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&c=pool&c=role&f=pool%3AChromeOS&f=role%3Aprecq&l=210&s=id%3Aasc

PreCQ task after change: 
https://chrome-swarming.appspot.com/task?id=3e55f31d46468a10&refresh=10&request_detail=true

-- Mike

Hrm, yea I see now: looks like you submit a Buildbucket request with Builder name "PreCQ" and role:precq is attached to that.

BTW, can we update the owners of the the bot pools to the whole team? https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chrome-swarming/bots.cfg#123

It affects the ability to click buttons on the Swarming UI.

Status: Started (was: Fixed)
I believe the permissions are part of another bug but will keep this open until I can confirm. 

-- Mike
Status: Fixed (was: Started)
The permissions are outside the bots.cfg changed during this initiative.  The latest changes have been submitted.  Marking this part as complete. 

-- Mike

Sign in to add a comment