New issue
Advanced search Search tips

Issue 741239 link

Starred by 2 users

Issue metadata

Status: Duplicate
Owner: ----
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 2
Type: Bug



Sign in to add a comment

Look into being able to better control the rate at which we schedule builds

Project Member Reported by dpranke@chromium.org, Jul 12 2017

Issue description

During the outage we had today (http://o/e/m254945d72000001c), I paused buildbucket, so that we stopped scheduling builds on tryserver.chromium.linux.

By the time we restarted scheduling builds, we had 1700 of them pending. We have no way to control the rate at which things are scheduled, which meant that we immediately hammered the buildbot master which was probably already borked. 

We should investigate better mechanisms for this sort of thing. Ideally whatever we did would also provide feedback to users so that they could tell why their jobs were still pending.
 
Labels: cit-pm-57

Comment 2 by no...@chromium.org, Jul 12 2017

Cc: estaab@chromium.org
Components: -Infra>Platform Infra>Platform>Buildbucket
Owner: ----
Status: Available (was: Untriaged)

Comment 3 by no...@chromium.org, Jul 13 2017

Cc: d...@chromium.org vadimsh@chromium.org
proposed design:

- introduce `peek_percentage`: an attribute of a bucket, an integer value [0..100]
- add get_peek_percentage(bucket) and set_peek_percentage(bucket) APIs
- in the peek API, filter builds: `build_id % 100 < percentage` in frontend
  Note: this is ineffective for low non-zero values of percentage, however it should be happen only during incidents
- deprecate pausing in favor of this (I could not come up with a smooth transition from binary paused/non-paused to a probability)

+dnj and vadimsh for opinion

the mental model I had was something more along the lines of throttling to X running builds, but maybe it's not obvious what the ranges for X are without also knowing what percentage X is of the total.

Comment 5 by no...@chromium.org, Jul 13 2017

once build is scheduled on buildbot, buildbucket does not have control over if/when it will run. At that point, buildbot has full control. In practice, number of running builds is limited by the number of connected slaves.

The proposed design controls how many builds buildbot discovers.

Comment 6 by estaab@chromium.org, Jul 31 2017

We talked about this a few times in person and I think the conclusion was that it's more important that this works in LUCI than buildbot if the implementation would be different. Nodir also said we could change swarmbucket to use task queues so it's pushed based like buildbot for easier throttling.

Comment 7 by no...@chromium.org, Apr 30 2018

Mergedinto: 812021
Status: Duplicate (was: Available)

Sign in to add a comment