Look into being able to better control the rate at which we schedule builds |
||||
Issue descriptionDuring the outage we had today (http://o/e/m254945d72000001c), I paused buildbucket, so that we stopped scheduling builds on tryserver.chromium.linux. By the time we restarted scheduling builds, we had 1700 of them pending. We have no way to control the rate at which things are scheduled, which meant that we immediately hammered the buildbot master which was probably already borked. We should investigate better mechanisms for this sort of thing. Ideally whatever we did would also provide feedback to users so that they could tell why their jobs were still pending.
,
Jul 12 2017
,
Jul 13 2017
proposed design: - introduce `peek_percentage`: an attribute of a bucket, an integer value [0..100] - add get_peek_percentage(bucket) and set_peek_percentage(bucket) APIs - in the peek API, filter builds: `build_id % 100 < percentage` in frontend Note: this is ineffective for low non-zero values of percentage, however it should be happen only during incidents - deprecate pausing in favor of this (I could not come up with a smooth transition from binary paused/non-paused to a probability) +dnj and vadimsh for opinion
,
Jul 13 2017
the mental model I had was something more along the lines of throttling to X running builds, but maybe it's not obvious what the ranges for X are without also knowing what percentage X is of the total.
,
Jul 13 2017
once build is scheduled on buildbot, buildbucket does not have control over if/when it will run. At that point, buildbot has full control. In practice, number of running builds is limited by the number of connected slaves. The proposed design controls how many builds buildbot discovers.
,
Jul 31 2017
We talked about this a few times in person and I think the conclusion was that it's more important that this works in LUCI than buildbot if the implementation would be different. Nodir also said we could change swarmbucket to use task queues so it's pushed based like buildbot for easier throttling.
,
Apr 30 2018
|
||||
►
Sign in to add a comment |
||||
Comment 1 by dpranke@chromium.org
, Jul 12 2017