New issue
Advanced search Search tips

Issue 637166 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Aug 2016
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

ChromeOS BuildBot Single Points of Failure

Project Member Reported by d...@chromium.org, Aug 12 2016

Issue description

Several builders on the ChromeOS waterfall have only single machines backing them. This means that, if those builders go offline, the systems in which they participate are temporarily shut down until they can be brought back. These builders include:
- ChromiumOS Waterfall Paladins
- ChromeOS Waterfall PFQs
- ChromeOS Waterfall Canaries

Various CrOS teams depend on the operations and artifacts produced by these builders. During normal business hours, an offline builder can be fixed by kicking or replacing it, but during unsupported hours (e.g., nights, weekends, holidays), teams that may still be operating are generally forced to wait until Infra can resolve the problem.

The rationale for single-builder assignment is twofold:
- If a single builder only builds for a single target, it can reuse checkouts and artifacts from previous builds, reducing bandwidth requirements and build time.
- If a builder were to process multiple builds, it would accumulate artifacts from each of those builds, potentially overwhelming its disk capacity.

The ChromeOS Waterfall's Paladin builders have a "floating" backup builder, which is a single builder assigned to the whole Paladin suite and configured to only accept builds if the dedicated builder has been offline for a sustained period of time. This implementation is dated and brittle, and attempts to apply it with more than 1 slave resulted in unexpected waterfall failure. The code for this is here: https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/master/floating_builder.py

Applying the floating builder code to other systems is one option, but the code should probably be rewritten to:
1) Have less complexity.
2) Support multiple floating builders.
3) Be applied to multiple builder groups (e.g., the ones listed at the top of this bug).
 
Project Member

Comment 1 by bugdroid1@chromium.org, Aug 15 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/525b79ec0f5e47f3d203094c2bd7f1a797736ea2

commit 525b79ec0f5e47f3d203094c2bd7f1a797736ea2
Author: dnj <dnj@chromium.org>
Date: Mon Aug 15 23:48:23 2016

Update floating builder logic, add to "chromiumos"

The floating builder logic had several flaws, notably:
- It was broken: adding more than one (see  crbug.com/600479 ).
- It was too complex. It was written generically with the idea that other
  waterfalls may want to use it (hah!)

This updates the logic to clearly have a primary and floating slave set
rather than a generic "strata" concept, and to specifically prefer them in a
predefined order. This also implements floating builders for PFQs (in builder
config), ups the number of floating builders from 1 to 2 for paladins, and
enrolls the floating builder logic for the "chromiumos" waterfall paladins.

BUG= chromium:637166 
TEST=local

Review-Url: https://codereview.chromium.org/2250443002

[modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/masters/master.chromiumos/master.cfg
[modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/masters/master.chromiumos/slave_pool.json
[modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/masters/master.chromiumos/slaves.cfg
[modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/scripts/common/slave_alloc.py
[modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/scripts/master/cros/builder_config.py
[modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/scripts/master/floating_builder.py
[add] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/scripts/master/unittests/floating_builder_test.py

Project Member

Comment 2 by bugdroid1@chromium.org, Aug 15 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/525b79ec0f5e47f3d203094c2bd7f1a797736ea2

commit 525b79ec0f5e47f3d203094c2bd7f1a797736ea2
Author: dnj <dnj@chromium.org>
Date: Mon Aug 15 23:48:23 2016

Update floating builder logic, add to "chromiumos"

The floating builder logic had several flaws, notably:
- It was broken: adding more than one (see  crbug.com/600479 ).
- It was too complex. It was written generically with the idea that other
  waterfalls may want to use it (hah!)

This updates the logic to clearly have a primary and floating slave set
rather than a generic "strata" concept, and to specifically prefer them in a
predefined order. This also implements floating builders for PFQs (in builder
config), ups the number of floating builders from 1 to 2 for paladins, and
enrolls the floating builder logic for the "chromiumos" waterfall paladins.

BUG= chromium:637166 
TEST=local

Review-Url: https://codereview.chromium.org/2250443002

[modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/masters/master.chromiumos/master.cfg
[modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/masters/master.chromiumos/slave_pool.json
[modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/masters/master.chromiumos/slaves.cfg
[modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/scripts/common/slave_alloc.py
[modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/scripts/master/cros/builder_config.py
[modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/scripts/master/floating_builder.py
[add] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/scripts/master/unittests/floating_builder_test.py

Project Member

Comment 3 by bugdroid1@chromium.org, Aug 15 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chrome/tools/build/+/aaf1219fff094c4663185693d469a42cdbfa3515

commit aaf1219fff094c4663185693d469a42cdbfa3515
Author: Dan Jacques <dnj@chromium.org>
Date: Mon Aug 15 23:57:11 2016

Project Member

Comment 4 by bugdroid1@chromium.org, Aug 16 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chrome/tools/build/+/d652f06806611140f48846db6308e6e7de2102d9

commit d652f06806611140f48846db6308e6e7de2102d9
Author: dnj <dnj@google.com>
Date: Tue Aug 16 02:02:11 2016

Project Member

Comment 5 by bugdroid1@chromium.org, Aug 17 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/d521c25d75f302880ffaa2e5f10aa5cfe1c932cb

commit d521c25d75f302880ffaa2e5f10aa5cfe1c932cb
Author: dnj <dnj@chromium.org>
Date: Wed Aug 17 01:31:40 2016

Floating builder logic waits at startup.

Currently, the floating builder logic considered slaves that haven't
ever signed in as last seen a long time ago (epoch). This means that if
a master starts up, has a task, and a floating builder connects before a
primary, the floating builder will be immediately given the task, since
the primary is outside of the grace period.

Update this logic: now all slaves are considered last seen when the
floating function was created. This will cause the floating builder to
wait a full grace period before giving builds to floating slaves, even
if it just started up.

BUG= chromium:637166 
TEST=None

Review-Url: https://codereview.chromium.org/2255643002

[modify] https://crrev.com/d521c25d75f302880ffaa2e5f10aa5cfe1c932cb/scripts/master/floating_builder.py
[modify] https://crrev.com/d521c25d75f302880ffaa2e5f10aa5cfe1c932cb/scripts/master/unittests/floating_builder_test.py

Comment 6 by d...@chromium.org, Aug 17 2016

Status: Fixed (was: Assigned)
This has been implemented on "chromiumos" and "chromeos" waterfalls. Net difference:
- Internal paladins now have +1 floating builder (from 1).
- Public paladins now have +2 floating builders (from 0).
- Internal PFQs now have +2 floating builders (from 0).

Net gain: whatever the bug was in the previous floating builder logic that made everything to to hell if the number increased past one seems to not be present in the new logic, so in theory we can add additional floating builders if necessary.

Sign in to add a comment