ChromeOS BuildBot Single Points of Failure |
||
Issue descriptionSeveral builders on the ChromeOS waterfall have only single machines backing them. This means that, if those builders go offline, the systems in which they participate are temporarily shut down until they can be brought back. These builders include: - ChromiumOS Waterfall Paladins - ChromeOS Waterfall PFQs - ChromeOS Waterfall Canaries Various CrOS teams depend on the operations and artifacts produced by these builders. During normal business hours, an offline builder can be fixed by kicking or replacing it, but during unsupported hours (e.g., nights, weekends, holidays), teams that may still be operating are generally forced to wait until Infra can resolve the problem. The rationale for single-builder assignment is twofold: - If a single builder only builds for a single target, it can reuse checkouts and artifacts from previous builds, reducing bandwidth requirements and build time. - If a builder were to process multiple builds, it would accumulate artifacts from each of those builds, potentially overwhelming its disk capacity. The ChromeOS Waterfall's Paladin builders have a "floating" backup builder, which is a single builder assigned to the whole Paladin suite and configured to only accept builds if the dedicated builder has been offline for a sustained period of time. This implementation is dated and brittle, and attempts to apply it with more than 1 slave resulted in unexpected waterfall failure. The code for this is here: https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/master/floating_builder.py Applying the floating builder code to other systems is one option, but the code should probably be rewritten to: 1) Have less complexity. 2) Support multiple floating builders. 3) Be applied to multiple builder groups (e.g., the ones listed at the top of this bug).
,
Aug 15 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/525b79ec0f5e47f3d203094c2bd7f1a797736ea2 commit 525b79ec0f5e47f3d203094c2bd7f1a797736ea2 Author: dnj <dnj@chromium.org> Date: Mon Aug 15 23:48:23 2016 Update floating builder logic, add to "chromiumos" The floating builder logic had several flaws, notably: - It was broken: adding more than one (see crbug.com/600479 ). - It was too complex. It was written generically with the idea that other waterfalls may want to use it (hah!) This updates the logic to clearly have a primary and floating slave set rather than a generic "strata" concept, and to specifically prefer them in a predefined order. This also implements floating builders for PFQs (in builder config), ups the number of floating builders from 1 to 2 for paladins, and enrolls the floating builder logic for the "chromiumos" waterfall paladins. BUG= chromium:637166 TEST=local Review-Url: https://codereview.chromium.org/2250443002 [modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/masters/master.chromiumos/master.cfg [modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/masters/master.chromiumos/slave_pool.json [modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/masters/master.chromiumos/slaves.cfg [modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/scripts/common/slave_alloc.py [modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/scripts/master/cros/builder_config.py [modify] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/scripts/master/floating_builder.py [add] https://crrev.com/525b79ec0f5e47f3d203094c2bd7f1a797736ea2/scripts/master/unittests/floating_builder_test.py
,
Aug 15 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/chrome/tools/build/+/aaf1219fff094c4663185693d469a42cdbfa3515 commit aaf1219fff094c4663185693d469a42cdbfa3515 Author: Dan Jacques <dnj@chromium.org> Date: Mon Aug 15 23:57:11 2016
,
Aug 16 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/chrome/tools/build/+/d652f06806611140f48846db6308e6e7de2102d9 commit d652f06806611140f48846db6308e6e7de2102d9 Author: dnj <dnj@google.com> Date: Tue Aug 16 02:02:11 2016
,
Aug 17 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/d521c25d75f302880ffaa2e5f10aa5cfe1c932cb commit d521c25d75f302880ffaa2e5f10aa5cfe1c932cb Author: dnj <dnj@chromium.org> Date: Wed Aug 17 01:31:40 2016 Floating builder logic waits at startup. Currently, the floating builder logic considered slaves that haven't ever signed in as last seen a long time ago (epoch). This means that if a master starts up, has a task, and a floating builder connects before a primary, the floating builder will be immediately given the task, since the primary is outside of the grace period. Update this logic: now all slaves are considered last seen when the floating function was created. This will cause the floating builder to wait a full grace period before giving builds to floating slaves, even if it just started up. BUG= chromium:637166 TEST=None Review-Url: https://codereview.chromium.org/2255643002 [modify] https://crrev.com/d521c25d75f302880ffaa2e5f10aa5cfe1c932cb/scripts/master/floating_builder.py [modify] https://crrev.com/d521c25d75f302880ffaa2e5f10aa5cfe1c932cb/scripts/master/unittests/floating_builder_test.py
,
Aug 17 2016
This has been implemented on "chromiumos" and "chromeos" waterfalls. Net difference: - Internal paladins now have +1 floating builder (from 1). - Public paladins now have +2 floating builders (from 0). - Internal PFQs now have +2 floating builders (from 0). Net gain: whatever the bug was in the previous floating builder logic that made everything to to hell if the number increased past one seems to not be present in the new logic, so in theory we can add additional floating builders if necessary. |
||
►
Sign in to add a comment |
||
Comment 1 by bugdroid1@chromium.org
, Aug 15 2016