CQ floating build slaves are used when they shouldn't. Blocks real outage recovery. |
||||||||||||||||||
Issue descriptionOur CQ has two 'extra' builders assigned to every paladin builder which are only supposed to be used if the main builder for that build is down. The current spares: build202-m2 build107-m2 Those two are being used for somewhat random builds every CQ run. Now that we have a builder which is actually down (build185-m2, https://crbug.com/753186) they are being kept busy by build configs with working builds, and not helping with the outage.
,
Aug 8 2017
CQ Timeline 3:56 PM: CQ Finishes https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/15655 bob-paladin: still running caroline-paladin: still running cave-paladin: still running cyan-paladin: still running kip-paladin: still running lakitu-paladin: cbuildbot failed lumpy-paladin: still running quawks-paladin: still running reef-paladin: still running sentry-paladin: still running veyron_minnie-paladin: still running wizpig-paladin: still running 3:56 PM: CQ Starts https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/15656 15:57:57: INFO: Cancelling obsolete slave builds. 15:57:58: INFO: Going to cancel buildbucket_ids: [u'8971885629071025728', u'8971885635550670272', u'8971885639105501856', u'8971885640470310512', u'8971885647613666112', u'8971885653510662784', u'8971885654954890448', u'8971885655603180480', u'8971885657192664224'] 16:02:27: INFO: Build_name auron_paine-paladin buildbucket_id 8971878478475145024 created_timestamp 1502146947367690 *** snipped long list of builders scheduled 16:03:06: INFO: Build_name lakitu-st-paladin buildbucket_id 8971878438332728288 created_timestamp 1502146985651130 betty-paladin Timeline 16:01:30: build173-m2: build aborted (assigned builder) 16:02:36: build202-m2: build started (floating builder) auron_paine-paladin Timeline 16:01:33: cros-beefy373-c2: build aborted (assigned builder) 16:02:36: build107-m2: build started (floating builder) What happened was that the builds on the assigned builders were cancelled before new builds were scheduled, but the cancellation + reboot interval was longer than the cancellation to scheduling internal. That means the assigned builders weren't available as the new builds started, so the new builds correctly fell back to floating builders. Looking at the floating builders, they are used for many different builds on a regular basis. As long as all builders are working, this doesn't hurt anything (in fact, it might make the CQ slightly faster). However, if a real builder goes offline, this process can prevents the floating builders from being used to replace it.
,
Aug 8 2017
In the short term, the downed builder that brought this to my attention has been fixed.
,
Aug 8 2017
nxia@ pointed out that I made this problem much larger by moving most of the sync time from cbuildbot sync state into cbuildbot_launch. That reduced the time interval between cancellation and scheduling. I see these approaches fixes: a) Cancel aborted builds before the CQ master exits (this covers the common case, not all cases). b) Insert a delay into the master cleanup stage after cancelling slaves. c) Speed up the abort process.
,
Aug 8 2017
Also, this is a performance issue, since it blows chroot caching.
,
Aug 9 2017
Looks to me b) is the best option we have now. a) builds may complete before they're canceled by the next master. b) Buildbucket-Buildbot controls the abortion, it might be related to the tick rate.
,
Aug 9 2017
,
Aug 14 2017
,
Aug 14 2017
add a sleep after cancelling (only if any builds were cancelled) prior to bb requests
,
Aug 14 2017
,
Aug 15 2017
pprabhu@ suggests to only delay the starting of the builds which were canceled in the current master CQ. so we can: 1) In the CleanUp stage, when obsolete slave builds are canceled, record the build_config of the builds in in-memory metadata. 2) In the ScheduleSlaveStage, schedule already completed slaves immediately, sleep x mins, then schedule slaves recorded in the metadata.
,
Aug 15 2017
You could, it seems more complex. I'd opt for the simpler solution.
,
Aug 15 2017
I just manually triggered a build and canceled it from buildbucket. The entire process took 5 min 30 s (5 min 20s for the build to be canceled and 10 s for the GCE builder to restart). So we may need to the delay as 6 min.
,
Aug 15 2017
Physical builders take longer to reboot that GCE, and the CQ still contains a mix of the two.
,
Aug 15 2017
so the waiting time will be even longer... Are we ok with this simply solution?
,
Aug 15 2017
I'm shocked that the buildbucket cancel took that long. I wonder if it recently got slower?
,
Aug 15 2017
I think it's always like this
,
Aug 15 2017
Will buildbucket/swarming affinity solve this?
,
Aug 15 2017
It uses a different bridge, so performance can be different. In my testing, builds started quickly, but I've never tested cancelling them. This is the hack up CL I used to schedule tryjobs against swarming. There are two GCE buildslaves to test against. https://chromium-review.googlesource.com/615926
,
Aug 15 2017
I tested again on the bare-metal machines. It took 3~4 mins to cancel a build, and 5~6 mins to restart a physical builder. Do we want to add ~ 10 mins extra time ?
,
Aug 15 2017
,
Aug 15 2017
Based on a vague memory of a conversation with CrOps there is a polling loop involved in buildbucket/buildbot cancellation and propagation of cancellation is bounded by that polling interval. I *THINK* hinoka@ wrote that code and can speak to its expected performance. As for after the move to swarming: When I talked over our affinity requirements with maruel@ in the past, we discussed having the swarming service wait a limited time for a builder with the necessary hot cache to be available. That would be the optimal solution for this problem.
,
Aug 15 2017
what's the current plan for this bug? 1) add sleep (~10 mins) in the CleanUp stage if there are obsolete slave builds from the previous CQ; OR 2) record canceled builds in the CleanUp stage and only delay the scheduling for the canceled slaves
,
Aug 15 2017
I recommend 1 because of the simplicity, but we follow up and look for ways to shrink the delay.
,
Aug 16 2017
10 minutes is a major delay. Is there any way to poll buildbucket or buildbot to determine when these rebooting slaves are back? If not, I think it would be better to move build-cancellation to the end of the prior build, than to add a 10 minute startup delay. The extra info that in-progress builds might produce is a nice-to-have, but don't think it is worth 10 minutes of latency per run.
,
Aug 16 2017
canceling builds at the end of the completion stage sounds a better option here. No significant delay and less confusion for people on inconsistent statuses of canceled builds.
,
Aug 16 2017
How difficult is it to do a one off calculation: % of not-waited-for-builds that actually complete before they get cancelled by the next cq-master in the current setup? If the answer is low, we're really losing nothing by cancelling at the end of the earlier CQ run, and we don't need to discuss this further :)
,
Aug 16 2017
With an unscientific sample size of one, the time delay between the completion stage finishing and the slaves being scheduled is 13 minutes (between two green builds before 8 AM this morning). That covers the 10 minutes delay with little to spare. If we want to maintain optimal performance, that's when we need to cancel slaves.
,
Aug 16 2017
Ok. We'll move the cancel_uncompleted_slaves logic to the completion stage. This also allows the clean_up operation to only abort the slaves triggered the current master_build_id. jkop@, you want to pick up the change? Before change: A CQ master may trigger a self-destruction (stop waiting for all slaves to complete) in the completion stage. The next CQ master will run CleanUp stage to cancel the obsolete slaves from the previous runs. After change: No cancel_obsolete_slaves operation in the CleanUp stage. When trigger a self-destruction in the CQ master completion stage, also cancel the uncompleted slaves.
,
Aug 16 2017
Sure, can do.
,
Aug 21 2017
progress this week.
,
Aug 23 2017
Raised in the CL (https://chromium-review.googlesource.com/c/chromiumos/chromite/+/627277): Do we want to remove the cancellation from cleanup entirely, or leave it in, but presumed to have nothing to cancel?
,
Aug 23 2017
I think we want cleanup to remain. If the previous master builder is killed in an unexpected way, the slaves will still be running. Also, that logic is used for different types of builders, such as release builders.
,
Aug 23 2017
Yeah, that's a compelling pair of reasons, I'll change course.
,
Aug 23 2017
Yeah, dgarrett@ is right. #29 isn't entirely correct in the "after change". we do need to preserve the cancellation in cleanup stages in case the previous master builds are aborted in the middle of a run.
,
Aug 28 2017
Almost done, breaking some other tests that use ShouldWait. Expected today or tomorrow.
,
Aug 29 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/d3ec0592f61e8daa411da39f93db624381e7bec0 commit d3ec0592f61e8daa411da39f93db624381e7bec0 Author: Jacob Kopczynski <jkop@chromium.org> Date: Tue Aug 29 07:12:22 2017 master-paladin: Cancel slaves on self-destruct Create a library function CancelBuilds, called on self-destruct of a master build to cancel all slave builds. Refactor CancelObsoleteBuilds to call this as well, for the shared code. Create mock patching to route around this path in existing tests. Copy and adapt tests from CancelObsoleteSlaveBuilds to also cover CancelBuilds. BUG= chromium:753189 TEST=cbuildbot/run_tests Change-Id: I71114b3e6871767a1d1d75e00af680817efaeae8 Reviewed-on: https://chromium-review.googlesource.com/627277 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/d3ec0592f61e8daa411da39f93db624381e7bec0/cbuildbot/build_status_unittest.py [modify] https://crrev.com/d3ec0592f61e8daa411da39f93db624381e7bec0/lib/builder_status_lib.py [modify] https://crrev.com/d3ec0592f61e8daa411da39f93db624381e7bec0/cbuildbot/stages/build_stages.py [modify] https://crrev.com/d3ec0592f61e8daa411da39f93db624381e7bec0/cbuildbot/build_status.py [modify] https://crrev.com/d3ec0592f61e8daa411da39f93db624381e7bec0/lib/builder_status_lib_unittest.py
,
Aug 29 2017
,
Aug 30 2017
Sweet! Have we verified the fix by looking at behavior in production?
,
Aug 30 2017
I believe this got reverted here: https://chromium-review.googlesource.com/c/chromiumos/chromite/+/642032 So, still going strong... ;)
,
Sep 5 2017
This is still on the chase list. Do we have any eta / blockers for relanding #37?
,
Sep 5 2017
It's headed to pre-CQ now.
,
Sep 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/31040ed758becb514db9754baba6f307c3f2f904 commit 31040ed758becb514db9754baba6f307c3f2f904 Author: Jacob Kopczynski <jkop@chromium.org> Date: Wed Sep 06 23:54:29 2017 master-paladin: Cancel slaves on self-destruct Create a library function CancelBuilds, called on self-destruct of a master build to cancel all slave builds. Refactor CancelObsoleteBuilds to call this as well, for the shared code. Create mock patching to route around this path in existing tests. Copy and adapt tests from CancelObsoleteSlaveBuilds to also cover CancelBuilds. Add tests for type-checking to ensure JSON serializability of buildbucket_ids, preventing the failure that required a revert. BUG= chromium:753189 TEST=cbuildbot/run_tests Change-Id: If45017627c7b22d95af1588ef989ed13794e6452 Reviewed-on: https://chromium-review.googlesource.com/644102 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> Reviewed-by: Jacob Kopczynski <jkop@chromium.org> [modify] https://crrev.com/31040ed758becb514db9754baba6f307c3f2f904/lib/buildbucket_lib.py [modify] https://crrev.com/31040ed758becb514db9754baba6f307c3f2f904/cbuildbot/build_status_unittest.py [modify] https://crrev.com/31040ed758becb514db9754baba6f307c3f2f904/cbuildbot/build_status.py [modify] https://crrev.com/31040ed758becb514db9754baba6f307c3f2f904/lib/builder_status_lib.py [modify] https://crrev.com/31040ed758becb514db9754baba6f307c3f2f904/lib/builder_status_lib_unittest.py [modify] https://crrev.com/31040ed758becb514db9754baba6f307c3f2f904/cbuildbot/stages/build_stages.py
,
Sep 8 2017
the master CQ canceled slaves in CommitQueueCompletion, but the logging should be adjusted also. for example, https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16165, cyan-paladin and quawks-paladin were canceled by the CQ master, not failed because of "cbuildbot failed". We could change it to something like "[aborted by self-destruction] cyan-paladin"
,
Sep 8 2017
I'll write a followup CL to make the logging more informative. While it is cancelling builders as it was supposed to, it has not entirely fixed the original root issue. Various builders have been running on them in the time since the CL landed. (See: https://uberchromegw.corp.google.com/i/chromeos/buildslaves/build107-m2 ) It's possible this will end when other current problems with the CQ pass, but I don't think that's likely. At least some further investigation is needed.
,
Sep 8 2017
Talked to jkop@ offline, CL:644102 did cancel important & running slaves in the self-destructed CQ completion stage, but it didn't cancel the experimental slaves. If the buildslave of an important slave failed, this issue will this happen. Now the plan is to also kill experimental slaves when self-destruction is triggered.
,
Sep 11 2017
Experimental builders are still hogging buildslaves, CL in flight.
,
Sep 13 2017
Issue 721216 has been merged into this issue.
,
Sep 18 2017
New attempt expected to land this week.
,
Sep 21 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/65ecb4e3f8d30d4478591b91436d7b1f35ec480a commit 65ecb4e3f8d30d4478591b91436d7b1f35ec480a Author: Jacob Kopczynski <jkop@chromium.org> Date: Thu Sep 21 08:08:02 2017 Cancel both important and experimental slaves Change labeling of "scheduled_slaves" to "scheduled_important_slaves" everywhere it previously appeared, making them reflect a change a couple months ago. Fetch all slaves, including experimental, when a master is cancelling builds. Cancel uncompleted experimental slaves when all important builds have completed. Add several tests to verify the old behavior is kept and new behavior works. BUG= chromium:753189 TEST=tryjob Change-Id: Ifd73fbdb8d58a27bbc68ce288213ae3fe471530f Reviewed-on: https://chromium-review.googlesource.com/660881 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/cbuildbot/stages/scheduler_stages.py [modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/lib/buildbucket_lib.py [modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/cbuildbot/manifest_version_unittest.py [modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/cbuildbot/stages/scheduler_stages_unittest.py [modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/cbuildbot/build_status_unittest.py [modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/cbuildbot/stages/completion_stages_unittest.py [modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/cbuildbot/build_status.py [modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/cbuildbot/stages/generic_stages_unittest.py [modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/lib/builder_status_lib.py [modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/lib/constants.py [modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/lib/buildbucket_lib_unittest.py
,
Sep 21 2017
That change again attacked the (believed) root cause, but did not fix the symptom. More investigation needed.
,
Sep 25 2017
As of now, it appears to have fixed the slow builders problem. The spare capacity builders have been exclusively building tatl-paladin all weekend. I'm tentatively declaring this fixed.
,
Sep 25 2017
Disregard that, it appeared fixed on the external waterfall (go/cros-waterfall) but is still broken on the internal one (uberchromegw/i/chromeos/waterfall).
,
Sep 25 2017
^ link the specific instance of wrong slave being used
,
Sep 25 2017
jkop@'s change has been effectively cancelling slaves (important + experimental) in CQ self-destruction. As I see, the floating buildslaves pick up experimental builds even the previous CQ didn't self-destruct. So I think this issue is not specific to self-destruction, self-destruction is one of the causes though. the pick_buildslaves_for_experimental_builds logic in buildbot recipe might need changes to only build experimental builds on floating builders when their assigned buildslaves are not available.
,
Sep 25 2017
The real issue is that we start scheduling new builds before the build slaves have restarted and reconnected from the previous build. Aborting earlier was just a reasonable way to try and give them enough time, but doesn't seem to be enough in all cases. Using the LUCI scheduler, I can pretty trivially add a delay between when the CQ master finishes and when the next build starts. Is that a reasonable solution? How much of a delay would be required? 1m? 5m? An alternative solution is to stop rebooting slaves between builds, but we added the reboots for good reason. PS: My prediction is that after jkop's changes, the slaves which are slow to start (and thus being replaced by floating builders) are all Golo builders, which have very slow reboot times.
,
Sep 25 2017
We could also solve this by just removing the floating builders, but any CQ builder outage would become a CQ outage.
,
Sep 25 2017
Re #56, "The real issue is that we start scheduling new builds before the build slaves have restarted and reconnected from the previous build." This is not true as I see. For example, build107-m2 picked up guado-accelerator-paladin/builds/456 at Mon Sep 25 13:52:16 2017, but its assigned buildslave has been available since "system boot 2017-09-25 01:47". cros-beefy274-c2 hasn't picked up any guado-accelerator-paladin since #452, although it has been idle and connected since 01:47 (~12 hours before #456 was picked up). https://uberchromegw.corp.google.com/i/chromeos/builders/guado-accelerator-paladin/builds/456 So I guess adding the delays between CQ master and the builds doesn't really solve the issue. Also, as I looked into the history of builds picked up by the floating builders, most of them are experimental slaves, this is not affecting current CQ performance. I think removing the floating builders will fail CQ when the buildslaves for important builders fail. Now the issue should be why the experimental builds are still scheduled on the floating builders even when their assigned buildslaves are idle.
,
Sep 25 2017
Note: System boot isn't what counts, it's when the slave re-connects to the buildbot master. That means that boot has finished, and that the buildbot client software has updated and made a valid connection. On the other hand, the GCE builder being idle for 12 hours says that it's a buildbot bug.
,
Sep 26 2017
I'm not sure how to check the exact timestamp of when it's brought up, as you mentioned "reboot" isn't enough to say it's idle. But as I checked and monitored, the GCE machine cros-beefy274-c2 has been idle and connected and didn't pick up any build.
,
Sep 26 2017
According to http://shortn/_EnnwK2I9rs, when nextSlave isn't configured, a random buildslave will be picked up. Not sure if https://chrome-internal-review.googlesource.com/#/c/chrome/tools/build/+/462954/ is the right way to fix.
,
Sep 27 2017
Looks like the fix in #61 is working. The experimental builds have been scheduled on their assigned buildslaves and the floating buildslaves have been idle since the waterfall restart after CL:*462954 was merged. Sent out the CL for the chromiumos waterfall at https://chromium-review.googlesource.com/c/chromium/tools/build/+/686042
,
Sep 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/d50a6366a8eb7a6ee57965bcdb5836ce6efc1b91 commit d50a6366a8eb7a6ee57965bcdb5836ce6efc1b91 Author: Ningning Xia <nxia@google.com> Date: Wed Sep 27 19:39:49 2017 Set "nextSlave" for experimental builds. Experimental builds have been randomly scheduled on "floating" and "assigned" builders. Set "nextSlave" for experimental builders so that the "assgined" builders have the higher priority to be picked up when they're available. BUG= chromium:753189 TEST=None Change-Id: I9d089a6c6321c56f6f11cc1541459d42421909b3 Reviewed-on: https://chromium-review.googlesource.com/686042 Reviewed-by: Don Garrett <dgarrett@chromium.org> Commit-Queue: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/d50a6366a8eb7a6ee57965bcdb5836ce6efc1b91/masters/master.chromiumos/master.cfg
,
Sep 27 2017
|
||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||
Comment 1 by dgarr...@chromium.org
, Aug 8 2017