New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 753189 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

CQ floating build slaves are used when they shouldn't. Blocks real outage recovery.

Project Member Reported by dgarr...@chromium.org, Aug 8 2017

Issue description

Our CQ has two 'extra' builders assigned to every paladin builder which are only supposed to be used if the main builder for that build is down.

The current spares:
  build202-m2
  build107-m2

Those two are being used for somewhat random builds every CQ run. Now that we have a builder which is actually down (build185-m2, https://crbug.com/753186) they are being kept busy by build configs with working builds, and not helping with the outage.

 
Build history for the two spare builders:

https://uberchromegw.corp.google.com/i/chromeos/buildslaves/build202-m2
https://uberchromegw.corp.google.com/i/chromeos/buildslaves/build107-m2

I believe that what is happening is that the CQ has been failing repeatedly and aborting some builds, then scheduling new builds before the designated builders have time to finish rebooting.
CQ Timeline

3:56 PM: CQ Finishes  https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/15655
             bob-paladin: still running
             caroline-paladin: still running
             cave-paladin: still running
             cyan-paladin: still running
             kip-paladin: still running
             lakitu-paladin: cbuildbot failed
             lumpy-paladin: still running
             quawks-paladin: still running
             reef-paladin: still running
             sentry-paladin: still running
             veyron_minnie-paladin: still running
             wizpig-paladin: still running

3:56 PM: CQ Starts  https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/15656
15:57:57: INFO: Cancelling obsolete slave builds.
15:57:58: INFO: Going to cancel buildbucket_ids: [u'8971885629071025728', u'8971885635550670272', u'8971885639105501856', u'8971885640470310512', u'8971885647613666112', u'8971885653510662784', u'8971885654954890448', u'8971885655603180480', u'8971885657192664224']
16:02:27: INFO: Build_name auron_paine-paladin buildbucket_id 8971878478475145024 created_timestamp 1502146947367690
*** snipped long list of builders scheduled
16:03:06: INFO: Build_name lakitu-st-paladin buildbucket_id 8971878438332728288 created_timestamp 1502146985651130


betty-paladin Timeline
16:01:30: build173-m2: build aborted  (assigned builder)
16:02:36: build202-m2: build started  (floating builder)


auron_paine-paladin Timeline
16:01:33: cros-beefy373-c2: build aborted  (assigned builder)
16:02:36: build107-m2: build started  (floating builder)


What happened was that the builds on the assigned builders were cancelled before new builds were scheduled, but the cancellation + reboot interval was longer than the cancellation to scheduling internal.

That means the assigned builders weren't available as the new builds started, so the new builds correctly fell back to floating builders.

Looking at the floating builders, they are used for many different builds on a regular basis. As long as all builders are working, this doesn't hurt anything (in fact, it might make the CQ slightly faster). However, if a real builder goes offline, this process can prevents the floating builders from being used to replace it.
Cc: nxia@chromium.org akes...@chromium.org
In the short term, the downed builder that brought this to my attention has been fixed.
nxia@ pointed out that I made this problem much larger by moving most of the sync time from cbuildbot sync state into cbuildbot_launch. That reduced the time interval between cancellation and scheduling.

I see these approaches fixes:
  a) Cancel aborted builds before the CQ master exits (this covers the common case, not all cases).
  b) Insert a delay into the master cleanup stage after cancelling slaves.
  c) Speed up the abort process.
Labels: -Pri-3 Pri-2
Also, this is a performance issue, since it blows chroot caching.

Comment 6 by nxia@chromium.org, Aug 9 2017

Looks to me b) is the best option we have now. 

a) builds may complete before they're canceled by the next master.
b) Buildbucket-Buildbot controls the abortion, it might be related to the tick rate. 

Comment 7 by nxia@chromium.org, Aug 9 2017

Owner: nxia@chromium.org
Labels: Chase-Pending
Labels: -Pri-2 -Chase-Pending Chase Pri-1
Owner: dgarr...@chromium.org
Status: Assigned (was: Untriaged)
add a sleep after cancelling (only if any builds were cancelled) prior to bb requests


Owner: nxia@chromium.org

Comment 11 by nxia@chromium.org, Aug 15 2017

Cc: pprabhu@chromium.org
pprabhu@ suggests to only delay the starting of the builds which were canceled in the current master CQ. so we can:

1) In the CleanUp stage, when obsolete slave builds are canceled, record the build_config of the builds in in-memory metadata. 
2) In the ScheduleSlaveStage, schedule already completed slaves immediately, sleep x mins, then schedule slaves recorded in the metadata.
You could, it seems more complex. I'd opt for the simpler solution.

Comment 13 by nxia@chromium.org, Aug 15 2017

I just manually triggered a build and canceled it from buildbucket. The entire process took 5 min 30 s (5 min 20s for the build to be canceled and 10 s for the GCE builder to restart). So we may need to the delay as 6 min. 
Physical builders take longer to reboot that GCE, and the CQ still contains a mix of the two.

Comment 15 by nxia@chromium.org, Aug 15 2017

so the waiting time will be even longer... Are we ok with this simply solution?
I'm shocked that the buildbucket cancel took that long.

I wonder if it recently got slower?

Comment 17 by nxia@chromium.org, Aug 15 2017

I think it's always like this 
Will buildbucket/swarming affinity solve this?
It uses a different bridge, so performance can be different. In my testing, builds started quickly, but I've never tested cancelling them.

This is the hack up CL I used to schedule tryjobs against swarming. There are two GCE buildslaves to test against.


https://chromium-review.googlesource.com/615926

Comment 20 by nxia@chromium.org, Aug 15 2017

I tested again on the bare-metal machines. 

It took 3~4 mins to cancel a build, and 5~6 mins to restart a physical builder. 

Do we want to add ~ 10 mins extra time ?

Comment 21 by nxia@chromium.org, Aug 15 2017

Cc: jkop@chromium.org
Based on a vague memory of a conversation with CrOps there is a polling loop involved in buildbucket/buildbot cancellation and propagation of cancellation is bounded by that polling interval. I *THINK* hinoka@ wrote that code and can speak to its expected performance.

As for after the move to swarming:

When I talked over our affinity requirements with maruel@ in the past, we discussed having the swarming service wait a limited time for a builder with the necessary hot cache to be available. That would be the optimal solution for this problem.

Comment 23 by nxia@chromium.org, Aug 15 2017

what's the current plan for this bug?

1) add sleep (~10 mins) in the CleanUp stage if there are obsolete slave builds from the previous CQ;

OR

2) record canceled builds in the CleanUp stage and only delay the scheduling for the canceled slaves 
I recommend 1 because of the simplicity, but we follow up and look for ways to shrink the delay.
Cc: d...@chromium.org
10 minutes is a major delay. Is there any way to poll buildbucket or buildbot to determine when these rebooting slaves are back?

If not, I think it would be better to move build-cancellation to the end of the prior build, than to add a 10 minute startup delay. The extra info that in-progress builds might produce is a nice-to-have, but don't think it is worth 10 minutes of latency per run.

Comment 26 by nxia@chromium.org, Aug 16 2017

canceling builds at the end of the completion stage sounds a better option here. No significant delay and less confusion for people on inconsistent statuses of canceled builds.
How difficult is it to do a one off calculation: % of not-waited-for-builds that actually complete before they get cancelled by the next cq-master in the current setup?

If the answer is low, we're really losing nothing by cancelling at the end of the earlier CQ run, and we don't need to discuss this further :)
With an unscientific sample size of one, the time delay between the completion stage finishing and the slaves being scheduled is 13 minutes (between two green builds before 8 AM this morning).

That covers the 10 minutes delay with little to spare. If we want to maintain optimal performance, that's when we need to cancel slaves.

Comment 29 by nxia@chromium.org, Aug 16 2017

Owner: jkop@chromium.org
Ok. We'll move the cancel_uncompleted_slaves logic to the completion stage. This also allows the clean_up operation to only abort the slaves triggered the current master_build_id.

jkop@, you want to pick up the change?

Before change:

A CQ master may trigger a self-destruction (stop waiting for all slaves to complete) in the completion stage.
The next CQ master will run CleanUp stage to cancel the obsolete slaves from the previous runs. 

After change:

No cancel_obsolete_slaves operation in the CleanUp stage.
When trigger a self-destruction in the CQ master completion stage, also cancel the uncompleted slaves. 

Comment 30 by jkop@chromium.org, Aug 16 2017

Status: Started (was: Assigned)
Sure, can do.
progress this week.

Comment 32 by jkop@chromium.org, Aug 23 2017

Raised in the CL (https://chromium-review.googlesource.com/c/chromiumos/chromite/+/627277):

Do we want to remove the cancellation from cleanup entirely, or leave it in, but presumed to have nothing to cancel?
I think we want cleanup to remain.

If the previous master builder is killed in an unexpected way, the slaves will still be running.

Also, that logic is used for different types of builders, such as release builders.

Comment 34 by jkop@chromium.org, Aug 23 2017

Yeah, that's a compelling pair of reasons, I'll change course.

Comment 35 by nxia@chromium.org, Aug 23 2017

Yeah, dgarrett@ is right. #29 isn't entirely correct in the "after change".  we do need to preserve the cancellation in cleanup stages in case the previous master builds are aborted in the middle of a run. 
Almost done, breaking some other tests that use ShouldWait. Expected today or tomorrow.
Project Member

Comment 37 by bugdroid1@chromium.org, Aug 29 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/d3ec0592f61e8daa411da39f93db624381e7bec0

commit d3ec0592f61e8daa411da39f93db624381e7bec0
Author: Jacob Kopczynski <jkop@chromium.org>
Date: Tue Aug 29 07:12:22 2017

master-paladin: Cancel slaves on self-destruct

Create a library function CancelBuilds, called on self-destruct of a
 master build to cancel all slave builds.
Refactor CancelObsoleteBuilds to call this as well, for the shared code.
Create mock patching to route around this path in existing tests.
Copy and adapt tests from CancelObsoleteSlaveBuilds to also cover CancelBuilds.

BUG= chromium:753189 
TEST=cbuildbot/run_tests
Change-Id: I71114b3e6871767a1d1d75e00af680817efaeae8
Reviewed-on: https://chromium-review.googlesource.com/627277
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/d3ec0592f61e8daa411da39f93db624381e7bec0/cbuildbot/build_status_unittest.py
[modify] https://crrev.com/d3ec0592f61e8daa411da39f93db624381e7bec0/lib/builder_status_lib.py
[modify] https://crrev.com/d3ec0592f61e8daa411da39f93db624381e7bec0/cbuildbot/stages/build_stages.py
[modify] https://crrev.com/d3ec0592f61e8daa411da39f93db624381e7bec0/cbuildbot/build_status.py
[modify] https://crrev.com/d3ec0592f61e8daa411da39f93db624381e7bec0/lib/builder_status_lib_unittest.py

Comment 38 by jkop@chromium.org, Aug 29 2017

Status: Fixed (was: Started)
Sweet!

Have we verified the fix by looking at behavior in production?
Status: Started (was: Fixed)
I believe this got reverted here: https://chromium-review.googlesource.com/c/chromiumos/chromite/+/642032

So, still going strong... ;)
This is still on the chase list. Do we have any eta / blockers for relanding #37?

Comment 42 by jkop@chromium.org, Sep 5 2017

It's headed to pre-CQ now.
Project Member

Comment 43 by bugdroid1@chromium.org, Sep 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/31040ed758becb514db9754baba6f307c3f2f904

commit 31040ed758becb514db9754baba6f307c3f2f904
Author: Jacob Kopczynski <jkop@chromium.org>
Date: Wed Sep 06 23:54:29 2017

master-paladin: Cancel slaves on self-destruct

Create a library function CancelBuilds, called on self-destruct of a
 master build to cancel all slave builds.
Refactor CancelObsoleteBuilds to call this as well, for the shared code.
Create mock patching to route around this path in existing tests.
Copy and adapt tests from CancelObsoleteSlaveBuilds to also cover CancelBuilds.
Add tests for type-checking to ensure JSON serializability of
 buildbucket_ids, preventing the failure that required a revert.

BUG= chromium:753189 
TEST=cbuildbot/run_tests

Change-Id: If45017627c7b22d95af1588ef989ed13794e6452
Reviewed-on: https://chromium-review.googlesource.com/644102
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>
Reviewed-by: Jacob Kopczynski <jkop@chromium.org>

[modify] https://crrev.com/31040ed758becb514db9754baba6f307c3f2f904/lib/buildbucket_lib.py
[modify] https://crrev.com/31040ed758becb514db9754baba6f307c3f2f904/cbuildbot/build_status_unittest.py
[modify] https://crrev.com/31040ed758becb514db9754baba6f307c3f2f904/cbuildbot/build_status.py
[modify] https://crrev.com/31040ed758becb514db9754baba6f307c3f2f904/lib/builder_status_lib.py
[modify] https://crrev.com/31040ed758becb514db9754baba6f307c3f2f904/lib/builder_status_lib_unittest.py
[modify] https://crrev.com/31040ed758becb514db9754baba6f307c3f2f904/cbuildbot/stages/build_stages.py

Comment 44 by nxia@chromium.org, Sep 8 2017

the master CQ canceled slaves in CommitQueueCompletion, but the logging should be adjusted also. 

for example, https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16165,

cyan-paladin and quawks-paladin were canceled by the CQ master, not failed because of "cbuildbot failed". We could change it to something like  

"[aborted by self-destruction] cyan-paladin"

Comment 45 by jkop@chromium.org, Sep 8 2017

I'll write a followup CL to make the logging more informative.

While it is cancelling builders as it was supposed to, it has not entirely fixed the original root issue. Various builders have been running on them in the time since the CL landed. (See: https://uberchromegw.corp.google.com/i/chromeos/buildslaves/build107-m2 )

It's possible this will end when other current problems with the CQ pass, but I don't think that's likely. At least some further investigation is needed.

Comment 46 by nxia@chromium.org, Sep 8 2017

Talked to jkop@ offline, CL:644102 did cancel important & running slaves in the self-destructed CQ completion stage, but it didn't cancel the experimental slaves. If the buildslave of an important slave failed, this issue will this happen. Now the plan is to also kill experimental slaves when self-destruction is triggered.
Experimental builders are still hogging buildslaves, CL in flight.
Cc: davidri...@chromium.org dgarr...@chromium.org vapier@chromium.org sosa@chromium.org jrbarnette@chromium.org
 Issue 721216  has been merged into this issue.
New attempt expected to land this week.
Project Member

Comment 50 by bugdroid1@chromium.org, Sep 21 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/65ecb4e3f8d30d4478591b91436d7b1f35ec480a

commit 65ecb4e3f8d30d4478591b91436d7b1f35ec480a
Author: Jacob Kopczynski <jkop@chromium.org>
Date: Thu Sep 21 08:08:02 2017

Cancel both important and experimental slaves

Change labeling of "scheduled_slaves" to "scheduled_important_slaves"
  everywhere it previously appeared, making them reflect a change a
  couple months ago.
Fetch all slaves, including experimental, when a master is cancelling builds.
Cancel uncompleted experimental slaves when all important builds have
  completed.
Add several tests to verify the old behavior is kept and new behavior
  works.

BUG= chromium:753189 
TEST=tryjob

Change-Id: Ifd73fbdb8d58a27bbc68ce288213ae3fe471530f
Reviewed-on: https://chromium-review.googlesource.com/660881
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/cbuildbot/stages/scheduler_stages.py
[modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/lib/buildbucket_lib.py
[modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/cbuildbot/manifest_version_unittest.py
[modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/cbuildbot/stages/scheduler_stages_unittest.py
[modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/cbuildbot/build_status_unittest.py
[modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/cbuildbot/stages/completion_stages_unittest.py
[modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/cbuildbot/build_status.py
[modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/cbuildbot/stages/generic_stages_unittest.py
[modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/lib/builder_status_lib.py
[modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/lib/constants.py
[modify] https://crrev.com/65ecb4e3f8d30d4478591b91436d7b1f35ec480a/lib/buildbucket_lib_unittest.py

Comment 51 by jkop@chromium.org, Sep 21 2017

That change again attacked the (believed) root cause, but did not fix the symptom. More investigation needed.

Comment 52 by jkop@chromium.org, Sep 25 2017

Status: Fixed (was: Started)
As of now, it appears to have fixed the slow builders problem. The spare capacity builders have been exclusively building tatl-paladin all weekend. I'm tentatively declaring this fixed.

Comment 53 by jkop@chromium.org, Sep 25 2017

Status: Started (was: Fixed)
Disregard that, it appeared fixed on the external waterfall (go/cros-waterfall) but is still broken on the internal one (uberchromegw/i/chromeos/waterfall).
^ link the specific instance of wrong slave being used

Comment 55 by nxia@chromium.org, Sep 25 2017

jkop@'s change has been effectively cancelling slaves (important + experimental) in CQ self-destruction. As I see, the floating buildslaves pick up experimental builds even the previous CQ didn't self-destruct. So I think this issue is not specific to self-destruction, self-destruction is one of the causes though. the pick_buildslaves_for_experimental_builds logic in buildbot recipe might need changes to only build experimental builds on floating builders when their assigned buildslaves are not available. 
The real issue is that we start scheduling new builds before the build slaves have restarted and reconnected from the previous build. Aborting earlier was just a reasonable way to try and give them enough time, but doesn't seem to be enough in all cases.

Using the LUCI scheduler, I can pretty trivially add a delay between when the CQ master finishes and when the next build starts.

Is that a reasonable solution? How much of a delay would be required? 1m? 5m?

An alternative solution is to stop rebooting slaves between builds, but we added the reboots for good reason.

PS: My prediction is that after jkop's changes, the slaves which are slow to start (and thus being replaced by floating builders) are all Golo builders, which have very slow reboot times.

We could also solve this by just removing the floating builders, but any CQ builder outage would become a CQ outage.

Comment 58 by nxia@chromium.org, Sep 25 2017

Re #56, 
"The real issue is that we start scheduling new builds before the build slaves have restarted and reconnected from the previous build."

This is not true as I see. For example, build107-m2 picked up guado-accelerator-paladin/builds/456 at Mon Sep 25 13:52:16 2017, but its assigned buildslave has been available since "system boot  2017-09-25 01:47". cros-beefy274-c2 hasn't picked up any guado-accelerator-paladin since #452, although it has been idle and connected since 01:47 (~12 hours before #456 was picked up).

https://uberchromegw.corp.google.com/i/chromeos/builders/guado-accelerator-paladin/builds/456


So I guess adding the delays between CQ master and the builds doesn't really solve the issue.

Also, as I looked into the history of builds picked up by the floating builders, most of them are experimental slaves, this is not affecting current CQ performance. I think removing the floating builders will fail CQ when the buildslaves for important builders fail.

Now the issue should be why the experimental builds are still scheduled on the floating builders even when their assigned buildslaves are idle.
Note: System boot isn't what counts, it's when the slave re-connects to the buildbot master. That means that boot has finished, and that the buildbot client software has updated and made a valid connection.

On the other hand, the GCE builder being idle for 12 hours says that it's a buildbot bug.

Comment 60 by nxia@chromium.org, Sep 26 2017

I'm not sure how to check the exact timestamp of when it's brought up, as you mentioned "reboot" isn't enough to say it's idle. But as I checked and monitored, the GCE machine cros-beefy274-c2 has been idle and connected and didn't pick up any build.

Comment 61 by nxia@chromium.org, Sep 26 2017

According to http://shortn/_EnnwK2I9rs, when nextSlave isn't configured, a random buildslave will be picked up. 

Not sure if https://chrome-internal-review.googlesource.com/#/c/chrome/tools/build/+/462954/ is the right way to fix.

Comment 62 by nxia@chromium.org, Sep 27 2017

Looks like the fix in #61 is working. The experimental builds have been scheduled on their assigned buildslaves and the floating buildslaves have been idle since the waterfall restart after CL:*462954 was merged.

Sent out the CL for the chromiumos waterfall at
https://chromium-review.googlesource.com/c/chromium/tools/build/+/686042
Project Member

Comment 63 by bugdroid1@chromium.org, Sep 27 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/d50a6366a8eb7a6ee57965bcdb5836ce6efc1b91

commit d50a6366a8eb7a6ee57965bcdb5836ce6efc1b91
Author: Ningning Xia <nxia@google.com>
Date: Wed Sep 27 19:39:49 2017

Set "nextSlave" for experimental builds.

Experimental builds have been randomly scheduled on "floating" and
"assigned" builders. Set "nextSlave" for experimental builders so that
the "assgined" builders have the higher priority to be picked up when
they're available.

BUG= chromium:753189 
TEST=None

Change-Id: I9d089a6c6321c56f6f11cc1541459d42421909b3
Reviewed-on: https://chromium-review.googlesource.com/686042
Reviewed-by: Don Garrett <dgarrett@chromium.org>
Commit-Queue: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/d50a6366a8eb7a6ee57965bcdb5836ce6efc1b91/masters/master.chromiumos/master.cfg

Comment 64 by jkop@chromium.org, Sep 27 2017

Status: Verified (was: Started)

Sign in to add a comment