New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 721216 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 753189
Owner:
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

CQ slaves take 10 minutes to all start (due to old and now obsolete fuzzing delay?)

Project Member Reported by akes...@chromium.org, May 11 2017

Issue description

Example build timeline:
https://viceroy.corp.google.com/chromeos/build_details?build_config=master-paladin&build_number=14538

The slave start times are over a range that is suspiciously close to 10 minutes. I've seen the same range when examining other builds.

Examining the ScheduleSlaves stage of the master, it seems all the buildbucket requests are being sent out within under a minute.

My guess is there is some fuzzed 10 minute delay being introduced by a buildbot recipe or waterfall config, leftover from the days (prior to repo cache) where we were worried about DDOSing GoB if we started all slaves at once. dnj, I think you may have implemented this for us. Do you recall? Can we turn this off now?
 
Owner: dgarr...@chromium.org
On further inspection, I think this random delay is due to the InitialCheckout phase, which is part of cbuildbot_launch.

-> Don for ideas on making it faster.

Comment 2 by aut...@google.com, May 16 2017

Labels: -current-issue
Looking at a newer build that's a very simple case (green build following a green build that ran to full completion).

https://viceroy.corp.google.com/chromeos/build_details?build_config=master-paladin&build_number=15054

All slaves were scheduled between 12:39:06 and 12:39:42. This explains up to 35 seconds of variance (not a lot).

Fastest slave start times: x86-generic-paladin (golo builder)
  waterfall:          12:39:24
  cbuildbot_launcher: 12:40:02
  cbuildbot:          12:44:30
  viceroy start:      12:44:31


Slowest slave start times: arm-generic-paladin (GCE builder)
  waterfall start:    12:39:45
  cbuildbot_launcher: 12:40:51
  cbuildbot:          12:50:18
  viceroy start:      12:50:18

The primary different is PreSync variance of about 4.5 minutes. I see nothing in the logs to indicate why.

Looking at the initial example:

This is dirtier. The run failed, followed a failing run, and aborted slaves left from the previous run.

Scheduling took between 07:21:54 and 07:22:23 (29 seconds)

Fastest slave: parrot-paladin (golo) Followed a green, non-interrupted build.
  waterfall:          07:22:07
  cbuildbot_launcher: 07:22:27
  cbuildbot:          07:26:17
  viceroy:            07:26:22

Slowest slave: wolf-tot-paladin (GCE) previous build aborted by current master.
  waterfall:          07:24:23
  cbuildbot_launcher: 07:25:25
  cbuildbot:          07:38:41
  viceroy:            07:38:41


The launcher spent much longer syncing wolf-tot-paladin. It was NOT a clobber sync, and I can't explain the difference from looking at the logs.

There is also a 2:15 second delay in waterfall start times. I believe this is because the parrot-paladin was not aborted, and so had rebooted before the previous CQ run finished. The wolf-tot paladin was aborted by the master, and so had to reboot before it could pick up the build scheduled on it.

The reboot differences will be larger for golo builders (they take much longer to reboot), and could be mitigated by having the master abort them before it shuts down.
In both of those cases, the sync times varied, but I can't explain why. Will try to use launcher metrics to see how common this is.
Cc: vapier@chromium.org
If you can push vapier@ to take https://gerrit-review.googlesource.com/#/c/102392/, we can use that to answer questions about the sync being slow.
Mergedinto: 753189
Status: Duplicate (was: Untriaged)
I think this delay was caused by incorrect use of the CQ floating build slaves. This is being addressed here:

 https://crbug.com/753189 

So... duping against that.

Comment 8 by nxia@chromium.org, Sep 13 2017

This was because the initial-checkout took time so the BuildStart stage was delayed?
Yes, and I believe that moving the builds between builders was making that checkout time less predictable. I could be wrong about that.

Sign in to add a comment