Issue metadata
Sign in to add a comment
|
CQ slaves take 10 minutes to all start (due to old and now obsolete fuzzing delay?) |
||||||||||||||||||||||||
Issue descriptionExample build timeline: https://viceroy.corp.google.com/chromeos/build_details?build_config=master-paladin&build_number=14538 The slave start times are over a range that is suspiciously close to 10 minutes. I've seen the same range when examining other builds. Examining the ScheduleSlaves stage of the master, it seems all the buildbucket requests are being sent out within under a minute. My guess is there is some fuzzed 10 minute delay being introduced by a buildbot recipe or waterfall config, leftover from the days (prior to repo cache) where we were worried about DDOSing GoB if we started all slaves at once. dnj, I think you may have implemented this for us. Do you recall? Can we turn this off now?
,
May 16 2017
,
Jun 14 2017
Looking at a newer build that's a very simple case (green build following a green build that ran to full completion). https://viceroy.corp.google.com/chromeos/build_details?build_config=master-paladin&build_number=15054 All slaves were scheduled between 12:39:06 and 12:39:42. This explains up to 35 seconds of variance (not a lot). Fastest slave start times: x86-generic-paladin (golo builder) waterfall: 12:39:24 cbuildbot_launcher: 12:40:02 cbuildbot: 12:44:30 viceroy start: 12:44:31 Slowest slave start times: arm-generic-paladin (GCE builder) waterfall start: 12:39:45 cbuildbot_launcher: 12:40:51 cbuildbot: 12:50:18 viceroy start: 12:50:18 The primary different is PreSync variance of about 4.5 minutes. I see nothing in the logs to indicate why.
,
Jun 14 2017
Looking at the initial example: This is dirtier. The run failed, followed a failing run, and aborted slaves left from the previous run. Scheduling took between 07:21:54 and 07:22:23 (29 seconds) Fastest slave: parrot-paladin (golo) Followed a green, non-interrupted build. waterfall: 07:22:07 cbuildbot_launcher: 07:22:27 cbuildbot: 07:26:17 viceroy: 07:26:22 Slowest slave: wolf-tot-paladin (GCE) previous build aborted by current master. waterfall: 07:24:23 cbuildbot_launcher: 07:25:25 cbuildbot: 07:38:41 viceroy: 07:38:41 The launcher spent much longer syncing wolf-tot-paladin. It was NOT a clobber sync, and I can't explain the difference from looking at the logs. There is also a 2:15 second delay in waterfall start times. I believe this is because the parrot-paladin was not aborted, and so had rebooted before the previous CQ run finished. The wolf-tot paladin was aborted by the master, and so had to reboot before it could pick up the build scheduled on it. The reboot differences will be larger for golo builders (they take much longer to reboot), and could be mitigated by having the master abort them before it shuts down.
,
Jun 14 2017
In both of those cases, the sync times varied, but I can't explain why. Will try to use launcher metrics to see how common this is.
,
Jun 21 2017
If you can push vapier@ to take https://gerrit-review.googlesource.com/#/c/102392/, we can use that to answer questions about the sync being slow.
,
Sep 13 2017
I think this delay was caused by incorrect use of the CQ floating build slaves. This is being addressed here: https://crbug.com/753189 So... duping against that.
,
Sep 13 2017
This was because the initial-checkout took time so the BuildStart stage was delayed?
,
Sep 13 2017
Yes, and I believe that moving the builds between builders was making that checkout time less predictable. I could be wrong about that. |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by akes...@chromium.org
, May 11 2017