Many timeouts starting PreCQ builds. |
|||||||||
Issue descriptionWe've been seeing a lot of timeouts starting PreCQ builders. An example: https://chromium-review.googlesource.com/c/457845/ We were not able to launch a lumpy-no-vmtest-pre-cq trybot for your change within 30 minutes. This problem can happen if the trybot waterfall is very busy, or if there is an infrastructure issue. Please notify the sheriff and mark your change as ready again. If this problem occurs multiple times in a row, please file a bug. Commit queue documentation: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium-os/commit-queue-overview
,
Mar 29 2017
xixuan is in the process of adding additional builders (both physical and GCE) to the PreCQ pools.
,
Mar 29 2017
,
Mar 29 2017
I'll point out that we changed many of our PreCQ builders to run VM testing a while back, which would have slowed them down quite a bit.
,
Mar 29 2017
,
Mar 29 2017
I feel like it's not just a pure scheduling issue. I think that there is some state that gets stuck after a timeout, and something needs to be restarted or the change needs to be updated to proceed. I was seeing failures across every pre-cq group, including when I tried to start trybots at midnight PDT.
,
Mar 29 2017
I believe that on timeout, we treat it as a bad CL and clear flags to prevent the PreCQ from trying it again until they are manually reset. But we saw that all of the builders were in use when this was happening. Assuming that they were finishing in a reasonable time, the timeouts happened because of a capacity issue.
,
Mar 29 2017
mysql> select date(finish_time), build_config, sec_to_time(avg(time_to_sec(timediff(finish_time, start_time)))) as 'avg time', count(*) from buildTable where build_config = 'caroline-pre-cq' and waterfall = 'chromiumos.tryserver' and final = 1 and last_updated > date_sub(current_date(), interval 21 day) group by date(finish_time); +-------------------+-----------------+----------+----------+ | date(finish_time) | build_config | avg time | count(*) | +-------------------+-----------------+----------+----------+ | 2017-03-08 | caroline-pre-cq | 01:41:43 | 94 | | 2017-03-09 | caroline-pre-cq | 01:39:30 | 84 | | 2017-03-10 | caroline-pre-cq | 01:42:16 | 78 | | 2017-03-11 | caroline-pre-cq | 01:38:44 | 46 | | 2017-03-12 | caroline-pre-cq | 02:01:41 | 15 | | 2017-03-13 | caroline-pre-cq | 01:57:43 | 64 | | 2017-03-14 | caroline-pre-cq | 01:35:36 | 93 | | 2017-03-15 | caroline-pre-cq | 01:39:33 | 92 | | 2017-03-16 | caroline-pre-cq | 01:36:26 | 83 | | 2017-03-17 | caroline-pre-cq | 01:35:36 | 89 | | 2017-03-18 | caroline-pre-cq | 01:45:39 | 38 | | 2017-03-19 | caroline-pre-cq | 01:39:26 | 5 | | 2017-03-20 | caroline-pre-cq | 01:47:51 | 61 | | 2017-03-21 | caroline-pre-cq | 01:40:56 | 92 | | 2017-03-22 | caroline-pre-cq | 01:34:53 | 96 | | 2017-03-23 | caroline-pre-cq | 01:39:40 | 98 | | 2017-03-24 | caroline-pre-cq | 01:43:44 | 61 | | 2017-03-25 | caroline-pre-cq | 01:38:27 | 35 | | 2017-03-26 | caroline-pre-cq | 02:06:03 | 9 | | 2017-03-27 | caroline-pre-cq | 01:32:41 | 61 | | 2017-03-28 | caroline-pre-cq | 01:40:42 | 97 | | 2017-03-29 | caroline-pre-cq | 01:31:48 | 62 | +-------------------+-----------------+----------+----------+ 22 rows in set (0.73 sec) mysql> select date(finish_time), build_config, sec_to_time(avg(time_to_sec(timediff(finish_time, start_time)))) as 'avg time', count(*) from buildTable where build_config = 'samus-no-vmtest-pre-cq' and waterfall = 'chromiumos.tryserver' and final = 1 and last_updated > date_sub(current_date(), interval 21 day) group by date(finish_time); +-------------------+------------------------+----------+----------+ | date(finish_time) | build_config | avg time | count(*) | +-------------------+------------------------+----------+----------+ | 2017-03-08 | samus-no-vmtest-pre-cq | 01:00:48 | 93 | | 2017-03-09 | samus-no-vmtest-pre-cq | 00:56:55 | 84 | | 2017-03-10 | samus-no-vmtest-pre-cq | 00:59:08 | 78 | | 2017-03-11 | samus-no-vmtest-pre-cq | 00:57:27 | 39 | | 2017-03-12 | samus-no-vmtest-pre-cq | 01:24:09 | 15 | | 2017-03-13 | samus-no-vmtest-pre-cq | 01:25:12 | 68 | | 2017-03-14 | samus-no-vmtest-pre-cq | 01:01:26 | 93 | | 2017-03-15 | samus-no-vmtest-pre-cq | 00:57:37 | 88 | | 2017-03-16 | samus-no-vmtest-pre-cq | 00:54:32 | 87 | | 2017-03-17 | samus-no-vmtest-pre-cq | 00:56:49 | 93 | | 2017-03-18 | samus-no-vmtest-pre-cq | 01:00:59 | 27 | | 2017-03-19 | samus-no-vmtest-pre-cq | 01:10:44 | 5 | | 2017-03-20 | samus-no-vmtest-pre-cq | 01:00:46 | 66 | | 2017-03-21 | samus-no-vmtest-pre-cq | 00:58:29 | 87 | | 2017-03-22 | samus-no-vmtest-pre-cq | 00:54:54 | 95 | | 2017-03-23 | samus-no-vmtest-pre-cq | 00:59:14 | 99 | | 2017-03-24 | samus-no-vmtest-pre-cq | 00:59:41 | 66 | | 2017-03-25 | samus-no-vmtest-pre-cq | 00:57:05 | 28 | | 2017-03-26 | samus-no-vmtest-pre-cq | 01:41:42 | 9 | | 2017-03-27 | samus-no-vmtest-pre-cq | 00:56:29 | 64 | | 2017-03-28 | samus-no-vmtest-pre-cq | 01:00:03 | 98 | | 2017-03-29 | samus-no-vmtest-pre-cq | 00:51:40 | 61 | +-------------------+------------------------+----------+----------+
,
Mar 29 2017
Both cheets boards, right? Looks like we have a cheets performance issue in the PreCQ. Especially the samus-no-vmtests-pre-cq config. I'm going to split that off into a new bug. Adding capacity, and being better at handling load spike doesn't hurt us, and costs very little.
,
Mar 29 2017
Thank you for the numbers!
,
Mar 29 2017
,
Mar 29 2017
Re: c#7: I was manually resetting that change multiple times -- I question the likelihood that every single one of the build configs every single time I tried it (including after midnight) was unavailable. Re: c#9: I'm not sure if I agree with that statement? I'm not sure of the target time and caroline is worse than samus. mysql> select date(finish_time), build_config, sec_to_time(avg(time_to_sec(timediff(finish_time, start_time)))) as 'avg time', count(*) from buildTable where build_config = 'binhost-pre-cq' and waterfall = 'chromiumos.tryserver' and final = 1 and last_updated > date_sub(current_date(), interval 21 day) group by date(finish_time); +-------------------+----------------+----------+----------+ | date(finish_time) | build_config | avg time | count(*) | +-------------------+----------------+----------+----------+ | 2017-03-08 | binhost-pre-cq | 00:28:07 | 44 | | 2017-03-09 | binhost-pre-cq | 00:27:02 | 34 | | 2017-03-10 | binhost-pre-cq | 00:26:40 | 29 | | 2017-03-11 | binhost-pre-cq | 00:24:24 | 18 | | 2017-03-12 | binhost-pre-cq | 00:25:57 | 7 | | 2017-03-13 | binhost-pre-cq | 00:26:16 | 30 | | 2017-03-14 | binhost-pre-cq | 00:27:36 | 54 | | 2017-03-15 | binhost-pre-cq | 00:26:12 | 42 | | 2017-03-16 | binhost-pre-cq | 00:28:02 | 39 | | 2017-03-17 | binhost-pre-cq | 00:26:20 | 38 | | 2017-03-18 | binhost-pre-cq | 00:25:42 | 11 | | 2017-03-19 | binhost-pre-cq | 00:26:47 | 1 | | 2017-03-20 | binhost-pre-cq | 00:28:41 | 33 | | 2017-03-21 | binhost-pre-cq | 00:28:07 | 32 | | 2017-03-22 | binhost-pre-cq | 00:27:21 | 44 | | 2017-03-23 | binhost-pre-cq | 00:29:11 | 32 | | 2017-03-24 | binhost-pre-cq | 00:28:56 | 40 | | 2017-03-25 | binhost-pre-cq | 00:24:23 | 8 | | 2017-03-26 | binhost-pre-cq | 00:28:34 | 3 | | 2017-03-27 | binhost-pre-cq | 00:28:01 | 42 | | 2017-03-28 | binhost-pre-cq | 00:28:07 | 42 | | 2017-03-29 | binhost-pre-cq | 00:26:48 | 25 | +-------------------+----------------+----------+----------+ mysql> select date(finish_time), build_config, sec_to_time(avg(time_to_sec(timediff(finish_time, start_time)))) as 'avg time', count(*) from buildTable where build_config = 'daisy_spring-no-vmtest-pre-cq' and waterfall = 'chromiumos.tryserver' and final = 1 and last_updated > date_sub(current_date(), interval 21 day) group by date(finish_time); +-------------------+-------------------------------+----------+----------+ | date(finish_time) | build_config | avg time | count(*) | +-------------------+-------------------------------+----------+----------+ | 2017-03-08 | daisy_spring-no-vmtest-pre-cq | 00:55:58 | 94 | | 2017-03-09 | daisy_spring-no-vmtest-pre-cq | 00:53:04 | 84 | | 2017-03-10 | daisy_spring-no-vmtest-pre-cq | 00:54:39 | 79 | | 2017-03-11 | daisy_spring-no-vmtest-pre-cq | 00:53:30 | 42 | | 2017-03-12 | daisy_spring-no-vmtest-pre-cq | 01:01:42 | 13 | | 2017-03-13 | daisy_spring-no-vmtest-pre-cq | 00:59:05 | 66 | | 2017-03-14 | daisy_spring-no-vmtest-pre-cq | 00:54:51 | 97 | | 2017-03-15 | daisy_spring-no-vmtest-pre-cq | 00:54:00 | 87 | | 2017-03-16 | daisy_spring-no-vmtest-pre-cq | 00:54:19 | 83 | | 2017-03-17 | daisy_spring-no-vmtest-pre-cq | 00:52:59 | 90 | | 2017-03-18 | daisy_spring-no-vmtest-pre-cq | 00:54:13 | 25 | | 2017-03-19 | daisy_spring-no-vmtest-pre-cq | 00:55:37 | 6 | | 2017-03-20 | daisy_spring-no-vmtest-pre-cq | 00:53:48 | 66 | | 2017-03-21 | daisy_spring-no-vmtest-pre-cq | 00:53:12 | 86 | | 2017-03-22 | daisy_spring-no-vmtest-pre-cq | 00:53:31 | 86 | | 2017-03-23 | daisy_spring-no-vmtest-pre-cq | 00:52:45 | 101 | | 2017-03-24 | daisy_spring-no-vmtest-pre-cq | 00:54:12 | 66 | | 2017-03-25 | daisy_spring-no-vmtest-pre-cq | 00:52:41 | 26 | | 2017-03-26 | daisy_spring-no-vmtest-pre-cq | 00:59:42 | 9 | | 2017-03-27 | daisy_spring-no-vmtest-pre-cq | 00:51:36 | 65 | | 2017-03-28 | daisy_spring-no-vmtest-pre-cq | 00:52:08 | 100 | | 2017-03-29 | daisy_spring-no-vmtest-pre-cq | 00:48:49 | 60 | +-------------------+-------------------------------+----------+----------+
,
Mar 29 2017
binhost is a specialized test, and expected to be faster.
,
Mar 29 2017
Here's chromiumos.tryserver times in the last 3 weeks for build configs with more than 10 builds. mysql> select build_config, sec_to_time(avg(time_to_sec(timediff(finish_time, start_time)))) as 'avg time', count(*) from buildTable where waterfall = 'chromiumos.tryserver' and final = 1 and last_updated > date_sub(current_date(), interval 21 day) group by build_config having count(*) > 10; +-------------------------------+----------+----------+ | build_config | avg time | count(*) | +-------------------------------+----------+----------+ | amd64-generic-pre-cq | 01:04:04 | 139 | | arm-generic-pre-cq | 00:52:43 | 129 | | binhost-pre-cq | 00:27:32 | 654 | | caroline-pre-cq | 01:40:15 | 1455 | | chell-no-vmtest-pre-cq | 01:28:57 | 115 | | cyan-pre-cq | 01:18:49 | 38 | | daisy_spring-no-vmtest-pre-cq | 00:53:48 | 1436 | | gale-pre-cq | 00:43:23 | 13 | | glados-pre-cq | 01:03:36 | 64 | | gru-no-vmtest-pre-cq | 01:15:54 | 115 | | gru-pre-cq | 01:16:15 | 140 | | kernel-3_14-a-pre-cq | 01:25:38 | 46 | | kernel-3_14-b-pre-cq | 00:57:45 | 47 | | kernel-3_14-c-pre-cq | 00:59:01 | 47 | | lakitu-pre-cq | 01:23:29 | 22 | | lumpy-no-vmtest-pre-cq | 00:55:51 | 1432 | | nyan-no-vmtest-pre-cq | 00:56:34 | 12 | | oak-pre-cq | 01:00:04 | 66 | | rambi-no-vmtest-pre-cq | 01:18:24 | 1417 | | rambi-pre-cq | 01:24:25 | 16 | | reef-no-vmtest-pre-cq | 01:02:09 | 112 | | reef-pre-cq | 01:07:17 | 132 | | rowan-pre-cq | 00:57:38 | 12 | | samus-cheets-no-vmtest-pre-cq | 01:19:26 | 35 | | samus-no-vmtest-pre-cq | 00:59:54 | 1448 | | samus-pre-cq | 01:00:19 | 53 | | sand-pre-cq | 00:52:36 | 17 | | smaug-pre-cq | 00:54:43 | 62 | | strago-no-vmtest-pre-cq | 00:59:49 | 118 | | strago-pre-cq | 00:59:34 | 65 | | veyron_minnie-cheets-pre-cq | 01:00:44 | 45 | | whirlwind-no-vmtest-pre-cq | 00:54:44 | 1431 | | x86-alex-no-vmtest-pre-cq | 00:58:44 | 1433 | +-------------------------------+----------+----------+
,
Mar 29 2017
So.. the overall average is about 60 minutes, with a few outliers. That means the CL I put up is mostly correctly and will nearly double our capacity when dealing with brief spikes. I also think it's appropriate to look at why the slowest builders are slow to see if we can do anything to speed them up. But he numbers put them in perspective and show they aren't THAT much slower.
,
Mar 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/aa1f841d8df44335321ee1d4bd40ee3e9df4abbd commit aa1f841d8df44335321ee1d4bd40ee3e9df4abbd Author: Don Garrett <dgarrett@google.com> Date: Thu Mar 30 00:17:13 2017 PreCQLauncher: Increase launch timeout from 30 -> 90 minutes. We have been seeing a lot of PreCQ jobs timeout while waiting to launch. We are adding more builders, but since we've hit the limit more than once recently. Spot checking (no easy access to hard metrics yet) shows that most PreCQ builds take ~60 minutes. That means that none of them can finish before the 30 minute timeout is hit. So... if we have capacity for 10 builds, and submit 12 builds, 2 will timeout. By making the timeout longer than the average build time, we can handle brief load spikes approaching double our builder capacity. My suspicion is that we see such spikes in the morning, or right after a CQ run with many CLs fails (because developers resubmit them quickly). BUG= chromium:706487 TEST=run_tests Change-Id: I0ca915defe27be17b884fe5e52149e53a25fed52 Reviewed-on: https://chromium-review.googlesource.com/462375 Commit-Ready: Don Garrett <dgarrett@chromium.org> Tested-by: Don Garrett <dgarrett@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/aa1f841d8df44335321ee1d4bd40ee3e9df4abbd/cbuildbot/stages/sync_stages.py
,
Mar 30 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal/+/f7818a9a057188fb46f034323c27950f810c23e5 commit f7818a9a057188fb46f034323c27950f810c23e5 Author: xixuan <xixuan@chromium.org> Date: Thu Mar 30 04:42:59 2017
,
Mar 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/cd955c1550db862cac4128605554b04a582f2f29 commit cd955c1550db862cac4128605554b04a582f2f29 Author: xixuan <xixuan@chromium.org> Date: Thu Mar 30 21:27:20 2017 Add new GCEs to the Pre-CQ slave pool. BUG= 706487 TEST=None Change-Id: Ife51d3708d8d502ea37948a48c60135192809290 Reviewed-on: https://chromium-review.googlesource.com/462337 Reviewed-by: Don Garrett <dgarrett@chromium.org> Reviewed-by: Ryan Tseng <hinoka@chromium.org> Commit-Queue: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/cd955c1550db862cac4128605554b04a582f2f29/masters/master.chromiumos.tryserver/slave_pool.json [modify] https://crrev.com/cd955c1550db862cac4128605554b04a582f2f29/masters/master.chromiumos.tryserver/slaves.cfg
,
Apr 4 2017
,
Apr 11 2017
More occurred with the 90 minute timeout. https://chromium-review.googlesource.com/c/474189/
,
Apr 11 2017
We've added the additional 20 physical and 20 GCE builders, and increased the timeout spent waiting for a job to start from 30 to 90 minutes. Further thoughts.... 1) We need metrics to understand usage. 2) If the PreCQ launcher knows we are overloaded, it can avoid speculative PreCQ runs. 3) Someone should investigate why usage is spiking in depth. CIDB should have the numbers. 4) We should prepare a demand estimate before throwing more hardware at this, and throw hardware based on usage expectations.
,
Apr 11 2017
Numbers that would be useful to have graphed over time. 1) Number of in-flight PreCQ builds. 2) Number of scheduled, but not started PreCQ builds. 3) Split of build configs being built. 4) Average build time, by build config. 5) Reasons for scheduling. a) CQ ready, but PreCQ required. b) Speculative. c) Ready, but validation expired after a week. d) Retrying PreCQ without new CL revision. 6) Number of in-flight builds cancelled by PreCQ launcher.
,
Apr 11 2017
The first two should give us enough to estimate required capacity. The rest are to see efficiency problems.
,
Apr 11 2017
Bug about adding such a dashboard: crbug.com/707348
,
Jun 19 2017
This is an undead bug about a specific incident. If there are TODOs that would alleviate the effects of such an incident, file 'em separately. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by dgarr...@chromium.org
, Mar 29 2017