New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 706487 link

Starred by 5 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocked on:
issue 706508



Sign in to add a comment

Many timeouts starting PreCQ builds.

Project Member Reported by dgarr...@chromium.org, Mar 29 2017

Issue description

We've been seeing a lot of timeouts starting PreCQ builders.

An example:

https://chromium-review.googlesource.com/c/457845/

We were not able to launch a lumpy-no-vmtest-pre-cq trybot for your change within 30 minutes.

This problem can happen if the trybot waterfall is very busy, or if there is an infrastructure issue. Please notify the sheriff and mark your change as ready again. If this problem occurs multiple times in a row, please file a bug.

Commit queue documentation: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium-os/commit-queue-overview
 
When manually investigating, I saw that all PreCQ builders were in use.
xixuan is in the process of adding additional builders (both physical and GCE) to the PreCQ pools.
Status: Started (was: Untriaged)
I'll point out that we changed many of our PreCQ builders to run VM testing a while back, which would have slowed them down quite a bit.
Cc: akes...@chromium.org
I feel like it's not just a pure scheduling issue.  I think that there is some state that gets stuck after a timeout, and something needs to be restarted or the change needs to be updated to proceed.  I was seeing failures across every pre-cq group, including when I tried to start trybots at midnight PDT.
I believe that on timeout, we treat it as a bad CL and clear flags to prevent the PreCQ from trying it again until they are manually reset.

But we saw that all of the builders were in use when this was happening. Assuming that they were finishing in a reasonable time, the timeouts happened because of a capacity issue.
mysql> select date(finish_time), build_config,  sec_to_time(avg(time_to_sec(timediff(finish_time, start_time)))) as 'avg time', count(*) from buildTable where build_config = 'caroline-pre-cq' and waterfall = 'chromiumos.tryserver'  and final = 1 and last_updated > date_sub(current_date(), interval 21 day) group by date(finish_time);
+-------------------+-----------------+----------+----------+
| date(finish_time) | build_config    | avg time | count(*) |
+-------------------+-----------------+----------+----------+
| 2017-03-08        | caroline-pre-cq | 01:41:43 |       94 |
| 2017-03-09        | caroline-pre-cq | 01:39:30 |       84 |
| 2017-03-10        | caroline-pre-cq | 01:42:16 |       78 |
| 2017-03-11        | caroline-pre-cq | 01:38:44 |       46 |
| 2017-03-12        | caroline-pre-cq | 02:01:41 |       15 |
| 2017-03-13        | caroline-pre-cq | 01:57:43 |       64 |
| 2017-03-14        | caroline-pre-cq | 01:35:36 |       93 |
| 2017-03-15        | caroline-pre-cq | 01:39:33 |       92 |
| 2017-03-16        | caroline-pre-cq | 01:36:26 |       83 |
| 2017-03-17        | caroline-pre-cq | 01:35:36 |       89 |
| 2017-03-18        | caroline-pre-cq | 01:45:39 |       38 |
| 2017-03-19        | caroline-pre-cq | 01:39:26 |        5 |
| 2017-03-20        | caroline-pre-cq | 01:47:51 |       61 |
| 2017-03-21        | caroline-pre-cq | 01:40:56 |       92 |
| 2017-03-22        | caroline-pre-cq | 01:34:53 |       96 |
| 2017-03-23        | caroline-pre-cq | 01:39:40 |       98 |
| 2017-03-24        | caroline-pre-cq | 01:43:44 |       61 |
| 2017-03-25        | caroline-pre-cq | 01:38:27 |       35 |
| 2017-03-26        | caroline-pre-cq | 02:06:03 |        9 |
| 2017-03-27        | caroline-pre-cq | 01:32:41 |       61 |
| 2017-03-28        | caroline-pre-cq | 01:40:42 |       97 |
| 2017-03-29        | caroline-pre-cq | 01:31:48 |       62 |
+-------------------+-----------------+----------+----------+
22 rows in set (0.73 sec)

mysql> select date(finish_time), build_config,  sec_to_time(avg(time_to_sec(timediff(finish_time, start_time)))) as 'avg time', count(*) from buildTable where build_config = 'samus-no-vmtest-pre-cq' and waterfall = 'chromiumos.tryserver'  and final = 1 and last_updated > date_sub(current_date(), interval 21 day) group by date(finish_time);
+-------------------+------------------------+----------+----------+
| date(finish_time) | build_config           | avg time | count(*) |
+-------------------+------------------------+----------+----------+
| 2017-03-08        | samus-no-vmtest-pre-cq | 01:00:48 |       93 |
| 2017-03-09        | samus-no-vmtest-pre-cq | 00:56:55 |       84 |
| 2017-03-10        | samus-no-vmtest-pre-cq | 00:59:08 |       78 |
| 2017-03-11        | samus-no-vmtest-pre-cq | 00:57:27 |       39 |
| 2017-03-12        | samus-no-vmtest-pre-cq | 01:24:09 |       15 |
| 2017-03-13        | samus-no-vmtest-pre-cq | 01:25:12 |       68 |
| 2017-03-14        | samus-no-vmtest-pre-cq | 01:01:26 |       93 |
| 2017-03-15        | samus-no-vmtest-pre-cq | 00:57:37 |       88 |
| 2017-03-16        | samus-no-vmtest-pre-cq | 00:54:32 |       87 |
| 2017-03-17        | samus-no-vmtest-pre-cq | 00:56:49 |       93 |
| 2017-03-18        | samus-no-vmtest-pre-cq | 01:00:59 |       27 |
| 2017-03-19        | samus-no-vmtest-pre-cq | 01:10:44 |        5 |
| 2017-03-20        | samus-no-vmtest-pre-cq | 01:00:46 |       66 |
| 2017-03-21        | samus-no-vmtest-pre-cq | 00:58:29 |       87 |
| 2017-03-22        | samus-no-vmtest-pre-cq | 00:54:54 |       95 |
| 2017-03-23        | samus-no-vmtest-pre-cq | 00:59:14 |       99 |
| 2017-03-24        | samus-no-vmtest-pre-cq | 00:59:41 |       66 |
| 2017-03-25        | samus-no-vmtest-pre-cq | 00:57:05 |       28 |
| 2017-03-26        | samus-no-vmtest-pre-cq | 01:41:42 |        9 |
| 2017-03-27        | samus-no-vmtest-pre-cq | 00:56:29 |       64 |
| 2017-03-28        | samus-no-vmtest-pre-cq | 01:00:03 |       98 |
| 2017-03-29        | samus-no-vmtest-pre-cq | 00:51:40 |       61 |
+-------------------+------------------------+----------+----------+
Both cheets boards, right?

Looks like we have a cheets performance issue in the PreCQ. Especially the samus-no-vmtests-pre-cq config.

I'm going to split that off into a new bug. Adding capacity, and being better at handling load spike doesn't hurt us, and costs very little.
Thank you for the numbers!
Blockedon: 706508
Re: c#7:

I was manually resetting that change multiple times -- I question the likelihood that every single one of the build configs every single time I tried it (including after midnight) was unavailable.

Re: c#9:

I'm not sure if I agree with that statement?  I'm not sure of the target time and caroline is worse than samus.

mysql> select date(finish_time), build_config,  sec_to_time(avg(time_to_sec(timediff(finish_time, start_time)))) as 'avg time', count(*) from buildTable where build_config = 'binhost-pre-cq' and waterfall = 'chromiumos.tryserver'  and final = 1 and last_updated > date_sub(current_date(), interval 21 day) group by date(finish_time);
+-------------------+----------------+----------+----------+
| date(finish_time) | build_config   | avg time | count(*) |
+-------------------+----------------+----------+----------+
| 2017-03-08        | binhost-pre-cq | 00:28:07 |       44 |
| 2017-03-09        | binhost-pre-cq | 00:27:02 |       34 |
| 2017-03-10        | binhost-pre-cq | 00:26:40 |       29 |
| 2017-03-11        | binhost-pre-cq | 00:24:24 |       18 |
| 2017-03-12        | binhost-pre-cq | 00:25:57 |        7 |
| 2017-03-13        | binhost-pre-cq | 00:26:16 |       30 |
| 2017-03-14        | binhost-pre-cq | 00:27:36 |       54 |
| 2017-03-15        | binhost-pre-cq | 00:26:12 |       42 |
| 2017-03-16        | binhost-pre-cq | 00:28:02 |       39 |
| 2017-03-17        | binhost-pre-cq | 00:26:20 |       38 |
| 2017-03-18        | binhost-pre-cq | 00:25:42 |       11 |
| 2017-03-19        | binhost-pre-cq | 00:26:47 |        1 |
| 2017-03-20        | binhost-pre-cq | 00:28:41 |       33 |
| 2017-03-21        | binhost-pre-cq | 00:28:07 |       32 |
| 2017-03-22        | binhost-pre-cq | 00:27:21 |       44 |
| 2017-03-23        | binhost-pre-cq | 00:29:11 |       32 |
| 2017-03-24        | binhost-pre-cq | 00:28:56 |       40 |
| 2017-03-25        | binhost-pre-cq | 00:24:23 |        8 |
| 2017-03-26        | binhost-pre-cq | 00:28:34 |        3 |
| 2017-03-27        | binhost-pre-cq | 00:28:01 |       42 |
| 2017-03-28        | binhost-pre-cq | 00:28:07 |       42 |
| 2017-03-29        | binhost-pre-cq | 00:26:48 |       25 |
+-------------------+----------------+----------+----------+

mysql> select date(finish_time), build_config,  sec_to_time(avg(time_to_sec(timediff(finish_time, start_time)))) as 'avg time', count(*) from buildTable where build_config = 'daisy_spring-no-vmtest-pre-cq' and waterfall = 'chromiumos.tryserver'  and final = 1 and last_updated > date_sub(current_date(), interval 21 day) group by date(finish_time);
+-------------------+-------------------------------+----------+----------+
| date(finish_time) | build_config                  | avg time | count(*) |
+-------------------+-------------------------------+----------+----------+
| 2017-03-08        | daisy_spring-no-vmtest-pre-cq | 00:55:58 |       94 |
| 2017-03-09        | daisy_spring-no-vmtest-pre-cq | 00:53:04 |       84 |
| 2017-03-10        | daisy_spring-no-vmtest-pre-cq | 00:54:39 |       79 |
| 2017-03-11        | daisy_spring-no-vmtest-pre-cq | 00:53:30 |       42 |
| 2017-03-12        | daisy_spring-no-vmtest-pre-cq | 01:01:42 |       13 |
| 2017-03-13        | daisy_spring-no-vmtest-pre-cq | 00:59:05 |       66 |
| 2017-03-14        | daisy_spring-no-vmtest-pre-cq | 00:54:51 |       97 |
| 2017-03-15        | daisy_spring-no-vmtest-pre-cq | 00:54:00 |       87 |
| 2017-03-16        | daisy_spring-no-vmtest-pre-cq | 00:54:19 |       83 |
| 2017-03-17        | daisy_spring-no-vmtest-pre-cq | 00:52:59 |       90 |
| 2017-03-18        | daisy_spring-no-vmtest-pre-cq | 00:54:13 |       25 |
| 2017-03-19        | daisy_spring-no-vmtest-pre-cq | 00:55:37 |        6 |
| 2017-03-20        | daisy_spring-no-vmtest-pre-cq | 00:53:48 |       66 |
| 2017-03-21        | daisy_spring-no-vmtest-pre-cq | 00:53:12 |       86 |
| 2017-03-22        | daisy_spring-no-vmtest-pre-cq | 00:53:31 |       86 |
| 2017-03-23        | daisy_spring-no-vmtest-pre-cq | 00:52:45 |      101 |
| 2017-03-24        | daisy_spring-no-vmtest-pre-cq | 00:54:12 |       66 |
| 2017-03-25        | daisy_spring-no-vmtest-pre-cq | 00:52:41 |       26 |
| 2017-03-26        | daisy_spring-no-vmtest-pre-cq | 00:59:42 |        9 |
| 2017-03-27        | daisy_spring-no-vmtest-pre-cq | 00:51:36 |       65 |
| 2017-03-28        | daisy_spring-no-vmtest-pre-cq | 00:52:08 |      100 |
| 2017-03-29        | daisy_spring-no-vmtest-pre-cq | 00:48:49 |       60 |
+-------------------+-------------------------------+----------+----------+
binhost is a specialized test, and expected to be faster.
Here's chromiumos.tryserver times in the last 3 weeks for build configs with more than 10 builds.

mysql> select  build_config,  sec_to_time(avg(time_to_sec(timediff(finish_time, start_time)))) as 'avg time', count(*) from buildTable where waterfall = 'chromiumos.tryserver'  and final = 1 and last_updated > date_sub(current_date(), interval 21 day) group by build_config having count(*) > 10;
+-------------------------------+----------+----------+
| build_config                  | avg time | count(*) |
+-------------------------------+----------+----------+
| amd64-generic-pre-cq          | 01:04:04 |      139 |
| arm-generic-pre-cq            | 00:52:43 |      129 |
| binhost-pre-cq                | 00:27:32 |      654 |
| caroline-pre-cq               | 01:40:15 |     1455 |
| chell-no-vmtest-pre-cq        | 01:28:57 |      115 |
| cyan-pre-cq                   | 01:18:49 |       38 |
| daisy_spring-no-vmtest-pre-cq | 00:53:48 |     1436 |
| gale-pre-cq                   | 00:43:23 |       13 |
| glados-pre-cq                 | 01:03:36 |       64 |
| gru-no-vmtest-pre-cq          | 01:15:54 |      115 |
| gru-pre-cq                    | 01:16:15 |      140 |
| kernel-3_14-a-pre-cq          | 01:25:38 |       46 |
| kernel-3_14-b-pre-cq          | 00:57:45 |       47 |
| kernel-3_14-c-pre-cq          | 00:59:01 |       47 |
| lakitu-pre-cq                 | 01:23:29 |       22 |
| lumpy-no-vmtest-pre-cq        | 00:55:51 |     1432 |
| nyan-no-vmtest-pre-cq         | 00:56:34 |       12 |
| oak-pre-cq                    | 01:00:04 |       66 |
| rambi-no-vmtest-pre-cq        | 01:18:24 |     1417 |
| rambi-pre-cq                  | 01:24:25 |       16 |
| reef-no-vmtest-pre-cq         | 01:02:09 |      112 |
| reef-pre-cq                   | 01:07:17 |      132 |
| rowan-pre-cq                  | 00:57:38 |       12 |
| samus-cheets-no-vmtest-pre-cq | 01:19:26 |       35 |
| samus-no-vmtest-pre-cq        | 00:59:54 |     1448 |
| samus-pre-cq                  | 01:00:19 |       53 |
| sand-pre-cq                   | 00:52:36 |       17 |
| smaug-pre-cq                  | 00:54:43 |       62 |
| strago-no-vmtest-pre-cq       | 00:59:49 |      118 |
| strago-pre-cq                 | 00:59:34 |       65 |
| veyron_minnie-cheets-pre-cq   | 01:00:44 |       45 |
| whirlwind-no-vmtest-pre-cq    | 00:54:44 |     1431 |
| x86-alex-no-vmtest-pre-cq     | 00:58:44 |     1433 |
+-------------------------------+----------+----------+
So.. the overall average is about 60 minutes, with a few outliers. That means the CL I put up is mostly correctly and will nearly double our capacity when dealing with brief spikes.

I also think it's appropriate to look at why the slowest builders are slow to see if we can do anything to speed them up. But he numbers put them in perspective and show they aren't THAT much slower.
Project Member

Comment 16 by bugdroid1@chromium.org, Mar 30 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/aa1f841d8df44335321ee1d4bd40ee3e9df4abbd

commit aa1f841d8df44335321ee1d4bd40ee3e9df4abbd
Author: Don Garrett <dgarrett@google.com>
Date: Thu Mar 30 00:17:13 2017

PreCQLauncher: Increase launch timeout from 30 -> 90 minutes.

We have been seeing a lot of PreCQ jobs timeout while waiting to
launch. We are adding more builders, but since we've hit the limit
more than once recently.

Spot checking (no easy access to hard metrics yet) shows that most
PreCQ builds take ~60 minutes. That means that none of them can finish
before the 30 minute timeout is hit.

So... if we have capacity for 10 builds, and submit 12 builds, 2 will
timeout. By making the timeout longer than the average build time, we
can handle brief load spikes approaching double our builder capacity.

My suspicion is that we see such spikes in the morning, or right after
a CQ run with many CLs fails (because developers resubmit them
quickly).

BUG= chromium:706487 
TEST=run_tests

Change-Id: I0ca915defe27be17b884fe5e52149e53a25fed52
Reviewed-on: https://chromium-review.googlesource.com/462375
Commit-Ready: Don Garrett <dgarrett@chromium.org>
Tested-by: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/aa1f841d8df44335321ee1d4bd40ee3e9df4abbd/cbuildbot/stages/sync_stages.py

Project Member

Comment 17 by bugdroid1@chromium.org, Mar 30 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal/+/f7818a9a057188fb46f034323c27950f810c23e5

commit f7818a9a057188fb46f034323c27950f810c23e5
Author: xixuan <xixuan@chromium.org>
Date: Thu Mar 30 04:42:59 2017

Project Member

Comment 18 by bugdroid1@chromium.org, Mar 30 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/cd955c1550db862cac4128605554b04a582f2f29

commit cd955c1550db862cac4128605554b04a582f2f29
Author: xixuan <xixuan@chromium.org>
Date: Thu Mar 30 21:27:20 2017

Add new GCEs to the Pre-CQ slave pool.

BUG= 706487 
TEST=None

Change-Id: Ife51d3708d8d502ea37948a48c60135192809290
Reviewed-on: https://chromium-review.googlesource.com/462337
Reviewed-by: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Ryan Tseng <hinoka@chromium.org>
Commit-Queue: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/cd955c1550db862cac4128605554b04a582f2f29/masters/master.chromiumos.tryserver/slave_pool.json
[modify] https://crrev.com/cd955c1550db862cac4128605554b04a582f2f29/masters/master.chromiumos.tryserver/slaves.cfg

Comment 19 by aut...@google.com, Apr 4 2017

Labels: -current-issue
Cc: dgarr...@chromium.org
More occurred with the 90 minute timeout.
https://chromium-review.googlesource.com/c/474189/
Owner: jrbarnette@chromium.org
We've added the additional 20 physical and 20 GCE builders, and increased the timeout spent waiting for a job to start from 30 to 90 minutes.

Further thoughts....

1) We need metrics to understand usage.
2) If the PreCQ launcher knows we are overloaded, it can avoid speculative PreCQ runs.
3) Someone should investigate why usage is spiking in depth. CIDB should have the numbers.
4) We should prepare a demand estimate before throwing more hardware at this, and throw hardware based on usage expectations.

Cc: pho...@chromium.org
Numbers that would be useful to have graphed over time.

1) Number of in-flight PreCQ builds.
2) Number of scheduled, but not started PreCQ builds.
3) Split of build configs being built.
4) Average build time, by build config.
5) Reasons for scheduling.
  a) CQ ready, but PreCQ required.
  b) Speculative.
  c) Ready, but validation expired after a week.
  d) Retrying PreCQ without new CL revision.
6) Number of in-flight builds cancelled by PreCQ launcher.

The first two should give us enough to estimate required capacity. The rest are to see efficiency problems.
Bug about adding such a dashboard: crbug.com/707348
Status: Archived (was: Started)
This is an undead bug about a specific incident. If there are TODOs that would alleviate the effects of such an incident, file 'em separately.

Sign in to add a comment