linux-chromeos-rel can take > 2 hours to complete on cros CLs |
|||||||||
Issue descriptionlinux-chromeos-rel currently has 113 pending builds, despite having a pool of 136. (There are only 82 pending builds though, I'm not sure why that would be). This is a huge CQ bottleneck right now (and has been so in the past also). One problem is that changes that are detected as affecting cros code can take more than 2 hours! Some recent examples (2 failed, 1 succeeded): https://ci.chromium.org/buildbot/tryserver.chromium.chromiumos/linux-chromeos-rel/100749 https://ci.chromium.org/buildbot/tryserver.chromium.chromiumos/linux-chromeos-rel/100736 https://ci.chromium.org/buildbot/tryserver.chromium.chromiumos/linux-chromeos-rel/100731 The successful build executed 301 steps! Presumably some of these are done in parallel since there are 2 58 minute compile stages and telemetry_perf_unittests takes 30 minutes. It seems like it is time to break this builder up?
,
Apr 13 2018
Recent times are definitely better, but still pretty long for full build/test runs, ranging from 40 mins to 90 mins (there were a couple of slow runs around 8:30 am). My understanding is that we try to keep CQ builders under 40 minutes?
,
Apr 13 2018
We try to keep successful CQ runs under 40 min on average. A run that fails tests and needs to be retried without the patch is expected (perhaps obviously) to take longer. We are looking at ways to get rid of the "without patch" part generally, but that's not specific to this config. It's been a bad 2-3 weeks for the CQ generally, and the linux-chromeos-rel builder in particular over the past couple days due to the bad image we pushed out. It looks like cycle times for the builder are back closer to normal now, but it is the second-slowest builder, and so we should look to see what the slowest tasks are and what we can do about them (e.g., add more test shards). @liaoyuke - as current CCI trooper, want to take a look at this?
,
Apr 13 2018
,
Apr 13 2018
The bot's been in really bad shape the past couple days due to a bad src-side change landing and making browser_tests fail on that bot. The induced load from the subsequent with+without patch behavior caused the pending times we were seeing. It's been reverted twice now: https://chromium-review.googlesource.com/c/chromium/src/+/1011564 https://chromium-review.googlesource.com/c/chromium/src/+/1012138
,
Apr 13 2018
Make that 3 test-breaking changes that have landed in the past couple days: https://chromium-review.googlesource.com/c/chromium/src/+/1010662 (it's since been reverted) This bot is *not* having a good time. We should look into how those changes made it past the CQ in the first place...
,
Apr 16 2018
The builder is definitely looking better: https://ci.chromium.org/buildbot/tryserver.chromium.chromiumos/linux-chromeos-rel/?limit=100 But full successful builds can still take about an hour: https://ci.chromium.org/buildbot/tryserver.chromium.chromiumos/linux-chromeos-rel/102731 https://ci.chromium.org/buildbot/tryserver.chromium.chromiumos/linux-chromeos-rel/102731 'compile' is taking 20-30 minutes, other than that I think it's just the fact that there are 300 steps; few even take more than a minute.
,
Apr 16 2018
An hour for a change that causes us to run all of the tests is probably about right these days. There's more stuff we can do to speed up things, but I don't see a real bug here. The 40 minute number is *on average*, not worst case.
,
Apr 16 2018
I guess that 'average' is tricky here; non-cros changes are very quick, but cros changes appear to be 45 mins+. We are seeing > 1 hr builds again with 31 pending. I was suggesting that we might break the builder up (300 steps seems like a lot), but I understand that introduces an extra maintenance cost.
,
Apr 16 2018
We shouldn't be seeing pending builds regardless of the cycle time, so I'll keep looking into it. The 300+ step are not that interesting because of the sheer number, since most of the steps are very fast.
,
Apr 17 2018
Okay, we definitely need more builders in the tryserver.chromium.chromiumos pool. bpastene@ - can you look into adding ~50 more tomorrow (Tuesday) to the 136 we have now? We'll need at least that many if we start adding tests into the CQ so we might as well get started now.
,
Apr 17 2018
actually, this should probably go to sergeyberezin@ as current ICC trooper.
,
Apr 18 2018
It looks like today's peak load was well down from yesterday, so maybe we've cleaned up things enough that we're back to normal and don't need the additional 50. I'm going to take this back and keep an eye on it.
,
Apr 18 2018
Also looks like the average cycle time is much closer to 40 min.
,
Apr 19 2018
Frankly, things look good. Even the 95th percentile is now hovering under 60 compared to last week when it was indeed over 2 hours. I'm calling this closed. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by dpranke@chromium.org
, Apr 13 2018Components: -Infra>Platform>Buildbot>TryServer Infra>Platform