New issue
Advanced search Search tips

Issue 832355 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

linux-chromeos-rel can take > 2 hours to complete on cros CLs

Project Member Reported by steve...@chromium.org, Apr 12 2018

Issue description

linux-chromeos-rel currently has 113 pending builds, despite having a pool of 136.

(There are only 82 pending builds though, I'm not sure why that would be).

This is a huge CQ bottleneck right now (and has been so in the past also).

One problem is that changes that are detected as affecting cros code can take more than 2 hours!

Some recent examples (2 failed, 1 succeeded):
https://ci.chromium.org/buildbot/tryserver.chromium.chromiumos/linux-chromeos-rel/100749
https://ci.chromium.org/buildbot/tryserver.chromium.chromiumos/linux-chromeos-rel/100736
https://ci.chromium.org/buildbot/tryserver.chromium.chromiumos/linux-chromeos-rel/100731

The successful build executed 301 steps! Presumably some of these are done in parallel since there are 2 58 minute compile stages and telemetry_perf_unittests takes 30 minutes.

It seems like it is time to break this builder up?



 
Cc: estaab@chromium.org jbudorick@chromium.org
Components: -Infra>Platform>Buildbot>TryServer Infra>Platform
The problem isn't the builder. It looks like there were capacity issues in the swarming fleet around the time of those jobs (you can see some jobs were pending for 28+ minutes).

The 58 min compile time is also highly unusual; I wonder if there was something wrong w/ goma. I was OOO for most of the day, so I don't know if there were issues w/ the fleet.

Looking at cycle times now, they seem fairly normal, but of course this is off-peak.

I'll check into this more tomorrow, but cc'ing estaab and jbudorick in case they know something or know who the troopers were.
Recent times are definitely better, but still pretty long for full build/test runs, ranging from 40 mins to 90 mins (there were a couple of slow runs around 8:30 am). My understanding is that we try to keep CQ builders under 40 minutes?

Owner: liaoyuke@chromium.org
Status: Assigned (was: Untriaged)
We try to keep successful CQ runs under 40 min on average. A run that fails tests and needs to be retried without the patch is expected (perhaps obviously) to take longer. We are looking at ways to get rid of the "without patch" part generally, but that's not specific to this config.

It's been a bad 2-3 weeks for the CQ generally, and the linux-chromeos-rel builder in particular over the past couple days due to the bad image we pushed out.

It looks like cycle times for the builder are back closer to normal now, but it is the second-slowest builder, and so we should look to see what the slowest tasks are and what we can do about them (e.g., add more test shards).

@liaoyuke - as current CCI trooper, want to take a look at this?
Components: -Infra>Platform Infra>Client>Chrome
The bot's been in really bad shape the past couple days due to a bad src-side change landing and making browser_tests fail on that bot. The induced load from the subsequent with+without patch behavior caused the pending times we were seeing.

It's been reverted twice now:
https://chromium-review.googlesource.com/c/chromium/src/+/1011564
https://chromium-review.googlesource.com/c/chromium/src/+/1012138 
Make that 3 test-breaking changes that have landed in the past couple days:
https://chromium-review.googlesource.com/c/chromium/src/+/1010662 (it's since been reverted)

This bot is *not* having a good time. We should look into how those changes made it past the CQ in the first place...
The builder is definitely looking better:
https://ci.chromium.org/buildbot/tryserver.chromium.chromiumos/linux-chromeos-rel/?limit=100

But full successful builds can still take about an hour:
https://ci.chromium.org/buildbot/tryserver.chromium.chromiumos/linux-chromeos-rel/102731
https://ci.chromium.org/buildbot/tryserver.chromium.chromiumos/linux-chromeos-rel/102731

'compile' is taking 20-30 minutes, other than that I think it's just the fact that there are 300 steps; few even take more than a minute.

An hour for a change that causes us to run all of the tests is probably about right these days. There's more stuff we can do to speed up things, but I don't see a real bug here. The 40 minute number is *on average*, not worst case.
I guess that 'average' is tricky here; non-cros changes are very quick, but cros changes appear to be 45 mins+. We are seeing > 1 hr builds again with 31 pending.

I was suggesting that we might break the builder up (300 steps seems like a lot), but I understand that introduces an extra maintenance cost.

Owner: dpranke@chromium.org
We shouldn't be seeing pending builds regardless of the cycle time, so I'll keep looking into it.

The 300+ step are not that interesting because of the sheer number, since most of the steps are very fast.
Cc: bpastene@chromium.org
Owner: bpastene@chromium.org
Okay, we definitely need more builders in the tryserver.chromium.chromiumos pool. 

bpastene@ - can you look into adding ~50 more tomorrow (Tuesday) to the 136 we have now? We'll need at least that many if we start adding tests into the CQ so we might as well get started now.
Owner: sergeybe...@chromium.org
actually, this should probably go to sergeyberezin@ as current ICC trooper.
Owner: dpranke@chromium.org
It looks like today's peak load was well down from yesterday, so maybe we've cleaned up things enough that we're back to normal and don't need the additional 50. 

I'm going to take this back and keep an eye on it.
Status: Started (was: Assigned)
Also looks like the average cycle time is much closer to 40 min.
Status: Fixed (was: Started)
Frankly, things look good. Even the 95th percentile is now hovering under 60 compared to last week when it was indeed over 2 hours. I'm calling this closed.

Sign in to add a comment