New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 833720 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
User never visited
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Need to increase the Linux swarming pool size for Chrome

Project Member Reported by dpranke@chromium.org, Apr 17 2018

Issue description

It looks like we're maxing out the linux swarming pool for Chrome: http://shortn/_FWC5LZpCxS

I suggest we add ~100 more builders (800 cores) ASAP and see how things are doing.

We should probably also check our monitoring stats to see if we've been seeing pending tasks, etc and/or whether our alerting needs to be adjusted.
 
Cc: s...@google.com
Labels: Infra-Troopers
Owner: sergeybe...@chromium.org
Status: Assigned (was: Untriaged)
I'll look into it as a current trooper.
Labels: -Infra-Troopers Foundation-Troopers
Owner: ----
Status: Untriaged (was: Assigned)
Since this requires updating MP / GCEBackend configs based on available CCompute capacity, I'm pushing this over to Foundation-Troopers queue.

This is causing a whole bunch of ticket alerts SwarmingPendingTimeHigh.

Please ping me if you need help. Thanks!

Comment 3 by s...@google.com, Apr 17 2018

Cc: -s...@google.com
Owner: smut@chromium.org
Status: Started (was: Untriaged)
https://chrome-internal-review.googlesource.com/c/infradata/config/+/610166
Issue 832809 has been merged into this issue.
Issue 832823 has been merged into this issue.
Issue 832824 has been merged into this issue.
Issue 832825 has been merged into this issue.
Issue 833035 has been merged into this issue.
Issue 833064 has been merged into this issue.
Issue 833070 has been merged into this issue.
Issue 833072 has been merged into this issue.
Issue 833077 has been merged into this issue.
Issue 833079 has been merged into this issue.
Issue 833110 has been merged into this issue.
Issue 833129 has been merged into this issue.
Issue 833131 has been merged into this issue.
Issue 833132 has been merged into this issue.
Issue 833139 has been merged into this issue.
Issue 833182 has been merged into this issue.
Issue 833199 has been merged into this issue.
Issue 833201 has been merged into this issue.
Issue 833202 has been merged into this issue.
Issue 833211 has been merged into this issue.
Issue 833428 has been merged into this issue.
Issue 833512 has been merged into this issue.
Issue 833524 has been merged into this issue.
Issue 833525 has been merged into this issue.
Issue 833529 has been merged into this issue.
Issue 833552 has been merged into this issue.
Issue 833553 has been merged into this issue.
Issue 833554 has been merged into this issue.
Issue 833579 has been merged into this issue.
Issue 833674 has been merged into this issue.
Issue 833683 has been merged into this issue.
Issue 833684 has been merged into this issue.
Issue 833693 has been merged into this issue.
Issue 833922 has been merged into this issue.
Issue 833936 has been merged into this issue.
Issue 833940 has been merged into this issue.
Issue 833954 has been merged into this issue.
Issue 833957 has been merged into this issue.
Issue 833972 has been merged into this issue.
Project Member

Comment 5 by bugdroid1@chromium.org, Apr 18 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/850d1ef2692cb0d7047e97edaec79a9d8069d72c

commit 850d1ef2692cb0d7047e97edaec79a9d8069d72c
Author: smut <smut@google.com>
Date: Tue Apr 17 23:59:59 2018

Comment 6 by s...@google.com, Apr 18 2018

Status: Fixed (was: Started)
Status: Assigned (was: Fixed)
It looks like we're better off but still pretty busy, based on:

http://shortn/_qQA7x3q95P

I suggest we either shift 100 VMs from the Win10 GCE pool (which seems to have lots of headroom):

http://shortn/_rVoow6EVL1

Or spin up another new 100.

Thoughts?

Comment 8 by s...@google.com, Apr 18 2018

Cc: tandrii@chromium.org
I think it's okay to have maxed out utilization and tasks having to wait a few minutes before they can start so long as the overall pending time for jobs users care about is low. Is the CQ still out of SLA? If SwarmingPendingTimeHigh_CQ is firing that's indicative of a need for capacity, but I don't think that the utilization graph alone implies that we definitely need more capacity.

Separately from that, yeah we can probably decrease the number of Windows VMs.
Sorry, I have no cycles to track Chromium CQ SLA, I can barely maintain the CQ daemon. I recommend to ask this question CCI team, which also receives SwarmingPendingTimeHigh pages after trooper split (IIUC).
Status: Fixed (was: Assigned)
Looks like we're meeting the (cycle time) SLA this week. 

I'm okay with tasks pending for seconds, but not for minutes; in my experience the latter means we're too close to being out of capacity. I guess we can leave as-is for now.

Comment 11 by s...@google.com, Apr 19 2018

Cc: -tandrii@chromium.org
Could we tune SwarmingPendingTimeHigh_CQ so that it fires enough in advance that it tells us "you'll be out of SLA if you don't add more capacity soon" rather than "too late, you're already out of SLA"? It seems to fire if pending time is >30s for any amount of time. Maybe we want >15s, for example? http://shortn/_5l3McIRE38
I think part of the problem with relying on SwarmingPendingTimeHigh is that it tells you when we're *out* of capacity, not when you have, say 10% headroom and can afford to add another copy of the layout tests w/o problems.

I'd like to keep us closer to the latter point to allow for variance and growth, but exceeding that is more of a ticket-level thing than a page-level thing.

Comment 13 by s...@google.com, Apr 19 2018

On the other hand, it fired many, many times over the past 5 days according to #4 and you said we remained in SLA this week even though capacity was only added yesterday. Maybe it's giving sufficiently advance notice, then.
Cc: martiniss@chromium.org
#11,12: We've been talking about changes to the CCI monitoring to do that.
Cc: kmarshall@chromium.org hzl@chromium.org
 Issue 831895  has been merged into this issue.
Owner: s...@google.com
I agree that having a small amount of pending time is fine; it's a bit unclear what the units are in https://bugs.chromium.org/p/chromium/issues/detail?id=832809. I know that when I've gotten alerts in the past, it's a bit hard to understand the thresholds that it's supposed to alert at. That's probably a separate thing though, but still annoying. 

http://shortn/_vHwZGuq4QG shows that we hit 60 minute pending time, which is the existing timeout for tasks. So it looks like tasks expired. Not sure where the claim that we remained in SLA came from?

(fixing smut@ email address)
The claim that we remained in SLA comes from go/cq-slo-dash, where to me it looks like we did.
Owner: smut@chromium.org

Sign in to add a comment