Need to increase the Linux swarming pool size for Chrome |
|||||||||||
Issue descriptionIt looks like we're maxing out the linux swarming pool for Chrome: http://shortn/_FWC5LZpCxS I suggest we add ~100 more builders (800 cores) ASAP and see how things are doing. We should probably also check our monitoring stats to see if we've been seeing pending tasks, etc and/or whether our alerting needs to be adjusted.
,
Apr 17 2018
Since this requires updating MP / GCEBackend configs based on available CCompute capacity, I'm pushing this over to Foundation-Troopers queue. This is causing a whole bunch of ticket alerts SwarmingPendingTimeHigh. Please ping me if you need help. Thanks!
,
Apr 17 2018
https://chrome-internal-review.googlesource.com/c/infradata/config/+/610166
,
Apr 17 2018
Issue 832809 has been merged into this issue. Issue 832823 has been merged into this issue. Issue 832824 has been merged into this issue. Issue 832825 has been merged into this issue. Issue 833035 has been merged into this issue. Issue 833064 has been merged into this issue. Issue 833070 has been merged into this issue. Issue 833072 has been merged into this issue. Issue 833077 has been merged into this issue. Issue 833079 has been merged into this issue. Issue 833110 has been merged into this issue. Issue 833129 has been merged into this issue. Issue 833131 has been merged into this issue. Issue 833132 has been merged into this issue. Issue 833139 has been merged into this issue. Issue 833182 has been merged into this issue. Issue 833199 has been merged into this issue. Issue 833201 has been merged into this issue. Issue 833202 has been merged into this issue. Issue 833211 has been merged into this issue. Issue 833428 has been merged into this issue. Issue 833512 has been merged into this issue. Issue 833524 has been merged into this issue. Issue 833525 has been merged into this issue. Issue 833529 has been merged into this issue. Issue 833552 has been merged into this issue. Issue 833553 has been merged into this issue. Issue 833554 has been merged into this issue. Issue 833579 has been merged into this issue. Issue 833674 has been merged into this issue. Issue 833683 has been merged into this issue. Issue 833684 has been merged into this issue. Issue 833693 has been merged into this issue. Issue 833922 has been merged into this issue. Issue 833936 has been merged into this issue. Issue 833940 has been merged into this issue. Issue 833954 has been merged into this issue. Issue 833957 has been merged into this issue. Issue 833972 has been merged into this issue.
,
Apr 18 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/850d1ef2692cb0d7047e97edaec79a9d8069d72c commit 850d1ef2692cb0d7047e97edaec79a9d8069d72c Author: smut <smut@google.com> Date: Tue Apr 17 23:59:59 2018
,
Apr 18 2018
,
Apr 18 2018
It looks like we're better off but still pretty busy, based on: http://shortn/_qQA7x3q95P I suggest we either shift 100 VMs from the Win10 GCE pool (which seems to have lots of headroom): http://shortn/_rVoow6EVL1 Or spin up another new 100. Thoughts?
,
Apr 18 2018
I think it's okay to have maxed out utilization and tasks having to wait a few minutes before they can start so long as the overall pending time for jobs users care about is low. Is the CQ still out of SLA? If SwarmingPendingTimeHigh_CQ is firing that's indicative of a need for capacity, but I don't think that the utilization graph alone implies that we definitely need more capacity. Separately from that, yeah we can probably decrease the number of Windows VMs.
,
Apr 18 2018
Sorry, I have no cycles to track Chromium CQ SLA, I can barely maintain the CQ daemon. I recommend to ask this question CCI team, which also receives SwarmingPendingTimeHigh pages after trooper split (IIUC).
,
Apr 19 2018
Looks like we're meeting the (cycle time) SLA this week. I'm okay with tasks pending for seconds, but not for minutes; in my experience the latter means we're too close to being out of capacity. I guess we can leave as-is for now.
,
Apr 19 2018
Could we tune SwarmingPendingTimeHigh_CQ so that it fires enough in advance that it tells us "you'll be out of SLA if you don't add more capacity soon" rather than "too late, you're already out of SLA"? It seems to fire if pending time is >30s for any amount of time. Maybe we want >15s, for example? http://shortn/_5l3McIRE38
,
Apr 19 2018
I think part of the problem with relying on SwarmingPendingTimeHigh is that it tells you when we're *out* of capacity, not when you have, say 10% headroom and can afford to add another copy of the layout tests w/o problems. I'd like to keep us closer to the latter point to allow for variance and growth, but exceeding that is more of a ticket-level thing than a page-level thing.
,
Apr 19 2018
On the other hand, it fired many, many times over the past 5 days according to #4 and you said we remained in SLA this week even though capacity was only added yesterday. Maybe it's giving sufficiently advance notice, then.
,
Apr 19 2018
#11,12: We've been talking about changes to the CCI monitoring to do that.
,
Apr 19 2018
,
Apr 19 2018
I agree that having a small amount of pending time is fine; it's a bit unclear what the units are in https://bugs.chromium.org/p/chromium/issues/detail?id=832809. I know that when I've gotten alerts in the past, it's a bit hard to understand the thresholds that it's supposed to alert at. That's probably a separate thing though, but still annoying. http://shortn/_vHwZGuq4QG shows that we hit 60 minute pending time, which is the existing timeout for tasks. So it looks like tasks expired. Not sure where the claim that we remained in SLA came from? (fixing smut@ email address)
,
Apr 20 2018
The claim that we remained in SLA comes from go/cq-slo-dash, where to me it looks like we did.
,
Jul 26
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by sergeybe...@chromium.org
, Apr 17 2018Labels: Infra-Troopers
Owner: sergeybe...@chromium.org
Status: Assigned (was: Untriaged)