New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 876034 link

Starred by 5 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 22
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Shortage of MP-provided capacity

Project Member Reported by cbiesin...@chromium.org, Aug 20

Issue description

Please provide the details for your request here.

Set Pri-0 iff it requires immediate attention, Pri-1 if resolution within a few hours is acceptable, and Pri-2 if it just needs to be handled today.

https://chromium-swarm.appspot.com/task?id=3f71d7c5c5c84a10&refresh=10&show_raw=1
Similar Load	18619 similar pending tasks, 1633 similar running tasks

That's 18k pending tasks for 1600 bots!
 
Labels: -Infra-Troopers Foundation-Troopers
Machine provider seems to be slow in getting us our usual ~2400 linux bots:
http://shortn/_tVLX9cEuhd

The reduced capacity would explain the long pending times. Over to foundation.
From chat:

For some unexplained reason state of the MP guts and state in GCE desynchronized.

In particular, MP thinks there's 0 VMs in gce-trusty group in us-east1-b.

But there are in fact 2 VMs there.
Also MP thinks there are 0 leases in this group.
It makes it set target size of the group to '1' (+1 to 0 it expects).
And it assumes the group will scale up. Then it will set target size to 2, wait for scale up, and so on until the group at the full capacity.
But since there are 2 VMs already, the group doesn't scale up. And MP sits waiting confusingly.
make sense.
I think we need to delete these 2 VMs manually
how did you figure that out?
I've done this
by adjusting the group to 0 here: https://pantheon.corp.google.com/compute/instanceGroups/list?project=google.com:chromecompute
I see that it's now scaling back up to 1
(and presumably will keep scaling)


We're not sure what the root cause of this desync is yet; once the group is scaled up I'll take an initial look.
Cc: s...@google.com
Owner: iannucci@chromium.org
Status: Assigned (was: Untriaged)
Assigning trooper as owner since pri-0.
It looks like this latest CL is interacting badly with the state desync: https://chromium.googlesource.com/infra/luci/luci-py/+/1043540b4376f5cdb007393693b197f74ac06c73

We see a lot of lease denials on the swarming side
So recap of what we think happened:

  * Swarming makes a bunch of lease requests to MP
  * MP scales up (except for us-east1-b), and fulfils those requests
  * MP keeps trying to scale up us-east1-b, but fails because the group is bigger than it already expects (?)
  * We manually resize the group to 0 so it's <= what MP expects
  * MP starts scaling the group up
  * However, the leases that swarming has outstanding are too old (> 4h?) and so MP denies them instead of fulfilling them

I think we probably are missing alerts at the "many outstanding leases" state and the "many lease denials" state
Shouldn't have made manual adjustments to the instance group. That just desyncs state even more. How was the desync discovered? Did you see 2 VMs in google.com:chromecompute that were not in the datastore of google.com:gce-backend?
Group is now scaled up to ~120
(backend pool is up to 387)
Blockedon: 876143
The MP issue is resolved, the backend is up to full capacity. However swarming is manifesting some bad scheduling behavior (see  bug 876143 )
Blockedon: -876143
Status: Fixed (was: Assigned)
Ok, this should be resolved. Removing  bug 876143  as a blocker, since that issue is no longer blocking this (but still needs some underlying work resolved)
Project Member

Comment 14 by bugdroid1@chromium.org, Aug 21

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/8660a4e588c8802207d93b29800b1ddb8b9f47ef

commit 8660a4e588c8802207d93b29800b1ddb8b9f47ef
Author: Ben Pastene <bpastene@chromium.org>
Date: Tue Aug 21 16:48:26 2018

Labels: chops-pm-91
Cc: iannucci@chromium.org tandrii@chromium.org
Labels: -Pri-0 Pri-1
Owner: s...@google.com
Status: Assigned (was: Fixed)
Summary: Shortage of MP-provided capacity (was: Long pending times on linux)
This isn't resolved yet, sadly. Today, we saw fewer than usual bots 
(see http://shortn/_JDQqtAJ6LU , compare right side=today with leftmost side=last friday).

So, bpastene@ added 200 bots: 100 in each of the two zones (see  https://chrome-internal.googlesource.com/infradata/config/+/8660a4e588c8802207d93b29800b1ddb8b9f47ef)
however, only 100 showed up. So, I conclude that 1 of the two zones touched is again in some kind of stuck state. Assigning to Sana to figure it out.

pri1 because there doesn't (yet?) affect developers.
Cc: -s...@google.com hinoka@chromium.org
Status: Started (was: Assigned)
80 16-core bots were added in us-west1-c this morning:
https://chrome-internal.googlesource.com/infradata/config/+/f4c8debd4756371324ef654970f61f6b0fa2a3f6

This is the equivalent of increasing the standard 8-core VM amount by 160. Since it happened during off-peak, we would've had capacity to create them when the config change landed, then when peak hours rolled around and we tried to scale up the standard 8-core VMs, we would've had insufficient capacity. This would account for the missing VMs.

In response, 200 VMs were added this morning:
https://chrome-internal.googlesource.com/infradata/config/+/8660a4e588c8802207d93b29800b1ddb8b9f47ef

We happened to have sufficient quota in us-east1 to bring up half of them, but since the other half were in the same zone we were already out of quota in, naturally they didn't arrive.
Project Member

Comment 18 by bugdroid1@chromium.org, Aug 22

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/63b628ff34ef9c2b38fa46b9cc914b5bdc4c6e61

commit 63b628ff34ef9c2b38fa46b9cc914b5bdc4c6e61
Author: Sana Muttaqi <smut@google.com>
Date: Wed Aug 22 00:08:17 2018

Cc: linds...@chromium.org
Status: Fixed (was: Started)
New capacity is available, so there should be enough for normal Chrome peak hours capacity + the 80 new 16-core (+ some requests in the queue).

Sign in to add a comment