Shortage of MP-provided capacity |
||||||||||
Issue descriptionPlease provide the details for your request here. Set Pri-0 iff it requires immediate attention, Pri-1 if resolution within a few hours is acceptable, and Pri-2 if it just needs to be handled today. https://chromium-swarm.appspot.com/task?id=3f71d7c5c5c84a10&refresh=10&show_raw=1 Similar Load 18619 similar pending tasks, 1633 similar running tasks That's 18k pending tasks for 1600 bots!
,
Aug 20
From chat: For some unexplained reason state of the MP guts and state in GCE desynchronized. In particular, MP thinks there's 0 VMs in gce-trusty group in us-east1-b. But there are in fact 2 VMs there. Also MP thinks there are 0 leases in this group. It makes it set target size of the group to '1' (+1 to 0 it expects). And it assumes the group will scale up. Then it will set target size to 2, wait for scale up, and so on until the group at the full capacity. But since there are 2 VMs already, the group doesn't scale up. And MP sits waiting confusingly. make sense. I think we need to delete these 2 VMs manually how did you figure that out? I've done this by adjusting the group to 0 here: https://pantheon.corp.google.com/compute/instanceGroups/list?project=google.com:chromecompute I see that it's now scaling back up to 1 (and presumably will keep scaling) We're not sure what the root cause of this desync is yet; once the group is scaled up I'll take an initial look.
,
Aug 20
,
Aug 20
Assigning trooper as owner since pri-0.
,
Aug 20
It looks like this latest CL is interacting badly with the state desync: https://chromium.googlesource.com/infra/luci/luci-py/+/1043540b4376f5cdb007393693b197f74ac06c73 We see a lot of lease denials on the swarming side
,
Aug 20
So recap of what we think happened: * Swarming makes a bunch of lease requests to MP * MP scales up (except for us-east1-b), and fulfils those requests * MP keeps trying to scale up us-east1-b, but fails because the group is bigger than it already expects (?) * We manually resize the group to 0 so it's <= what MP expects * MP starts scaling the group up * However, the leases that swarming has outstanding are too old (> 4h?) and so MP denies them instead of fulfilling them I think we probably are missing alerts at the "many outstanding leases" state and the "many lease denials" state
,
Aug 20
Shouldn't have made manual adjustments to the instance group. That just desyncs state even more. How was the desync discovered? Did you see 2 VMs in google.com:chromecompute that were not in the datastore of google.com:gce-backend?
,
Aug 20
Group is now scaled up to ~120
,
Aug 20
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/fafeeafd0ff95c25e8a5502f36669450c311cfb8 commit fafeeafd0ff95c25e8a5502f36669450c311cfb8 Author: smut <smut@google.com> Date: Mon Aug 20 23:45:57 2018 [GCE Backend] Detect instances leased indefinitely Bug: 876034 Change-Id: I2db83b03daac3cfac8c12656679710bd2bff4dfa Reviewed-on: https://chromium-review.googlesource.com/1182515 Reviewed-by: Robbie Iannucci <iannucci@chromium.org> Commit-Queue: smut <smut@google.com> [modify] https://crrev.com/fafeeafd0ff95c25e8a5502f36669450c311cfb8/appengine/gce-backend/catalog.py [modify] https://crrev.com/fafeeafd0ff95c25e8a5502f36669450c311cfb8/appengine/gce-backend/catalog_test.py [modify] https://crrev.com/fafeeafd0ff95c25e8a5502f36669450c311cfb8/appengine/gce-backend/instances.py [modify] https://crrev.com/fafeeafd0ff95c25e8a5502f36669450c311cfb8/appengine/gce-backend/instances_test.py [modify] https://crrev.com/fafeeafd0ff95c25e8a5502f36669450c311cfb8/appengine/gce-backend/models.py
,
Aug 20
(backend pool is up to 387)
,
Aug 21
,
Aug 21
The MP issue is resolved, the backend is up to full capacity. However swarming is manifesting some bad scheduling behavior (see bug 876143 )
,
Aug 21
Ok, this should be resolved. Removing bug 876143 as a blocker, since that issue is no longer blocking this (but still needs some underlying work resolved)
,
Aug 21
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/8660a4e588c8802207d93b29800b1ddb8b9f47ef commit 8660a4e588c8802207d93b29800b1ddb8b9f47ef Author: Ben Pastene <bpastene@chromium.org> Date: Tue Aug 21 16:48:26 2018
,
Aug 21
,
Aug 21
This isn't resolved yet, sadly. Today, we saw fewer than usual bots (see http://shortn/_JDQqtAJ6LU , compare right side=today with leftmost side=last friday). So, bpastene@ added 200 bots: 100 in each of the two zones (see https://chrome-internal.googlesource.com/infradata/config/+/8660a4e588c8802207d93b29800b1ddb8b9f47ef) however, only 100 showed up. So, I conclude that 1 of the two zones touched is again in some kind of stuck state. Assigning to Sana to figure it out. pri1 because there doesn't (yet?) affect developers.
,
Aug 21
80 16-core bots were added in us-west1-c this morning: https://chrome-internal.googlesource.com/infradata/config/+/f4c8debd4756371324ef654970f61f6b0fa2a3f6 This is the equivalent of increasing the standard 8-core VM amount by 160. Since it happened during off-peak, we would've had capacity to create them when the config change landed, then when peak hours rolled around and we tried to scale up the standard 8-core VMs, we would've had insufficient capacity. This would account for the missing VMs. In response, 200 VMs were added this morning: https://chrome-internal.googlesource.com/infradata/config/+/8660a4e588c8802207d93b29800b1ddb8b9f47ef We happened to have sufficient quota in us-east1 to bring up half of them, but since the other half were in the same zone we were already out of quota in, naturally they didn't arrive.
,
Aug 22
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/63b628ff34ef9c2b38fa46b9cc914b5bdc4c6e61 commit 63b628ff34ef9c2b38fa46b9cc914b5bdc4c6e61 Author: Sana Muttaqi <smut@google.com> Date: Wed Aug 22 00:08:17 2018
,
Aug 22
,
Aug 22
New capacity is available, so there should be enough for normal Chrome peak hours capacity + the 80 new 16-core (+ some requests in the queue). |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by bpastene@chromium.org
, Aug 20