New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 700744 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
User never visited
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Spread Mp actions over time

Project Member Reported by mar...@chromium.org, Mar 12 2017

Issue description

When Machine Provider restructure the fleet composition, it tends to hit heavily the GCE Instance Group Manager. It doesn't like that. It'd be good to spread the actions (both adding more VMs and removal) over time a bit. The same applies for rolling updates.
 

Comment 1 by s...@google.com, Mar 14 2017

VM creation is already rate limited at 100 per instance group manager per minute. This was done to avoid overwhelming the cr-puppet token server.

I understand GCE is worried about our deletion rate as well.

Comment 2 by s...@google.com, Mar 22 2017

Cloud Platform thinks they've solved the mass deletion problem and we don't need to do anything on our end for it.
We are still stressing our own systems when mass respawning bots. In particular Puppet and Isolate.

Comment 4 by s...@google.com, Mar 23 2017

Cc: vadimsh@chromium.org mar...@chromium.org
Creations are already limited to 100 per minute per managed instance group (we have one managed instance group), do we need to limit that even further for our own systems' sake?

Comment 5 by s...@google.com, Mar 23 2017

Oh, actually we have two managed instance groups because each is limited to 1000 and we need 1300 instances.

That means it's possible for us to spike to 200 creations per minute.
The rate of changes should be as smooth as possible. Are there any downsides for making lease durations randomized in Swarming? It seems to be a simple change.

Comment 7 by s...@google.com, Mar 23 2017

It doesn't fix the problem. The 810 always-on leases already have their lease end times randomized because they have naturally spread out across the several hours due to random delays in the provisioning steps (and there are a lot of provisioning steps). Adding an additional randomization factor should be unnecessary.

The 500 extra VMs are the problem, but their lease duration is not the issue. At 9pm tonight when 500 extra VMs all expire at once, 500 will be deleted (no longer a big deal for Cloud Platform), but we won't have 500 creations, because GCE Backend only maintains 10% more than the number of leased VMs. Since only 810 VMs are leased overnight, 500 VMs are deleted but only 81 replacements are made. Presumably we can handle a single burst of 81 creations per minte.

The problem arises tomorrow morning at 7am when Swarming asks for 500 more VMs. GCE Backend will supply the 81 it held overnight right away. Then with 891 leases it will spin up 89 more. Once those are leased (980 total leased) it will bring up 98 more. Once those are leased (1078 total) it will bring up at most 108 (if they are in two different managed instance groups). Once those are leased (1186) it will bring up another 117. Once leased (1303) it will bring up 7 more to reach the configured maximum of 1310.

It's these hundreds being brought up every minute that are liable to cause problems.

The deletions might have been an issue under the old scheme where at UTC midnight we actually had to delete the VMs and recreate them all in the middle of the day, but the new scheme that went into effect today doesn't require that.

Comment 8 by estaab@chromium.org, Jun 21 2017

Owner: smut@chromium.org
Status: Assigned (was: Untriaged)
Is this still an issue?

Comment 9 by s...@google.com, Jun 21 2017

Cc: -smut@chromium.org
Status: WontFix (was: Assigned)
See #7, I think this is wontfix.

Sign in to add a comment