New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 620534 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
User never visited
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Limit rate of VM respawns and limit how many VMs can respawn at the same time

Project Member Reported by vadimsh@chromium.org, Jun 16 2016

Issue description

It looks like Machine Provider likes to respawn VMs all at once:
1. Here's all 3 MP VMs assigned to chromium-swarm-dev get respawned within same minute https://screenshot.googleplex.com/aMdRuUPsfCp.png
2. Here's a chunk of log of respawn of prod VMs, also within same minute https://screenshot.googleplex.com/pCJWkGLwQHd.png

This is suboptimal for 3 reasons:
1. When respawns happen, Swarming suddenly looses large chunk of capacity (because lots of VMs go offline all at once).
2. Spiky load on various bootstrap related services (CA master suddenly receives tons of certificates to sign; static IP assigner hits datastore QPS limits; isolate server QPS skyrockets because there are lots of VMs with cold cache, etc).
3. If new VM image is bad (e.g. can't boot swarming bot), all VMs merrily go offline at the same time.

Is it possible to restrict respawns somehow? For example, allow no more than X respawns per time interval. (e.g. no more than 5 machines per minute).
 
Project Member

Comment 1 by bugdroid1@chromium.org, Jun 29 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-py.git/+/1487682e535e2425d22a6d7fd9997b7826474430

commit 1487682e535e2425d22a6d7fd9997b7826474430
Author: smut <smut@google.com>
Date: Wed Jun 29 23:33:49 2016

Allow no more than 100 instances to be created on each resizing run

Together with the cron configuration for the resize job, this will limit the rate at which instances are created.

BUG= 620534 

Review-Url: https://codereview.chromium.org/2110253003

[modify] https://crrev.com/1487682e535e2425d22a6d7fd9997b7826474430/appengine/gce-backend/instance_group_managers.py

Project Member

Comment 2 by bugdroid1@chromium.org, Jul 20 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/external/github.com/luci/luci-py.git/+/f10f1d7b962785a2f2f3a44bbdc2657beb02cdf8

commit f10f1d7b962785a2f2f3a44bbdc2657beb02cdf8
Author: smut <smut@google.com>
Date: Wed Jul 20 21:18:57 2016

Look at the current number of idle VMs during resize logic

targetSize includes all VMs that are being created, deleted, etc. By looking only at idle VMs we actually wait for them to be created before flooding GCE with another increase request.

BUG= 620534 

Review-Url: https://codereview.chromium.org/2163413002

[modify] https://crrev.com/f10f1d7b962785a2f2f3a44bbdc2657beb02cdf8/appengine/gce-backend/instance_group_managers.py

Status: Available (was: Untriaged)
Project Member

Comment 4 by sheriffbot@chromium.org, Sep 4 2017

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available. If you change it back, also remove the "Hotlist-Recharge-Cold" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Comment 5 by s...@google.com, Sep 5 2017

Cc: -smut@chromium.org
Labels: -Hotlist-Recharge-Cold
Owner: smut@chromium.org
Status: Fixed (was: Untriaged)
This was done a long time ago.

Sign in to add a comment