Limit rate of VM respawns and limit how many VMs can respawn at the same time |
||||
Issue descriptionIt looks like Machine Provider likes to respawn VMs all at once: 1. Here's all 3 MP VMs assigned to chromium-swarm-dev get respawned within same minute https://screenshot.googleplex.com/aMdRuUPsfCp.png 2. Here's a chunk of log of respawn of prod VMs, also within same minute https://screenshot.googleplex.com/pCJWkGLwQHd.png This is suboptimal for 3 reasons: 1. When respawns happen, Swarming suddenly looses large chunk of capacity (because lots of VMs go offline all at once). 2. Spiky load on various bootstrap related services (CA master suddenly receives tons of certificates to sign; static IP assigner hits datastore QPS limits; isolate server QPS skyrockets because there are lots of VMs with cold cache, etc). 3. If new VM image is bad (e.g. can't boot swarming bot), all VMs merrily go offline at the same time. Is it possible to restrict respawns somehow? For example, allow no more than X respawns per time interval. (e.g. no more than 5 machines per minute).
,
Jul 20 2016
The following revision refers to this bug: https://chromium.googlesource.com/external/github.com/luci/luci-py.git/+/f10f1d7b962785a2f2f3a44bbdc2657beb02cdf8 commit f10f1d7b962785a2f2f3a44bbdc2657beb02cdf8 Author: smut <smut@google.com> Date: Wed Jul 20 21:18:57 2016 Look at the current number of idle VMs during resize logic targetSize includes all VMs that are being created, deleted, etc. By looking only at idle VMs we actually wait for them to be created before flooding GCE with another increase request. BUG= 620534 Review-Url: https://codereview.chromium.org/2163413002 [modify] https://crrev.com/f10f1d7b962785a2f2f3a44bbdc2657beb02cdf8/appengine/gce-backend/instance_group_managers.py
,
Sep 1 2016
,
Sep 4 2017
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. If you change it back, also remove the "Hotlist-Recharge-Cold" label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Sep 5 2017
This was done a long time ago. |
||||
►
Sign in to add a comment |
||||
Comment 1 by bugdroid1@chromium.org
, Jun 29 2016