Machine Provider: get rid of early_release_secs |
|
Issue descriptionhttps://cs.chromium.org/chromium/infra/luci/appengine/swarming/proto/bots.proto?q=early_release_secs continuously trips users up. A recent use case is a developper triggering a task with an hard_timeout of 7 hours. The user wonders why of the 5 idle bots don't reap the task, this is because they all synchronously recycle in 4 hours. The current state is that early_release_secs needs to be tuned against the hard_timeout value specified in the tasks, which is unrestricted. This means surprising runtime behavior! Instead, let's get rid of early_release_secs completely, and instead have Swarming *extend* the lease for the hard_timeout value whenever a bot reaps a task that could potentially extend over the lease termination timestamp. This will simplify the Swarming's inner loop, at the cost of doing an RPC to extend the lease.
,
Oct 10
(early click) One important thing: as one bot's lease is extended, this means another bot can expire early to reduce the total fleet to the configured number.
,
Oct 10
why do we need to do anything special to expire another one earlier? Presumably, we aren't going to request new MP bot unless we are under a certain number. So, I think other than extending existing lease, swarming doesn't have to do anything.
,
Oct 10
Yeah, it should be fine.
,
Oct 11
I think we should consider this in the new design rather than trying to kludge in lease extension in the existing Machine Provider.
,
Oct 25
Still, I'd like to remove res.lease_expiration_ts being passed at this line: https://chromium.googlesource.com/infra/luci/luci-py/+/fcc43278e38f71f51aa20e463f0fae1485c8f73d/appengine/swarming/handlers_bot.py#557 It's impossible for the users to figure out why this is happening.
,
Oct 25
We can't eliminate that yet or the bot could be deleted mid-task. early_release_secs has to be replaced with extensions first. For the time being we could change all leases to have a huge early_release_secs, for example request the VM for 48 hours but make early_release_secs 24 hours. Then only tasks with a timeout > 24 hours would be rejected. This is the same problem we have now, but hopefully this threshold affects much fewer users until the problem is fixed for real.
,
Oct 29
Probably worth doing exactly this right now; replace all 24h leases to 48h leases with 24h early_release_secs. This will make this problem less acute. |
|
►
Sign in to add a comment |
|
Comment 1 by mar...@chromium.org
, Oct 10