New issue
Advanced search Search tips

Issue 894201 link

Starred by 2 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on:
issue 888603



Sign in to add a comment

Machine Provider: get rid of early_release_secs

Project Member Reported by mar...@chromium.org, Oct 10

Issue description

https://cs.chromium.org/chromium/infra/luci/appengine/swarming/proto/bots.proto?q=early_release_secs
continuously trips users up.

A recent use case is a developper triggering a task with an hard_timeout of 7 hours. The user wonders why of the 5 idle bots don't reap the task, this is because they all synchronously recycle in 4 hours.

The current state is that early_release_secs needs to be tuned against the hard_timeout value specified in the tasks, which is unrestricted. This means surprising runtime behavior!

Instead, let's get rid of early_release_secs completely, and instead have Swarming *extend* the lease for the hard_timeout value whenever a bot reaps a task that could potentially extend over the lease termination timestamp. This will simplify the Swarming's inner loop, at the cost of doing an RPC to extend the lease.

 
Cc: -smut@chromium.org s...@google.com
One important thing: as one bot's lease is extended, this means another bot can expire ito 
(early click)
One important thing: as one bot's lease is extended, this means another bot can expire early to reduce the total fleet to the configured number.
why do we need to do anything special to expire another one earlier?
Presumably, we aren't going to request new MP bot unless we are under a certain number. So, I think other than extending existing lease, swarming doesn't have to do anything.
Yeah, it should be fine.
I think we should consider this in the new design rather than trying to kludge in lease extension in the existing Machine Provider.
Still, I'd like to remove res.lease_expiration_ts being passed at this line:
https://chromium.googlesource.com/infra/luci/luci-py/+/fcc43278e38f71f51aa20e463f0fae1485c8f73d/appengine/swarming/handlers_bot.py#557

It's impossible for the users to figure out why this is happening.
We can't eliminate that yet or the bot could be deleted mid-task. early_release_secs has to be replaced with extensions first. For the time being we could change all leases to have a huge early_release_secs, for example request the VM for 48 hours but make early_release_secs 24 hours. Then only tasks with a timeout > 24 hours would be rejected. This is the same problem we have now, but hopefully this threshold affects much fewer users until the problem is fixed for real.
Probably worth doing exactly this right now; replace all 24h leases to 48h leases with 24h early_release_secs. This will make this problem less acute.

Sign in to add a comment