New issue
Advanced search Search tips

Issue 913776 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Feature



Sign in to add a comment

Named cache "regeneration time" feature

Project Member Reported by iannu...@google.com, Dec 11

Issue description

It would be great to be able to extend the timeout of a task to account for cold/missing caches.

Scenario:
  * Task A is launched and asks for cache Big, Medium, Small, and Fresh
  * it can land on a bot with any combination of these caches.
  * If it lands on a bot with all of them, then great! It's 'timeout' value is accurate and it can run for the specified amount of time
  * If it lands on a bot with none of them, it's going to take a lot longer to run than a task that got lucky and landed on a machine with all of them :(. Right now the only way to account for this is to set ALL Task's timeouts to be the sum of the regeneration times for all caches.

Strawman proposal:
Add 'regeneration time' and 'effective lifespan' parameters to the named cache entries on Task. When the task is assigned to a bot, the bot takes the current age of each named cache actually mounted, and adds a linear interpolation of 'regeneration time' to the task's timeout value.

So, for a missing cache, or a cache whose age is > 'effective lifespan', the full 'regeneration time' will be added. For a 50%-aged cache (e.g. it's 50% of 'effective lifespan'), 50% of its 'regeneration time' would be added.

For bonus points we could also have other parameters like 'purge after effective lifetime' (which just treats the cache as missing if it's too old), or parameters to change the interpolation from linear to something else.

For many many bonus points, apply unicorn blood to the problem (aka "Machine Learning") to automagically figure this out.
 
Labels: -Type-Bug Type-Feature
Case study: the 'recipe bundler' is HIGHLY dependent on its caches being warm. When they're cold the task takes easily 10x longer than it does with warm caches. It would be great to be able to set the expected timeout of the task to a small value (e.g. 1 minute) but when it runs from a bot with a cold cache, let it run for a much longer time (e.g. 10 minutes)
Components: -Infra>Platform>Swarming Infra>Platform>Buildbucket
I think swarming already has this because expiration is per slice
The expiration of a slice doesn't have anything to do with the timeout of the task, afaik?
Components: -Infra>Platform>Buildbucket Infra>Platform>Swarming
(the feature I'm requesting here cannot be done purely with buildbucket)
the proposal in bug description would solve the problem 100%. I think the problem can be solved 90% with 10% of work and complexity required by the proposal, to the extent that nobody cares about last 10% of the problem.

let's take the case study in #c2. A build prefers a cache. The build task has two slices, with and without the cache. The two slices can have different timeouts. I believe this fits well into how slices are supposed to use.

There is a chance the first slice expires, but a bot with cache still shows up. In that case timeout is incorrect. This is the last unsolved 10% of the problem. In the worst case, the timeout is larger than it should be. I claim that this case is rare, and thus unimportant. If it is not rare, then we must have other problems, because this is how task fallback in builds is supposed to work. If it doesn't, it should be fixed first. Unfortunately, there is no data to backup or disprove this claim (it is a problem on its own).

Note that both 90% and 100% solution require buildbucket config to learn "cache regeneration time" concept, which means 90% solution is extensible to 100% without breaking changes to the public API. So I suggest to fix the problem 90% first:
- add regeneration time to CacheEntry in buildbucket config
- configure timeouts in task slices accordingly. Note that buildbucket already generates multiple slices
  based on caches, so we are talking about ~10 LOC

if that does not solve the problem, sufficiently then think more about this design

----

Aside, in general, the 100% solution changes the meaning of swarming_rpcs.TaskProperties.execution_timeout_secs to something like "base timeout" that one uses to compute actual timeout. This is a breaking change.
At the very least, swarming UI will need to be updated, otherwise a user would have to compute actual timeout used for a task in their heads. Since the actual timeout is computed on the bot, the 100% solution work involves updating bot->server communication.
To be explicit, the full unsolved 10% of the problem would be:
  * Task slices with this extra timeout being assigned to bots which actually have the caches available (that is; timeout would be extended by accident). This could happen in the rather common scenario where we have a pending time greater than some of the task slices. This will result in many tasks being assigned to bots with warm caches, but erroneously extended timeouts.
  * Timeout extension wouldn't be based on age of the mounted cache (i.e. a 24h old cache wouldn't get extra time to warm up vs a 10s old cache)
  * Clients of swarming other than buildbucket would have to reimplement this logic

I filed this feature request against swarming because I believe that management of swarming's named caches is a responsibility of swarming. Additionally, I don't believe the implementation of this feature in swarming would be that much work, and the need to resolve it isn't immediate, so we're not forced to take a shortcut. If this was a P1 and we were incapable of adding it to swarming, then sure, we could take a shortcut by adding a workaround to buildbucket, but I don't think we're in such a position.

Since the actual state of the caches is only known to the swarming bot at the time the task is assigned, I think the logical place to put this is in swarming.

As for "I claim that this case is rare, and thus unimportant.", IME, rare, unaccounted-for scenarios make debugging a system at scale difficult. I would rather have very predictable behavior from the API. The extended timeout should be tied to the actual state of the caches on the bot that was actually selected, not the caller's best guess.

--- 

As for changing the meaning of execution_timeout_secs, I don't think this is true; tasks which don't set values for these extra fields would have exactly the same behavior they have today.

The net effect of adding this feature to swarming would be to improve the accuracy of the timeout information expressed by the task. The task execution_timeout_secs would reflect the time we expect the task to take with warm caches. The regen time on each cache would be extra time we add to account for that particular cache being cold. So if we have a task today with a timeout of 30m, with this feature, we'd have a timeout of 20m, plus cache warmup times that scale up to 10m. This allows us to put a tighter lower bound on the execution of the task in the happy case, but allows it to scale up to account for the 'I got unlucky' case.
Status: Available (was: Untriaged)

Sign in to add a comment