Swarming task scheduler: Batches of swarming tasks should either all run, or all not run. |
||
Issue descriptionNote: I'm not very familiar with swarming. Please let me know if I've made incorrect assumptions about how the system works. In https://bugs.chromium.org/p/chromium/issues/detail?id=908551, we see that there was a period of time when there was insufficient capacity in the macOS swarming fleet. There are examples where we try run layout tests across 12 shards. Some of the shards finish, and others are never started due to insufficient capacity e.g. https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/194018 In this case, running a subset of 12 shards is equivalent to running none of the shards, as the chromium test recipe does not gracefully handle partial results [where some tasks return no information]. Eventually, it would be nice if the chromium test recipe could gracefully recover from this failure mode. Regardless, the swarming task scheduler should try to fulfill all tasks associated with a test suite rather than a subset of tasks with many test suites. Current behavior [hypothetical example]: * 5 CQ try jobs dispatch 12 tasks each. * Swarming has insufficient capacity, and runs 3 tasks for each try job. The results are eventually thrown away by the chromium recipe. Behavior if chromium tests were changed to gracefully handle failing shards: * 5 CQ try jobs dispatch 12 tasks each. * Swarming has insufficient capacity, and runs 3 tasks for each try job. * Each of the CQ try jobs handles the failing shards, and redispatches the 9 remaining tasks. * Swarming has insufficient capacity, and runs 3 tasks for each try job. ... etc, eventually the try jobs will either complete or time out and the results will be thrown away. Proposed behavior: * 5 CQ try jobs dispatch 12 tasks each. * Swarming has insufficient capacity. Swarming runs all 12 tasks for 1 of the CQ try jobs, and 3 tasks for the second CQ try job. Future capacity that opens up will used to run the second CQ try job.
,
Nov 27
I'm not sure what you mean by "chromium test recipe could gracefully recover"? Retrying jobs that failed due to being out of capacity likely just amplifies the failure; given that we are already buffering the jobs a long time before failing, it's not clear to me that it'd make sense to try again any time soon, and I'd probably rather the recipe fail and recover at a higher layer (e.g., CQ retrying after N hours, when it has a better view of the global situation) rather than keep those resources locked up.
,
Nov 27
A long time ago, children tasks had their priority implicitly bumped automatically. This had to be removed because of something that couldn't work with this scheme (I forget what explicitly). In practice you'd want to do this manually inside the recipe. This requires knowing at what priority the current task is running at, which is not possible at the moment. Not sure how to pass this information efficiently.
,
Nov 27
Hm. I think there are several different proposals being floated around here, all of which could help improve CQ behavior when capacity is constrained. Not sure what the best forum is to discuss solutions. Let me try to clarify mine: When a CQ build [chromium test recipe] posts N tasks for a single test suite [e.g. 12 shards for webkit layout tests], they should all be given a unique group identifier. When swarming schedules tasks, it should either schedule all N tasks, or none of them. We should avoid situations where we schedule a subset of the N tasks, since that is likely just going to be wasted work.
,
Nov 27
> In practice you'd want to do this manually inside the recipe. Is this because the actual sharding is done client-side, and so swarming is actually getting 12 independent tasks, and not a batch of 12 related tasks?
,
Nov 27
Yes and because you want to specify the priority accordingly. Swarming intentionally doesn't support at the moment what #c4 describes. This would be essentially a rewrite of the core scheduling. Also, I think you characterization is not very close to reality; assuming that triggering tasks is fast enough, the probability of N CQ tasks triggering simultaneously enough to saturate capacity isn't that high. What will help is (ironically) triggering more tasks faster. There's work being done for this on the Go client.
,
Nov 27
Could you clarify why triggering tasks faster will help? How does swarming scheduling currently work? Is it FIFO? Let's say I have 5 CQ builds, which sequentially trigger 12 webkit_layout_test tasks each. [We can call them A1...A12, B1...B12, etc.] As capacity frees up, will the builds be run in A1...A12, B1...B12, C1...C12 order?
,
Nov 27
Independent of LIFO or FIFO, clustering of tasks (temporal proximity) will cause the tasks to be scheduled around the same time based on available capacity. What we could ask for is triggering multiple tasks as a single RPC, forcing these to be in a single streak. That could be possible. My only fear is the higher than usual RPC failure rate, due to the highly increased DB Tx needed.
,
Nov 27
I am not particularly interested in the proposal in #4, basically because of the rewrite required. Swarming scheduling is currently FIFO but I'm trying to flip that to LIFO (https://bugs.chromium.org/p/chromium/issues/detail?id=901197). Doing so requires something similar to the graceful failure mode here. I would like us to eventually be LIFO w/ short expiration timeouts & exponential backoff in trigger retries on the recipe side. This would require the ability to attempt to trigger tasks multiple times within each phase of a trybots' attempts (w/ patch, w/o patch, retry w/ patch), though. |
||
►
Sign in to add a comment |
||
Comment 1 by jbudorick@chromium.org
, Nov 27Status: Assigned (was: Untriaged)