Process swarming tasks in LIFO order |
||
Issue descriptionSwarming currently processes tasks in order of priority and then timestamp (both ascending) -- basically, FIFO order within a priority band. When some segment of a swarming pool is overloaded, all tasks in that segment will either complete with some large pending time or will expire. Switching to LIFO order within priority bands would mean that the segment could always handle some load with low latency, processing the backlog when it gets the chance and letting it expire otherwise. internally, see http://shortn/_kegq735sZX for more context.
,
Nov 12
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/319238f548e3227c2f3d064c62cbe09d96d30509 commit 319238f548e3227c2f3d064c62cbe09d96d30509 Author: John Budorick <jbudorick@chromium.org> Date: Mon Nov 12 20:52:24 2018 swarming: process tasks in LIFO order. Switching to this version of swarming is liable to temporarily confuse swarming as both old- and new-style queue numbers will exist in the service. Deploying it in the latter half of the year should be safer, though, as new tasks will have lower queue numbers than existing tasks. Bug: 901197 Change-Id: I8e970dfaaf20c0723f32b4b84812f08b2ee7c2cc Reviewed-on: https://chromium-review.googlesource.com/c/1312957 Commit-Queue: John Budorick <jbudorick@chromium.org> Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/doc/Design.md [modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/doc/Detailed-Design.md [modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/doc/User-Guide.md [modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/proto/config.proto [modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/proto/config_pb2.py [modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/server/task_scheduler_test.py [modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/server/task_to_run.py [modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/server/task_to_run_test.py
,
Nov 19
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/19410bf24b3fa855975ec195d62fa65edd0d35b9 commit 19410bf24b3fa855975ec195d62fa65edd0d35b9 Author: John Budorick <jbudorick@google.com> Date: Mon Nov 19 20:09:37 2018
,
Nov 20
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/9a265d0e852e59e8a88f29509f17d5589154f234 commit 9a265d0e852e59e8a88f29509f17d5589154f234 Author: John Budorick <jbudorick@google.com> Date: Tue Nov 20 19:05:21 2018
,
Nov 21
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/9aa42c24d210a7862087fb702989836f7f99cd0c commit 9aa42c24d210a7862087fb702989836f7f99cd0c Author: John Budorick <jbudorick@google.com> Date: Wed Nov 21 01:00:41 2018
,
Nov 21
#4 landed LIFO in chromium-swarm; #5 reverted it. I do think that LIFO is the way we want to go here, but it's also clear after today that the chromium_tests and/or swarming recipe modules will need to be revised to cooperate with it. As is, trybots needed all of their test tasks to be picked up in order to be effective; that's guaranteed (with some delay) w/ FIFO with a backlog, but isn't w/ LIFO with a backlog. A revised implementation would likely have individual tasks expire much faster but would retry individual tasks inside each phase of a trybot's attempts on expiration.
,
Dec 19
|
||
►
Sign in to add a comment |
||
Comment 1 by jbudorick@chromium.org
, Nov 2