New issue
Advanced search Search tips

Issue 901197 link

Starred by 2 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 916562



Sign in to add a comment

Process swarming tasks in LIFO order

Project Member Reported by jbudorick@chromium.org, Nov 2

Issue description

Swarming currently processes tasks in order of priority and then timestamp (both ascending) -- basically, FIFO order within a priority band. When some segment of a swarming pool is overloaded, all tasks in that segment will either complete with some large pending time or will expire.

Switching to LIFO order within priority bands would mean that the segment could always handle some load with low latency, processing the backlog when it gets the chance and letting it expire otherwise.

internally, see http://shortn/_kegq735sZX for more context.
 
Labels: -Pri-3 Pri-2
Project Member

Comment 2 by bugdroid1@chromium.org, Nov 12

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/319238f548e3227c2f3d064c62cbe09d96d30509

commit 319238f548e3227c2f3d064c62cbe09d96d30509
Author: John Budorick <jbudorick@chromium.org>
Date: Mon Nov 12 20:52:24 2018

swarming: process tasks in LIFO order.

Switching to this version of swarming is liable to temporarily confuse
swarming as both old- and new-style queue numbers will exist in the
service. Deploying it in the latter half of the year should be safer,
though, as new tasks will have lower queue numbers than existing tasks.

Bug: 901197
Change-Id: I8e970dfaaf20c0723f32b4b84812f08b2ee7c2cc
Reviewed-on: https://chromium-review.googlesource.com/c/1312957
Commit-Queue: John Budorick <jbudorick@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/doc/Design.md
[modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/doc/Detailed-Design.md
[modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/doc/User-Guide.md
[modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/proto/config.proto
[modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/proto/config_pb2.py
[modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/server/task_scheduler_test.py
[modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/server/task_to_run.py
[modify] https://crrev.com/319238f548e3227c2f3d064c62cbe09d96d30509/appengine/swarming/server/task_to_run_test.py

Project Member

Comment 3 by bugdroid1@chromium.org, Nov 19

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/19410bf24b3fa855975ec195d62fa65edd0d35b9

commit 19410bf24b3fa855975ec195d62fa65edd0d35b9
Author: John Budorick <jbudorick@google.com>
Date: Mon Nov 19 20:09:37 2018

Project Member

Comment 4 by bugdroid1@chromium.org, Nov 20

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/9a265d0e852e59e8a88f29509f17d5589154f234

commit 9a265d0e852e59e8a88f29509f17d5589154f234
Author: John Budorick <jbudorick@google.com>
Date: Tue Nov 20 19:05:21 2018

Project Member

Comment 5 by bugdroid1@chromium.org, Nov 21

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/9aa42c24d210a7862087fb702989836f7f99cd0c

commit 9aa42c24d210a7862087fb702989836f7f99cd0c
Author: John Budorick <jbudorick@google.com>
Date: Wed Nov 21 01:00:41 2018

#4 landed LIFO in chromium-swarm; #5 reverted it.

I do think that LIFO is the way we want to go here, but it's also clear after today that the chromium_tests and/or swarming recipe modules will need to be revised to cooperate with it. As is, trybots needed all of their test tasks to be picked up in order to be effective; that's guaranteed (with some delay) w/ FIFO with a backlog, but isn't w/ LIFO with a backlog. A revised implementation would likely have individual tasks expire much faster but would retry individual tasks inside each phase of a trybot's attempts on expiration.
Blocking: 916562

Sign in to add a comment