New issue
Advanced search Search tips

Issue 888560 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Sep 25
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Feature

Blocked on:
issue 888603



Sign in to add a comment

Swarming: account for bots that recently shut down as capacity

Project Member Reported by mar...@chromium.org, Sep 24

Issue description

Right now the capacity is cached for 1 minute after the last bot ping. In practice there's two use cases which fail in this situation:
- MachineProvider bots that are recycled simultaneously, which destroys capacity for N minutes.
- Android bots that temporarily show up as Linux due to issue 801679.

The workaround is to extend the "capacity" bit after the last bot ping for a few additional minutes. This should get most of the cases, with only a small amount of drawback, mainly that when a bot breaks for real, tasks that are triggered in this window may be stuck as pending for significantly longer than expected.

Implementation:
- task_queues.py: make set_has_capacity() cache for more than 61 seconds.
- bot_management.py: make has_capacity() search for bots which had capacity recently via a BotEvent query.
 
Blockedon: 888603
Project Member

Comment 2 by bugdroid1@chromium.org, Sep 24

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/328430c35c590dcf37ce728bcb7996666b4d2c85

commit 328430c35c590dcf37ce728bcb7996666b4d2c85
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Mon Sep 24 18:34:52 2018

[swarming] enable indexing for BotEvent.dimensions_flat

This index will be needed to find 'historical capacity' from bots that used to
exist recently.

First, an index needs to be created, which takes time to deploy and
materialize, hence a separate CL to deploy this safely first.

R=jchinlee@chromium.org

Bug:  888560 
Change-Id: I4b21e941f3b67402c1177cef1eb99bfa5eecc8f9
Reviewed-on: https://chromium-review.googlesource.com/1239027
Reviewed-by: Jao-ke Chin-Lee <jchinlee@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/328430c35c590dcf37ce728bcb7996666b4d2c85/appengine/swarming/index.yaml
[modify] https://crrev.com/328430c35c590dcf37ce728bcb7996666b4d2c85/appengine/swarming/server/bot_management.py

Project Member

Comment 3 by bugdroid1@chromium.org, Sep 25

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/6739d67d654e2e714cb2836c9f2d1807faf9ae5f

commit 6739d67d654e2e714cb2836c9f2d1807faf9ae5f
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Sep 25 17:18:08 2018

[swarming] enable capacity to outlive the last bot

Apply the bot_death_timeout_secs value when calculating the duration up
to which we consider there is still capacity.

Improve unit tests.

R=jchinlee@chromium.org

Bug:  888560 
Change-Id: I42165f6416380d5509a8e3d05d87fb706b7e684d
Reviewed-on: https://chromium-review.googlesource.com/1239029
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Commit-Queue: Jao-ke Chin-Lee <jchinlee@chromium.org>
Auto-Submit: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Jao-ke Chin-Lee <jchinlee@chromium.org>

[modify] https://crrev.com/6739d67d654e2e714cb2836c9f2d1807faf9ae5f/appengine/swarming/server/bot_management.py
[modify] https://crrev.com/6739d67d654e2e714cb2836c9f2d1807faf9ae5f/appengine/swarming/server/bot_management_test.py
[modify] https://crrev.com/6739d67d654e2e714cb2836c9f2d1807faf9ae5f/appengine/swarming/server/task_queues.py
[modify] https://crrev.com/6739d67d654e2e714cb2836c9f2d1807faf9ae5f/appengine/swarming/server/task_queues_test.py

Status: Fixed (was: Assigned)
Deployed. That should be good enough, if not, let's revisit.

Sign in to add a comment