Swarming: account for bots that recently shut down as capacity |
||
Issue descriptionRight now the capacity is cached for 1 minute after the last bot ping. In practice there's two use cases which fail in this situation: - MachineProvider bots that are recycled simultaneously, which destroys capacity for N minutes. - Android bots that temporarily show up as Linux due to issue 801679. The workaround is to extend the "capacity" bit after the last bot ping for a few additional minutes. This should get most of the cases, with only a small amount of drawback, mainly that when a bot breaks for real, tasks that are triggered in this window may be stuck as pending for significantly longer than expected. Implementation: - task_queues.py: make set_has_capacity() cache for more than 61 seconds. - bot_management.py: make has_capacity() search for bots which had capacity recently via a BotEvent query.
,
Sep 24
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/328430c35c590dcf37ce728bcb7996666b4d2c85 commit 328430c35c590dcf37ce728bcb7996666b4d2c85 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Mon Sep 24 18:34:52 2018 [swarming] enable indexing for BotEvent.dimensions_flat This index will be needed to find 'historical capacity' from bots that used to exist recently. First, an index needs to be created, which takes time to deploy and materialize, hence a separate CL to deploy this safely first. R=jchinlee@chromium.org Bug: 888560 Change-Id: I4b21e941f3b67402c1177cef1eb99bfa5eecc8f9 Reviewed-on: https://chromium-review.googlesource.com/1239027 Reviewed-by: Jao-ke Chin-Lee <jchinlee@chromium.org> Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/328430c35c590dcf37ce728bcb7996666b4d2c85/appengine/swarming/index.yaml [modify] https://crrev.com/328430c35c590dcf37ce728bcb7996666b4d2c85/appengine/swarming/server/bot_management.py
,
Sep 25
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/6739d67d654e2e714cb2836c9f2d1807faf9ae5f commit 6739d67d654e2e714cb2836c9f2d1807faf9ae5f Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Tue Sep 25 17:18:08 2018 [swarming] enable capacity to outlive the last bot Apply the bot_death_timeout_secs value when calculating the duration up to which we consider there is still capacity. Improve unit tests. R=jchinlee@chromium.org Bug: 888560 Change-Id: I42165f6416380d5509a8e3d05d87fb706b7e684d Reviewed-on: https://chromium-review.googlesource.com/1239029 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Commit-Queue: Jao-ke Chin-Lee <jchinlee@chromium.org> Auto-Submit: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Jao-ke Chin-Lee <jchinlee@chromium.org> [modify] https://crrev.com/6739d67d654e2e714cb2836c9f2d1807faf9ae5f/appengine/swarming/server/bot_management.py [modify] https://crrev.com/6739d67d654e2e714cb2836c9f2d1807faf9ae5f/appengine/swarming/server/bot_management_test.py [modify] https://crrev.com/6739d67d654e2e714cb2836c9f2d1807faf9ae5f/appengine/swarming/server/task_queues.py [modify] https://crrev.com/6739d67d654e2e714cb2836c9f2d1807faf9ae5f/appengine/swarming/server/task_queues_test.py
,
Sep 25
Deployed. That should be good enough, if not, let's revisit. |
||
►
Sign in to add a comment |
||
Comment 1 by mar...@chromium.org
, Sep 24