New issue
Advanced search Search tips

Issue 843655 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 839173



Sign in to add a comment

Swarming: when a leased VM fails to come up online fast enough, the associated terminate task lack capacity

Project Member Reported by mar...@chromium.org, May 16 2018

Issue description

In check_for_connection() in the failed case, either:
- remove the create_terminate_task().
- make this specific create_terminate_task() not look for task queues.

I think the second is the best implementation, albeit with a bit more work but it shouldn't be too much.

This is a significant blocker for  issue 839173 .

Ref:
https://chromium.googlesource.com/infra/luci/luci-py.git/+/master/appengine/swarming/server/lease_management.py
 
Project Member

Comment 1 by bugdroid1@chromium.org, May 17 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/c9816d05fe7f18abcb55a4e8775e2172c708171f

commit c9816d05fe7f18abcb55a4e8775e2172c708171f
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Thu May 17 18:16:11 2018

[swarming] fix terminate task when MP lease VM not coming up fast enough

This is done by skipping the capacity check in this case. Add unit test.

Tweak on the BotInfo query to do a count(limit=1) instead of a get(), hopefully
it's faster. (?)

Bug:  843655 
Change-Id: Ib4c9e765616a21f7c846624f0aacc3d1f27484f7
Reviewed-on: https://chromium-review.googlesource.com/1062626
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/c9816d05fe7f18abcb55a4e8775e2172c708171f/appengine/swarming/handlers_endpoints.py
[modify] https://crrev.com/c9816d05fe7f18abcb55a4e8775e2172c708171f/appengine/swarming/server/bot_management.py
[modify] https://crrev.com/c9816d05fe7f18abcb55a4e8775e2172c708171f/appengine/swarming/server/lease_management.py
[modify] https://crrev.com/c9816d05fe7f18abcb55a4e8775e2172c708171f/appengine/swarming/server/task_scheduler.py
[modify] https://crrev.com/c9816d05fe7f18abcb55a4e8775e2172c708171f/appengine/swarming/server/task_scheduler_test.py

Comment 2 by s...@google.com, May 17 2018

The point of the termination task is that if the bot connects within the few moments after Swarming gave up on it but before MP could delete it, we need to ensure it does not accept anything other than a termination task. The termination task should be created when the bot isn't there to be run if the bot appears and to expire if it doesn't.

If I understand correctly, you're still scheduling the task, but you're not bothering to check for capacity (since it's expected that there will be zero capacity). Is that right?

Comment 3 by mar...@chromium.org, May 17 2018

Status: Fixed (was: Assigned)
Exact.

It's deployed to prod now and this worked; it removed the errors.

Sign in to add a comment