Allow cancelling a running Swarming task |
|||||||||||||||
Issue descriptionActual: It is currently possible to cancel a Swarming task but only if it is PENDING. If the task is running, was_running is True and the task is not canceled. There's value in only canceling when the task is PENDING but it would be valuable to cancel a task that is RUNNING. See CancelResponse in https://chromium.googlesource.com/infra/luci/luci-py.git/+/master/appengine/swarming/swarming_rpcs.py
,
Feb 12 2018
In fairness, other options are: - /vm-power - gcloud compute instances reset Both do not require ssh access.
,
Mar 19 2018
Issue 823426 has been merged into this issue.
,
Mar 19 2018
Raising priority for ChromeOS builds.
,
Mar 19 2018
,
Mar 19 2018
,
Mar 20 2018
Unfortunately, workarounds from #2 do not work for baremetal hosts. Also they set status of the task to "BOT_DIED", whereas IMHO correct status should be "CANCELLED".
,
Mar 20 2018
What exactly does buildbot do today to cancel an inprogress build? I suspect this is raising signals against it's child process and expecting them to propagate as needed. It seems reasonable to do the same thing.
,
Mar 20 2018
buildbot builds (in general) tend to leak processes when getting terminated. Troopers have wasted a lot of hours battling these over the years (especially on windows). It's possible to implement a child process correctly so it doesn't leak, but at least in chromium we haven't found this to be generally true.
I think the initial implementation of this would need to raise signals against the task and then reboot the host. A better implementation would be to try to contain the process tree of the task (e.g. in a Job Object on windows or cgroup on Linux. OS X is behind the times here, I don't think it has anything like a cgroup).
Strawman proposal:
* Enhance cancel API to allow it to work for tasks in any state
* Bot during heartbeat observes task state. If it's CANCELLED, it then initiates teardown:
* SIGTERM
* wait || sleep 30
* reboot
We can enhance this with Job Objects or other process container mechanisms later.
,
Mar 20 2018
I'm not sure how this should interact with named caches though.
,
Mar 20 2018
Maybe we just add an 'unmanaged' flag on each named cache; if the bot dies or has a cancelled task, swarming will unmount these unmanaged caches and leave them for the next task to clean up. We probably won't set this in chromium, but I'd expect that CrOS would want it for e.g. its chroot.
,
Mar 20 2018
Actually, it could be worthwhile to set this flag on chromium's `git` cache; we don't necessarily want it to be purged when the bot times out or crashes.
,
Mar 20 2018
Vadim suggests (and I agree) that instead of modifying the state to CANCELLED directly, there should be a new state boolean "you should be cancelled". The state would continue to reflect the actual task state as it does today.
,
Mar 20 2018
,
Mar 21 2018
Maybe CANCELLING or BEING_CANCELLED?
,
Mar 21 2018
,
Mar 21 2018
A short summary of the flow I prefer is: - cancel API explicitly handles state=RUNNING - set the bit to kill the task - during /poll, the bot gets the signal and kills the process via normal SIGTERM dance like how it's done with timeout - on last update, state is set to KILLED A new state KILLING could be interesting but we can live without it initially, like how TIMING_OUT doesn't exist.
,
Mar 21 2018
cgroups is issue 764493 and is tracked independently.
,
Mar 21 2018
c#17 SGTM. I think it is better to add a boolean field "is being killed", not a whole new state. This is what Vadim suggested in c#13
,
Mar 21 2018
,
Mar 21 2018
,
Mar 21 2018
Added blocker to CQSets due to timeframe needs.
,
Mar 21 2018
The first step would be to add the new KILLED state in the API and client tool, if someone wants to take on that immediately. Otherwise I'll do in the coming days.
,
Mar 22 2018
,
Mar 23 2018
WIP is at https://chromium-review.googlesource.com/970906, it's not too complicated after all.
,
Mar 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/db862dad8c255bcf88a9cb71d64344cb49a29342 commit db862dad8c255bcf88a9cb71d64344cb49a29342 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Fri Mar 23 20:37:13 2018 [client] enable collecting task with no timeout - Add "./swarming.py collect --timeout -1" In this case, it immediately returns the task's state. It's going to be useful to test a pending task. - Add a unit test for --timeout -1. - Make --timeout 0 work the same as if it was not specified. - Verify --timeout argument has reasonable value. - Add a unit test for cancel, the post data is ignored, so remove it. - Add local_smoke_test for cancel. - Tweaks to local_smoke_test to log more and allow running task_collect multiple times. R=vadimsh@chromium.org Bug: 754390 Change-Id: I5a2704a7b99bed5057d43222aab061e8d51db490 Reviewed-on: https://chromium-review.googlesource.com/977004 Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/db862dad8c255bcf88a9cb71d64344cb49a29342/appengine/swarming/local_smoke_test.py [modify] https://crrev.com/db862dad8c255bcf88a9cb71d64344cb49a29342/client/swarming.py [modify] https://crrev.com/db862dad8c255bcf88a9cb71d64344cb49a29342/client/tests/swarming_test.py
,
Mar 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/d15fc3980b4541f6f6b384ba167db4af19179625 commit d15fc3980b4541f6f6b384ba167db4af19179625 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Mon Mar 26 22:28:58 2018 [swarming] overhaul handlers unit tests Will add new tests soon and wanted to clean it up a bit first. Removed 460 lines from the test code by creating reusable expectations. No functional change. Bug: 754390 , 781021 Change-Id: I1f8a5b84312c21e6d4578cae1beb61d281009a4e Reviewed-on: https://chromium-review.googlesource.com/979275 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Robbie Iannucci <iannucci@chromium.org> [modify] https://crrev.com/d15fc3980b4541f6f6b384ba167db4af19179625/appengine/swarming/handlers_bot_test.py [modify] https://crrev.com/d15fc3980b4541f6f6b384ba167db4af19179625/appengine/swarming/handlers_endpoints_test.py [modify] https://crrev.com/d15fc3980b4541f6f6b384ba167db4af19179625/appengine/swarming/handlers_test.py [modify] https://crrev.com/d15fc3980b4541f6f6b384ba167db4af19179625/appengine/swarming/test_env_handlers.py
,
Mar 27 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5 commit ef85e9fb88ca82e60f66a43f6fb4946043aa07d5 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Tue Mar 27 19:14:01 2018 [swarming] Add support to cancel a running task. - Add server side support to permits to cancel a task while running. - Make CLI to optionally permit the user to cancel a running task. - Rename the event task_canceled to task_killed, since the event task_canceled can't have ever occured and is now a misnomer. - Change bot task_runner to do the proper SIGTERM/SIGKILL dance on must_stop=True instead of just sending SIGKILL right away. This is confirmed by smoke test. - Fix local_smoke_test expectation when the isolate server is not using the default port. Useful when a task goes astray or slow tasks were triggered by error, or in case of emergency load shedding. The web UI hasn't been updated to show the button for running task. For now a user has to use either the CLI client swarming.py or the API directly. Monitoring for KILLED tasks is not sent yet. Bug: 754390 Change-Id: I806dd0290543a4eae322e30f630403dbc674eb81 Reviewed-on: https://chromium-review.googlesource.com/970906 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Robbie Iannucci <iannucci@chromium.org> [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/event_mon_metrics.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/handlers_backend.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/handlers_bot.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/handlers_bot_test.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/handlers_endpoints.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/handlers_endpoints_test.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/local_smoke_test.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/server/bot_management.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/server/task_result.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/server/task_result_test.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/server/task_scheduler.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/server/task_scheduler_test.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/swarming_bot/bot_code/task_runner.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/swarming_rpcs.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/test_env_handlers.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/client/swarming.py [modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/client/tests/swarming_test.py
,
Mar 28 2018
,
Mar 28 2018
,
Mar 28 2018
,
Mar 28 2018
The core feature is deployed everywhere. Filed follow up bugs for the remaining tasks. |
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by serg...@chromium.org
, Feb 12 2018