New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 754390 link

Starred by 5 users

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Feature

Blocking:
issue 826709
issue 826706
issue 826708



Sign in to add a comment

Allow cancelling a running Swarming task

Project Member Reported by mar...@chromium.org, Aug 10 2017

Issue description

Actual:
It is currently possible to cancel a Swarming task but only if it is PENDING. If the task is running, was_running is True and the task is not canceled.

There's value in only canceling when the task is PENDING but it would be valuable to cancel a task that is RUNNING.

See CancelResponse in https://chromium.googlesource.com/infra/luci/luci-py.git/+/master/appengine/swarming/swarming_rpcs.py
 
This would provide parity with Buildbot (when accessed from uberchromegw), which allowed to stop builds. It is also very useful for running debugging builds: one often knows that the build is incorrect early on and does not need to wait until the build can finish. It's especially painful if the number of machines that can run the build is limited and the user can't trigger another debug build until the current build has finished.

The current workaround is to ssh into the machine and reboot it.

Comment 2 by mar...@chromium.org, Feb 12 2018

In fairness, other options are:
- /vm-power
- gcloud compute instances reset

Both do not require ssh access. 

Comment 3 by mar...@chromium.org, Mar 19 2018

Cc: akes...@chromium.org jinjingl@chromium.org no...@chromium.org mar...@chromium.org
 Issue 823426  has been merged into this issue.

Comment 4 by mar...@chromium.org, Mar 19 2018

Labels: -Pri-3 Pri-2
Raising priority for ChromeOS builds.

Comment 5 by mar...@chromium.org, Mar 19 2018

Cc: dgarr...@chromium.org
Labels: Swarming
Unfortunately, workarounds from #2 do not work for baremetal hosts. Also they set status of the task to "BOT_DIED", whereas IMHO correct status should be "CANCELLED".
What exactly does buildbot do today to cancel an inprogress build? I suspect this is raising signals against it's child process and expecting them to propagate as needed.

It seems reasonable to do the same thing.

buildbot builds (in general) tend to leak processes when getting terminated. Troopers have wasted a lot of hours battling these over the years (especially on windows). It's possible to implement a child process correctly so it doesn't leak, but at least in chromium we haven't found this to be generally true.

I think the initial implementation of this would need to raise signals against the task and then reboot the host. A better implementation would be to try to contain the process tree of the task (e.g. in a Job Object on windows or cgroup on Linux. OS X is behind the times here, I don't think it has anything like a cgroup).

Strawman proposal:
  * Enhance cancel API to allow it to work for tasks in any state
  * Bot during heartbeat observes task state. If it's CANCELLED, it then initiates teardown:
    * SIGTERM
    * wait || sleep 30
    * reboot

We can enhance this with Job Objects or other process container mechanisms later.
I'm not sure how this should interact with named caches though.
Maybe we just add an 'unmanaged' flag on each named cache; if the bot dies or has a cancelled task, swarming will unmount these unmanaged caches and leave them for the next task to clean up. We probably won't set this in chromium, but I'd expect that CrOS would want it for e.g. its chroot.
Actually, it could be worthwhile to set this flag on chromium's `git` cache; we don't necessarily want it to be purged when the bot times out or crashes.
Vadim suggests (and I agree) that instead of modifying the state to CANCELLED directly, there should be a new state boolean "you should be cancelled". The state would continue to reflect the actual task state as it does today.
Cc: jclinton@chromium.org
Maybe CANCELLING or BEING_CANCELLED?
Cc: pprabhu@chromium.org
A short summary of the flow I prefer is:
- cancel API explicitly handles state=RUNNING
- set the bit to kill the task
- during /poll, the bot gets the signal and kills the process via normal SIGTERM dance like how it's done with timeout
- on last update, state is set to KILLED

A new state KILLING could be interesting but we can live without it initially, like how TIMING_OUT doesn't exist.
cgroups is issue 764493 and is tracked independently.

Comment 19 by no...@chromium.org, Mar 21 2018

c#17 SGTM. I think it is better to add a boolean field "is being killed", not a whole new state. This is what Vadim suggested in c#13

Comment 20 by efoo@chromium.org, Mar 21 2018

Labels: LUCI-Chromium-CQSets LUCI-Blocker-Chromium-CQSets

Comment 21 by efoo@chromium.org, Mar 21 2018

Labels: -LUCI-Chromium-CQSets -LUCI-Blocker-Chromium-CQSets LUCI-Blocker-Clients LUCI-Clients

Comment 22 by efoo@chromium.org, Mar 21 2018

Labels: LUCI-Blocker-Chromium-CQSets
Added blocker to CQSets due to timeframe needs. 
The first step would be to add the new KILLED state in the API and client tool, if someone wants to take on that immediately. Otherwise I'll do in the coming days.
Cc: -mar...@chromium.org
Labels: -Pri-2 Pri-1
Owner: mar...@chromium.org
Status: Assigned (was: Available)
WIP is at https://chromium-review.googlesource.com/970906, it's not too complicated after all.
Project Member

Comment 26 by bugdroid1@chromium.org, Mar 23 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/db862dad8c255bcf88a9cb71d64344cb49a29342

commit db862dad8c255bcf88a9cb71d64344cb49a29342
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Fri Mar 23 20:37:13 2018

[client] enable collecting task with no timeout

- Add "./swarming.py collect --timeout -1"
  In this case, it immediately returns the task's state. It's going to be useful
  to test a pending task.
- Add a unit test for --timeout -1.
- Make --timeout 0 work the same as if it was not specified.
- Verify --timeout argument has reasonable value.
- Add a unit test for cancel, the post data is ignored, so remove it.
- Add local_smoke_test for cancel.
- Tweaks to local_smoke_test to log more and allow running task_collect multiple
times.

R=vadimsh@chromium.org
Bug:  754390 
Change-Id: I5a2704a7b99bed5057d43222aab061e8d51db490
Reviewed-on: https://chromium-review.googlesource.com/977004
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/db862dad8c255bcf88a9cb71d64344cb49a29342/appengine/swarming/local_smoke_test.py
[modify] https://crrev.com/db862dad8c255bcf88a9cb71d64344cb49a29342/client/swarming.py
[modify] https://crrev.com/db862dad8c255bcf88a9cb71d64344cb49a29342/client/tests/swarming_test.py

Project Member

Comment 27 by bugdroid1@chromium.org, Mar 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/d15fc3980b4541f6f6b384ba167db4af19179625

commit d15fc3980b4541f6f6b384ba167db4af19179625
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Mon Mar 26 22:28:58 2018

[swarming] overhaul handlers unit tests

Will add new tests soon and wanted to clean it up a bit first.
Removed 460 lines from the test code by creating reusable expectations.

No functional change.

Bug:  754390 , 781021 
Change-Id: I1f8a5b84312c21e6d4578cae1beb61d281009a4e
Reviewed-on: https://chromium-review.googlesource.com/979275
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>

[modify] https://crrev.com/d15fc3980b4541f6f6b384ba167db4af19179625/appengine/swarming/handlers_bot_test.py
[modify] https://crrev.com/d15fc3980b4541f6f6b384ba167db4af19179625/appengine/swarming/handlers_endpoints_test.py
[modify] https://crrev.com/d15fc3980b4541f6f6b384ba167db4af19179625/appengine/swarming/handlers_test.py
[modify] https://crrev.com/d15fc3980b4541f6f6b384ba167db4af19179625/appengine/swarming/test_env_handlers.py

Project Member

Comment 28 by bugdroid1@chromium.org, Mar 27 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5

commit ef85e9fb88ca82e60f66a43f6fb4946043aa07d5
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Mar 27 19:14:01 2018

[swarming] Add support to cancel a running task.

- Add server side support to permits to cancel a task while running.
- Make CLI to optionally permit the user to cancel a running task.
- Rename the event task_canceled to task_killed, since the event task_canceled
  can't have ever occured and is now a misnomer.
- Change bot task_runner to do the proper SIGTERM/SIGKILL dance on
  must_stop=True instead of just sending SIGKILL right away. This is confirmed
  by smoke test.
- Fix local_smoke_test expectation when the isolate server is not using the
  default port.

Useful when a task goes astray or slow tasks were triggered by error, or in case
of emergency load shedding.

The web UI hasn't been updated to show the button for running task. For now a
user has to use either the CLI client swarming.py or the API directly.

Monitoring for KILLED tasks is not sent yet.

Bug:  754390 
Change-Id: I806dd0290543a4eae322e30f630403dbc674eb81
Reviewed-on: https://chromium-review.googlesource.com/970906
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>

[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/event_mon_metrics.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/handlers_backend.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/handlers_bot.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/handlers_bot_test.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/handlers_endpoints.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/handlers_endpoints_test.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/local_smoke_test.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/server/bot_management.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/server/task_result.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/server/task_result_test.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/server/task_scheduler.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/server/task_scheduler_test.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/swarming_bot/bot_code/task_runner.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/swarming_rpcs.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/appengine/swarming/test_env_handlers.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/client/swarming.py
[modify] https://crrev.com/ef85e9fb88ca82e60f66a43f6fb4946043aa07d5/client/tests/swarming_test.py

Blocking: 826706
Blocking: 826708
Blocking: 826709
Status: Fixed (was: Assigned)
The core feature is deployed everywhere. Filed follow up bugs for the remaining tasks.

Sign in to add a comment