New issue
Advanced search Search tips

Issue 863684 link

Starred by 3 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Swarming: sweep pending tasks to mark as NO_RESOURCE when count of bot which could run the task goes to 0.

Project Member Reported by nednguyen@chromium.org, Jul 14

Issue description

Even with soft device affinity, sometimes we trigger a task and a bot die after. That leads to tasks pending for 6, 7hours. Would be great if the tasks with 0 available bots are ended early.

Example: https://chrome-swarming.appspot.com/task?id=3eafc5e5476c0610&refresh=10&show_raw=1

*I am guessing this problem is specific to perf since we are the only one who try to trigger task with specific bot id?
 
Status: Available (was: Untriaged)
Summary: Swarming: sweep pending tasks to mark as NO_RESOURCE when count of bot which could run the task goes to 0. (was: Stop swarming jobs that have no available bots early)
The bot can disappear for two reasons:
1. Bot is manually deleted (which includes MP bots)
2. Bot is marked as dead via the cron job

In both cases, it should sweep for pending task (TaskToRun) where their target bot capacity goes to 0 (has_capacity() becomes False), so these tasks can immediately be aborted with NO_RESOURCE.

For 1, there isn't a central place yet:
https://chromium.googlesource.com/infra/luci/luci-py/+/65de3aef5845afd951e64077df8c50d11cc9a2b1/appengine/swarming/handlers_endpoints.py#818
https://chromium.googlesource.com/infra/luci/luci-py/+/65de3aef5845afd951e64077df8c50d11cc9a2b1/appengine/swarming/server/lease_management.py#950

For 2:
https://chromium.googlesource.com/infra/luci/luci-py/+/65de3aef5845afd951e64077df8c50d11cc9a2b1/appengine/swarming/server/bot_management.py#515


---

Logs, for posterity
https://console.cloud.google.com/logs/viewer?project=chrome-swarming&minLogLevel=0&expandAll=false&timestamp=2018-07-16T17:00:14.979000000Z&customFacets=&limitCustomFacetWidth=true&interval=CUSTOM&resource=gae_app&logName=projects%2Fchrome-swarming%2Flogs%2Fappengine.googleapis.com%252Frequest_log&dateRangeStart=2018-07-14T02:00:00.000Z&dateRangeEnd=2018-07-15T04:00:00.000Z&scrollTimestamp=2018-07-14T10:03:00.134780000Z&filters=text:3eafc5e5476c061
Thanks Marc for pointing out the solution. Can someone from infra team help with implementing this? This will help a lot with speeding up the perf waterfall cycle time 
I guess #2 could be implemented first (since it's what is occurring in your case), it shouldn't be too hard since it's happening in a cron job already.

As for when it's going to be implemented, I can't give you a definite time table yet.

Sign in to add a comment