New issue
Advanced search Search tips

Issue 876143 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Sep 4
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Swarming: Poor polling performance with high backlog of expired tasks

Project Member Reported by iannucci@chromium.org, Aug 21

Issue description

It appears that swarming is currently having all bots colliding in the poll handler trying and failing to expire the same tasks.
 
Project Member

Comment 2 by bugdroid1@chromium.org, Aug 21

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/a0ce96d265a688d06d2d00089be05d91fb50e2ad

commit a0ce96d265a688d06d2d00089be05d91fb50e2ad
Author: Robert Iannucci <iannucci@chromium.org>
Date: Tue Aug 21 01:54:55 2018

We also deployed a hotpatch (https://chromium-review.googlesource.com/c/infra/luci/luci-py/+/1182785) to make the poll handler skip all expiration attempts. This was able to get the bots to a healthy state.

Everything seems fine now, so going to revert the config change landed in comment 2.
Project Member

Comment 4 by bugdroid1@chromium.org, Aug 21

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/2958dc03b9097aac0bfb19f186c9d47d0d773fbc

commit 2958dc03b9097aac0bfb19f186c9d47d0d773fbc
Author: Robbie Iannucci <iannucci@google.com>
Date: Tue Aug 21 02:17:48 2018

Blocking: -876034
Labels: -Pri-0 Pri-1
No longer an emergency, but we need a long term fix for this, so demoting to P1
Labels: chops-pm-91
Owner: mar...@chromium.org
Status: Assigned (was: Untriaged)
Summary: Swarming: Poor polling performance with high backlog of expired tasks (was: Poor scheduler performance in swarming with large numbers of multi-slice tasks)
Status: Fixed (was: Assigned)
Project Member

Comment 11 by bugdroid1@chromium.org, Sep 5

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/b2a824348c35774756ea5b5c9c4f1a5d82e9d7cb

commit b2a824348c35774756ea5b5c9c4f1a5d82e9d7cb
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Sep 04 19:02:12 2018

[swarming] fix inline expiration behavior during bot task polling

- Correctly look at the negative cache for inline expiration:
  - Do it before fetching the TaskToRun entity, which is a DB GET.
  - Check at the return value of the negative cache memcache.add(), it
    if failed, it means that expiration should be skipped when doing it
    inline as another poll handler is likely already expiring it.
- Limit inline expiration to 5 tasks per polling. This is to make sure
  the poll request doesn't go over the 60 seconds limit.

Bug:  876143 
Change-Id: Id7e5a5efe0deeefdebfe0fba676381a1182a5944
Reviewed-on: https://chromium-review.googlesource.com/1197124
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Jao-ke Chin-Lee <jchinlee@chromium.org>

[modify] https://crrev.com/b2a824348c35774756ea5b5c9c4f1a5d82e9d7cb/appengine/swarming/server/task_scheduler.py
[modify] https://crrev.com/b2a824348c35774756ea5b5c9c4f1a5d82e9d7cb/appengine/swarming/server/task_scheduler_test.py
[modify] https://crrev.com/b2a824348c35774756ea5b5c9c4f1a5d82e9d7cb/appengine/swarming/server/task_to_run.py

Sign in to add a comment