Swarming: Poor polling performance with high backlog of expired tasks |
||||||
Issue descriptionIt appears that swarming is currently having all bots colliding in the poll handler trying and failing to expire the same tasks.
,
Aug 21
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/a0ce96d265a688d06d2d00089be05d91fb50e2ad commit a0ce96d265a688d06d2d00089be05d91fb50e2ad Author: Robert Iannucci <iannucci@chromium.org> Date: Tue Aug 21 01:54:55 2018
,
Aug 21
We also deployed a hotpatch (https://chromium-review.googlesource.com/c/infra/luci/luci-py/+/1182785) to make the poll handler skip all expiration attempts. This was able to get the bots to a healthy state. Everything seems fine now, so going to revert the config change landed in comment 2.
,
Aug 21
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/2958dc03b9097aac0bfb19f186c9d47d0d773fbc commit 2958dc03b9097aac0bfb19f186c9d47d0d773fbc Author: Robbie Iannucci <iannucci@google.com> Date: Tue Aug 21 02:17:48 2018
,
Aug 21
,
Aug 21
arg... and apparently viceroy doesn't match swarming. https://viceroy.corp.google.com/chrome_infra/Appengine/swarming?duration=1h has wildly different (and better looking) numbers than https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&et=1534817880000&f=gpu%3Anone&f=os%3AUbuntu-14.04&f=cpu%3Ax86-64&f=pool%3AChrome&f=state%3APENDING&l=50&n=true&s=created_ts%3Adesc&st=1534731480000
,
Aug 21
No longer an emergency, but we need a long term fix for this, so demoting to P1
,
Aug 21
,
Aug 30
,
Sep 4
,
Sep 5
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/b2a824348c35774756ea5b5c9c4f1a5d82e9d7cb commit b2a824348c35774756ea5b5c9c4f1a5d82e9d7cb Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Tue Sep 04 19:02:12 2018 [swarming] fix inline expiration behavior during bot task polling - Correctly look at the negative cache for inline expiration: - Do it before fetching the TaskToRun entity, which is a DB GET. - Check at the return value of the negative cache memcache.add(), it if failed, it means that expiration should be skipped when doing it inline as another poll handler is likely already expiring it. - Limit inline expiration to 5 tasks per polling. This is to make sure the poll request doesn't go over the 60 seconds limit. Bug: 876143 Change-Id: Id7e5a5efe0deeefdebfe0fba676381a1182a5944 Reviewed-on: https://chromium-review.googlesource.com/1197124 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Jao-ke Chin-Lee <jchinlee@chromium.org> [modify] https://crrev.com/b2a824348c35774756ea5b5c9c4f1a5d82e9d7cb/appengine/swarming/server/task_scheduler.py [modify] https://crrev.com/b2a824348c35774756ea5b5c9c4f1a5d82e9d7cb/appengine/swarming/server/task_scheduler_test.py [modify] https://crrev.com/b2a824348c35774756ea5b5c9c4f1a5d82e9d7cb/appengine/swarming/server/task_to_run.py |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by iannucci@chromium.org
, Aug 21