CQ triggering too many jobs |
|||||
Issue descriptionToday we experienced large queues caused by CQ sending far too many jobs for tryjobs. It seems to be 2 extra duplicates for each CL, tripling the load! Example CLs where this happened: https://codereview.webrtc.org/1715423002/ https://codereview.webrtc.org/1741723002/ A CL from 13 hours ago went in just fine (https://codereview.webrtc.org/1744083002) so this must have been caused by a recent change (and/or infra instability).
,
Feb 29 2016
,
Feb 29 2016
I did add a new trybot to the config 5 hours ago (https://codereview.webrtc.org/1744933002/) but I don't see how that could have affected this... Other than that, we haven't done any CQ related changes (I restarted the tryserver this morning when adding that bot though), but it was completely idle at that point.
,
Feb 29 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/90bcde082a102114f0d031dff1aebd7d31d74ea1 commit 90bcde082a102114f0d031dff1aebd7d31d74ea1 Author: tandrii <tandrii@google.com> Date: Mon Feb 29 11:04:31 2016
,
Feb 29 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/d2205e3330b6f052d251f6788997e4e2d9b717a6 commit d2205e3330b6f052d251f6788997e4e2d9b717a6 Author: tandrii <tandrii@google.com> Date: Mon Feb 29 11:47:21 2016
,
Feb 29 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/cca7c275f4389b395105dc9e48ff75550e068bac commit cca7c275f4389b395105dc9e48ff75550e068bac Author: tandrii <tandrii@google.com> Date: Mon Feb 29 12:07:13 2016
,
Feb 29 2016
The following revision refers to this bug: https://chromium.googlesource.com/external/webrtc.git/+/dffb894a4c6e371666fad45d8bc5c0b8f9441178 commit dffb894a4c6e371666fad45d8bc5c0b8f9441178 Author: Henrik Kjellander <kjellander@webrtc.org> Date: Mon Feb 29 12:12:08 2016 Enable CQ This reverts commit 7352804849012c1b6f47d4dbba87a75d9978a1f1 committed in https://codereview.webrtc.org/1749673002/ The CQ is now supposed to be functional again. BUG= chromium:590671 TBR=tandrii@chromium.org Review URL: https://codereview.webrtc.org/1744173002 . Cr-Commit-Position: refs/heads/master@{#11812} [modify] https://crrev.com/dffb894a4c6e371666fad45d8bc5c0b8f9441178/infra/config/cq.cfg
,
Feb 29 2016
tl;dr The fix has just landed. The ultimate reason was flake in AppEngine/CloudEndpoints when CQ was trying to schedule the tryjobs which made the request actually succeed on buildbucket side but fail from CQ PoV. Hence, CQ re-tried immediately. From logs, it's not easily clear how often it had happened (though possible to find[1]), but we got alerts 3 times, which means at least 3 times same PUT command was tried 3 times. The solution is https://chromereviews.googleplex.com/366097013/ which just adds a unique client_operation_id to each tryjob request which allows buildbucket to disregard retries for the same tryjob. [1] by counting how many PUT requests with same URL happen sequentially with no other CQ messages between them.
,
Feb 29 2016
+nodir@ thanks for letting me know about client_operation_id in a code review last week :)
,
Apr 26 2016
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by bugdroid1@chromium.org
, Feb 29 2016