New issue
Advanced search Search tips

Issue 590671 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 0
Type: Bug



Sign in to add a comment

CQ triggering too many jobs

Project Member Reported by kjellander@chromium.org, Feb 29 2016

Issue description

Today we experienced large queues caused by CQ sending far too many jobs for tryjobs. It seems to be 2 extra duplicates for each CL, tripling the load!

Example CLs where this happened:
https://codereview.webrtc.org/1715423002/
https://codereview.webrtc.org/1741723002/

A CL from 13 hours ago went in just fine (https://codereview.webrtc.org/1744083002) so this must have been caused by a recent change (and/or infra instability).

 
Project Member

Comment 1 by bugdroid1@chromium.org, Feb 29 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/external/webrtc.git/+/7352804849012c1b6f47d4dbba87a75d9978a1f1

commit 7352804849012c1b6f47d4dbba87a75d9978a1f1
Author: kjellander@webrtc.org <kjellander@webrtc.org>
Date: Mon Feb 29 09:33:50 2016

Disable CQ since being flooded with jobs

BUG= chromium:590671 
TBR=tandrii@chromium.org

Review URL: https://codereview.webrtc.org/1749673002 .

Cr-Commit-Position: refs/heads/master@{#11807}

[modify] https://crrev.com/7352804849012c1b6f47d4dbba87a75d9978a1f1/infra/config/cq.cfg

Comment 2 by pbos@chromium.org, Feb 29 2016

Cc: pbos@chromium.org
I did add a new trybot to the config 5 hours ago (https://codereview.webrtc.org/1744933002/) but I don't see how that could have affected this...

Other than that, we haven't done any CQ related changes (I restarted the tryserver this morning when adding that bot though), but it was completely idle at that point.
Project Member

Comment 4 by bugdroid1@chromium.org, Feb 29 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/90bcde082a102114f0d031dff1aebd7d31d74ea1

commit 90bcde082a102114f0d031dff1aebd7d31d74ea1
Author: tandrii <tandrii@google.com>
Date: Mon Feb 29 11:04:31 2016

Project Member

Comment 5 by bugdroid1@chromium.org, Feb 29 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/d2205e3330b6f052d251f6788997e4e2d9b717a6

commit d2205e3330b6f052d251f6788997e4e2d9b717a6
Author: tandrii <tandrii@google.com>
Date: Mon Feb 29 11:47:21 2016

Project Member

Comment 6 by bugdroid1@chromium.org, Feb 29 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/cca7c275f4389b395105dc9e48ff75550e068bac

commit cca7c275f4389b395105dc9e48ff75550e068bac
Author: tandrii <tandrii@google.com>
Date: Mon Feb 29 12:07:13 2016

Project Member

Comment 7 by bugdroid1@chromium.org, Feb 29 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/external/webrtc.git/+/dffb894a4c6e371666fad45d8bc5c0b8f9441178

commit dffb894a4c6e371666fad45d8bc5c0b8f9441178
Author: Henrik Kjellander <kjellander@webrtc.org>
Date: Mon Feb 29 12:12:08 2016

Enable CQ

This reverts commit 7352804849012c1b6f47d4dbba87a75d9978a1f1 committed
in https://codereview.webrtc.org/1749673002/

The CQ is now supposed to be functional again.

BUG= chromium:590671 
TBR=tandrii@chromium.org

Review URL: https://codereview.webrtc.org/1744173002 .

Cr-Commit-Position: refs/heads/master@{#11812}

[modify] https://crrev.com/dffb894a4c6e371666fad45d8bc5c0b8f9441178/infra/config/cq.cfg

Owner: tandrii@chromium.org
Status: Fixed (was: Untriaged)
tl;dr The fix has just landed.

The ultimate reason was flake in AppEngine/CloudEndpoints when CQ was trying to schedule the tryjobs which made the request actually succeed on buildbucket side but fail from CQ PoV. Hence, CQ re-tried immediately. From logs, it's not easily clear how often it had happened (though possible to find[1]), but we got alerts 3 times, which means at least 3 times same PUT command was tried 3 times.

The solution is https://chromereviews.googleplex.com/366097013/ which just adds a unique client_operation_id to each tryjob request which allows buildbucket to disregard retries for the same tryjob.

[1] by counting how many PUT requests with same URL happen sequentially with no other CQ messages between them.
Cc: no...@chromium.org
+nodir@ thanks for letting me know about client_operation_id in a code review last week :)
Components: Infra>CQ
Labels: -Infra-CommitQueue

Sign in to add a comment