New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 680808 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Feature



Sign in to add a comment

CQ submits CLs out of order of their readiness for landing

Project Member Reported by vadimsh@chromium.org, Jan 13 2017

Issue description

Example: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2629213002/1

The CL was stuck in the state "Patch is ready to be committed" for 3 hours.

Re-enqueuing same CL didn't help: second attempt is stuck in the same state.

tandrii@ is looking at it now.
 
Labels: Pri-1
Many CLs still landing though, and the delay between "Patch is ready to be committed" and actual commit seems to be around 15 min.

The tree has been closed for one hour (14:53 - 15:53), maybe it caused enough of a queueing to slowdown CQ?..
Status: Started (was: Assigned)
CQ has quite a few CLs in committing state, currently 29 (see attached).
Although 29 issues form a FIFO queue (not sure actually), every 30 minutes CQ restarts and re-discovers all issues in whatever order Rietveld search returns them. Combine this with the fact that 1 CL takes from 2 to 5 minutes to commit (fetch, clean checkout, apply, commit, publish intends for 2PC, push (and on failure, rebase, publish intent for 2PC, ...)}.
Hence, it seems 2629213002 was particularly unlucky one.

This should get better with Gerrit, but we might need to ensure better fairness for Rietveld issues as meantime solution.
current_queue.16.51.PST.txt
858 bytes View Download
Re #1: 
> Many CLs still landing though, and the delay between "Patch is ready to be committed" and actual commit seems to be around 15 min.

Spoke too soon. It is 15 min for CLs that "won the race" and landed relatively fast (https://codereview.chromium.org/2614833003). I now see CLs that where stuck for >1h (https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2623103002/1).

---

IIUC, all we can do now is just wait until it clears itself?
I disabled auto-restarts of CQ: $ touch auto_deploy.pid
This has to be undone this evening.
25 cls in queue.
FTR: committer thread started trying to process each issue in this order for today: https://storage.cloud.google.com/chrome-dumpfiles/x6cmbj3wvq 
actual success closing of CLs upon commits log: https://storage.cloud.google.com/chrome-dumpfiles/w5d6jtxyqa

Over last 20 minutes, there was a rate of 1 CL per 2 minutes. So, basically CQ is under big load today. 
Project Member

Comment 9 by bugdroid1@chromium.org, Jan 13 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/7d10f618a1c4bcaddc47385b7fc47a73b9579bd3

commit 7d10f618a1c4bcaddc47385b7fc47a73b9579bd3
Author: Andrii Shyshkalov <tandrii@chromium.org>
Date: Fri Jan 13 01:16:42 2017

Summary: Chromium Rietveld CQ submits CLs out of order of their readiness for landing (was: Chromium CQ doesn't submit CLs even though they are ready)
The long queue has been processed. This bug will be open to fix unfairness.
Labels: -Pri-1 Pri-2
Also, another reason for slowness as I've just seen in the logs:
after CQ restart, it took from 6 to 8 minutes for first CL to pass through all verifiers and be put into read queue. This means that committer thread was idling for ~20..25% of time.
Cc: phajdan.jr@chromium.org tandrii@chromium.org serg...@chromium.org
Labels: Type-Bug
Owner: ----
Status: Available (was: Started)
Ah, and this also means that it's not at as easy to make this 100% fair in the current system as the first issue to pass through all verifiers isn't necessarily the oldest non-committed issue that has passed the verifiers.

Even so, it seems a good idea to me to order the queue of the committer thread by **the attempt start time**. Attempt start time isn't the same as the time at which issue has passed all verifiers for the first time, but it's easiest to do and would have helped avoid the reason this bug was filed.

Ultimately, this should be resolved by Gerrit CQ, at least until chromium gets >1 CL per minute or >150 CLs processed by CQ at any point of time.

+sergiyb@ phajdan@ WDYT?
Other than restarts, CQ clears and re-populates the ready queue regularly in scan_Results function. This also adds a chance that committer thread will check the queue between clearing and repopulating, discover that nothing is in the queue and idle for 5 seconds before checking again. Why do we do that and not let committer thread remove completed items from HEAD on its own? Then there will be no need to re-populate the queue and thus committer thread will process items in the same order they were added.

O'course there are restarts and they will clear the queue as well. In this case I think approach suggested by tAndrii to sort by attempt_start_msec is quite reasonable. It will process CLs that were in the queue longest.

Also how would Gerrit CQ help with this? It would still be slow, wouldn't it? Or does it process landing asynchronously?
s/HEAD/queue head/... typing memory
Labels: -Infra-Troopers
Components: -Infra>CQ Infra>Platform>CQdaemon

Comment 18 by efoo@chromium.org, Aug 31 2017

Components: Infra>Platform>CQ

Comment 19 by efoo@chromium.org, Aug 31 2017

Components: -Infra>Platform>CQdaemon
Labels: -Type-Bug Type-Feature
Summary: CQ submits CLs out of order of their readiness for landing (was: Chromium Rietveld CQ submits CLs out of order of their readiness for landing)
Update: the challenge here is ensuring progress if one ready CL can't be landed for external to CQ reasons (ie Gerrit). Today, CQ intentionally randomizes the order, ensuring progress.

Alternatively, CQ can abort an attempt of landing a CL if Gerrit can't land it in reasonable time & number of attempts.

Sign in to add a comment