CQ submits CLs out of order of their readiness for landing |
||||||||||
Issue descriptionExample: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2629213002/1 The CL was stuck in the state "Patch is ready to be committed" for 3 hours. Re-enqueuing same CL didn't help: second attempt is stuck in the same state. tandrii@ is looking at it now.
,
Jan 13 2017
CQ has quite a few CLs in committing state, currently 29 (see attached). Although 29 issues form a FIFO queue (not sure actually), every 30 minutes CQ restarts and re-discovers all issues in whatever order Rietveld search returns them. Combine this with the fact that 1 CL takes from 2 to 5 minutes to commit (fetch, clean checkout, apply, commit, publish intends for 2PC, push (and on failure, rebase, publish intent for 2PC, ...)}. Hence, it seems 2629213002 was particularly unlucky one. This should get better with Gerrit, but we might need to ensure better fairness for Rietveld issues as meantime solution.
,
Jan 13 2017
Re #1: > Many CLs still landing though, and the delay between "Patch is ready to be committed" and actual commit seems to be around 15 min. Spoke too soon. It is 15 min for CLs that "won the race" and landed relatively fast (https://codereview.chromium.org/2614833003). I now see CLs that where stuck for >1h (https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2623103002/1). --- IIUC, all we can do now is just wait until it clears itself?
,
Jan 13 2017
I disabled auto-restarts of CQ: $ touch auto_deploy.pid This has to be undone this evening.
,
Jan 13 2017
FTR, I've sent PSA to chromium devs: https://groups.google.com/a/chromium.org/d/msg/chromium-dev/wuhBT1p8i6Y/J71HfH5GAQAJ
,
Jan 13 2017
25 cls in queue.
,
Jan 13 2017
FTR: committer thread started trying to process each issue in this order for today: https://storage.cloud.google.com/chrome-dumpfiles/x6cmbj3wvq
,
Jan 13 2017
actual success closing of CLs upon commits log: https://storage.cloud.google.com/chrome-dumpfiles/w5d6jtxyqa Over last 20 minutes, there was a rate of 1 CL per 2 minutes. So, basically CQ is under big load today.
,
Jan 13 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/7d10f618a1c4bcaddc47385b7fc47a73b9579bd3 commit 7d10f618a1c4bcaddc47385b7fc47a73b9579bd3 Author: Andrii Shyshkalov <tandrii@chromium.org> Date: Fri Jan 13 01:16:42 2017
,
Jan 13 2017
The long queue has been processed. This bug will be open to fix unfairness.
,
Jan 13 2017
,
Jan 13 2017
Also, another reason for slowness as I've just seen in the logs: after CQ restart, it took from 6 to 8 minutes for first CL to pass through all verifiers and be put into read queue. This means that committer thread was idling for ~20..25% of time.
,
Jan 13 2017
Ah, and this also means that it's not at as easy to make this 100% fair in the current system as the first issue to pass through all verifiers isn't necessarily the oldest non-committed issue that has passed the verifiers. Even so, it seems a good idea to me to order the queue of the committer thread by **the attempt start time**. Attempt start time isn't the same as the time at which issue has passed all verifiers for the first time, but it's easiest to do and would have helped avoid the reason this bug was filed. Ultimately, this should be resolved by Gerrit CQ, at least until chromium gets >1 CL per minute or >150 CLs processed by CQ at any point of time. +sergiyb@ phajdan@ WDYT?
,
Jan 13 2017
Other than restarts, CQ clears and re-populates the ready queue regularly in scan_Results function. This also adds a chance that committer thread will check the queue between clearing and repopulating, discover that nothing is in the queue and idle for 5 seconds before checking again. Why do we do that and not let committer thread remove completed items from HEAD on its own? Then there will be no need to re-populate the queue and thus committer thread will process items in the same order they were added. O'course there are restarts and they will clear the queue as well. In this case I think approach suggested by tAndrii to sort by attempt_start_msec is quite reasonable. It will process CLs that were in the queue longest. Also how would Gerrit CQ help with this? It would still be slow, wouldn't it? Or does it process landing asynchronously?
,
Jan 13 2017
s/HEAD/queue head/... typing memory
,
Jan 13 2017
,
Jan 23 2017
,
Aug 31 2017
,
Aug 31 2017
,
Jul 2
Update: the challenge here is ensuring progress if one ready CL can't be landed for external to CQ reasons (ie Gerrit). Today, CQ intentionally randomizes the order, ensuring progress. Alternatively, CQ can abort an attempt of landing a CL if Gerrit can't land it in reasonable time & number of attempts. |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by vadimsh@chromium.org
, Jan 13 2017