CL landed by CQ despite messages on Rietveld say the opposite |
||||||
Issue descriptionCL: https://codereview.chromium.org/2743663005/ Both the ios_simulator and the presubmit were red. Specifically the pre-submit failed in the commit-git-patch step [1] saying: --- HEAD detached at origin/master nothing to commit, working tree clean step returned non-zero exit code: 1 --- The audit trail in the CL says first "Exceeded global retry quota" in #30 then say "Try jobs failed on following builders: chromium_presubmit on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/chromium_presub...)" in #31 the CL is still open, and there is no sign of that being commited. However, looks like the CL has been indeed committed: https://chromium.googlesource.com/chromium/src/+/e85f2a786db355e428813ba82a0533c3943ee552 And according to git, the committer of that was "Commit bot <commit-bot@chromium.org>" I have no idea what's going on, but I think two think went wrong here: 1. A CL that failed presubmit and another CQ bot did get landed 2. After landing the CL was not updated nor closed. [1] https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.linux%2Fchromium_presubmit%2F391498%2F%2B%2Frecipes%2Fsteps%2Fcommit-git-patch%2F0%2Fstdout
,
Mar 23 2017
,
Mar 23 2017
Sergiy, PTAL. Maybe our defense of double commits is broken?
,
Mar 24 2017
Trying to reconstruct what happened on issue 2743663005 patchset 120001 from the logs (times are in UTC on 2017-03-22): 01:23:45 Dry run is triggered by ssid 02:49:55 linux_chromium_rel_ng failed, CQ triggers a retry 03:25:46 ios-simulator and ios-simulator-xcode-clang time out, dry run attempt fails 03:36:31 Prod run is triggered by ssid 04:38:06 CQ triggers builds on 4 builders (linux_chromium_rel_ng, chromium_presubmit, ios-simulator and ios-simulator-xcode-clang) and reuses successful results for other builders from a previous run 05:27:13 All tryjobs pass, CQ is ready to commit the change 05:30:29 CQ starts to execute git-push command. 05:31:54 CQ is stopped to be restarted, git-push is terminated half-way, but CL lands into the repo, CQ fails to update Rietveld. 05:31:?? Tree is closed by ddoman 05:32:12 CQ is started again. 05:32:30 CL is discovered by CQ after restart 05:33:31 CQ discovers that all tryjobs succeeded, but does not attempt to land since tree is closed Unfortunately, we check whether a CL has been committed already and close the tree after the tree check. Therefore issue on Rietveld remained opened for a long period of time after the CL has actually landed into the repo. 06:01:33 CQ is stopped to be restarted again 06:01:53 CL is discovered by CQ after restart 06:02:53 CQ discovers that all tryjobs succeeded, but try job verifier fails due to global quota being exceeded After looking deeper at the logs, I saw that ios-simulator and ios-simulator-xcode-clang builders had quota set to 2 in a recent verifier_jobs_update. Strangely, the previous verifier_jobs_update had quota set to 0 for both of these bots and no bots were triggered by CQ in-between. I've started to investigate how the quota is computed and found out that buildbucket_util.py returns status_changed_ts field as build['timestamp'], which is later compared to attempt_start_ts to detect if a build is part of the current attempt or not. The build that has timed out in the first attempt has failed in CQ, but was actually still being processed by Buildbot. Eventually, it has succeeded and state_changed_ts has been updated when the state changed to COMPLETED. 06:03:06 CQ stops processing the attempt I see the following fixes from this: - Consider checking if we've already committed a CL before doing tree checks. This may be non-trivial because tree is checked on main thread, while committing happens on committer thread. - Use some other field from BuildBucket such that we can identify builds belonging to current attempt more reliably.
,
Mar 29 2017
CL: https://chrome-internal-review.googlesource.com/c/340864/. Checking double-commits before tree checks is very difficult to implement in current CQ architecture, so I'm going to leave this out of this bug.
,
Mar 29 2017
,
Apr 7 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal/+/e3417ebe8b2a12bb886fdb97c10c64b18a35fb1f commit e3417ebe8b2a12bb886fdb97c10c64b18a35fb1f Author: Sergiy Byelozyorov <sergiyb@google.com> Date: Fri Apr 07 11:32:09 2017
,
Apr 7 2017
,
Apr 11 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal/+/14eb07d91bb1262709f4b898552d37bd9bd4a794 commit 14eb07d91bb1262709f4b898552d37bd9bd4a794 Author: Sergiy Byelozyorov <sergiyb@google.com> Date: Tue Apr 11 15:32:57 2017
,
Apr 11 2017
CL broke chromium-cq-status app. Fix is here: https://chromium-review.googlesource.com/c/474906/3. Once fix lands and deployed, I'll reland the CL fixing quota computation.
,
Apr 11 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal/+/2da05a3515e56d56667f67a558e0fcdf4b1398e7 commit 2da05a3515e56d56667f67a558e0fcdf4b1398e7 Author: Sergiy Byelozyorov <sergiyb@google.com> Date: Tue Apr 11 16:43:03 2017
,
Apr 11 2017
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by bpastene@chromium.org
, Mar 22 2017