New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 704151 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

CL landed by CQ despite messages on Rietveld say the opposite

Project Member Reported by primiano@chromium.org, Mar 22 2017

Issue description

CL: https://codereview.chromium.org/2743663005/

Both the ios_simulator and the presubmit were red.
Specifically the pre-submit failed in the commit-git-patch step [1]

saying:
---
HEAD detached at origin/master
nothing to commit, working tree clean
step returned non-zero exit code: 1
---

The audit trail in the CL says first "Exceeded global retry quota" in #30
then say
"Try jobs failed on following builders:
  chromium_presubmit on master.tryserver.chromium.linux (JOB_FAILED,
http://build.chromium.org/p/tryserver.chromium.linux/builders/chromium_presub...)"
in #31

the CL is still open, and there is no sign of that being commited.

However, looks like the CL has been indeed committed:
https://chromium.googlesource.com/chromium/src/+/e85f2a786db355e428813ba82a0533c3943ee552

And according to git, the committer of that was "Commit bot <commit-bot@chromium.org>"


I have no idea what's going on, but I think two think went wrong here:
1. A CL that failed presubmit and another CQ bot did get landed
2. After landing the CL was not updated nor closed.


[1] https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.linux%2Fchromium_presubmit%2F391498%2F%2B%2Frecipes%2Fsteps%2Fcommit-git-patch%2F0%2Fstdout
 
Components: -Infra Infra>CQ
Owner: serg...@chromium.org
Status: Assigned (was: Untriaged)
Sergiy, PTAL. Maybe our defense of double commits is broken?
Trying to reconstruct what happened on issue 2743663005 patchset 120001 from the logs (times are in UTC on 
2017-03-22):

01:23:45 Dry run is triggered by ssid
02:49:55 linux_chromium_rel_ng failed, CQ triggers a retry
03:25:46 ios-simulator and ios-simulator-xcode-clang time out, dry run attempt fails

03:36:31 Prod run is triggered by ssid
04:38:06 CQ triggers builds on 4 builders (linux_chromium_rel_ng, chromium_presubmit, ios-simulator and ios-simulator-xcode-clang) and reuses successful results for other builders from a previous run
05:27:13 All tryjobs pass, CQ is ready to commit the change
05:30:29 CQ starts to execute git-push command.
05:31:54 CQ is stopped to be restarted, git-push is terminated half-way, but CL lands into the repo, CQ fails to update Rietveld.
05:31:?? Tree is closed by ddoman
05:32:12 CQ is started again.
05:32:30 CL is discovered by CQ after restart
05:33:31 CQ discovers that all tryjobs succeeded, but does not attempt to land since tree is closed

Unfortunately, we check whether a CL has been committed already and close the tree after the tree check. Therefore issue on Rietveld remained opened for a long period of time after the CL has actually landed into the repo.

06:01:33 CQ is stopped to be restarted again
06:01:53 CL is discovered by CQ after restart
06:02:53 CQ discovers that all tryjobs succeeded, but try job verifier fails due to global quota being exceeded

After looking deeper at the logs, I saw that ios-simulator and ios-simulator-xcode-clang builders had quota set to 2 in a recent verifier_jobs_update. Strangely, the previous verifier_jobs_update had quota set to 0 for both of these bots and no bots were triggered by CQ in-between. I've started to investigate how the quota is computed and found out that buildbucket_util.py returns status_changed_ts field as build['timestamp'], which is later compared to attempt_start_ts to detect if a build is part of the current attempt or not. The build that has timed out in the first attempt has failed in CQ, but was actually still being processed by Buildbot. Eventually, it has succeeded and state_changed_ts has been updated when the state changed to COMPLETED.

06:03:06 CQ stops processing the attempt

I see the following fixes from this:
 - Consider checking if we've already committed a CL before doing tree checks. This may be non-trivial because tree is checked on main thread, while committing happens on committer thread.
 - Use some other field from BuildBucket such that we can identify builds belonging to current attempt more reliably.
CL: https://chrome-internal-review.googlesource.com/c/340864/.

Checking double-commits before tree checks is very difficult to implement in current CQ architecture, so I'm going to leave this out of this bug.
Status: Started (was: Assigned)
Project Member

Comment 7 by bugdroid1@chromium.org, Apr 7 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal/+/e3417ebe8b2a12bb886fdb97c10c64b18a35fb1f

commit e3417ebe8b2a12bb886fdb97c10c64b18a35fb1f
Author: Sergiy Byelozyorov <sergiyb@google.com>
Date: Fri Apr 07 11:32:09 2017

Status: Fixed (was: Started)
Project Member

Comment 9 by bugdroid1@chromium.org, Apr 11 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal/+/14eb07d91bb1262709f4b898552d37bd9bd4a794

commit 14eb07d91bb1262709f4b898552d37bd9bd4a794
Author: Sergiy Byelozyorov <sergiyb@google.com>
Date: Tue Apr 11 15:32:57 2017

Status: Started (was: Fixed)
CL broke chromium-cq-status app. Fix is here: https://chromium-review.googlesource.com/c/474906/3. Once fix lands and deployed, I'll reland the CL fixing quota computation.
Project Member

Comment 11 by bugdroid1@chromium.org, Apr 11 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal/+/2da05a3515e56d56667f67a558e0fcdf4b1398e7

commit 2da05a3515e56d56667f67a558e0fcdf4b1398e7
Author: Sergiy Byelozyorov <sergiyb@google.com>
Date: Tue Apr 11 16:43:03 2017

Status: Fixed (was: Started)

Sign in to add a comment