New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 663365 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 662140
Owner: ----
Closed: Nov 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

CQ commits the same CL twice

Project Member Reported by xlai@chromium.org, Nov 8 2016

Issue description

On Tue, 08 Nov 09:41, tree is closed due to a compilation error:

http://build.chromium.org/p/chromium.webkit/builders/WebKit%20Mac%20Builder/builds/184044

The strange thing about this compilation error is that, the CL that
causes this problem is the same CL that's been landed twice. Check up
https://codereview.chromium.org/2485013002, patchset is landed at

https://crrev.com/b2fd274ec5b20ace720420106a13b8c90b61bed1
Cr-Commit-Position: refs/heads/master@{#430607}

and

https://crrev.com/93ed99a0238f25fe391874d3f6dfa02c01acb781
Cr-Commit-Position: refs/heads/master@{#430608}
.

Even more strangely, the second land only lands a partial code of the
first land.

We have to manually revert the second landed commit, because using
CodeReview tool to do the revert will wrongly reverts the first landed
commit, which is not the culprit.

Please investigate. This is quite a serious problem. 
 
Components: -Infra
Labels: Infra-Troopers
Be sure to include the trooper label when filing P0s against infra. (http://g.co/bugatrooper)
Owner: sergeybe...@chromium.org
Status: Assigned (was: Untriaged)
Looking. While indeed serious, my first hunch is that it's likely a low probability error, will evaluate if it needs to stay trooper-P0 (but should probably stay CQ P0).
Stats: the two commits happened within 1 second of each other (time in UTC):

b2fd274ec5b20ace720420106a13b8c90b61bed1 (original, good):  Tue Nov 08 14:35:53 2016
93ed99a0238f25fe391874d3f6dfa02c01acb781 (additional, bad): Tue Nov 08 14:35:54 2016

From CQ logs:

2016-11-08 14:32:18 UTC	[pid:24867 tid:139871004825344 pending_manager.base:320] Committer thread is processing 2485013002.	 
2016-11-08 14:31:05 UTC	[pid:13058 tid:140511909295872 pending_manager.base:320] Committer thread is processing 2485013002.

Apparently, the same patch got dispatched for commit twice. I have a theory as to why, looking to confirm it (will post later).
Cc: sergeybe...@chromium.org andyb...@chromium.org phajdan.jr@chromium.org serg...@chromium.org tandrii@chromium.org
Labels: -Infra-Troopers -Pri-0 -Build-CommitQueue Pri-1
Owner: ----
Status: Untriaged (was: Assigned)
CQ restarted at 14:31:43 - right in between the two commits: http://chromium-cq-status.appspot.com/recent#action=cq_start,project=chromium

So, my theory is confirmed:

- CQ relies on Rietveld's Datastore (eventual) consistency to know if the issue has been committed.
- CQ keeps internal cache in memory for committed issues to avoid duplicate commits, but this cache is flushed and repopulated from Rietveld on restart.
- Datastore's eventual consistency may lag for minutes, and occasionally hours. SLO: 24h.

What actually happened:

- So, CQ committed the patch, then immediately restarted.
- Rietveld's Datastore didn't catch up, so CQ loaded the issue as not-yet-closed, but ready to commit (since all try jobs passed).
- CQ committed the patchset, which happen to apply nontrivially (usually it fails to apply twice, or resolve to no changes).
- Result: erroneous extra commit.

This was a known possible scenario at the time the feature was implemented (~2 years ago), and this is the first known incident when it caused a problem. I'd say, this is a very rare event, and is not likely to repeat very often. 

I'm downgrading to P1 and sending to the CQ team for triage. I believe that the upcoming transition to Gerrit will solve this problem. 

In the meantime, if AppEngine continues to give us trouble with degraded eventual consistency, we may do something about it. In particular, running CQ longer without interruption will definitely help mitigate this issue, so consider running it under the service_manager.
Mergedinto: 662140
Status: Duplicate (was: Untriaged)
This is very known issue. Your analysis is almost right: datastore isn't to blame. CQ was restarted as it was waiting on the git push to finish, and hence Rietveld issue was never closed.

That said, this is happening with elevated frequencies lately - see issue 432905 and  issue 662140 . Maybe we should fix this, or add trooper book manual item?

Gerrit indeed solves this by ensuring CL isn't committed twice.
Also, the stats of commits in gnumbd-enabled repo are meaningful: timestamp there correcpond to local gnumbd machine clock && it ensures they are monotonically increasing. One has to look at refs/pending/heads/master equivalents - timestamps there are actually those from cq machine.

Sign in to add a comment