CQ committed patch twice |
||||||||||||||||||||
Issue descriptionIn https://codereview.chromium.org/2468333002/, the CQ applied a patch twice, resulting in a broken build. It was landed as both 67b831b414b1a10a87d36ae79d6283e4c4f60ac8 and 8097dff55c84516414ce944ee6bc5435807f4fbc. Because that file is spaced out and regularly formatted, the same patch could be applied twice I guess.
,
Nov 3 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/92c7339480267e5ab6e51c5445d7f9802405ce72 commit 92c7339480267e5ab6e51c5445d7f9802405ce72 Author: Scott Graham <scottmg@chromium.org> Date: Thu Nov 03 19:44:35 2016 Fix presubmit after CQ commit error TBR=jam@chromium.org,ynovikov@chromium.org BUG= 662140 Review URL: https://codereview.chromium.org/2477803003 . Cr-Commit-Position: refs/heads/master@{#429676} [modify] https://crrev.com/92c7339480267e5ab6e51c5445d7f9802405ce72/testing/buildbot/chromium.linux.json
,
Nov 3 2016
CQ team, can you figure out why this happened? Probably not a good thing....
,
Nov 3 2016
I had a similar problem: https://codereview.chromium.org/2468723003/ It looks like the patch was committed, then another message was posted: Failed to apply the patch. On branch working_branch Your branch is up-to-date with 'origin/refs/pending/heads/master'. nothing to commit, working tree clean which looks like it was trying to double-commit to me.
,
Nov 4 2016
I also got comment 4 https://codereview.chromium.org/2460063002#msg12 recently
,
Nov 4 2016
,
Nov 4 2016
The problem mentioned in this issue is related to the fact that CQ is committing and closing the issue on Rietveld in two separate non-atomic steps. Should CQ be restarted right after it has committed but before it has closed the issue, CQ will attempt to commit the patch again after restart. For most patches, this results in an error message since those patches do not apply anymore (e.g. see #4 on this issue, tracked in issue 432905). Some patches, however, can be applied multiple times and this is what happened here. This happens rarely enough that we do not wish to spend resources to fix it immediately and most of the time it does not cause major outages. Instead, the issue will be solved when we'll migrate to Gerrit (tandrii@ can give an ETA for this), which does committing for us in an atomic fashion. I'll keep this issue open to allow others dedupe against it, but reduce the priority.
,
Nov 8 2016
,
Nov 8 2016
Issue 663357 has been merged into this issue.
,
Nov 8 2016
,
Nov 8 2016
Several times a week doesn't seem that rare when each occurrence causes tree closures.
,
Nov 9 2016
If this is causing tree closures, then an interim solution should be determined until the Gerrit migration is complete.
,
Nov 10 2016
As discussed with +sergiyb@ yesterday,
Interim solution outline:
def apply_patch(CL):
git fetch
# NEW CHECK:
new_check_already_known(cl)
apply_issue.py
git push
def new_check_already_known(cl):
for commit in last_20_commits:
if time.time() - commit.timestamp < 5*minutes:
if cl.get_review_url() in commit.full_message:
raise ProbablyAlreadyCommittedSoAbortCQAttempt(commit)
Actual code will be more hairy :( but it seems aborting an attempt and letting developer decide what to do next is better than committing twice.
,
Nov 10 2016
failing to commit sounds acceptable to me (as a user of the CQ I'm already use to it failing to commit in many cases due to flaky tests, tree closures, merge conflicts, etc. --- this is just one more, hopefully rare, reason)
,
Nov 10 2016
+boliu just saw this on tree again
,
Nov 10 2016
this happened again and closed the tree https://codereview.chromium.org/2487713002/#msg55 https://codereview.chromium.org/2487713002/#msg57
,
Nov 10 2016
-> andybons@ for comment ... there's really no world in which committing the same change twice is a good thing to me, so this seems like it should really be a P0. Thoughts?
,
Nov 11 2016
,
Nov 11 2016
This is very bad. In particular, this last instance is horrid. The two versions of the commit aren't just "oh, the patch applied a second time". In the second one, only one of the files was patched again, thereby landing something completely different from anything that ever went up for review. This is definitely P0.
,
Nov 11 2016
Is this perhaps a failure in `git patch apply`?
,
Nov 11 2016
One idea is that patch application failed somehow, but CQ kept on truckin'. That doesn't explain why there would be a separate pass though.
,
Nov 11 2016
Yeah I don't buy the CQ restart argument. This is two separate malformed commits, not a repeat.
,
Nov 11 2016
Not quite -- only the second commit is malformed, and it does appear to be the result of trying to apply the same diff a second time First commit: https://chromium.googlesource.com/chromium/src/+/273cccbc60b358d06c332f64a8b3cc9f5ee42dd6%5E%21/#F2 Second commit: https://chromium.googlesource.com/chromium/src/+/8f60271614652b6110081128604470460af5dadb%5E%21/#F0 The second one is just a piece of the first, and that piece is a self-contained insertion that will apply as many times as you like. So this definitely is a double-patch-application issue. The two big problems are: * Why is it even trying to apply the patch twice (this might be the "cq restart" explanation) * Why isn't it failing instantly when only a portion of the patch applies correctly
,
Nov 11 2016
Looks like I got a similar error just now trying to revert a patch: https://codereview.chromium.org/2496633004/ (see comments #3 and following.
,
Nov 11 2016
I'm going to leave the tree open for the moment, but if I see another issue I'm going to close it until we have a better understanding of what's going on.
,
Nov 11 2016
,
Nov 11 2016
agable and I discussed this, and we're in agreement that cq restart is a likely culprit. As for why it's not failing instantly, we think a git merge strategy is trying to be clever, and ignoring hunks it thinks are already committed. We're looking into applying a more strict merge strategy.
,
Nov 11 2016
CQ is running the following: git apply --index -3 -p<patchlevel>
,
Nov 11 2016
Query for double commits: https://goto.google.com/ogban I see double commits several times a day on weekdays. This apparently has been going on for a long time. So, yes, this is pretty bad actually. I don't know how we didn't notice this before.
,
Nov 11 2016
To be clear, most of these commits likely didn't actually make it to the repo (but I didn't check that yet). I'm blaming CQ restarts for this, will verify with another query.
,
Nov 11 2016
We think that removing -3 (and possibly --index as well) will solve this problem. agable@ is looking into if CQ is in a reboot loop (sergeyberezin is as well).
,
Nov 11 2016
Note: agable and I are going to move forward with removing -3 from the command-line. According to https://git-scm.com/docs/git-apply we want --index kept.
,
Nov 11 2016
For what it is worth, this is the bug (and associate CLs) that added the three-way merge in the first place: https://bugs.chromium.org/p/chromium/issues/detail?id=416003
,
Nov 11 2016
,
Nov 11 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/af8d74cc5bec8c7c939818df4ada9edfc3a218ba commit af8d74cc5bec8c7c939818df4ada9edfc3a218ba Author: Aaron Gable <agable@chromium.org> Date: Fri Nov 11 01:08:22 2016
,
Nov 11 2016
OK, I think I got the query right this time - checking for more than one patch_committing events. Every single one of them is happening around the hour or xx:30 which is when CQ restarts, within 1-3 min of each other. It also shows that it's happening A LOT. Pretty much every half an hour at peak hours. I'm guessing most of these commits just don't go through, but I can't think of a way to verify that right now.
,
Nov 11 2016
Outline of an actual locking mechanism to prevent this from ever happening:
cq detects a cl on rietveld that looks like it needs to be landed
if there isn't a flag message:
download patch
locally commit patch
post flag message "about to push sha FOO with parent sha BAR"
push
mark issue as closed
else (i.e. there is a flag message):
fetch from gob
if BAR is not in the fetch:
something is very wrong
elif both BAR and FOO are in the fetch:
mark issue as closed
elif BAR is the most recent thing in the fetch:
keep fetching until we're confident this isn't a replication problem
try this if/else block again
if BAR has a child commit *other than FOO*:
act as though there wasn't a flag message at all and try everything again
,
Nov 11 2016
So, disabling a 3-way merge will likely decrease the incidence of these, but will not completely prevent them from happening. If a patch can cleanly apply any number of times, it'll still apply. Fundamentally, it's due to the commit + issue close being non-atomic, and CQ has no way of knowing it if it restarts.
,
Nov 11 2016
What CQ does now: check if issue is closed on Rietveld; if not: download patch commit close the issue on Rietveld How would posting a message be different from closing an issue? If the message lags and CQ restarts before it appears, it'll double-commit anyway. Or am I missing something?
,
Nov 11 2016
Yes, the 3-way disable will prevent the worst cases like the one in comment 23 of this bug. It will not prevent things like adding a new whitespace-buffered entry to the DEPS file. For that, we need something like the locking in comment 37. This is a two-phase commit problem which has been solved *hundreds of times*. We can solve it too.
,
Nov 11 2016
The reason my proposal in 37 is different from what it does now is that it records both the actual sha of the commit it wants to push, and the sha of parent of that commit (which we know must exist and be replicated already). This way it can guarantee that it doesn't try to double commit, by being able to verify the existence *or lack of existence* of the commit it thinks it may have committed before.
,
Nov 11 2016
Just verified that CQ tried to double-commit with the same frequency back in May (as far as our logs go). It's a wonder we're having trouble with it only now.
,
Nov 11 2016
But the lack of a message is the problem: the message may have already been posted, but if it's not yet indexed in Datastore, it will not show up, and thus CQ will commit again. Having said that, http://crbug.com/663365#c6 claims CQ didn't get a chance to close an issue, so maybe posting before committing will help.
,
Nov 11 2016
Ok, it is now past 17:30, and the CQs should have restarted and picked up my removal of 3-way merge.
,
Nov 11 2016
The lack of message is the problem, yes, but I've been operating under the assumption that that is more likely to be caused by CQ dying (being restarted) between the commit+push and the POST to Rietveld, not by Rietveld being laggy. If we're confident that it is only due to the latter, then yes, a different solution would be needed.
,
Nov 11 2016
There was an effort at some point (or at least an idea) to run CQ under service_manager. This way, CQ would only restart when there is an actual change pushed, avoiding this problem altogether. We may also be able to shutdown CQ more gracefully, by giving it a lot more time to complete its cycle.
,
Nov 11 2016
The cq.golo has picked up the change. Hopefully that's enough to mitigate. I'm going to lower the priority to P1, and assign to the actual CQ team to follow up and ensure that this truly can't happen. Pawel, please follow up with this quickly. This is a high-impact bug that never should have lived in the CQ for this long, and now needs to be fixed with utmost aclarity.
,
Nov 11 2016
FYI issue 490229 is about better CQ deployment, and in particular, putting it under service_manager.
,
Nov 11 2016
Spoke with agable after a longer convo with stip and jparent... sergeyberezin, we need more precise numbers on how often this is happening. Can you please investigate that? sergiyb, please implement tandrii's solution outlined in https://bugs.chromium.org/p/chromium/issues/detail?id=662140#c13 as it is much simpler (albeit a bit more hacky) than a two-phased commit approach. This is P1 so it overrides any other work being done until it's fixed. Also adding this to the trooper queue so that the current trooper has context into any weirdness happening in the tree due to double commits.
,
Nov 11 2016
,
Nov 11 2016
andybons: Just run the query I posted for deeper analysis: https://goto.google.com/ogban In its current form, each attempt with N commit attepmts will have N rows (N>=2), to list the timestamps when each commit attempt happened. I don't really have any more data than this; not sure how to tell if CQ actually double-committed something. But it attempts double commits literally several times a day - anywhere between 1 and 10 (it varies).
,
Nov 11 2016
I think the reason why we see CQ double commit more frequently lately is due to slower git pushes, which makes them more likely to be interrupted. I like agable@'s idea of using Rietveld comments as a Write Ahead Log. Actually, I have vague memory of this this idea floating around almost two years ago right after V8 switch when CQ tried to commit the same V8 patch twice and I went to investigate, but at the time post comment meant sending email, so it wasn't worth it. Since then, rmistry@ and I made it possible for CQ to be selective about email and visual spam, so this idea is perfect now! The devil, as usual for distributed systems is in the recovery :( In this case, refactoring checkouts.py to break apply_patch into pieces, and then also adding tests. Thus, perhaps better implementation is storing flag in Rietveld instead of parsing commit messages, but skipping recovery for now and just aborting attempt if flag already exists, after which users have to trigger CQ again hence invalidating the flag. Once that works, we can add more recovery logic.
,
Nov 11 2016
Real data: 24 duplicated successful commits since May 2nd when I changed Review-Url footer to its current shape. Full list - attached and https://docs.google.com/a/google.com/spreadsheets/d/1OeLTo7zNHnkXP4H8vzxVuUGnE0QRgiaVoCCfa2esPG8/edit?usp=sharing. Code https://chromium-review.googlesource.com/410201.
,
Nov 11 2016
Update on real data that my script finished running: 94 commits since Blink Merge day. So, we are at ~1.5 duplicate commits per week.
,
Nov 11 2016
Removing Infra-Troopers, since this is assigned, long-lived, and larger than an immediate outage.
,
Nov 11 2016
I will have a look at this bug next Wednesday. Unfortunately, I can't do much before because I'll be on a training in Zurich.
,
Nov 15 2016
@tandrii / @sergeyberezin - thanks for running the queries. Someone (probably in chrome dev) should review the list of double-commits and see if any of them caused problems or otherwise need to be reverted. I'll file a new bug for that work so as to not confuse this issue too much.
,
Nov 16 2016
Filed bug 665663 for the cleanup effort.
,
Nov 17 2016
I think Wed has happened in your timezone. Any update?
,
Nov 17 2016
I've started working on it, but wasn't able to do much progress (3 intern placement interviews, was not feeling well).
,
Nov 17 2016
Update: first version of the CL is out: https://chrome-internal-review.googlesource.com/c/305179/. But this is incomplete yet. I'll continue working on it tomorrow.
,
Nov 17 2016
Update: posted another patchset on https://chrome-internal-review.googlesource.com/c/305179 making it complete, but still pending review. Also separated refactoring changes into another CL https://chrome-internal-review.googlesource.com/c/305259/1 as per offline comment from tandrii@.
,
Nov 18 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/69f12f9b25a7845712dc5570020cae178f91ace8 commit 69f12f9b25a7845712dc5570020cae178f91ace8 Author: Sergiy Byelozyorov <sergiyb@google.com> Date: Thu Nov 17 17:58:31 2016
,
Nov 19 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/0c9cbe42e0934c49a68ac4f11794378c9511e253 commit 0c9cbe42e0934c49a68ac4f11794378c9511e253 Author: Sergiy Byelozyorov <sergiyb@google.com> Date: Fri Nov 18 16:31:18 2016
,
Nov 21 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/58221b7b3dbb82c7cb86a0194e566597a7046fa5 commit 58221b7b3dbb82c7cb86a0194e566597a7046fa5 Author: Sergiy Byelozyorov <sergiyb@google.com> Date: Mon Nov 21 14:54:33 2016
,
Nov 21 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/4b9e346f9efd73ad4b8b3e57278920eccfbe8c40 commit 4b9e346f9efd73ad4b8b3e57278920eccfbe8c40 Author: Sergiy Byelozyorov <sergiyb@google.com> Date: Mon Nov 21 15:02:39 2016
,
Nov 21 2016
Issue 666362 has been merged into this issue.
,
Nov 21 2016
I think this was fixed. We'll re-run analysis in two weeks to confirm that there are no further double-commits. If you also spot one, please re-open this bug immediately.
,
Nov 21 2016
Awesome. Thanks, Sergiy.
,
Dec 1 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/f448d8db34d3dcbf181f2b3a506d6e028d0245f3 commit f448d8db34d3dcbf181f2b3a506d6e028d0245f3 Author: Sergiy Byelozyorov <sergiyb@google.com> Date: Thu Dec 01 10:37:54 2016
,
Dec 9 2016
Ping, can you verify that this is resolved?
,
Dec 9 2016
There is actually an open issue 670865 that I will investigate today (didn't investigate it earlier because I was on a sick leave).
,
Dec 9 2016
Just re-run the analysis. No double-commits since Nov 11th.
,
Dec 12 2016
Andrii, while investigating another bug, just found out that we were trying to double-commit CLs approximately once an hour before my fix. Most of these attempts resulted in a confusing "CQ failed to apply patch" message on CLs after they were already landed, but some resulted in a double-commit.
,
Jan 4 2017
Just for the sake of posterity, here's an example of why falling back to 3-way merge *was* desirable. It's a good thing that we have it disabled, because it is too dangerous, but I figured I'd write this down anyway: This CL: https://codereview.chromium.org/2589503002/ (patch set 3) Runs just fine on all the bots: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2589503002/60001 But fails to apply on the CQ: Failed to apply patch for chrome/browser/android/webapk/webapk_update_data_fetcher.cc: While running git apply --index -p1; error: patch failed: chrome/browser/android/webapk/webapk_update_data_fetcher.cc:108 error: chrome/browser/android/webapk/webapk_update_data_fetcher.cc: patch does not apply Patch: chrome/browser/android/webapk/webapk_update_data_fetcher.cc Index: chrome/browser/android/webapk/webapk_update_data_fetcher.cc diff --git a/chrome/browser/android/webapk/webapk_update_data_fetcher.cc b/chrome/browser/android/webapk/webapk_update_data_fetcher.cc index a5a20efebe050d58bcb9d3074a7fff60b8fd0524..53a944e6afaae8181b5262efa0eb85924cd964e5 100644 --- a/chrome/browser/android/webapk/webapk_update_data_fetcher.cc +++ b/chrome/browser/android/webapk/webapk_update_data_fetcher.cc @@ -108,10 +108,10 @@ void WebApkUpdateDataFetcher::FetchInstallableData() { return; InstallableParams params; - params.ideal_icon_size_in_dp = - ShortcutHelper::GetIdealHomescreenIconSizeInDp(); - params.minimum_icon_size_in_dp = - ShortcutHelper::GetMinimumHomescreenIconSizeInDp(); + params.ideal_icon_size_in_px = + ShortcutHelper::GetIdealHomescreenIconSizeInPx(); + params.minimum_icon_size_in_px = + ShortcutHelper::GetMinimumHomescreenIconSizeInPx(); params.check_installable = true; params.fetch_valid_icon = true; InstallableManager::CreateForWebContents(web_contents()); This is because HEAD now has an extra "}" near the top of the context: https://chromium.googlesource.com/chromium/src/+/c44e4f0a2120f24a6701298868befa1b2adc1ce8/chrome/browser/android/webapk/webapk_update_data_fetcher.cc#111 The bots skip by this just fine: https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.linux%2Fchromium_presubmit%2F334250%2F%2B%2Frecipes%2Fsteps%2Fbot_update%2F0%2Fstdout (ctrl-f for "error: while searching for") But the CQ fails. Anyway, just thought I'd write that down so we remember why it was useful if we ever revisit this in the future.
,
Mar 27 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal/+/af8d74cc5bec8c7c939818df4ada9edfc3a218ba commit af8d74cc5bec8c7c939818df4ada9edfc3a218ba Author: Aaron Gable <agable@chromium.org> Date: Fri Nov 11 01:15:06 2016
,
Mar 27 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal/+/af9b202bd2a4c57742dd374030e02404f93002cd commit af9b202bd2a4c57742dd374030e02404f93002cd Author: Andrii Shyshkalov <tandrii@chromium.org> Date: Mon Mar 27 17:40:03 2017 |
||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||
Comment 1 by scottmg@chromium.org
, Nov 3 2016