New issue
Advanced search Search tips

Issue 737125 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 648343
Owner: ----
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

bot_update should fail to apply the patch in tryjobs that ran after Gerrit CL has landed

Project Member Reported by machenb...@chromium.org, Jun 27 2017

Issue description

On CL https://chromium-review.googlesource.com/c/541221/ (patch 8) the layout test trybot ran twice inconsistently.

The first time, it compiled, ran the tests and had a lot of failures. The second time, analyze decided that no compile was necessary. Though the CL indeed breaks layout tests.

The two runs:
https://build.chromium.org/p/tryserver.v8/builders/v8_linux_blink_rel/builds/22894
https://build.chromium.org/p/tryserver.v8/builders/v8_linux_blink_rel/builds/22904

In the second run, the problem seems to occur in "git diff to analyze patch", it didn't show any files. Though it seems to be equally called and also the two bot_update steps seem to do the right thing (just had a superficial look).
 
Components: Infra>Codereview>Gerrit
Also adding gerrit label. Maybe it's due to the way bot_update applies gerrit patches...

Comment 2 by aga...@chromium.org, Jun 27 2017

Note that those two runs were on different bots, so this wasn't a case of the first run poisoning the checkout or anything like that.
No, but maybe whatever ran before the second build on _its bot_ poisoned the checkout?
This was the build that ran before on the same bot:
https://build.chromium.org/p/tryserver.v8/builders/v8_linux_blink_rel/builds/22900

Also nothing out of the ordinary as far as I can see...
I SSHed in and couldn't find anything weird with the checkout. I tried rebasing the checkout on the bot though. 

The git diff to analyze patch steps between the two runs look very similar. What's odd to me is that after step 15, the second bot seems to skip a ton of the steps that the first bot runs. 
Ignore my comment before about the git diff steps looking similar. I see the missing files now. 
Every difference after the "git diff" is a result of those missing files. I'm afraid this might not be very actionable, but still scary.

Are there more sanity checks that we could add as a shot in the dark?
Another example:
https://build.chromium.org/p/tryserver.v8/builders/v8_linux_blink_rel/builds/22798
https://build.chromium.org/p/tryserver.v8/builders/v8_linux_blink_rel/builds/22961

Both from same CL/patchset. The second one had an empty "git diff". It was on a different host than the example from the description.
It looks like the second build (22961) happened *after* the change landed (v8 was synced to 46251, but the change landed in 46249).

v8_linux_blink_rel is an experimental bot, and as such, they don't block the commit queue. Depending on busy said experimental bot might be, this can mean that it doesn't get a chance to run until after the change might've landed.

So, the empty diff is to be expected. I don't think there's an easy way to avoid this if you are capacity-constrained.
Cc: tandrii@chromium.org
Components: -Infra>Codereview>Gerrit Infra>CQ
Labels: -Pri-1 -Infra-Troopers -Type-Bug-Regression Pri-2 Type-Bug
Status: Available (was: Untriaged)
Summary: Failed experimental bots may re-run after CL has landed, looking like a no-op (was: Git diff inconsistent after bot_update patch application)
Nice analysis, thanks Dirk!

Yes, it all makes sense now. +tandrii for this CQ use case. This has the extra complication that a V8 patch is applied on a Chromium bot. CQ doesn't know anything about that.

But I think we could maybe teach the recipe to detect that the CL has already landed? I assume 'git diff == no files' already gives us that answer? Maybe we could in this case bale out immediately with a different message.

@tandrii, is CQ possibly re-triggering a failed experimental bot after the CL has landed? If that were possible, maybe we could avoid that. If not, then the second run must have been triggered just before the patch landed. And the patch managed to land before bot_update got to it. There's probably a max-1-minute window where that can happen.
Components: -Infra -Infra>CQ Infra>Codereview>Gerrit
Labels: Milestone-Afterglow Proj-Gerrit-Migration
Summary: bot_update should fail in tryjobs that ran after Gerrit CL has landed (was: Failed experimental bots may re-run after CL has landed, looking like a no-op)
I think apply_issue for Rietveld patches was able to detect this is CL was marked as closed (IIRC). Current rebase logic for Gerrit patches in bot_update isn't that smart: it'd fail iff patch doesn't rebase cleanly. Hence you the observed results. I think best way to implement this is to add extra step:

= git checkout refs/patch
+ # check if has been CL merged using Gerrit API. Race conditions still possible due to eventual consistency, but should be rare. 
+ # For total protection, maybe check that commit message stays the same after rebase?
= git rebase RETCH_HEAD onto refs/head/branch

Also, IMO, CQ is WAI here, and since one can trigger tryjobs with git cl try with the same outcomes, I'm removing CQ component.
Summary: bot_update should fail to apply the patch in tryjobs that ran after Gerrit CL has landed (was: bot_update should fail in tryjobs that ran after Gerrit CL has landed)
Mergedinto: 648343
Status: Duplicate (was: Available)
At this point, this bug is now a duplicate of 648343

Sign in to add a comment