New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 883406 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Sep 17
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

CQ: support retrying child builds

Project Member Reported by tandrii@chromium.org, Sep 12

Issue description

(I'm pretty sure there is V8 bug for this from before, but after 5 minutes of searching, I can't find it)

Scenario: suppose there are two try builders:
  "compiler", which does checkout + compile + isolate + triggering ...
  "tester", which either tests locally or triggers and waits for swarming tasks.

Today, CQ can only trigger "compiler", but is able to block until both compiler and tester turn green. There is a case when CQ can get stuck:

Suppose, compiler is green, but triggered "tester" is red.

Now, CQ even if CQ aborts attempt upon tester turning red (may actually not be the case), re-triggering CQ will not be sufficient to make any progress. The only workaround today is to upload new patchset with some changes (not just description) and re-trigger CQ from scratch.


Two proposals to mitigate this in CQ itself:
1. Blunt: re-trigger the job that ultimately triggered "tester", ie "compiler". This will retry all child builds even if they were green, so it is a heavy hammer, but it's very robust and easy to implement. It's definitely better than current workaround, which would require retrying **everything**.

2. Re-trigger just "tester" job itself by specifying all properties that were originally specified by "compiler". This is very efficient, but
a) more work to implement
b) not robust in case of bad compiler artifacts, such as in case of a bad commit which doesn't break compile yet breaks test. In such a case, we ultimately want to re-sync to new healthy tip of tree, re-compile, and re-run all tests.
 
Possibly an additional point of mitigation - which it seems we would need eventual support for: manually choosing/re-trying the failing child build (and having it check what input a parent build might have already fed to it at the given revision)?
Re #1: manually retrying has two parts (and it probably has to be tracked in a different bug):

[Command line]: it is trivial to implement in "git cl try" command, but it's not used in fuchsia, but presumably you can implement similar helper which uses Change-Id of say HEAD in user's checkout, first fetching existing builds, so you can copy their inputs, and then triggering child builds with the same inputs.
With buildbucket v2 api, it's a lot easier -- just 1 schedule build call, which gets passed prior build id.

[Gerrit UI]: basically same as command line, except in the javascript.


Note, however, that CQ will need to be taught to recognize such manual retries. This is also required for proposal (2) (CQ re-triggering on its own the "tester"). But, this isn't required for proposal 1.
Project Member

Comment 3 by bugdroid1@chromium.org, Sep 12

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal/+/50cd15c70c0612e83f6e3d3c7080acae164e3df9

commit 50cd15c70c0612e83f6e3d3c7080acae164e3df9
Author: Andrii Shyshkalov <tandrii@chromium.org>
Date: Wed Sep 12 22:30:50 2018

I though proposal 1. is how it already works today? If I retrigger CQ, all the compilers belonging to failed child builds are retriggered. Do you have an example if it got stuck that way? V8 folks are basically working like that all the time.

The only downside you write is some resource waste with multiple children. But e.g. V8 has only a 1:1 parent:child relation and other projects don't support it at all.

The only times where I needed the new patchset work-around was when buildbucket integration had a bug/outage and CQ wouldn't realize jobs are done.
+1 to what Michael said. My experience suggests that it already worked like in proposal 1 and waste of resources is not a concern for V8 due to 1:1 mapping. AFAIK, no other projects used triggered builds in CQ, which is a pity though given how many resources it could save on bots that trigger all tests as swarming tasks. But that's a different discussion...

Not sure whether its common, but re-triggering parent builder is also important to handle flakes in the build process that only surface when running tests.
Yes indeed. The compiler alsways needs retrigger as Sergiy suggested, since the test failures could stem from dirty ToT. ToT might have advanced when retriggering, and the rerun should account for that.
Ha, thanks for correcting me!
WontFix?
Owner: tandrii@chromium.org
Status: WontFix (was: Available)
Given that the blunt approach already works, I think it's fine to WontFix.
Cc: garymm@google.com
Fuchsia is likely going to implement compiler-tester split in Q1 and I'd like to see proposal #2 be an option. Our compilation is slow and our tests flake a lot, so retrying just the test builds when they fail will save people a lot of time.

tandrii@, would you be open to implementing #2, perhaps providing an option to enable / disable that behavior?
Actually we're leaning towards putting retry logic into our tester recipe, so we probably don't need this.
Sorry for late reply, but i'm glad you've chosen:

> leaning towards putting retry logic into our tester recipe

Right, retrying at the lowest level is best.
That said, we are considering a new protocol for a given build report back to CQ whether to result is final (already added) and if not, whether retry itself and/or its parent. LMK if you are interested.
How would that be useful if we put the retry logic into our tester recipe?

example: say you have 2 stage jobs:
 compiler (checkout HEAD, apply patch, build, isolate)
  -> tester (run tests on some device, maybe retry on another)
sometimes tester may realize that the problem is due to usage of bad HEAD, hence it'd ask to retry compiler job.
My initial reaction is that we wouldn't use that feature.
But you should probably ask other CQ users.

Sign in to add a comment