CQ: support retrying child builds |
|||
Issue description(I'm pretty sure there is V8 bug for this from before, but after 5 minutes of searching, I can't find it) Scenario: suppose there are two try builders: "compiler", which does checkout + compile + isolate + triggering ... "tester", which either tests locally or triggers and waits for swarming tasks. Today, CQ can only trigger "compiler", but is able to block until both compiler and tester turn green. There is a case when CQ can get stuck: Suppose, compiler is green, but triggered "tester" is red. Now, CQ even if CQ aborts attempt upon tester turning red (may actually not be the case), re-triggering CQ will not be sufficient to make any progress. The only workaround today is to upload new patchset with some changes (not just description) and re-trigger CQ from scratch. Two proposals to mitigate this in CQ itself: 1. Blunt: re-trigger the job that ultimately triggered "tester", ie "compiler". This will retry all child builds even if they were green, so it is a heavy hammer, but it's very robust and easy to implement. It's definitely better than current workaround, which would require retrying **everything**. 2. Re-trigger just "tester" job itself by specifying all properties that were originally specified by "compiler". This is very efficient, but a) more work to implement b) not robust in case of bad compiler artifacts, such as in case of a bad commit which doesn't break compile yet breaks test. In such a case, we ultimately want to re-sync to new healthy tip of tree, re-compile, and re-run all tests.
,
Sep 12
Re #1: manually retrying has two parts (and it probably has to be tracked in a different bug): [Command line]: it is trivial to implement in "git cl try" command, but it's not used in fuchsia, but presumably you can implement similar helper which uses Change-Id of say HEAD in user's checkout, first fetching existing builds, so you can copy their inputs, and then triggering child builds with the same inputs. With buildbucket v2 api, it's a lot easier -- just 1 schedule build call, which gets passed prior build id. [Gerrit UI]: basically same as command line, except in the javascript. Note, however, that CQ will need to be taught to recognize such manual retries. This is also required for proposal (2) (CQ re-triggering on its own the "tester"). But, this isn't required for proposal 1.
,
Sep 12
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal/+/50cd15c70c0612e83f6e3d3c7080acae164e3df9 commit 50cd15c70c0612e83f6e3d3c7080acae164e3df9 Author: Andrii Shyshkalov <tandrii@chromium.org> Date: Wed Sep 12 22:30:50 2018
,
Sep 13
I though proposal 1. is how it already works today? If I retrigger CQ, all the compilers belonging to failed child builds are retriggered. Do you have an example if it got stuck that way? V8 folks are basically working like that all the time. The only downside you write is some resource waste with multiple children. But e.g. V8 has only a 1:1 parent:child relation and other projects don't support it at all. The only times where I needed the new patchset work-around was when buildbucket integration had a bug/outage and CQ wouldn't realize jobs are done.
,
Sep 13
+1 to what Michael said. My experience suggests that it already worked like in proposal 1 and waste of resources is not a concern for V8 due to 1:1 mapping. AFAIK, no other projects used triggered builds in CQ, which is a pity though given how many resources it could save on bots that trigger all tests as swarming tasks. But that's a different discussion... Not sure whether its common, but re-triggering parent builder is also important to handle flakes in the build process that only surface when running tests.
,
Sep 13
Yes indeed. The compiler alsways needs retrigger as Sergiy suggested, since the test failures could stem from dirty ToT. ToT might have advanced when retriggering, and the rerun should account for that.
,
Sep 14
Ha, thanks for correcting me!
,
Sep 17
WontFix?
,
Sep 17
Given that the blunt approach already works, I think it's fine to WontFix.
,
Dec 29
Fuchsia is likely going to implement compiler-tester split in Q1 and I'd like to see proposal #2 be an option. Our compilation is slow and our tests flake a lot, so retrying just the test builds when they fail will save people a lot of time. tandrii@, would you be open to implementing #2, perhaps providing an option to enable / disable that behavior?
,
Jan 4
Actually we're leaning towards putting retry logic into our tester recipe, so we probably don't need this.
,
Jan 7
Sorry for late reply, but i'm glad you've chosen: > leaning towards putting retry logic into our tester recipe Right, retrying at the lowest level is best.
,
Jan 7
That said, we are considering a new protocol for a given build report back to CQ whether to result is final (already added) and if not, whether retry itself and/or its parent. LMK if you are interested.
,
Jan 7
How would that be useful if we put the retry logic into our tester recipe?
,
Jan 7
example: say you have 2 stage jobs: compiler (checkout HEAD, apply patch, build, isolate) -> tester (run tests on some device, maybe retry on another) sometimes tester may realize that the problem is due to usage of bad HEAD, hence it'd ask to retry compiler job.
,
Jan 7
My initial reaction is that we wouldn't use that feature. But you should probably ask other CQ users. |
|||
►
Sign in to add a comment |
|||
Comment 1 by joshuaseaton@google.com
, Sep 12