CQ Bots should have an API that lets them fail without being retried |
|||||
Issue descriptionFor android-binary-size bot, it doesn't run any tests, so shouldn't be flakey at all. Looking at this as an example: (click show experimental) https://chromium-review.googlesource.com/c/chromium/src/+/1115702 The bot was run three times. I'm guessing this is due to retry logic. I'd guess that most bots that fail on the "compile" step would not need to be retried.
,
Aug 14
First, experimental builds aren't retried. But, there were >=3 CQ attempts (Dry run or full run), each of which triggered a new experimental build, because...
Second, CQ attempt N+1 re-uses only green jobs from prior (1..N) attempts. This is true for both experimental and normal tryjobs.
To make CQ retries per-builder, there are two ways:
1. Static, defined in cq.cfg per builder. Something like:
builder {
name: "super-deterministic"
max_retries:
-1 (disabled)
0 (default, shares retries budget with other builders)
N (up to N retries, regardless of other builders)
}
2. Runtime build property, say "cq_retries" which can be set by recipe to tell CQ to perhaps not retry this builder any more.
Opinions?
,
Aug 14
I would prefer option 1.
,
Aug 14
I'm not wild about this idea, because while it may reduce the number of unnecessary retries, it will probably also increase the number of false negatives (i.e., false rejections), and keeping the latter low is more important to me. For example, if the compile is ever broken at HEAD, your CL will fail and not be retries, even if your CL isn't at fault. It would be interesting to try and quantify how often we had unnecessary retried jobs.
,
Aug 14
,
Aug 14
I wanted to point out that compile could also be flaky. As I observed so far, the causes could be: 1) Bot died 2) Unexpected exception during compile without any error message 3) Incorrect gn config of indirect dependencies for generated header files As Dirk mentioned, we need some data to understand it better.
,
Aug 14
It's not even just compile; there's probably a half dozen steps that might fail that aren't infra failures.
,
Aug 14
I'm also not "wild" about this idea for chromium/src. I think non-Chromium projects might actually benefit from this.
,
Aug 14
but to be clear, this is a new feature request and it's certainly not Pri1, unless one can show that doing this for some super stable builder would save us lots of capacity.
,
Oct 12
,
Oct 12
Ah, I was planning on implementing option (2). Didn't realize there was an existing bug for this. > 2. Runtime build property, say "cq_retries" which can be set by recipe to tell CQ to perhaps not retry this builder any more. For most recipes, I expect [in the not too distant] to: * Need to retry compile failures, as there may be some config issue with the device. * Not need to retry test failures, as there will be *very small probability* that the failure is due to flakiness.
,
Oct 16
I feel like most of this falls into the "unnecessary optimization" bucket, but maybe that's only true because we're not notifying devs of a failure until we've given up retrying? |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by jbudorick@chromium.org
, Aug 14