New issue
Advanced search Search tips

Issue 874117 link

Starred by 3 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Feature



Sign in to add a comment

CQ Bots should have an API that lets them fail without being retried

Project Member Reported by agrieve@chromium.org, Aug 14

Issue description

For android-binary-size bot, it doesn't run any tests, so shouldn't be flakey at all.

Looking at this as an example: (click show experimental)
https://chromium-review.googlesource.com/c/chromium/src/+/1115702

The bot was run three times. I'm guessing this is due to retry logic. I'd guess that most bots that fail on the "compile" step would not need to be retried.
 
Components: Infra>Platform>CQ
I think this would need a change in the CQ itself; retries are configurable but do not appear to be configurable on a per-builder basis at the moment.
First, experimental builds aren't retried. But, there were >=3 CQ attempts (Dry run or full run), each of which triggered a new experimental build, because...

Second, CQ attempt N+1 re-uses only green jobs from prior (1..N) attempts. This is true for both experimental and normal tryjobs.


To make CQ retries per-builder, there are two ways:

1. Static, defined in cq.cfg per builder. Something like:
  builder {
    name: "super-deterministic"
    max_retries: 
     -1 (disabled)
     0 (default, shares retries budget with other builders)
     N (up to N retries, regardless of other builders)
  }

2. Runtime build property, say "cq_retries" which can be set by recipe to tell CQ to perhaps not retry this builder any more.


Opinions?
I would prefer option 1.
Cc: liaoyuke@chromium.org st...@chromium.org
I'm not wild about this idea, because while it may reduce the number of unnecessary retries, it will probably also increase the number of false negatives (i.e., false rejections), and keeping the latter low is more important to me.

For example, if the compile is ever broken at HEAD, your CL will fail and not be retries, even if your CL isn't at fault. 

It would be interesting to try and quantify how often we had unnecessary retried jobs.
Cc: dpranke@chromium.org
I wanted to point out that compile could also be flaky. As I observed so far, the causes could be:
1) Bot died
2) Unexpected exception during compile without any error message
3) Incorrect gn config of indirect dependencies for generated header files

As Dirk mentioned, we need some data to understand it better.
It's not even just compile; there's probably a half dozen steps that might fail that aren't infra failures. 
I'm also not "wild" about this idea for chromium/src. I think non-Chromium projects might actually benefit from this.
Labels: -Type-Bug Type-Feature
but to be clear, this is a new feature request and it's certainly not Pri1, unless one can show that doing this for some super stable builder would save us lots of capacity.
Cc: erikc...@chromium.org
Ah, I was planning on implementing option (2). Didn't realize there was an existing bug for this.

> 2. Runtime build property, say "cq_retries" which can be set by recipe to tell CQ to perhaps not retry this builder any more.

For most recipes, I expect [in the not too distant] to:
* Need to retry compile failures, as there may be some config issue with the device.
* Not need to retry test failures, as there will be *very small probability* that the failure is due to flakiness.
I feel like most of this falls into the "unnecessary optimization" bucket, but maybe that's only true because we're not notifying devs of a failure until we've given up retrying?

Sign in to add a comment