New issue
Advanced search Search tips

Issue 907081 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Feature



Sign in to add a comment

CQ: Do not retry on CQ+1

Project Member Reported by mar...@chromium.org, Nov 20

Issue description

Right now, the CQ retries builds in case a build is flaky.

When doing CQ+1, the false positive rate is generally lower, because the true positive is higher.

So we should disable automatic build retry on CQ+1. Opinion?
 
Cc: erikc...@chromium.org
Components: Infra>Client>Chrome
I think that's a quite good idea for chromium/src.
CCI + erikchen@ WDYT?
I think we should continue to do so for now. 

I don't think CQ+1 vs CQ+2 is the main factor for ratio of false positives vs false negatives. I'd like to eventually move towards a world where test failures are not retried, but build/infra issues are retried, as I think that's the real distinguishing factor for false positives vs false negatives.

https://bugs.chromium.org/p/chromium/issues/detail?id=903495#c1


I kinda disagree, when doing a CQ+1 without the CL sent for review, this is clearly a check by a dev to see if the code is good or not, and that the odd of true failure is high(er).
I'm in favor of this for chromium/src; I'm ok with forcing developers to manually intervene to retry at a build level for dry runs.
What at least 3 of us have so far assumed is that not retrying in dry run will actually be a good thing for infra. Will it actually?

I assumed the only benefit is reduction in required fleet capacity, but is this reduction substantial to warrant extra feature in CQ?

Have I missed some other benefit?
The main issues is that doing a "git cl upl -d" currently takes twice as time as it should to send the failure email, because jobs are retried unconditionally.
Oh, latency in response to a developer is something I totally missed, even though I regularly experience it (since I'm a big of fan of "-d" flag).

I guess my personal bias is motivating me to implement this, and suddenly it doesn't look as hard :D
As a Chromium developer, it is time consuming to run a patch through the CQ, and then discover that it's failed [due to unrelated reasons], as the developer now has to retry the CQ and wait again. We currently have no metrics to track the pain of this since false rejects only tracks committed CQ attempts. As such, I'm hesitant to land a change that might have a negative effect on developers.

I've attached a script that looks at all dry runs in the period from 11-01 through 11-22. 

=====Effectiveness of CQ retry for dry runs=====

cq_runs_not_retried 8849
cq_runs_retry_same_result 1531
cq_runs_retry_different_result 993

39% of dry-runs produce a different result when failing builds get retried by CQ. If we got rid of CQ layer retry for dry runs, we'd increase false positives by at least (993) / (993 + 1531 + 8849) = 8.7%.

=====Effectiveness of CQ retry for dry runs [breakdown by builds & failure reason]=====

Breaking down by failure reason of individual builds, we see:

builds_not_retried [('NO_FAILURE_REASON', 298909), (u'PATCH_FAILURE', 6634), (u'COMPILE_FAILURE', 14527), (u'TEST_FAILURE', 13805), (u'INVALID_TEST_RESULTS', 2715)]

builds_retried_same_result [('NO_FAILURE_REASON', 460), (u'PATCH_FAILURE', 56), (u'COMPILE_FAILURE', 339), (u'TEST_FAILURE', 653), (u'INVALID_TEST_RESULTS', 146)]

builds_retried_different_result [('NO_FAILURE_REASON', 454), (u'COMPILE_FAILURE', 303), (u'TEST_FAILURE', 189), (u'INVALID_TEST_RESULTS', 141)]

We see that when a dry-run fails due to a TEST_FAILURE, retrying it will produce a different result 22% of the time. However, when a dry-run fails due to 'COMPILE_FAILURE' or due to 'INVALID_TEST_RESULTS', retrying will produce a different result ~50% of the time. 

=====Effectiveness of CQ retry for all CQ attempts=====
cq_runs_not_retried 14451
cq_runs_retry_same_result 2411
cq_runs_retry_different_result 1628

40% of all CQ attempts produce a different result when failing builds get retired by the CQ. If we got rid of CQ layer retry for all CQ attempts, we'd increase false positives by at least (1628) / (1628 + 2411 + 14451) = 8.8%

builds_not_retried [('NO_FAILURE_REASON', 501070), (u'COMPILE_FAILURE', 15985), (u'TEST_FAILURE', 19332), (u'PATCH_FAILURE', 7311), (u'INVALID_TEST_RESULTS', 3379)]
builds_retried_same_result [('NO_FAILURE_REASON', 748), (u'COMPILE_FAILURE', 462), (u'PATCH_FAILURE', 81), (u'TEST_FAILURE', 1096), (u'INVALID_TEST_RESULTS', 216)]
builds_retried_different_result [('NO_FAILURE_REASON', 794), (u'COMPILE_FAILURE', 462), (u'TEST_FAILURE', 300), (u'INVALID_TEST_RESULTS', 234)]

=====Summary=====
These stats show that CQ-layer retry is currently highly effective at reducing false positives for both dry runs and all CQ attempts. 

A significant number of build failures are due to either 'COMPILE_FAILURE' or 'INVALID_TEST_RESULTS'. Retrying these produces different results ~50% of the time. 
compute_result_fidelity_dry_run.py
7.6 KB View Download
#5: I think so, particularly in outages. Not automatically having the CQ retry builds when we're e.g. dying on mac layout tests persistently (as we were last week) would be great, even if it's limited to dry runs.

#8: I'm curious whether the false positive rate matters for dry runs. My suspicion would be that it doesn't, as I think the use cases are different -- either developers are trying to smoke test their change or they're trying to basically warm up the CQ pending CR+1. The former case doesn't necessarily need to retry everything, as the developer may get sufficient signal from what behaves well; the latter case requires manual developer intervention to land anyway.
#9: A significant number of failures are due to 'COMPILE_FAILURE' [not due to CL, as it passes on retry] or 'INVALID_TEST_RESULTS'. Both prevent the developer from getting much signal. 


Sign in to add a comment