We've disabled 'retry with patch' for GPU integration tests, but we still run CQ-level build retries. I conferred with Ken to come up with the best way to turn down CQ-level build retries.
Observations:
* GPU integration tests have device affinity [e.g. GPUs wear down over time]
* We don't know the frequency, but GPU integration tests do occasionally flake due to Chrome bugs.
* The test ordering is not stable. Regardless, we want to run tests in the same order as much as possible. We do not want to retry tests with a different ordering.
Proposal:
* We run GPU tests once during 'with patch'.
* If there is a failure, we redispatch the same swarming task [possibly N times].
* We mark the test run as a success as long as there are [M successes].
* We never trigger 'retry with patch' or CQ-level build retries.
Most likely values for N & M are [N=1, M=1], [N=3, M=2]. The implementation will likely be shared with Issue 917122.
Other proposals, discarded:
* In 'with patch', dispatch N tasks. Only let CL pass if all N tasks pass.
* PRO: Exponentially small probability that newly introduced flakiness lands.
* CON: Insufficient device capacity.
* CON: Task/device-affine flakiness will cause large amounts of false rejects.
Comment 1 by erikc...@chromium.org
, Jan 3