New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 838735 link

Starred by 4 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: ----



Sign in to add a comment

Not retry a failed build in CQ if the failed tests are known flaky tests

Project Member Reported by st...@chromium.org, May 1 2018

Issue description

This aligns with the idea to do 10X or 30X rerun of failed tests in CQ to know whether they are flaky ones and ignore them.

One good example is this CL: https://chromium-review.googlesource.com/c/chromium/src/+/1029556/2
The failed build due to flaky tests: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_chromium_tsan_rel_ng/3667
The retry of that build took over an hour which is too long an overhead: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_chromium_tsan_rel_ng/3680

After the new flake detection is in place in Findit, we should have enough data to tell whether a test is VERY flaky or not at a given commit position.
If a failed test is VERY flaky, we potentially could have two options:
1. (preferred option) Make the chromium_trybot recipe to ignore them (by querying Findit "hey, is this test XYZ flaky at revision ABC?") and not fail the build.
2. Let CQ daemon ask Findit the same question and not retry the failed build and let the CL go through CQ.

The idea is to not retry builds or rerun tests if they are well-known flaky tests, and just let the CL go through CQ.
This could save us capacity.
 

Comment 1 by st...@chromium.org, May 1 2018

Cc: dpranke@chromium.org
+dpranke@ in case there is some red flag on this thought.
I'm not sure what you're suggesting; are you saying that we should not retry a patch if it failed due to a flake? 

If so, wouldn't that just cause us to falsely reject a lot more patches?

Also, in general, I'm very leery of trying to make decisions based on near-real-time data as part of a CQ attempt (I have talked to liaoyuke@ about this some, in the context of getting rid of the "without patch" logic altogether).

In this case, for example, if we know a test is flaky, we should just disable it so that someone else doesn't see the failure in the first place. Continuing to run the test, particularly if we don't somehow distinguish it as known to be flaky, just confuses the user.

Comment 3 by st...@chromium.org, May 2 2018

Description: Show this description

Comment 4 by st...@chromium.org, May 2 2018

Description: Show this description

Comment 5 by st...@chromium.org, May 2 2018

Cc: jparent@chromium.org jbudorick@chromium.org estaab@chromium.org
Sorry for the confusion. I've updated the summary.

A more inclusive idea here is to detect failing tests as known flaky tests:
1) preferably by using existing near-real-time data
2) if needed, do 10X or 30X rerun to detect them at real time
Once detected, don't mark the build as failed so that the CL could go through.

As pointed out in the slide https://docs.google.com/presentation/d/1_ASQ_1bO9GgwT3agTytKlo8HVw3Ub7gjpwzlUB5MZx8/edit#slide=id.g36b1086068_0_8 , one opening question is how to ensure that the flakiness is NOT due to the patch being tested. If a test is failed (with patch), one possible option is to:
1) Rerun the test 10X or 30X (without patch) --> whether it is flaky without patch.
2) Rerun the test 10X or 30X still (with patch) --> whether it is flaky with the patch.

There are quite a few cases to take care of. For examples:
1) if a bad commit causes 100s of tests to be failing (either flaky or consistently failing), how to handle that appropriately (hopefully this is rare, maybe just not do any rerun and fail the CQ build).
2) if the test is flaky without patch, is it possible to highly confidently tell that it is not becoming consistently failing (with patch) so that we don't have to do the 10X or 30X reruns.


BTW, if we are gonna to get rid of (without patch) logic, chromium-try-flakes and the new BQ-based flake detection need a lot of adaption for that. That said, we need to further discussion before we remove (without patch) logic.

Comment 6 by st...@chromium.org, May 2 2018

If we could detect sooner enough (the sooner the better, but no more than a couple hours after the bad commit landed) and we could disable flaky tests automatically (gtest seems a blocker here), then we might not need the (without patch) logic.

However, for reliability, I'd want to keep (without patch) until all the pieces are sorted out.
re #2, one concern I have about just disabling a flaky test is that, it's very hard to make sure this will be done in a timely manner. Say a test is detected as flaky, a bug or a disabling CL is sent to the Sheriff's queue, however, it may take a long time before the bug or the CL is picked up, and the gap between can cause a lot of CLs to be rejected incorrectly.

Should we do both? If a test is confidently know to be flaky, we skip re-try and add a step to explain that although this test failed, evidence shows it's not caused by this CL. And at the same time, we do whatever needs to be done to get this test disabled ASAP.
Owner: liaoyuke@chromium.org
Status: Assigned (was: Available)
Labels: -Pri-3 Pri-2
re #c7 - good point about the time lag. That might be a strong argument to react immediately (and programmatically).
Cc: kbr@chromium.org
We should make sure that we take the flaky rate into account. A CL can make a known flaky test flakier than it used to be.
Many thanks Ned for the suggestion!

There was some discussion around this, but I don't think we move forward until the reproduciblity issue of flaky tests are sorted out properly (which seems really difficult due to the nature of flakiness itself). Without reproduciblity, it will cause false positives: a flaky test's pass rate could vary a lot when we rerun at different revisions, even though there is no noticeable code changes in between. This could be caused by test environment change -- for example, different machine states. 

For example, in the linked flake analysis, the flaky test is very reproducible at r574150 (passed 66% with 89 runs), but totally not reproducible at r574149 (passed 100% of 430 runs). However, r574149 is not the culprit as reported in bug 863087. https://findit-for-me.appspot.com/waterfall/flake?key=ag9zfmZpbmRpdC1mb3ItbWVy1wELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9vdCKgAWNocm9taXVtLm1hYy9NYWMxMC4xMCBUZXN0cy8zMzczMC9pbnRlcmFjdGl2ZV91aV90ZXN0cyBvbiBNYWMtMTAuMTAvVkdGaVJISmhaMmRwYm1jdlJHVjBZV05vVkc5Q2NtOTNjMlZ5VkdGaVJISmhaME52Ym5SeWIyeHNaWEpVWlhOMExrUmxkR0ZqYUZSdlQzZHVWMmx1Wkc5M0x6QT0MCxITTWFzdGVyRmxha2VBbmFseXNpcxgBDA
There are more examples like this, and that's the main reason why we don't enable auto-revert for existing tests which are made flaky.

That said, telling whether a test is flaky to some level of certainty is hard, and telling how flaky a test seems close to infeasible.

Non-repro of flaky tests is a problem we need to dig into before we could do more optimization.
Besides, reliably measuring flake/pass rate of a test will need a lot of VM resources, and we might not be able to provision that.

Repro issue is a problem I'd like us to look into with rr, because it would help a lot on fixing flaky test.
However, rr might not help out on new run of a test since it is just "record and replay" IIUC. I could be wrong here, and need to dig further once I have some cycle.
Cc: nednguyen@chromium.org
Cc: dtu@chromium.org
#11: intersting. So basically what I am getting from this is flaky rate is a function of test binary & hardware. (let's call it FlakyRate(test binary, hardware))

Have you tried to use device affinity when using FindIt? For perf, we know that the performance characteristic is definitely involves hardware, so we always pin down a same hardware in our bisecting process.

+Dave can speak more about how pinpoint work.


Though I agree that we may want to pursue "taking flaky rate into account" after the MVP. Just pursuing what's proposed in #0 may already make things a lot better than they're now.
Unfortunately, we don't know the outstanding factors that impact reproduciblility of flaky tests yet. And that's an area I'd like us to dig into afterwards.

What I tried to say is that: it's hard for flake rates to be reliable enough to reject CLs in CQ. My worry is that it could cause more CQ false rejections besides using a lot of machine resources.
There are a lot of other stuffs we could do to mitigate CQ false rejections and improve postsubmit analysis of flaky tests.

Findit doesn't use device affinity for running tests. It might worth a try to sort out machine-related flakes.
Machine/hardware state is one example I use to illustrate the non-repro issue.
Re: c13: Pinpoint does have device pinning, but we still see many regressions that are not reproducible on any device in the Pinpoint pool.

The function is something more like FailureRate(test binary, hardware, software environment), where "software environment" is mostly unknown factors. I think the biggest software factor is what tests run before or after -- the tests aren't independent from each other. The next step for perf dashboard/Pinpoint may be automatically running the full benchmark/suite if it doesn't repro on an individual story.

We're also going to analyze if there are any patterns with no-repro bisects, e.g. if they occur more commonly on any particular bots, OSes, benchmarks, or metrics.
Cc: erikc...@chromium.org
+erikchen who is interested in this idea
Owner: chanli@chromium.org
Assigning to Chan, who is on this.
Cc: -nednguyen@chromium.org
Could someone please provide a summary of what this feature entails, specifically? The opening comment is a bit vague.
Correctly me if I'm wrong, I think we want to utilize our knowledge of flakes to reduce retries: if the failure is a known flake or the failure is confirmed to be a flake, we should ignore this failure.
Here are some questions that might help flesh out the details:
  * What interface/API/service will be used to determine if a test is "flaky". Does it already exist? Do we need to build it?
  * How will the time range of the flakiness affect the logic? 
e.g. imagine CL #1 lands [introduces a non-deterministic failure]. This causes the test to be marked as "flaky" by the service. CL #2 is reverted. Someone tries to reland CL #1, with the same bug. The test will fail, but the service might still consider it to be "flaky", and so let the CL land again?
  * Which of the retry layers are being skipped on test failure? 
  * How will we measure the effectiveness of this feature?
Re c21:

  * What interface/API/service will be used to determine if a test is "flaky". Does it already exist? Do we need to build it?

The initial idea is to make Findit to expose an API whose input would be information about a failure (test info, build info, gerrit cl info, etc), and output would be if the test is flaky or not (or more info if needed). 

  * How will the time range of the flakiness affect the logic? 
e.g. imagine CL #1 lands [introduces a non-deterministic failure]. This causes the test to be marked as "flaky" by the service. CL #2 is reverted. Someone tries to reland CL #1, with the same bug. The test will fail, but the service might still consider it to be "flaky", and so let the CL land again?

That is a good case. I need to think more to look for a good solution for it. 
One quick solution I can think of right now is we might need to check if the test is flake/stable at ToT when got requested about one test.

  * Which of the retry layers are being skipped on test failure? 

It depends on failures in a failed build. If we can confirm all failed tests are flaky in one build, then we might be able to skip the build-level retry; or if all failed tests are flaky in one step, we might skip step level retry (retry-with-patch, or without-patch); or just skip retry the flaky tests in step retries.

  * How will we measure the effectiveness of this feature?

Measurement I can think of could be 1. change of overhead on CQ (overhead could be decreased by skipping retries; though it could also be increased by waiting Findit to verify if a test is flaky) 2. change of false-negetives. 
Another thing: this feature will affect Flake Detector detecting new flake occurrences, which means we might need to adjust Flake Detector again to adapt this.
Thanks. I think this is a good idea overall.

I'm excited to see what the design will look like once it's more flushed out. :)

Sign in to add a comment