We currently do a pretty good job of detecting flakienss and preventing it from causing false rejects. However, we are not very good at detecting CLs that introduced the flakiness and preventing it from letting the CL land.
There are three cases I think we should consider.
Example 1: CL introduces a new test. Test is flaky.
Expected behavior: CL fails. [Test flakes with patch, doesn't exist without patch.]
Actual behavior: CL passes. Test flakes with patch, thus we mark the test as a flaky success and pass the CL.
Example 2: CL renames a flaky test. Test is flaky.
Expected behavior: CL passes.
Actual behavior: CL passes.
Unfortunately, (example 2) is very hard to distinguish from (example 1) from the recipe's perspective.
Example 3: CL makes an existing non-flaky test flaky.
Expected behavior: CL fails. [Test flakes with patch, doesn't flake without patch.]
Actual behavior: CL passes. Test flakes with patch, thus we mark the test as a flaky success and pass the CL.
These are going to be tricky to get right without regressing false rejects or CQ run time.