New issue
Advanced search Search tips

Issue 674084 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Add "False Rejections" column and sort by it

Project Member Reported by serg...@chromium.org, Dec 14 2016

Issue description

Currently we only report flaky failures per test and will be reporting it for dirs after  issue 674082  is implemented. This is a good, but incomplete metric for the actual developer impact. I've seen a test that was very flaky as it was frequently failing 3 times before passing on the 4th, but never actually failed the 4th try and thus never caused a false rejection. Such tests should still be fixed, but we should prioritize our work on decreasing perceivable flakiness and reducing disruption of the developer workflow.

We should instead add a new column that can account for the actual flaky failures seen to the users and descending sort by this column's value (possibly sorting by flakiness for equal values in the column). Since only complete failures (4 failed tries) are visible to the users, we would not able to tell whether they are caused by flakiness or actual reliable failures, therefore we would need to compare results from different builders on the same CL similarly to how it is done by chromium-try-flakes.

To verify that this data actually makes sense we should try to correlate it with flakiness. I'd expect it to have high correlation in general, but low correlation may not necessarily mean that the data is useless, but rather that we have many tests that flake frequently, but hardly ever fail 4 tries in the same build and thus do not cause false rejections (similarly to the example I was talking about in the first paragraph above).
 
Another alternative is that we can add "waterfall failure rate". But we would need to find a way to distinguish waterfall results from tryserver results (perhaps by cross-joining with chrome_infra.completed_builds table and checking category field) and also we'd need to check if amount of data is sufficient.

This metric is also somewhat biased since it does not filter out reliable failures, but we should expect to have far fewer of them on the main waterfall and IMHO it's much easier to implement.
As part of  issue 673378 , I've observed some tests that were reported as very flaky, but caused no effect on end users since they were only flaky when run with some other test (due to a shared global state) and always passed on a second try when the other test was not run. This underlines the need to implement a metric that shows actual impact on productivity reflected in an estimated number of false rejections.
Owner: serg...@chromium.org
Status: Started (was: Available)
Prototype query: https://plx.corp.google.com/script/#a=qo%7Ci=google%253A%253Ascript_06._a4c49b_3cab_4f66_beb2_3742cdb5273f.
Components: Infra>CQ Infra>Flakiness
As part of working on this bug, I've discovered that there are CQ attempts that are resulting in a commit, but are still marked as failed by CQ, e.g. 

https://chromium-cq-status.appspot.com/recent#issue=2631713002,patchset=1,codereview_hostname=codereview.chromium.org

Results from event_mon are attached. Note that 3rd attempt has both committed=true and failed=true. This should not happen and must be fixed in CQ.
attempts.csv
1.2 KB View Download
The above is preventing us from finding patchsets that have at least one passed attempt. We can o'course use a hack and consider attempts resulting in a commit successful (disregard value of the failed property), but I'd rather try to find a fix the bug in CQ.

I've also learned that chromium-try-flakes reports any builds that passed and failed on CQ, not necessarily those that resulted in a false rejection (failed attempt following by successful attempt). We are going to use a stricter condition, where we'll count the actual number of false rejections affecting developers (requiring them to re-check CQ bit). Later we may relax this a bit and count number of builds passed/failed.
Query that we can use to check that committing attempts are correctly marked as non-failures:

https://plx.corp.google.com/script/#a=qo%7Ci=google%253A%253Ascript_f2._6e5637_c2e4_4a7f_8275_2e3bd67cffa1
Project Member

Comment 9 by bugdroid1@chromium.org, Jan 24 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/a3b26d04538300aa297907a69bdf795d60cb12c2

commit a3b26d04538300aa297907a69bdf795d60cb12c2
Author: Sergiy Byelozyorov <sergiyb@google.com>
Date: Tue Jan 24 10:30:07 2017

Project Member

Comment 10 by bugdroid1@chromium.org, Jan 25 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/51daee307e01dc3053a099b12ffa58780cf63e16

commit 51daee307e01dc3053a099b12ffa58780cf63e16
Author: Sergiy Byelozyorov <sergiyb@google.com>
Date: Wed Jan 25 15:44:54 2017

The above two CLs fix bugs in CQ. The prototype query uses data from 1 week ago, but I have updated it with additional checks to make sure it works with corrupted data too. Unfortunately this means that I had to exclude all dry-runs from consideration. I'll proceed working on integrating this query into current pipeline and exposing false rejection information to the users, but will remove these additional checks one week later.
Project Member

Comment 12 by bugdroid1@chromium.org, Jan 27 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/f402dda5482424c1b757e67d637f057f682a0422

commit f402dda5482424c1b757e67d637f057f682a0422
Author: Sergiy Byelozyorov <sergiyb@google.com>
Date: Fri Jan 27 09:56:46 2017

Discovered an interesting corner case:

          "abs-pos-non-replaced-vrl-128.xht": {
            "expected": "FAIL",
            "actual": "IMAGE",
            "bugs": [
              "crbug.com/505151"
            ],
            "reftest_type": [
              "=="
            ],
            "time": 0.4
          },

This test was detected as causing false rejection even though it didn't actually cause one. The reason is that layout test launcher, seems to ignore tests that are expected to fail, but don't. More details on the logic used by the layout test launcher can be found here: https://cs.chromium.org/chromium/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/models/test_run_results.py?l=212
After some consideration, I've decided that we want to consider any builder that has failed and passed on the same patchset as a flake even if it didn't lead to a CQ false rejection. Considering that I've started a new prototype query: https://plx.corp.google.com/script/#a=qo%7Ci=google%253A%253Ascript_49._25fcef_662e_4ce1_8c0e_bbb4a612390b. The latter query also includes logic from the layout tests referenced in comment #14, but I still need to make that it works as expected.
After manually verifying several flaky failures reported by the script and most of them turned out to be legitimate, but discovered one more corner case that we should handle. In the script, we first search for builds that have passed and failed on the same patchset and then look for failing tests in the failed build. What can happen, however, is that a build has flaked due to test A, while test B was reliably broken on the waterfall. Our script will report both A and B as flaky even though only A is flaky. We need to compare the outcome reported in "(without patch)" and "(with patch)" steps and test fail in both, we should ignore it.
Found two more corner cases:

1) in failed Build
- with-patch step: test A fails, test B passes
- without-patch step: test A passes, test B fails

The A is the reason why build fails and so should be reported as flaky if there is a passed build on the same patchset. Test B is also flaky, but it is not the reason why build failed and therefore reporting this as a false rejection may confuse users.

Resolution: These are flakes, but they don't cause a false rejection, so we do not report them.

2) Reliable broken test on waterfall fails in with-patch step, but the build is interrupted and thus without-patch step is not run. Build retry actually runs both with-patch and without-patch steps and build succeeds. This is a false positive.

Resolution: Do not report them.
After some more testing, I've concluded that the script works as intended and it can now be integrated into #plx pipeline to produce those numbers on Flakiness Surface.
CL with changes to the app is ready: https://chromium-review.googlesource.com/c/445296/, but we need to add support for the cq_false_rejection field in the chrome_infra.flaky_tests_with_layout_team_dir_info table. That will require adding prototype query as another node in the hourly workflow and JOINing its results into the flaky_tests_with_layout_team_dir_info table.
CL adding cq_false_rejections to the table: http://cl/148069739 (internal only).
Status: Fixed (was: Started)
This is now finished and deployed. Also created follow-up issue 695061.

Sign in to add a comment