Compute CQ result fidelity |
||
Issue descriptionDesign doc: https://docs.google.com/document/d/1yLZhxvwQw60Tiox2LZkIdQ4ZhBF7DuzoJbCFci-dG0U/edit# Definition: (1 - (frequency that CQ-triggered build retries causes a change in the results)) Computation: Iterate through all CQ runs. For each run: * If there is a build that is repeated and returns a different result, increment X. * Otherwise, increment Y. Return (1 - (X) / (X + Y)). Note: We intentionally only count build repeats that are triggered by the CQ. We ignore build repeats that are triggered by a user or across different patch sets, since that doesn't reflect the utility of CQ-layer build retries.
,
Dec 14
,
Jan 3
CQ stats from 12-26 through 01-02, with a focus on builds that fail with the result TEST_FAILURE, and are then retried. """ total # of CQ runs: 1371 # of CQ runs with at least one failed build that was retried: 285 # of CQ runs with at least one failed build that succeeded on retry: 128 CQ result fidelity: 0.906637490883 total # of builds that were retried on TEST_FAILURE : 128 (percentage that succeeded on retry: 0.3203125) total # of builds that were retried on TEST_FAILURE on android: 26 (percentage that succeeded on retry: 0.576923076923) total # of builds that were retried on TEST_FAILURE no retry with patch [GPU]: 44 (percentage that succeeded on retry: 0.340909090909) total # of builds that were retried on TEST_FAILURE other: 58 (percentage that succeeded on retry: 0.189655172414) """
,
Jan 3
To be clear, that's 32% that succeeded on retry, not 0.32%, right?
,
Jan 3
Correct.
,
Jan 10
The following is a list of all non-Android, non-GPU TEST_FAILURE build failures between 12-31 and 1-07, which when retried, succeed. They are grouped by failing step name.
('TEST_FAILURE', u'sync_integration_tests (retry with patch summary)') [8925231859663163136, 8925237682720079936]
('TEST_FAILURE', u'sync_integration_tests on (none) GPU on Mac (retry with patch summary)') [8925041592376073616]
('TEST_FAILURE', u'webkit_layout_tests (retry with patch summary)') [8925229363909928608, 8925116958433017824, 8925310677564933568, 8925214693842784256, 8925218966434061088, 8925231151582127296]
('TEST_FAILURE', u'viz_browser_tests (retry with patch summary)') [8925218503517896368]
('TEST_FAILURE', u'webkit_layout_tests on Intel GPU on Mac (retry with patch summary)') [8925343258424393488]
('TEST_FAILURE', u'unit_tests (retry with patch summary)') [8925350930160546400]
('TEST_FAILURE', u'non_single_process_mash_unit_tests (retry with patch summary)') [8925393134052349744]
('TEST_FAILURE', u'network_service_interactive_ui_tests (retry with patch summary)') [8925254178541165104, 8925337285572494240, 8925234373723679232]
('TEST_FAILURE', u'webkit_unit_tests (retry with patch summary)') [8925326985499642944]
('TEST_FAILURE', u'cc_unittests (retry with patch summary)') [8925236558533938304]
('TEST_FAILURE', u'chromedriver_py_tests on (none) GPU on Mac (retry with patch summary)') [8925397717775604848, 8925405287458120512, 8925375178876626048, 8925365553264382784, 8925394361493399712, 8925341195882336800, 8925386439602971808, 8925396545610142016, 8925349837852569280, 8925383897650908912, 8925358978461035600, 8925401856657046864, 8925398167804666304, 8925400190583773504, 8925374664442215600, 8925385835115544368, 8925391474387190736, 8925385580578187232, 8925399000308928960, 8925361666724296464]
('TEST_FAILURE', u'interactive_ui_tests (retry with patch summary)') [8925347138275491824]
On deeper investigation, the test suites that cause the most false rejects: sync_integration_tests, webkit_layout_tests, chromedriver_py_tests all exhibit the same symptoms:
1) Test state is carried between tests.
2) Test X flakes rarely [< 1%] when run as part of a batch of tests.
3) Test X flakes frequently [often >50%] when run by itself.
Example of such a test [from sync_integration_tests, above]: https://bugs.chromium.org/p/chromium/issues/detail?id=919945
The combination of (2) and (3) means that when test X does flake, it is exceedingly likely to fail during 'retry_with_patch'. But when the whole build is retried by the CQ, the test will likely pass.
Unfortunately, there is no way to fix (1). While the behavior may be unintentional for some test suites [e.g. sync_integration_tests (?)], we intentionally run tests in batches for webkit_layout_tests to speed up cycle time.
I think the only retry mechanism that is robust against this type of flaky test will be to retry the entire shard so as to get the same test ordering, with the same binary, and the same state carried between tests. This also means that we can't roll tree and recompile, since test ordering is unstable, and a small change when tree is rolled might cause a shard to run entirely different tests.
|
||
►
Sign in to add a comment |
||
Comment 1 by erikc...@chromium.org
, Nov 9The script attached computes effectiveness of CQ build-layer retries, segmented by failure_reason. Note that although we segment by failure_reason, when computing whether a retry produces the same result, we only look at whether the retries passes or fails [infra failures are not distinguished from normal failures]. ==========Observations=========== * CQ result fidelity has significant improved over the last 2 months for 'TEST_FAILURE' [47% -> 86%] * CQ result fidelity has significant improved over the last 2 months for 'INVALID_TEST_RESULTS' [26% -> 80%] * Frequency of INVALID_TEST_RESULTS has dropped by almost 10X over last 2 months. ==========Next Steps=========== The attached script also nicely spits out all builds that [when automatically retried by the CQ], go from TEST_FAILURE -> SUCCESS or INVALID_TEST_RESULTS -> SUCCESS. I intend to go through these and continue to find and fix the underlying bugs. Note: In the last two months, I've yet to find an example where CQ result infidelity is caused by test flakiness. So far, root causes have either been test runner bugs or infra bugs. ==========Details=========== Period from 09/01-09/08 [prior to any of my changes]: builds_retried_same_result [('NO_FAILURE_REASON', 120), (u'COMPILE_FAILURE', 146), (u'TEST_FAILURE', 428), (u'PATCH_FAILURE', 4), (u'INVALID_TEST_RESULTS', 142)] builds_retried_different_result [('NO_FAILURE_REASON', 74), (u'COMPILE_FAILURE', 25), (u'INVALID_TEST_RESULTS', 401), (u'TEST_FAILURE', 478)] When a TEST_FAILURE is retried, it returns the same result 47% of the time. When an INVALID_TEST_RESULTS is retried, it returns the same result 26% of the time. When a COMPILE_FAILURE is retried, it returns the same result 85% of the time. When a PATCH_FAILURE is retried, it returns the same result 100% of the time. Period from 10/01-11/08 [after landing 'retry with patch']: builds_retried_same_result [(u'COMPILE_FAILURE', 153), ('NO_FAILURE_REASON', 120), (u'TEST_FAILURE', 384), (u'PATCH_FAILURE', 24), (u'INVALID_TEST_RESULTS', 71)] builds_retried_different_result [('NO_FAILURE_REASON', 140), (u'COMPILE_FAILURE', 129), (u'TEST_FAILURE', 291), (u'INVALID_TEST_RESULTS', 32)] When a TEST_FAILURE is retried, it returns the same result 57% of the time. When an INVALID_TEST_RESULTS is retried, it returns the same result 68% of the time. When a COMPILE_FAILURE is retried, it returns the same result 54% of the time. When a PATCH_FAILURE is retried, it returns the same result 100% of the time. Period from 11/01-11/08 [after some result fidelity improvements: builds_retried_same_result [('NO_FAILURE_REASON', 244), (u'PATCH_FAILURE', 17), (u'COMPILE_FAILURE', 161), (u'TEST_FAILURE', 382), (u'INVALID_TEST_RESULTS', 53)] builds_retried_different_result [(u'COMPILE_FAILURE', 212), ('NO_FAILURE_REASON', 344), (u'TEST_FAILURE', 60), (u'INVALID_TEST_RESULTS', 13)] When a TEST_FAILURE is retried, it returns the same result 86% of the time. When an INVALID_TEST_RESULTS is retried, it returns the same result 80% of the time. When a COMPILE_FAILURE is retried, it returns the same result 43% of the time. When a PATCH_FAILURE is retried, it returns the same result 100% of the time.7.5 KB
7.5 KB View Download