PASS PASS PASS PASS appears flaky |
||||||||
Issue descriptionFor systems that handle the json-test-results format, such as Luci and flakiness dashboard, actual results containing more than one result are considered flaky since most systems only re-run tests when the test is flaky. Telemetry runs some tests 3-4 times intentionally to reduce noise in collected metrics. When a Telemetry test passes 4 times, it is not flaky so we need to display it as a pass not a flake (same with skipping 4 times or failing 4 times). The temporary solution is to update json_3_output_formatter to change actual results to have one PASS or SKIP or FAIL iff the results for all test runs are the same. If the test result has a mix of results, actual results will contain the exact number of PASSes, SKIPs, and FAILs to help with calculating flakiness during that run. This temporary solution will allow Milo and the flakiness dashboard to identify reran tests which passed every run as a pass instead of a flake. The reason it's only a temporary solution is because we need to address the need for calculating flakiness percentages over multiple runs of a story over time. If one run of the story results in PASS PASS PASS and another run results in PASS FAIL PASS, the flakiness over the past two runs should be 1/6 but with the current solution, it will end up being 1/3 since the system will only see one passed run in the first run. One possible long-term solution could be to update json-test-results format with an optional param to count the results by type and continue to use the 'actual' field to show only PASS for PASS PASS PASS. The downside to this is that it would only be used by Telemetry afaik so I'm hesitant to update a format used this widely. Another possible long-term solution could be to update the understanding of flakiness to factor in teams like Telemetry which rerun tests not because they are flaky but because they are trying to reduce noise on a collected metric. Lastly, if we're able to reduce the noise for Telemetry benchmarks, we could remove the feature of repeating benchmarks to lower noise. Then we could just repeat a test if it failed. This would put our system in line with the other systems tracking flakiness. If anyone has suggestions for a more hardy solution or comments on the ideas above, please comment on this bug. Thanks!
,
Aug 11 2017
For Findit to detect and analyze flaky Telemetry benchmark tests, it needs to know the result of each run of the test. For gtests, the output has a record for each run of the test in a parsable format. IIUC, the current telemetry test runner already produces such info in some format (mind a link to an example?). Could we keep it in some way?
,
Aug 12 2017
Why isn't the temporary solution == the permanent solution == make the flakiness tooling understand that PASS PASS PASS PASS isn't flaky? That just seems like a bug. Or, perhaps, it's an invalid assumption to make that any test that was run more than once is run because something was flaky. Most of our test harnesses currently support running a test more than once for any reason, and the json-test-results format handles that just fine. It seems wrong to try and make changes in that code for any but a rather urgent need. Most of the non-gtest-based tests use the "json test results format" ashleymarie@ mentioned above: https://www.chromium.org/developers/the-json-test-results-format . I think even the gtest based tests get converted to a variant of that format as they are uploaded, though I'm not positive.
,
Aug 12 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c5b4b8c6ac8c00c744e65916d13e041e35c39611 commit c5b4b8c6ac8c00c744e65916d13e041e35c39611 Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org> Date: Sat Aug 12 02:05:34 2017 Roll src/third_party/catapult/ b55e40d94..5db51352a (6 commits) https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/b55e40d94068..5db51352aa5c $ git log b55e40d94..5db51352a --date=short --no-merges --format='%ad %ae %s' 2017-08-11 kraynov Android systrace: atrace_helper /proc/meminfo global stats. 2017-08-11 ashleymarie Temporary fix for multiple passes to mean pass not flaky 2017-08-11 benjhayden Split TelemetryInfo into separate diagnostics. 2017-08-11 benjhayden Remove HistogramSet support from compare_samples. 2017-08-11 htwiggsmith Integrate tag filter into test picker 2017-08-11 agrieve Reland of ApkHelper: Add GetViewActivityName(). Make GetActivityName() more strict Created with: roll-dep src/third_party/catapult BUG=754825, 749239 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=sullivan@chromium.org Change-Id: If61ff1ad45ae5736538fdc58e6890741ced4177d Reviewed-on: https://chromium-review.googlesource.com/612677 Reviewed-by: <catapult-deps-roller@chromium.org> Commit-Queue: <catapult-deps-roller@chromium.org> Cr-Commit-Position: refs/heads/master@{#493937} [modify] https://crrev.com/c5b4b8c6ac8c00c744e65916d13e041e35c39611/DEPS
,
Aug 14 2017
I think that it's better to fix toolings to address this problem. Findit inspects the result of each run of a gtest and calculate the pass rate based on that, and it will not have trouble with Telemetry benchmark tests later when they are supported. IIUC on ashleymarie@'s description for the "temporary solution", currently Milo and the flakiness dashboard have the assumption that a reran test is a flake. So pass the bug over to sergiyb@ and cc hinoka@.
,
Aug 14 2017
Sean is taking over my responsibilities as I am transitioning to a new team.
,
Aug 14 2017
Hi Sean, would you have time to make the changes needed to allow the flakiness dashboard to understand repeated runs? I went ahead and made a temporary change from PASS PASS PASS to PASS on the Telemetry results side but I don't want this temporary change to stick around for an extended period of time. Thanks!
,
Aug 22 2017
,
Sep 26 2017
perezju@ pointed out the same case for SKIP tests in https://bugs.chromium.org/p/chromium/issues/detail?id=717394 I've got the same workaround out for review now
,
Sep 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ac28286446d3a0d6eafcc6885f44594c1e2de14d commit ac28286446d3a0d6eafcc6885f44594c1e2de14d Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org> Date: Tue Sep 26 23:05:17 2017 Roll src/third_party/catapult/ cf05c91b6..a17499a86 (2 commits) https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/cf05c91b6757..a17499a8648f $ git log cf05c91b6..a17499a86 --date=short --no-merges --format='%ad %ae %s' 2017-09-26 ashleymarie Temporary fix for multiple skips to mean skip not flaky 2017-09-26 dtu [pinpoint] Move _previous_builds from the Execution class to the Quest instance. Created with: roll-dep src/third_party/catapult BUG=754825 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=sullivan@chromium.org Change-Id: I165d323c6bcea2c29b5cbcf4773174231f1c62ee Reviewed-on: https://chromium-review.googlesource.com/685642 Reviewed-by: <catapult-deps-roller@chromium.org> Commit-Queue: <catapult-deps-roller@chromium.org> Cr-Commit-Position: refs/heads/master@{#504507} [modify] https://crrev.com/ac28286446d3a0d6eafcc6885f44594c1e2de14d/DEPS
,
Sep 27 2017
Thanks Ashley for the quick fix/workaround!
,
Jan 16
,
Jan 16
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by nedngu...@google.com
, Aug 11 2017