New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 754825 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

PASS PASS PASS PASS appears flaky

Project Member Reported by ashleymarie@chromium.org, Aug 11 2017

Issue description

For systems that handle the json-test-results format, such as Luci and flakiness dashboard, actual results containing more than one result are considered flaky since most systems only re-run tests when the test is flaky. Telemetry runs some tests 3-4 times intentionally to reduce noise in collected metrics. When a Telemetry test passes 4 times, it is not flaky so we need to display it as a pass not a flake (same with skipping 4 times or failing 4 times).

The temporary solution is to update json_3_output_formatter to change actual results to have one PASS or SKIP or FAIL iff the results for all test runs are the same. If the test result has a mix of results, actual results will contain the exact number of PASSes, SKIPs, and FAILs to help with calculating flakiness during that run. This temporary solution will allow Milo and the flakiness dashboard to identify reran tests which passed every run as a pass instead of a flake.

The reason it's only a temporary solution is because we need to address the need for calculating flakiness percentages over multiple runs of a story over time. If one run of the story results in PASS PASS PASS and another run results in PASS FAIL PASS, the flakiness over the past two runs should be 1/6 but with the current solution, it will end up being 1/3 since the system will only see one passed run in the first run.

One possible long-term solution could be to update json-test-results format with an optional param to count the results by type and continue to use the 'actual' field to show only PASS for PASS PASS PASS. The downside to this is that it would only be used by Telemetry afaik so I'm hesitant to update a format used this widely.

Another possible long-term solution could be to update the understanding of flakiness to factor in teams like Telemetry which rerun tests not because they are flaky but because they are trying to reduce noise on a collected metric.

Lastly, if we're able to reduce the noise for Telemetry benchmarks, we could remove the feature of repeating benchmarks to lower noise. Then we could just repeat a test if it failed. This would put our system in line with the other systems tracking flakiness.

If anyone has suggestions for a more hardy solution or comments on the ideas above, please comment on this bug. Thanks!
 
Cc: dpranke@chromium.org st...@chromium.org
+Dirk: the concept of "tests are repeated but they are not flaky" only applicable to perf tests, should we make sure that other infrastructures (flakiness dashboard/FindIt/FlakeAnalyzer..) in ops can understand this?

+stgao@ who is TL of FIndIt

Comment 2 by st...@chromium.org, Aug 11 2017

For Findit to detect and analyze flaky Telemetry benchmark tests, it needs to know the result of each run of the test. For gtests, the output has a record for each run of the test in a parsable format.
IIUC, the current telemetry test runner already produces such info in some format (mind a link to an example?).
Could we keep it in some way?
Cc: ashleymarie@chromium.org
Owner: st...@chromium.org
Why isn't the temporary solution == the permanent solution == make the flakiness tooling understand that PASS PASS PASS PASS isn't flaky? That just seems like a bug. Or, perhaps, it's an invalid assumption to make that any test that was run more than once is run because something was flaky.

Most of our test harnesses currently support running a test more than once for any reason, and the json-test-results format handles that just fine. It seems wrong to try and make changes in that code for any but a rather urgent need.

Most of the non-gtest-based tests use the "json test results format" ashleymarie@ mentioned above: https://www.chromium.org/developers/the-json-test-results-format . I think even the gtest based tests get converted to a variant of that format as they are uploaded, though I'm not positive.


Project Member

Comment 4 by bugdroid1@chromium.org, Aug 12 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c5b4b8c6ac8c00c744e65916d13e041e35c39611

commit c5b4b8c6ac8c00c744e65916d13e041e35c39611
Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org>
Date: Sat Aug 12 02:05:34 2017

Roll src/third_party/catapult/ b55e40d94..5db51352a (6 commits)

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/b55e40d94068..5db51352aa5c

$ git log b55e40d94..5db51352a --date=short --no-merges --format='%ad %ae %s'
2017-08-11 kraynov Android systrace: atrace_helper /proc/meminfo global stats.
2017-08-11 ashleymarie Temporary fix for multiple passes to mean pass not flaky
2017-08-11 benjhayden Split TelemetryInfo into separate diagnostics.
2017-08-11 benjhayden Remove HistogramSet support from compare_samples.
2017-08-11 htwiggsmith Integrate tag filter into test picker
2017-08-11 agrieve Reland of ApkHelper: Add GetViewActivityName(). Make GetActivityName() more strict

Created with:
  roll-dep src/third_party/catapult
BUG=754825, 749239 


Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls


CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=sullivan@chromium.org

Change-Id: If61ff1ad45ae5736538fdc58e6890741ced4177d
Reviewed-on: https://chromium-review.googlesource.com/612677
Reviewed-by: <catapult-deps-roller@chromium.org>
Commit-Queue: <catapult-deps-roller@chromium.org>
Cr-Commit-Position: refs/heads/master@{#493937}
[modify] https://crrev.com/c5b4b8c6ac8c00c744e65916d13e041e35c39611/DEPS

Comment 5 by st...@chromium.org, Aug 14 2017

Cc: hinoka@chromium.org
Owner: serg...@chromium.org
I think that it's better to fix toolings to address this problem. Findit inspects the result of each run of a gtest and calculate the pass rate based on that, and it will not have trouble with Telemetry benchmark tests later when they are supported.

IIUC on ashleymarie@'s description for the "temporary solution", currently Milo and the flakiness dashboard have the assumption that a reran test is a flake.
So pass the bug over to sergiyb@ and cc hinoka@.
Owner: seanmccullough@chromium.org
Status: Assigned (was: Untriaged)
Sean is taking over my responsibilities as I am transitioning to a new team.
Hi Sean, would you have time to make the changes needed to allow the flakiness dashboard to understand repeated runs?

I went ahead and made a temporary change from PASS PASS PASS to PASS on the Telemetry results side but I don't want this temporary change to stick around for an extended period of time.

Thanks!
Labels: -Pri-3 Pri-2
Cc: -dpranke@chromium.org perezju@chromium.org
perezju@ pointed out the same case for SKIP tests in https://bugs.chromium.org/p/chromium/issues/detail?id=717394
I've got the same workaround out for review now
Project Member

Comment 10 by bugdroid1@chromium.org, Sep 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ac28286446d3a0d6eafcc6885f44594c1e2de14d

commit ac28286446d3a0d6eafcc6885f44594c1e2de14d
Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org>
Date: Tue Sep 26 23:05:17 2017

Roll src/third_party/catapult/ cf05c91b6..a17499a86 (2 commits)

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/cf05c91b6757..a17499a8648f

$ git log cf05c91b6..a17499a86 --date=short --no-merges --format='%ad %ae %s'
2017-09-26 ashleymarie Temporary fix for multiple skips to mean skip not flaky
2017-09-26 dtu [pinpoint] Move _previous_builds from the Execution class to the Quest instance.

Created with:
  roll-dep src/third_party/catapult
BUG=754825


Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls


CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=sullivan@chromium.org

Change-Id: I165d323c6bcea2c29b5cbcf4773174231f1c62ee
Reviewed-on: https://chromium-review.googlesource.com/685642
Reviewed-by: <catapult-deps-roller@chromium.org>
Commit-Queue: <catapult-deps-roller@chromium.org>
Cr-Commit-Position: refs/heads/master@{#504507}
[modify] https://crrev.com/ac28286446d3a0d6eafcc6885f44594c1e2de14d/DEPS

Thanks Ashley for the quick fix/workaround!
Components: Test>Telemetry
Components: -Tests>Telemetry

Sign in to add a comment