Reduce flakiness in blink layout tests. |
|||||
Issue descriptionAccording to go/top-cq-flakes, blink layout tests are one of the top sources of CQ flakiness [around 1/3 of all flakes are caused by blink layout tests]. I've been reducing sources of non-determinism in the test runner in Issue 889036 . One of the last remaining sources of non-determinism in blink layout tests is that the test runner reuses the content shell between tests. This causes pathological behavior with respect to retries. The following example is representative of most failures I've looked at: A layout test T usually passes [flakiness << 10%]. But if it happens to flake, then: * It continues to fail on all retries. * When we run the test in isolation [retry count = 10], then the retries all pass or all fail, depending on whether the the first run passes or fails. This is because we reuse content shell between tests, and some state is carrying through. If we reset the content shell between tests, then retries become independent, and state stops carrying between tests. The test now flakily fails with a fixed probability on subsequent retries. Example: https://bugs.chromium.org/p/chromium/issues/detail?id=875884#c6. The problem is that resetting content shell between tests significantly increases test run time. The cost is roughly 2X on Windows, 3X on Linux and 5X on macOS. In this CL: https://chromium-review.googlesource.com/c/chromium/src/+/1273810 I propose resetting content shell between tests, but only when not running all tests. This gives us the benefit of cleaner test runs, and the cost of increased run time is negligible since typically only a small set of tests are running.
,
Oct 11
+ atotic, dcheng, avi, kbr from https://chromium-review.googlesource.com/c/chromium/src/+/1273810
,
Oct 11
Three more examples of sets of "flaky" layout tests whose behavior changes depending on whether content shell is reset between tests: https://bugs.chromium.org/p/chromium/issues/detail?id=894522 https://bugs.chromium.org/p/chromium/issues/detail?id=889952 https://bugs.chromium.org/p/chromium/issues/detail?id=889500 If we were resetting content shell between tests for developers running a subset of tests, there's a high probability that these bugs would have been caught prior to the test being added
,
Oct 11
Hey, chiming in regarding (1) (Different configuration between <running all tests> vs <running a subset of tests> can cause confusion). As a developer, I will only run a subset of layout tests, and there are only a few ways this can cause confusion in principle: a. My test passes locally, but fails/flakes on the CQ b. My test fails/flakes locally, but passes on the CQ. (a) should not happen, because my local test runs are in the same configuration as the CQ retries. If the test is flaky I should be able to reproduce in the same configuration as the CQ. (b) is the real concern, but it is a minor one. A locally failing test is immediately actionable and debuggable.
,
Oct 11
To csharrison's point, this CL makes it _easier_ to reproduce CQ flakes locally. Here are two examples of tests that frequently flake on the CQ but are hard to reproduce locally. With the CL patched in, the tests easily reproduce locally. https://bugs.chromium.org/p/chromium/issues/detail?id=894522 https://bugs.chromium.org/p/chromium/issues/detail?id=875884#c7
,
Oct 11
Yeah I didn't mention this but I have always had a hard time reproducing flakes of layout tests locally. I even remember one time where I tried to simulate this behavior by running the test suite manually N times rather than passing --repeat.
,
Oct 12
csharrison@ and thakis@ independently suggested that I modify my patch to reset content shell if a user passes --gtest_repeat=X, where X > 1. This way the behavior is consistent between local runs and CQ runs, but we get the benefit of resetting content shell between tests in the most important case, which is repeating the same test. This also makes it easier for Blink developers to repro CQ failures locally when passing --gtest_repeat [which previously did not have the desired effect].
,
Oct 12
,
Oct 12
Re #7: I like that plan, seems like a good compromise between predictability and runtime performance.
,
Oct 23
Hi, I just found my way here after discovering the --reset-shell-between-tests being added to my local runs. I read through the comments and understand the change, except for one thing. Would it be possible to add a way to NOT reset the shell between tests, if desired? Something like --reuse-shell-between-tests? Let me know if it already exists and I missed it. I'm currently trying to reproduce some flakiness that happens on a trybot, but I'm running into comment 4, option a: it only flakes if run on a re-used content_shell. So I'd like to run with --iterations=100 but NOT reset the shell between tests.
,
Oct 23
Thanks for the feedback -- I'm in the process of changing the flags around, see comments on: https://chromium-review.googlesource.com/c/chromium/src/+/1288974 Once that lands and I've refactored the current code to have the same behavior, I'll add a flag to disable this behavior. In the short term, you can modify port/base.py to not add the flag by default: https://chromium-review.googlesource.com/c/chromium/src/+/1273810/10/third_party/blink/tools/blinkpy/web_tests/port/base.py#264
,
Oct 23
Also, nit: the help for run_web_tests.py should say that this flag is added if iterations>1:
--reset-shell-between-tests
Resetting the shell between tests causes the tests to
take twice as long to run on average, but provides
more consistent results. This is automatically enabled
if --repeat-each or --gtest_repeat is specified
,
Oct 23
Thanks, I'll fix that as well.
,
Oct 23
Awesome, thanks for the link to the new CL. And for working on test flakiness in general - anything that improves flakiness is a great project!
,
Oct 24
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/932b3fc6770949c9751ad265995f940f6836ceaa commit 932b3fc6770949c9751ad265995f940f6836ceaa Author: Erik Chen <erikchen@chromium.org> Date: Wed Oct 24 17:50:44 2018 Restart content shell rather than renderers between layout tests. Restarting content shell between layout tests increases run time by 2-5x but produces more consistent results. Previously, this was implemented as a flag to content shell that would restart the renderer between tests. This CL changes the test runner to instead restart content shell between tests. This has comparable run time, with less complexity and even less probability of state carrying over between tests. The content shell flag --reset-shell-between-tests is now unused, and will be removed in a future CL. Bug: 894527 Change-Id: Idce9a74006000ded0d308affaff0568226a2cf7f Reviewed-on: https://chromium-review.googlesource.com/c/1296673 Commit-Queue: Erik Chen <erikchen@chromium.org> Reviewed-by: Dirk Pranke <dpranke@chromium.org> Cr-Commit-Position: refs/heads/master@{#602388} [modify] https://crrev.com/932b3fc6770949c9751ad265995f940f6836ceaa/third_party/blink/tools/blinkpy/web_tests/controllers/layout_test_runner.py [modify] https://crrev.com/932b3fc6770949c9751ad265995f940f6836ceaa/third_party/blink/tools/blinkpy/web_tests/port/base.py [modify] https://crrev.com/932b3fc6770949c9751ad265995f940f6836ceaa/third_party/blink/tools/blinkpy/web_tests/run_webkit_tests.py
,
Oct 25
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/79be9703b942a785d97ff877483d0d39ee5ebe2e commit 79be9703b942a785d97ff877483d0d39ee5ebe2e Author: Erik Chen <erikchen@chromium.org> Date: Thu Oct 25 23:30:52 2018 Remove content shell flag reset-shell-between-tests. The test runner now restarts content shell instead of passing this flag to content shell when the behavior is desired. Bug: 894527 Change-Id: I28bc74239c74cbcadcdcbc927eb213a07cbed5c0 Reviewed-on: https://chromium-review.googlesource.com/c/1296674 Commit-Queue: Erik Chen <erikchen@chromium.org> Reviewed-by: Avi Drissman <avi@chromium.org> Cr-Commit-Position: refs/heads/master@{#602925} [modify] https://crrev.com/79be9703b942a785d97ff877483d0d39ee5ebe2e/content/shell/browser/layout_test/blink_test_controller.cc [modify] https://crrev.com/79be9703b942a785d97ff877483d0d39ee5ebe2e/content/shell/common/layout_test/layout_test_switches.cc [modify] https://crrev.com/79be9703b942a785d97ff877483d0d39ee5ebe2e/content/shell/common/layout_test/layout_test_switches.h
,
Oct 30
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/dc58887b8ee0fab682a0167e4e559a6289d5bd34 commit dc58887b8ee0fab682a0167e4e559a6289d5bd34 Author: erikchen <erikchen@chromium.org> Date: Tue Oct 30 14:05:18 2018 Simplify flags for restarting content shell in layout test runner. This CL removes the --batch-size argument, which is rarely [or never] used. It changes --restart-shell-between-tests to be a tri-state, with options 'always', 'never' and 'on_retry'. The last is the default. Bug: 894527 Change-Id: Ib15497545aa4e55e7ab77e987db7989755d3c97b Reviewed-on: https://chromium-review.googlesource.com/c/1300236 Reviewed-by: Dirk Pranke <dpranke@chromium.org> Commit-Queue: Erik Chen <erikchen@chromium.org> Cr-Commit-Position: refs/heads/master@{#603886} [modify] https://crrev.com/dc58887b8ee0fab682a0167e4e559a6289d5bd34/third_party/blink/tools/blinkpy/web_tests/run_webkit_tests.py [modify] https://crrev.com/dc58887b8ee0fab682a0167e4e559a6289d5bd34/third_party/blink/tools/blinkpy/web_tests/run_webkit_tests_unittest.py
,
Nov 26
,
Nov 26
This is basically fixed. See https://bugs.chromium.org/p/chromium/issues/detail?id=889036#c27. While individual tests are still flaky, we now have independent retries. This means that the test suite as a whole is almost never flaky. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by erikc...@chromium.org
, Oct 115.1 KB
5.1 KB View Download