New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 894527 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 26
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Reduce flakiness in blink layout tests.

Project Member Reported by erikc...@chromium.org, Oct 11

Issue description

According to go/top-cq-flakes, blink layout tests are one of the top sources of CQ flakiness [around 1/3 of all flakes are caused by blink layout tests]. 

I've been reducing sources of non-determinism in the test runner in  Issue 889036 . One of the last remaining sources of non-determinism in blink layout tests is that the test runner reuses the content shell between tests. This causes pathological behavior with respect to retries. The following example is representative of most failures I've looked at:

A layout test T usually passes [flakiness << 10%]. But if it happens to flake, then:
 * It continues to fail on all retries.
 * When we run the test in isolation [retry count = 10], then the retries all pass or all fail, depending on whether the the first run passes or fails. 

This is because we reuse content shell between tests, and some state is carrying through. If we reset the content shell between tests, then retries become independent, and state stops carrying between tests. The test now flakily fails with a fixed probability on subsequent retries. Example: https://bugs.chromium.org/p/chromium/issues/detail?id=875884#c6.

The problem is that resetting content shell between tests significantly increases test run time. The cost is roughly 2X on Windows, 3X on Linux and 5X on macOS. 

In this CL: https://chromium-review.googlesource.com/c/chromium/src/+/1273810 I propose resetting content shell between tests, but only when not running all tests. This gives us the benefit of cleaner test runs, and the cost of increased run time is negligible since typically only a small set of tests are running. 
 
Cc: liaoyuke@chromium.org dpranke@chromium.org st...@chromium.org nednguyen@chromium.org
Dirk and Emil are both concerned with this solution. 

Quoting Dirk:
https://chromium-review.googlesource.com/c/chromium/src/+/1273810#message-289f784fb08bec7a379a721e7b2e27e050f1dcad
"""
I'm not wild about this change; the fewer things that vary between different test configs, the better, and I think this behavior will catch people by surprise and also create more variance between things that fail when run as part of the full suite and when run individually. In particular, I think it's rare for devs to run the full suite locally these days, so they'd run a subset of tests and get this change, and then we'd see failures in the CQ that they wouldn't see locally.
"""

Quoting Emil:
https://chromium-review.googlesource.com/c/chromium/src/+/1273810#message-290318b389003a3e5838c387300ef3d4b0e7bd92
"""
While I'd love to see us switch to a model where each test is run in a "clean" content shell I do share Dirk's concerns about increasing the run time and introducing inconsistencies. 
...
Another option might be to move those tests to a separate test-suite and run that with this configuration. Then we'd be incentivise to "fix" those tests over time in order to reduce the overall layout test run time.
"""

There are two main concerns:
(1) Different configuration between <running all tests> vs <running a subset of tests> can cause confusion.
(2) If retries mask flakiness, then flaky tests [possibly representing real bugs] will go unfixed.


Addressing concern (2) first. 
* It's important that flaky tests get filed and fixed. 
* Whether test flakiness causes false rejects of CLs is orthogonal to the above point. False rejects cause developer pain without providing benefit.

The most common flow for a Chrome developer is: notice that a build failed. Check the build, notice that the test is unrelated to the change, hit the CQ button again. Most false rejects do not cause bugs to be filed. 

We absolutely need to make sure that flaky tests get filed and fixed, but that should not prevent us from reducing false rejects. +stgao, liaoyuke


Addressing concern (1) next.
The CL I proposed reduces, but does not eliminate inconsistencies in test runs. 

Developers will typically run a small set of tests locally. Since content shell gets reused, this can mask real bugs, depending on the ordering of the local tests that get run.
Example 1: https://bugs.chromium.org/p/chromium/issues/detail?id=875884#c7 -- On my local machine, this test fails with ~20% probability when run with a clean content shell, and always passes when reusing a content shell. This means that if a developer added this test, but ran it with a set of other tests locally [such that this test was not run first], then it would pass with 100% probability, no matter how many times they ran the tests.

On the CQ, we currently run all tests and reuse content shell. This means that aside from the first test run, all subsequent tests will be using a reused content shell [minus the rare tests that cause a content shell restart]. This contrasts with the developer's local workflow, where if they're only running a single test, then the test will always be run in a clean content shell.

I had an extended discussion with +nednguyen and +dpranke on this subject [attaching conversation transcript]. Attaching the conclusion.
"""
erikchen:
I personally think the right solution is:
1) make sure that flaky tests get surfaced and prioritized appropriately 
2) make sure that flaky tests don't cause false rejects.
...
I think that (1) and (2) should be independent

dpranke:
I agree that that's a good direction

erikchen:
This CL purports to do (2), which leaving (1) unchanged. If we had (1), would be okay with (2)?

dpranke:
stgao's team has (at my urging) spent a lot more time trying to make sure we could do (1) than (2)
yes
but we don't have (1) yet

erikchen:
the problem is, Not (2) doesn't give us (1). Developers mostly ignore flaky layout tests right now.

dpranke:
it's possible that doing (2) now and (1) later is still the right thing to do

erikchen:
Since most of the time, the flake occurs on a CL for a developer that doesn't know blink

dpranke:
agreed

erikchen:
Thanks -- I'm going to transcribe this conversation and post it as a txt document, then post a summary. I think you're OKing moving forward with this CL.

dpranke:
I'm not OK'ing moving forward with this CL without commitment to a larger plan or agreement w/ more people that we should land this, particularly since eae agreed w/ me
but transcribe/post is a good next step.
"""


layout_test_discussion.txt
5.1 KB View Download
Cc: a...@chromium.org kbr@chromium.org dcheng@chromium.org atotic@chromium.org
+ atotic, dcheng, avi, kbr from https://chromium-review.googlesource.com/c/chromium/src/+/1273810
Three more examples of sets of "flaky" layout tests whose behavior changes depending on whether content shell is reset between tests:

https://bugs.chromium.org/p/chromium/issues/detail?id=894522
https://bugs.chromium.org/p/chromium/issues/detail?id=889952
https://bugs.chromium.org/p/chromium/issues/detail?id=889500

If we were resetting content shell between tests for developers running a subset of tests, there's a high probability that these bugs would have been caught prior to the test being added
Hey, chiming in regarding (1) (Different configuration between <running all tests> vs <running a subset of tests> can cause confusion).

As a developer, I will only run a subset of layout tests, and there are only a few ways this can cause confusion in principle:
a. My test passes locally, but fails/flakes on the CQ
b. My test fails/flakes locally, but passes on the CQ.

(a) should not happen, because my local test runs are in the same configuration as the CQ retries. If the test is flaky I should be able to reproduce in the same configuration as the CQ.

(b) is the real concern, but it is a minor one. A locally failing test is immediately actionable and debuggable.
To csharrison's point, this CL makes it _easier_ to reproduce CQ flakes locally.

Here are two examples of tests that frequently flake on the CQ but are hard to reproduce locally. With the CL patched in, the tests easily reproduce locally.

https://bugs.chromium.org/p/chromium/issues/detail?id=894522
https://bugs.chromium.org/p/chromium/issues/detail?id=875884#c7
Yeah I didn't mention this but I have always had a hard time reproducing flakes of layout tests locally. I even remember one time where I tried to simulate this behavior by running the test suite manually N times rather than passing --repeat.
csharrison@ and thakis@ independently suggested that I modify my patch to reset content shell if a user passes --gtest_repeat=X, where X > 1.

This way the behavior is consistent between local runs and CQ runs, but we get the benefit of resetting content shell between tests in the most important case, which is repeating the same test.

This also makes it easier for Blink developers to repro CQ failures locally when passing --gtest_repeat [which previously did not have the desired effect].
Cc: chanli@chromium.org
Re #7: I like that plan, seems like a good compromise between predictability and runtime performance.
Hi, I just found my way here after discovering the --reset-shell-between-tests being added to my local runs. I read through the comments and understand the change, except for one thing. Would it be possible to add a way to NOT reset the shell between tests, if desired? Something like --reuse-shell-between-tests? Let me know if it already exists and I missed it.

I'm currently trying to reproduce some flakiness that happens on a trybot, but I'm running into comment 4, option a: it only flakes if run on a re-used content_shell. So I'd like to run with --iterations=100 but NOT reset the shell between tests.
Thanks for the feedback -- I'm in the process of changing the flags around, see comments on:
https://chromium-review.googlesource.com/c/chromium/src/+/1288974

Once that lands and I've refactored the current code to have the same behavior, I'll add a flag to disable this behavior. 

In the short term, you can modify port/base.py to not add the flag by default:
https://chromium-review.googlesource.com/c/chromium/src/+/1273810/10/third_party/blink/tools/blinkpy/web_tests/port/base.py#264
Also, nit: the help for run_web_tests.py should say that this flag is added if iterations>1:

    --reset-shell-between-tests
                        Resetting the shell between tests causes the tests to
                        take twice as long to run on average, but provides
                        more consistent results. This is automatically enabled
                        if --repeat-each or --gtest_repeat is specified

Thanks, I'll fix that as well.
Awesome, thanks for the link to the new CL. And for working on test flakiness in general - anything that improves flakiness is a great project!
Project Member

Comment 15 by bugdroid1@chromium.org, Oct 24

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/932b3fc6770949c9751ad265995f940f6836ceaa

commit 932b3fc6770949c9751ad265995f940f6836ceaa
Author: Erik Chen <erikchen@chromium.org>
Date: Wed Oct 24 17:50:44 2018

Restart content shell rather than renderers between layout tests.

Restarting content shell between layout tests increases run time by 2-5x but
produces more consistent results. Previously, this was implemented as a flag to
content shell that would restart the renderer between tests. This CL changes the
test runner to instead restart content shell between tests. This has comparable
run time, with less complexity and even less probability of state carrying over
between tests.

The content shell flag --reset-shell-between-tests is now unused, and will be
removed in a future CL.

Bug:  894527 
Change-Id: Idce9a74006000ded0d308affaff0568226a2cf7f
Reviewed-on: https://chromium-review.googlesource.com/c/1296673
Commit-Queue: Erik Chen <erikchen@chromium.org>
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Cr-Commit-Position: refs/heads/master@{#602388}
[modify] https://crrev.com/932b3fc6770949c9751ad265995f940f6836ceaa/third_party/blink/tools/blinkpy/web_tests/controllers/layout_test_runner.py
[modify] https://crrev.com/932b3fc6770949c9751ad265995f940f6836ceaa/third_party/blink/tools/blinkpy/web_tests/port/base.py
[modify] https://crrev.com/932b3fc6770949c9751ad265995f940f6836ceaa/third_party/blink/tools/blinkpy/web_tests/run_webkit_tests.py

Project Member

Comment 16 by bugdroid1@chromium.org, Oct 25

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/79be9703b942a785d97ff877483d0d39ee5ebe2e

commit 79be9703b942a785d97ff877483d0d39ee5ebe2e
Author: Erik Chen <erikchen@chromium.org>
Date: Thu Oct 25 23:30:52 2018

Remove content shell flag reset-shell-between-tests.

The test runner now restarts content shell instead of passing this flag to
content shell when the behavior is desired.

Bug:  894527 
Change-Id: I28bc74239c74cbcadcdcbc927eb213a07cbed5c0
Reviewed-on: https://chromium-review.googlesource.com/c/1296674
Commit-Queue: Erik Chen <erikchen@chromium.org>
Reviewed-by: Avi Drissman <avi@chromium.org>
Cr-Commit-Position: refs/heads/master@{#602925}
[modify] https://crrev.com/79be9703b942a785d97ff877483d0d39ee5ebe2e/content/shell/browser/layout_test/blink_test_controller.cc
[modify] https://crrev.com/79be9703b942a785d97ff877483d0d39ee5ebe2e/content/shell/common/layout_test/layout_test_switches.cc
[modify] https://crrev.com/79be9703b942a785d97ff877483d0d39ee5ebe2e/content/shell/common/layout_test/layout_test_switches.h

Project Member

Comment 17 by bugdroid1@chromium.org, Oct 30

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/dc58887b8ee0fab682a0167e4e559a6289d5bd34

commit dc58887b8ee0fab682a0167e4e559a6289d5bd34
Author: erikchen <erikchen@chromium.org>
Date: Tue Oct 30 14:05:18 2018

Simplify flags for restarting content shell in layout test runner.

This CL removes the --batch-size argument, which is rarely [or never] used. It
changes --restart-shell-between-tests to be a tri-state, with options 'always',
'never' and 'on_retry'. The last is the default.

Bug:  894527 
Change-Id: Ib15497545aa4e55e7ab77e987db7989755d3c97b
Reviewed-on: https://chromium-review.googlesource.com/c/1300236
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Commit-Queue: Erik Chen <erikchen@chromium.org>
Cr-Commit-Position: refs/heads/master@{#603886}
[modify] https://crrev.com/dc58887b8ee0fab682a0167e4e559a6289d5bd34/third_party/blink/tools/blinkpy/web_tests/run_webkit_tests.py
[modify] https://crrev.com/dc58887b8ee0fab682a0167e4e559a6289d5bd34/third_party/blink/tools/blinkpy/web_tests/run_webkit_tests_unittest.py

Components: Blink>Infra
Owner: erikc...@chromium.org
Status: Fixed (was: Untriaged)
This is basically fixed. See https://bugs.chromium.org/p/chromium/issues/detail?id=889036#c27.

While individual tests are still flaky, we now have independent retries. This means that the test suite as a whole is almost never flaky.

Sign in to add a comment