New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Make LayoutTests work in random order.

Project Member Reported by tansell@chromium.org, Apr 7 2016

Issue description

The webkit LayoutTests should work reliably when run in any order (including random order).

It would be a good idea to make "--order=random" the default run order (after I have done more thorough testing).

We should make this change *independently* of moving the LayoutTests to run on swarming. Doing so will help separate in people's mind problems caused by bad layout tests from the issues which might occur from running on the migration to swarming.

Context - https://groups.google.com/a/chromium.org/forum/#!topic/blink-infra/KFrL9laKyec
 
Showing comments 5 - 104 of 104 Older
The patch which adds the --order=random-seeded=XXX stuff can be found at https://codereview.chromium.org/2082653004/

Will try and get it landed.

Comment 6 by mar...@chromium.org, Aug 18 2016

Was there progress on this? It seems to be the only other significant blocker to run layout tests on Swarming.
Cc: -raikiri@google.com -dcampb@google.com
Summary: Make LayoutTests work in random order. (was: Make LayoutTests work in random order)
No progress on this recently; in discussions before, the possible next steps were:

 Choice 1. Use the method above noted in #3 to make a list of order-dependent or flaky tests, which will either be run separately or deleted. Any further flaky order-dependent tests would then be caught by turning on random order by default and watching the waterfalls for flaky tests.

 Choice 2. When running layout tests, start a new renderer process for every single test. This will cause the layout tests to be much slower in the short term, may cause some tests to start failing of course, so expectations would be updated until the tests reach a stable state after that.
Blockedon: 644380
Project Member

Comment 9 by bugdroid1@chromium.org, Sep 12 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/fa5c50adb9b096701b9c0e1e9eee458665484bc5

commit fa5c50adb9b096701b9c0e1e9eee458665484bc5
Author: qyearsley <qyearsley@chromium.org>
Date: Mon Sep 12 17:40:17 2016

Support specifying a seed for pseudo-random test order.

This is a continuation of http://crrev.com/2082653004 with the following
differences:
 - --order=random-seeded is removed
 - The seed is given in a separate flag.

Once we find out which tests are order-dependent and will be flaky with random order,
then this can be changed so that "random" is the default order instead of "natural".

BUG= 601332 

Review-Url: https://codereview.chromium.org/2308283002
Cr-Commit-Position: refs/heads/master@{#417968}

[modify] https://crrev.com/fa5c50adb9b096701b9c0e1e9eee458665484bc5/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/controllers/manager.py
[modify] https://crrev.com/fa5c50adb9b096701b9c0e1e9eee458665484bc5/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/models/test_run_results.py
[modify] https://crrev.com/fa5c50adb9b096701b9c0e1e9eee458665484bc5/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/run_webkit_tests.py
[modify] https://crrev.com/fa5c50adb9b096701b9c0e1e9eee458665484bc5/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py
[modify] https://crrev.com/fa5c50adb9b096701b9c0e1e9eee458665484bc5/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/views/printing.py
[modify] https://crrev.com/fa5c50adb9b096701b9c0e1e9eee458665484bc5/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/views/printing_unittest.py

Comment 10 by ojan@chromium.org, Sep 12 2016

Cc: qyears...@chromium.org
It sounds like we're moving forward with choice from from comment #7. I think that's the right choice, but I want to be clear on what "run separately" means. It's important to not break this up into two separate invocations of run-webkit-tests. Otherwise we're putting burden on developers to know which one they have to run for each test suite as we move them out of OrderDependentTests. Instead, a single invocation of run-webkit-tests should still run all the tests, it should just use OrderDependentTests in order to figure out sharding and test order.
Labels: -Pri-3 -OS-All Pri-2
Good point.

Yep, I believe we want to move forward with  choice 1; I was originally thinking this would mean running two separate invocations of run-webkit-tests (and two separate webkit_tests steps), but now I agree we'd definitely want to keep just one invocation to run all of the tests.
You should specify what you mean by "run separately". Will the tests be run in a fixed order at the same time as tests are running in random order? And, will the order-dependent tests be run in a single shard, or multiple shards?
Blockedon: 652357
Blockedon: 653709
Blockedon: 653722
Blockedon: 655831
Status update: 

I've been running the test suite in default & random order on my workstation. I put the data up in a scatterplot (y-axis = % failures in random order, x-axis = % failures in default order) here:
https://blink-order-dependency.appspot.com/

You can browse by directory here:
https://blink-order-dependency.appspot.com/by-directory

Based on the data I created a CL with a new RandomOrderExpectations file containing 128 tests. With that file, the suite ran 6/6 times without any failures in random order on my local machine.
https://codereview.chromium.org/2427613004

The next step is to add a chromium.fyi.randomorder bot that continuously runs the suite in random order to gather more data. Bug:
https://bugs.chromium.org/p/chromium/issues/detail?id=652357
Project Member

Comment 18 by bugdroid1@chromium.org, Oct 27 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/62121cde5cfb54d37231f0dc271d06ef8a237853

commit 62121cde5cfb54d37231f0dc271d06ef8a237853
Author: jeffcarp <jeffcarp@chromium.org>
Date: Thu Oct 27 22:42:34 2016

Here's a first shot at getting random order runs green locally, by adding a RandomOrderExpectations file.

Run by adding: --additional-expectations=LayoutTests/RandomOrderExpectations

5 out of 5 local runs have passed with this file (minus 2 tests which I think might be a regression:  https://crbug.com/656777 ).

BUG= 601332 

Review-Url: https://codereview.chromium.org/2427613004
Cr-Commit-Position: refs/heads/master@{#428168}

[add] https://crrev.com/62121cde5cfb54d37231f0dc271d06ef8a237853/third_party/WebKit/LayoutTests/RandomOrderExpectations

Project Member

Comment 19 by bugdroid1@chromium.org, Nov 3 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/576a434b15d3240cd3c3ffd4f8897b4e60d81fe8

commit 576a434b15d3240cd3c3ffd4f8897b4e60d81fe8
Author: jeffcarp <jeffcarp@chromium.org>
Date: Thu Nov 03 22:10:33 2016

Add two fast/text/ellipsis tests to RandomOrderExpectations.

All the recent failures on WebKit Linux - RandomOrder have been from at least one of three tests. This adds two of them. The other test only appeared once in the failures so I'm not adding it now.

BUG= 601332 

Review-Url: https://codereview.chromium.org/2473613004
Cr-Commit-Position: refs/heads/master@{#429708}

[modify] https://crrev.com/576a434b15d3240cd3c3ffd4f8897b4e60d81fe8/third_party/WebKit/LayoutTests/RandomOrderExpectations

Blockedon: 663571
Blockedon: 663581
Blockedon: 663585
Blockedon: 663838
Blockedon: 663840
Blockedon: 663847
Blockedon: 663848
Blockedon: 663849
Blockedon: 663851
Blockedon: 663852
Blockedon: 663853
Blockedon: 663855
Blockedon: 663858
Blockedon: 663860
Blockedon: 663865
Blockedon: 663871
Blockedon: 663872
Blockedon: 663874
Blockedon: 663875
Blockedon: 663876
Blockedon: 663877
Blockedon: 663879
Blockedon: 664816
Blockedon: 664817
Blockedon: 664819
Project Member

Comment 45 by bugdroid1@chromium.org, Nov 14 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b3af257e85b3f5b3c46049db264353ba97a94def

commit b3af257e85b3f5b3c46049db264353ba97a94def
Author: jeffcarp <jeffcarp@chromium.org>
Date: Mon Nov 14 22:48:27 2016

Add corresponding bugs to RandomOrderExpectations

BUG= 601332 

Review-Url: https://codereview.chromium.org/2502773002
Cr-Commit-Position: refs/heads/master@{#431932}

[modify] https://crrev.com/b3af257e85b3f5b3c46049db264353ba97a94def/third_party/WebKit/LayoutTests/RandomOrderExpectations

Blockedon: 665605
Blockedon: 665610
Blockedon: -665610
Project Member

Comment 49 by bugdroid1@chromium.org, Nov 21 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8d653d9454503da0e54c90fc583e72fb025a7f23

commit 8d653d9454503da0e54c90fc583e72fb025a7f23
Author: jeffcarp <jeffcarp@chromium.org>
Date: Mon Nov 21 19:58:51 2016

Remove http/tests/inspector/network/har-content.html from RandomOrderExpectations

A CL was landed to fix the flakiness: https://crrev.com/2503573002

BUG= 601332 

Review-Url: https://codereview.chromium.org/2499153003
Cr-Commit-Position: refs/heads/master@{#433614}

[modify] https://crrev.com/8d653d9454503da0e54c90fc583e72fb025a7f23/third_party/WebKit/LayoutTests/RandomOrderExpectations

Cc: joelo@chromium.org jsb...@chromium.org
 Issue 502013  has been merged into this issue.
Blockedon: 667923
Blockedon: 664849
Blockedon: 669693
Blockedon: 669938
Blockedon: 669985
Blockedon: 670841
Blockedon: 670843
Blockedon: 670846
Blockedon: 670892
Blockedon: 671395
Blockedon: 671475
Blockedon: 671477
Blockedon: 671478
Blockedon: 671480
Blockedon: 671802
Blockedon: 671805
Blockedon: 672204
Blockedon: 673003
Blockedon: 674191
Blockedon: 652536
Blockedon: 674633
Blockedon: 675023
LayoutTests on Linux and Mac moved to random order by default today!! 🎉

Next order of business is rolling out Windows and Android.
@jeffcarp: It was my understanding that windows is now done. Is this bug out of date?
@mcgreevy: yes that is correct, I just didn't update this bug. The LayoutTests now work in random order but there are still a bunch of bugs blocking this one (they're tests that were disabled due to random order flakiness). Not sure if this bug should be marked Fixed or if we should update the title to be more specific (e.g. 'Resolve random order flakiness in LayoutTests').

Comment 76 by ojan@chromium.org, Feb 7 2017

As best I can tell, the change that's landed makes it so that we run test within a directory in random order, but not actually in a globally random order. This doesn't match what swarming will do since swarming will run them in a globally random order, right?

I think in order to get the behavior of swarming, you'd also have to run with --fully-parallel, which will increase flakiness a lot.
Thanks for catching that, I'll start investigating --fully-parallel.
Tim and I spoke in person about whether we want to set up new FYI bots for this. We think it makes the most sense to continue converting the RandomOrder FYI bots to swarming and catch flakiness issues there. This saves us time for two reasons:

1. If we tried to address --fully-parallel flakiness before moving to swarming, we'd still need to address (the likely) flakiness issues after moving to swarming. Even if a test is flaky only because it's running on swarming, we can't ignore it because swarming is our eventual goal.

2. Converting the RandomOrder FYI bots to swarming will decrease the iteration time for catching flakiness. Even if results need to be confirmed locally, we will run into issues faster and be able to confirm fixes faster on swarming.

Let me know if you disagree or see any problems with this approach.
Have you tried a test run with --fully-parallel to get an anecdotal sense of how significant the increase in flakiness would be?
Cc: -joelo@chromium.org
It seems to me like life would be much easier if we knew that fully random worked properly before trying to get swarming working, so it seems to me you should at least flip the switch and see what happens to see how bad things are first.

(Also, apologies for missing the fact that the sharding happened after the tests were shuffled before :().
Just to be clear, there are *two* levels of sharding going on here.

There is;

 (a) the swarming/top-level sharding
 (b) the multi-process sharding which splits things up for running in multiple processes on a single machine.

The code for (a) is taking the *complete* list of tests, shuffling them and then splitting them into equal chunks (which is what we want for swarming).

The code in (b) is then taking the chunk and regrouping them by directory before handing them to a local test runner.

The side effect of (b) is that currently tests are only running in random order *inside a directory*. Directories are still run in sorted order.

To "simulate" the effect that swarming will have on the output, we can do;
-------------------------
for i in 0 1 2 3 4 5; do
  GTEST_RANDOM_SEED=1 GTEST_SHARD_INDEX=$i GTEST_TOTAL_SHARDS=6 <layout test command>
done
-------------------------
While I agree that we probably want to separate "running on swarming" from "running in random order" issues, the fact that it takes ~1 hour to do the run locally and <10m to do a run on swarming means we can probably get some initial data using swarming directly first.
Here are my test results.

Without --fully-parallel:

Expected to timeout, but passed: (15)
Expected to fail, but passed: (17)
Regressions: Unexpected text-only failures (2)
Regressions: Unexpected image-only failures (4)

With --fully-parallel:

Expected to timeout, but passed: (15)
Expected to fail, but passed: (16)
Regressions: Unexpected text-only failures (2)
Regressions: Unexpected image-only failures (4)

Ran against Release build of Chromium@184394b on Ubuntu.  Run #2 failures ⊂ run #1 failures.

Certainly on Windows or under swarming conditions we'd expect more flakiness, but these results indicate that there isn't a catastrophic increase in flakiness with --fully-parallel (either that or the flag isn't working properly).

Also keep in mind that tests are re-tried 3 times individually after a test run. I believe this catches a lot of tests that would have failed otherwise (it looks like about 100 for each).

Comment 85 by ojan@chromium.org, Feb 8 2017

Is there a large increase in the number of tests retried? If so, that implies an increase in expected flakiness.

It seems to me like we don't need to argue about this. We can turn on --fully-parallel for 1-2 days, and then enable swarming.
> We can turn on --fully-parallel for 1-2 days, and then enable swarming.

+1. It's not like we're that close to enabling swarming anyway ...
@ojan: there were actually fewer tests retired under --fully-parallel, 96 with vs. 107 without iirc. Not sure why. Enabling the flag on more builders might shed more light. 

@ojan & dpranke: Do you mean turn on --fully-parallel on the FYI builders or on all builders? Either way that sounds like a good plan and I can do that on Friday when I get back. 

Comment 88 by ojan@chromium.org, Feb 8 2017

I meant the FYI builders. If the FYI builders look good after a couple days (low flakiness and good cycle time), there's no reason not to enable it on all builders.
@jeffcarp: How is --fully-parallel working out?
The RandomOrder FYI builders moved to --fully-parallel at 10:10am PST on 10/2/17. Since then:

Linux
- Looks relatively stable, with around the same number of failures
- https://luci-milo.appspot.com/buildbot/chromium.fyi/WebKit%20Linux%20-%20RandomOrder/?limit=200

Mac
- Is having infra failures I'm looking into
- https://luci-milo.appspot.com/buildbot/chromium.fyi/WebKit%20Mac%20-%20RandomOrder/?limit=200

Windows
- Has a number of flaky failing tests, listed below
- https://luci-milo.appspot.com/buildbot/chromium.fyi/WebKit%20Win%20-%20RandomOrder/?limit=200

Out of 53 test runs on WebKit Win - RandomOrder:

fast/css-grid-layout/grid-align-justify-stretch-with-orthogonal-flows.html                           failures: 48
fast/css/fontfaceset-check-platform-fonts.html                                                       failures: 32
virtual/mojo-loading/http/tests/security/suborigins/suborigin-invalid-options.html                   failures: 30
http/tests/security/cross-frame-mouse-source-capabilities.html                                       failures: 22
virtual/mojo-loading/http/tests/security/cross-frame-mouse-source-capabilities.html                  failures: 9
fast/overflow/overflow-height-float-not-removed-crash3.html                                          failures: 5
fast/block/float/float-should-dirty-line-when-adjacent-to-line-breaks-2.html                         failures: 4
fast/block/float/float-should-dirty-line-when-adjacent-to-line-breaks.html                           failures: 4
virtual/mojo-loading/http/tests/webfont/font-display-intervention.html                               failures: 2
svg/text/text-selection-ws-01-t.svg                                                                  failures: 1
virtual/mojo-loading/http/tests/loading/preload-img-test.html                                        failures: 1
virtual/mojo-loading/http/tests/misc/client-hints-accept-meta-preloader.html                         failures: 1
virtual/mojo-loading/http/tests/webfont/font-display.html                                            failures: 1
fast/table/percent-height-overflow-visible-content-in-cell.html                                      failures: 1
http/tests/misc/client-hints-accept-meta-preloader.html                                              failures: 1
external/wpt/pointerevents/pointerevent_pointerout_received_once-manual.html                         failures: 1


Comment 92 by ojan@chromium.org, Feb 14 2017

1. --fully-parallel still doesn't shard virtual test suites. This is on purpose because switching between virtual tests suites needs to restart content_shell with different commandline arguments, so it slow. Is there a plan for how to handle this efficiently with swarming?
2. All the non-virtual tests have a "killing secondary driver" line right after it.
3. I suggested in comment 88 that you double check that cycle time didn't regress. You should still do that. As best I can tell, the webkit_tests step takes 3x longer now.

Comment 93 by ojan@chromium.org, Feb 14 2017

4. The amount of retrying seems to go from ~40-->~100 tests now.
Looks like we kill the secondary driver at the end of each test shard. That's not gonna work so well with randomly ordered tests ...
> Is there a plan for how to handle this efficiently with swarming?

Not that I'm aware of. What do you suggest?

> As best I can tell, the webkit_tests step takes 3x longer now.

I'll pull some numbers on cycle time + number of retries.

ojan + dpranke: do you think this means we'll need to do some work to change how tests are sharded on swarming? It's pretty clear now that this bug can't be closed simply by adding more lines to TestExpectations.
I think most of the performance/cycle time issues go away if we're not actually truly running things in random order.

So I think we're fine to proceed w/ swarming stuff, and should worry about flakiness concurrently w/ proceeding on swarming.

Comment 97 by ojan@chromium.org, Feb 15 2017

Cc: -ojan@chromium.org
FYI, I'm unsubscribing from this bug. So, if you want my feedback you'll need to contact me directly.
Tim and I met with Ojan on Tuesday and discussed this. The results basically echo Dirk in #96. Here's our game plan:

Problem: --fully-parallel restarts non-virtual content-shells every single test

- This is because --fully-parallel puts all non-virtual tests into a function LayoutTestRunner._shard_every_file and the content shell is restarted after every shard (where every shard is 1 test big). Tim filed a  bug 692866  to address this. 

Problem: Cycle time has regressed 3x

- Tim and I looked at a bunch of logs and determined that the cause behind the explosion in cycle time was the previous issue (restarting content-shell after every non-virtual test). It will be resolved with  bug 692866 . Since we will not be running under --fully-parallel on swarming, this should not be an issue, however we will keep a close eye on cycle time going forward.

Problem: The number of retries is higher with --fully-parallel

- This is something we'll also be keeping a close eye on as we move the FYI bots to swarming. With a quicker cycle time on swarming we should be able to address flakiness problems faster and get more data.

Problem: --fully-parallel doesn't shard virtual test suites

- This is not an issue (that's what I have from my notes but I can't recall the reasoning behind it)

I also pulled the numbers from #95 and Ojan's numbers in #92 and #93 are accurate.
We've observed failures of these two tests:
fast/block/float/float-should-dirty-line-when-adjacent-to-line-breaks.html
fast/block/float/float-should-dirty-line-when-adjacent-to-line-breaks-2.html
https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Win7%20%28dbg%29/builds/9095
https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Win7%20%28dbg%29/builds/9096

Could it be related to this issue?
Blockedon: 695378
I too just discovered that fast/block/float/float-should-dirty-line-when-adjacent-to-line-breaks.html is really flaky. It has nothing to do with this issue, though. Opening the test locally in content_shell or even Chrome, and then reloading a few times, you'll see that the test fails quite often (i.e. all the words end up on the same line).

I've reported  bug 695378  for this.
Blockedon: -695378
Status: Fixed (was: Started)
I'm going to close this bug as we now run Layout Tests in random order by default.

The Layout Tests are also running on swarming in this mode.


Blockedon: -652536
Showing comments 5 - 104 of 104 Older

Sign in to add a comment