Make LayoutTests work in random order. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Issue descriptionThe webkit LayoutTests should work reliably when run in any order (including random order). It would be a good idea to make "--order=random" the default run order (after I have done more thorough testing). We should make this change *independently* of moving the LayoutTests to run on swarming. Doing so will help separate in people's mind problems caused by bad layout tests from the issues which might occur from running on the migration to swarming. Context - https://groups.google.com/a/chromium.org/forum/#!topic/blink-infra/KFrL9laKyec ⛆ |
|
|
Showing comments 5 - 104
of 104
Older ›
,
Jun 21 2016
The patch which adds the --order=random-seeded=XXX stuff can be found at https://codereview.chromium.org/2082653004/ Will try and get it landed.
,
Aug 18 2016
Was there progress on this? It seems to be the only other significant blocker to run layout tests on Swarming.
,
Aug 18 2016
No progress on this recently; in discussions before, the possible next steps were: Choice 1. Use the method above noted in #3 to make a list of order-dependent or flaky tests, which will either be run separately or deleted. Any further flaky order-dependent tests would then be caught by turning on random order by default and watching the waterfalls for flaky tests. Choice 2. When running layout tests, start a new renderer process for every single test. This will cause the layout tests to be much slower in the short term, may cause some tests to start failing of course, so expectations would be updated until the tests reach a stable state after that.
,
Sep 6 2016
,
Sep 12 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/fa5c50adb9b096701b9c0e1e9eee458665484bc5 commit fa5c50adb9b096701b9c0e1e9eee458665484bc5 Author: qyearsley <qyearsley@chromium.org> Date: Mon Sep 12 17:40:17 2016 Support specifying a seed for pseudo-random test order. This is a continuation of http://crrev.com/2082653004 with the following differences: - --order=random-seeded is removed - The seed is given in a separate flag. Once we find out which tests are order-dependent and will be flaky with random order, then this can be changed so that "random" is the default order instead of "natural". BUG= 601332 Review-Url: https://codereview.chromium.org/2308283002 Cr-Commit-Position: refs/heads/master@{#417968} [modify] https://crrev.com/fa5c50adb9b096701b9c0e1e9eee458665484bc5/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/controllers/manager.py [modify] https://crrev.com/fa5c50adb9b096701b9c0e1e9eee458665484bc5/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/models/test_run_results.py [modify] https://crrev.com/fa5c50adb9b096701b9c0e1e9eee458665484bc5/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/run_webkit_tests.py [modify] https://crrev.com/fa5c50adb9b096701b9c0e1e9eee458665484bc5/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py [modify] https://crrev.com/fa5c50adb9b096701b9c0e1e9eee458665484bc5/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/views/printing.py [modify] https://crrev.com/fa5c50adb9b096701b9c0e1e9eee458665484bc5/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/views/printing_unittest.py
,
Sep 12 2016
It sounds like we're moving forward with choice from from comment #7. I think that's the right choice, but I want to be clear on what "run separately" means. It's important to not break this up into two separate invocations of run-webkit-tests. Otherwise we're putting burden on developers to know which one they have to run for each test suite as we move them out of OrderDependentTests. Instead, a single invocation of run-webkit-tests should still run all the tests, it should just use OrderDependentTests in order to figure out sharding and test order.
,
Sep 12 2016
Good point. Yep, I believe we want to move forward with choice 1; I was originally thinking this would mean running two separate invocations of run-webkit-tests (and two separate webkit_tests steps), but now I agree we'd definitely want to keep just one invocation to run all of the tests.
,
Sep 12 2016
You should specify what you mean by "run separately". Will the tests be run in a fixed order at the same time as tests are running in random order? And, will the order-dependent tests be run in a single shard, or multiple shards?
,
Oct 3 2016
,
Oct 6 2016
,
Oct 6 2016
,
Oct 13 2016
,
Oct 20 2016
Status update: I've been running the test suite in default & random order on my workstation. I put the data up in a scatterplot (y-axis = % failures in random order, x-axis = % failures in default order) here: https://blink-order-dependency.appspot.com/ You can browse by directory here: https://blink-order-dependency.appspot.com/by-directory Based on the data I created a CL with a new RandomOrderExpectations file containing 128 tests. With that file, the suite ran 6/6 times without any failures in random order on my local machine. https://codereview.chromium.org/2427613004 The next step is to add a chromium.fyi.randomorder bot that continuously runs the suite in random order to gather more data. Bug: https://bugs.chromium.org/p/chromium/issues/detail?id=652357
,
Oct 27 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/62121cde5cfb54d37231f0dc271d06ef8a237853 commit 62121cde5cfb54d37231f0dc271d06ef8a237853 Author: jeffcarp <jeffcarp@chromium.org> Date: Thu Oct 27 22:42:34 2016 Here's a first shot at getting random order runs green locally, by adding a RandomOrderExpectations file. Run by adding: --additional-expectations=LayoutTests/RandomOrderExpectations 5 out of 5 local runs have passed with this file (minus 2 tests which I think might be a regression: https://crbug.com/656777 ). BUG= 601332 Review-Url: https://codereview.chromium.org/2427613004 Cr-Commit-Position: refs/heads/master@{#428168} [add] https://crrev.com/62121cde5cfb54d37231f0dc271d06ef8a237853/third_party/WebKit/LayoutTests/RandomOrderExpectations
,
Nov 3 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/576a434b15d3240cd3c3ffd4f8897b4e60d81fe8 commit 576a434b15d3240cd3c3ffd4f8897b4e60d81fe8 Author: jeffcarp <jeffcarp@chromium.org> Date: Thu Nov 03 22:10:33 2016 Add two fast/text/ellipsis tests to RandomOrderExpectations. All the recent failures on WebKit Linux - RandomOrder have been from at least one of three tests. This adds two of them. The other test only appeared once in the failures so I'm not adding it now. BUG= 601332 Review-Url: https://codereview.chromium.org/2473613004 Cr-Commit-Position: refs/heads/master@{#429708} [modify] https://crrev.com/576a434b15d3240cd3c3ffd4f8897b4e60d81fe8/third_party/WebKit/LayoutTests/RandomOrderExpectations
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 9 2016
,
Nov 13 2016
,
Nov 13 2016
,
Nov 13 2016
,
Nov 14 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b3af257e85b3f5b3c46049db264353ba97a94def commit b3af257e85b3f5b3c46049db264353ba97a94def Author: jeffcarp <jeffcarp@chromium.org> Date: Mon Nov 14 22:48:27 2016 Add corresponding bugs to RandomOrderExpectations BUG= 601332 Review-Url: https://codereview.chromium.org/2502773002 Cr-Commit-Position: refs/heads/master@{#431932} [modify] https://crrev.com/b3af257e85b3f5b3c46049db264353ba97a94def/third_party/WebKit/LayoutTests/RandomOrderExpectations
,
Nov 15 2016
,
Nov 15 2016
,
Nov 15 2016
,
Nov 21 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/8d653d9454503da0e54c90fc583e72fb025a7f23 commit 8d653d9454503da0e54c90fc583e72fb025a7f23 Author: jeffcarp <jeffcarp@chromium.org> Date: Mon Nov 21 19:58:51 2016 Remove http/tests/inspector/network/har-content.html from RandomOrderExpectations A CL was landed to fix the flakiness: https://crrev.com/2503573002 BUG= 601332 Review-Url: https://codereview.chromium.org/2499153003 Cr-Commit-Position: refs/heads/master@{#433614} [modify] https://crrev.com/8d653d9454503da0e54c90fc583e72fb025a7f23/third_party/WebKit/LayoutTests/RandomOrderExpectations
,
Nov 22 2016
,
Nov 22 2016
,
Nov 29 2016
,
Nov 29 2016
,
Nov 30 2016
,
Nov 30 2016
,
Dec 2 2016
,
Dec 2 2016
,
Dec 2 2016
,
Dec 2 2016
,
Dec 5 2016
,
Dec 6 2016
,
Dec 6 2016
,
Dec 6 2016
,
Dec 6 2016
,
Dec 6 2016
,
Dec 6 2016
,
Dec 7 2016
,
Dec 9 2016
,
Dec 14 2016
,
Dec 15 2016
,
Dec 15 2016
,
Dec 16 2016
,
Dec 17 2016
LayoutTests on Linux and Mac moved to random order by default today!! 🎉 Next order of business is rolling out Windows and Android.
,
Feb 6 2017
@jeffcarp: It was my understanding that windows is now done. Is this bug out of date?
,
Feb 6 2017
@mcgreevy: yes that is correct, I just didn't update this bug. The LayoutTests now work in random order but there are still a bunch of bugs blocking this one (they're tests that were disabled due to random order flakiness). Not sure if this bug should be marked Fixed or if we should update the title to be more specific (e.g. 'Resolve random order flakiness in LayoutTests').
,
Feb 7 2017
As best I can tell, the change that's landed makes it so that we run test within a directory in random order, but not actually in a globally random order. This doesn't match what swarming will do since swarming will run them in a globally random order, right? I think in order to get the behavior of swarming, you'd also have to run with --fully-parallel, which will increase flakiness a lot.
,
Feb 7 2017
Thanks for catching that, I'll start investigating --fully-parallel.
,
Feb 7 2017
Tim and I spoke in person about whether we want to set up new FYI bots for this. We think it makes the most sense to continue converting the RandomOrder FYI bots to swarming and catch flakiness issues there. This saves us time for two reasons: 1. If we tried to address --fully-parallel flakiness before moving to swarming, we'd still need to address (the likely) flakiness issues after moving to swarming. Even if a test is flaky only because it's running on swarming, we can't ignore it because swarming is our eventual goal. 2. Converting the RandomOrder FYI bots to swarming will decrease the iteration time for catching flakiness. Even if results need to be confirmed locally, we will run into issues faster and be able to confirm fixes faster on swarming. Let me know if you disagree or see any problems with this approach.
,
Feb 7 2017
Have you tried a test run with --fully-parallel to get an anecdotal sense of how significant the increase in flakiness would be?
,
Feb 7 2017
It seems to me like life would be much easier if we knew that fully random worked properly before trying to get swarming working, so it seems to me you should at least flip the switch and see what happens to see how bad things are first. (Also, apologies for missing the fact that the sharding happened after the tests were shuffled before :().
,
Feb 8 2017
Just to be clear, there are *two* levels of sharding going on here. There is; (a) the swarming/top-level sharding (b) the multi-process sharding which splits things up for running in multiple processes on a single machine. The code for (a) is taking the *complete* list of tests, shuffling them and then splitting them into equal chunks (which is what we want for swarming). The code in (b) is then taking the chunk and regrouping them by directory before handing them to a local test runner. The side effect of (b) is that currently tests are only running in random order *inside a directory*. Directories are still run in sorted order.
,
Feb 8 2017
To "simulate" the effect that swarming will have on the output, we can do; ------------------------- for i in 0 1 2 3 4 5; do GTEST_RANDOM_SEED=1 GTEST_SHARD_INDEX=$i GTEST_TOTAL_SHARDS=6 <layout test command> done -------------------------
,
Feb 8 2017
While I agree that we probably want to separate "running on swarming" from "running in random order" issues, the fact that it takes ~1 hour to do the run locally and <10m to do a run on swarming means we can probably get some initial data using swarming directly first.
,
Feb 8 2017
Here are my test results. Without --fully-parallel: Expected to timeout, but passed: (15) Expected to fail, but passed: (17) Regressions: Unexpected text-only failures (2) Regressions: Unexpected image-only failures (4) With --fully-parallel: Expected to timeout, but passed: (15) Expected to fail, but passed: (16) Regressions: Unexpected text-only failures (2) Regressions: Unexpected image-only failures (4) Ran against Release build of Chromium@184394b on Ubuntu. Run #2 failures ⊂ run #1 failures. Certainly on Windows or under swarming conditions we'd expect more flakiness, but these results indicate that there isn't a catastrophic increase in flakiness with --fully-parallel (either that or the flag isn't working properly). Also keep in mind that tests are re-tried 3 times individually after a test run. I believe this catches a lot of tests that would have failed otherwise (it looks like about 100 for each).
,
Feb 8 2017
Is there a large increase in the number of tests retried? If so, that implies an increase in expected flakiness. It seems to me like we don't need to argue about this. We can turn on --fully-parallel for 1-2 days, and then enable swarming.
,
Feb 8 2017
> We can turn on --fully-parallel for 1-2 days, and then enable swarming. +1. It's not like we're that close to enabling swarming anyway ...
,
Feb 8 2017
@ojan: there were actually fewer tests retired under --fully-parallel, 96 with vs. 107 without iirc. Not sure why. Enabling the flag on more builders might shed more light. @ojan & dpranke: Do you mean turn on --fully-parallel on the FYI builders or on all builders? Either way that sounds like a good plan and I can do that on Friday when I get back.
,
Feb 8 2017
I meant the FYI builders. If the FYI builders look good after a couple days (low flakiness and good cycle time), there's no reason not to enable it on all builders.
,
Feb 10 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/8b0d49e735c59bcef1d62a7e9f237a31bfc86315 commit 8b0d49e735c59bcef1d62a7e9f237a31bfc86315 Author: Jeff Carpenter <jeffcarp@chromium.org> Date: Fri Feb 10 18:06:18 2017 Add --fully-parallel to all RandomOrder FYI builders BUG= 601332 R=dpranke@chromium.org,qyearsley@chromium.org,tansell@chromium.org Change-Id: I71ebf5de295b0ee7770f80e7fe48bd07e02f478e Reviewed-on: https://chromium-review.googlesource.com/440448 Commit-Queue: Dirk Pranke <dpranke@chromium.org> Reviewed-by: Dirk Pranke <dpranke@chromium.org> [modify] https://crrev.com/8b0d49e735c59bcef1d62a7e9f237a31bfc86315/scripts/slave/recipes/chromium.expected/full_chromium_fyi_WebKit_Win___RandomOrder.json [modify] https://crrev.com/8b0d49e735c59bcef1d62a7e9f237a31bfc86315/scripts/slave/recipes/chromium.expected/full_chromium_fyi_WebKit_Linux___RandomOrder.json [modify] https://crrev.com/8b0d49e735c59bcef1d62a7e9f237a31bfc86315/scripts/slave/recipe_modules/chromium_tests/chromium_fyi.py [modify] https://crrev.com/8b0d49e735c59bcef1d62a7e9f237a31bfc86315/scripts/slave/recipes/chromium.expected/full_chromium_fyi_WebKit_Mac___RandomOrder.json
,
Feb 14 2017
@jeffcarp: How is --fully-parallel working out?
,
Feb 14 2017
The RandomOrder FYI builders moved to --fully-parallel at 10:10am PST on 10/2/17. Since then: Linux - Looks relatively stable, with around the same number of failures - https://luci-milo.appspot.com/buildbot/chromium.fyi/WebKit%20Linux%20-%20RandomOrder/?limit=200 Mac - Is having infra failures I'm looking into - https://luci-milo.appspot.com/buildbot/chromium.fyi/WebKit%20Mac%20-%20RandomOrder/?limit=200 Windows - Has a number of flaky failing tests, listed below - https://luci-milo.appspot.com/buildbot/chromium.fyi/WebKit%20Win%20-%20RandomOrder/?limit=200 Out of 53 test runs on WebKit Win - RandomOrder: fast/css-grid-layout/grid-align-justify-stretch-with-orthogonal-flows.html failures: 48 fast/css/fontfaceset-check-platform-fonts.html failures: 32 virtual/mojo-loading/http/tests/security/suborigins/suborigin-invalid-options.html failures: 30 http/tests/security/cross-frame-mouse-source-capabilities.html failures: 22 virtual/mojo-loading/http/tests/security/cross-frame-mouse-source-capabilities.html failures: 9 fast/overflow/overflow-height-float-not-removed-crash3.html failures: 5 fast/block/float/float-should-dirty-line-when-adjacent-to-line-breaks-2.html failures: 4 fast/block/float/float-should-dirty-line-when-adjacent-to-line-breaks.html failures: 4 virtual/mojo-loading/http/tests/webfont/font-display-intervention.html failures: 2 svg/text/text-selection-ws-01-t.svg failures: 1 virtual/mojo-loading/http/tests/loading/preload-img-test.html failures: 1 virtual/mojo-loading/http/tests/misc/client-hints-accept-meta-preloader.html failures: 1 virtual/mojo-loading/http/tests/webfont/font-display.html failures: 1 fast/table/percent-height-overflow-visible-content-in-cell.html failures: 1 http/tests/misc/client-hints-accept-meta-preloader.html failures: 1 external/wpt/pointerevents/pointerevent_pointerout_received_once-manual.html failures: 1
,
Feb 14 2017
1. --fully-parallel still doesn't shard virtual test suites. This is on purpose because switching between virtual tests suites needs to restart content_shell with different commandline arguments, so it slow. Is there a plan for how to handle this efficiently with swarming? 2. All the non-virtual tests have a "killing secondary driver" line right after it. 3. I suggested in comment 88 that you double check that cycle time didn't regress. You should still do that. As best I can tell, the webkit_tests step takes 3x longer now.
,
Feb 14 2017
4. The amount of retrying seems to go from ~40-->~100 tests now.
,
Feb 14 2017
Looks like we kill the secondary driver at the end of each test shard. That's not gonna work so well with randomly ordered tests ...
,
Feb 14 2017
> Is there a plan for how to handle this efficiently with swarming? Not that I'm aware of. What do you suggest? > As best I can tell, the webkit_tests step takes 3x longer now. I'll pull some numbers on cycle time + number of retries. ojan + dpranke: do you think this means we'll need to do some work to change how tests are sharded on swarming? It's pretty clear now that this bug can't be closed simply by adding more lines to TestExpectations.
,
Feb 14 2017
I think most of the performance/cycle time issues go away if we're not actually truly running things in random order. So I think we're fine to proceed w/ swarming stuff, and should worry about flakiness concurrently w/ proceeding on swarming.
,
Feb 15 2017
FYI, I'm unsubscribing from this bug. So, if you want my feedback you'll need to contact me directly.
,
Feb 16 2017
Tim and I met with Ojan on Tuesday and discussed this. The results basically echo Dirk in #96. Here's our game plan: Problem: --fully-parallel restarts non-virtual content-shells every single test - This is because --fully-parallel puts all non-virtual tests into a function LayoutTestRunner._shard_every_file and the content shell is restarted after every shard (where every shard is 1 test big). Tim filed a bug 692866 to address this. Problem: Cycle time has regressed 3x - Tim and I looked at a bunch of logs and determined that the cause behind the explosion in cycle time was the previous issue (restarting content-shell after every non-virtual test). It will be resolved with bug 692866 . Since we will not be running under --fully-parallel on swarming, this should not be an issue, however we will keep a close eye on cycle time going forward. Problem: The number of retries is higher with --fully-parallel - This is something we'll also be keeping a close eye on as we move the FYI bots to swarming. With a quicker cycle time on swarming we should be able to address flakiness problems faster and get more data. Problem: --fully-parallel doesn't shard virtual test suites - This is not an issue (that's what I have from my notes but I can't recall the reasoning behind it) I also pulled the numbers from #95 and Ojan's numbers in #92 and #93 are accurate.
,
Feb 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/852c6f0043c86ca5a66549cee5cbb0f54f371e04 commit 852c6f0043c86ca5a66549cee5cbb0f54f371e04 Author: Tim 'mithro' Ansell <tansell@chromium.org> Date: Thu Feb 16 23:32:34 2017 chromium.fyi: Remove --fully-parallel from RandomOrder bots. Also adds --seed=4 which should allow us to compare local run to the run on swarming. BUG= 692866 , 601332 Change-Id: Ide57008d2b15c9eb32c18afe351f0f1ff881f838 Reviewed-on: https://chromium-review.googlesource.com/444424 Reviewed-by: Quinten Yearsley <qyearsley@chromium.org> Reviewed-by: smut <smut@chromium.org> Reviewed-by: Stephen Martinis <martiniss@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Tim 'mithro' Ansell <tansell@chromium.org> [modify] https://crrev.com/852c6f0043c86ca5a66549cee5cbb0f54f371e04/scripts/slave/recipes/chromium.expected/full_chromium_fyi_WebKit_Win___RandomOrder.json [modify] https://crrev.com/852c6f0043c86ca5a66549cee5cbb0f54f371e04/scripts/slave/recipes/chromium.expected/full_chromium_fyi_WebKit_Linux___RandomOrder.json [modify] https://crrev.com/852c6f0043c86ca5a66549cee5cbb0f54f371e04/scripts/slave/recipe_modules/chromium_tests/chromium_fyi.py [modify] https://crrev.com/852c6f0043c86ca5a66549cee5cbb0f54f371e04/scripts/slave/recipes/chromium.expected/full_chromium_fyi_WebKit_Mac___RandomOrder.json
,
Feb 22 2017
We've observed failures of these two tests: fast/block/float/float-should-dirty-line-when-adjacent-to-line-breaks.html fast/block/float/float-should-dirty-line-when-adjacent-to-line-breaks-2.html https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Win7%20%28dbg%29/builds/9095 https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Win7%20%28dbg%29/builds/9096 Could it be related to this issue?
,
Feb 23 2017
I too just discovered that fast/block/float/float-should-dirty-line-when-adjacent-to-line-breaks.html is really flaky. It has nothing to do with this issue, though. Opening the test locally in content_shell or even Chrome, and then reloading a few times, you'll see that the test fails quite often (i.e. all the words end up on the same line). I've reported bug 695378 for this.
,
Feb 23 2017
,
May 2 2017
I'm going to close this bug as we now run Layout Tests in random order by default. The Layout Tests are also running on swarming in this mode.
,
Oct 11
Showing comments 5 - 104
of 104
Older ›
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||