run-webkit-tests times out but content_shell --run-layout-test doesn't |
||||||||
Issue descriptionWith a ToT chromium build from today, run-webkit-tests webaudio/BiquadFilter/biquadfilternode-basic.html will often time out; the run time is approximately 6 sec or so. However, doing content_shell --run-layout-test http://localhost:7000/webaudio/BiquadFilter/biquadfilternode-basic.html run the same test in well under one second. (The web server is a simple-http-server on port 7000 started in third_party/WebKit/LayoutTests.) I only see run-webkit-tests timing out locally on my Z620 linux machine. I don't see this on mac or windows. The bots seem perfectly happy too, and there are no reports of flakiness there. And it's not just me; I've had a few other people run the webaudio layout tests locally and they also get 30-40 tests timing out.
,
May 22 2017
,
Jun 20 2017
I reran my test suite with long timeout. All tests finish in either 11s, or 21s, as if something is throttling the events:
src> python run-webkit-tests media/W3C/video/events --timing --time-out-ms=25000 -f -v
Found 23 tests; running 23, skipping 0.
Running 23 content_shells in parallel.
[1/23] media/W3C/video/events/event_pause_manual.html passed 11.5700s
[2/23] media/W3C/video/events/event_play_manual.html passed 11.7764s
[3/23] media/W3C/video/events/event_progress_manual.html passed 11.7165s
[4/23] media/W3C/video/events/event_loadedmetadata_manual.html passed 11.0386s
[5/23] media/W3C/video/events/event_loadedmetadata.html passed 11.1401s
[6/23] media/W3C/video/events/event_order_loadstart_progress.html passed 11.5358s
[7/23] media/W3C/video/events/event_progress.html passed 11.5586s
[8/23] media/W3C/video/events/event_loadstart.html passed 11.6779s
[9/23] media/W3C/video/events/event_loadstart_manual.html passed 11.1886s
[10/23] media/W3C/video/events/event_order_canplay_canplaythrough.html passed 21.9149s
[11/23] media/W3C/video/events/event_order_canplay_playing.html passed 21.8309s
[12/23] media/W3C/video/events/event_canplay.html passed 21.7224s
[13/23] media/W3C/video/events/event_playing.html passed 21.3382s
[14/23] media/W3C/video/events/event_playing_manual.html passed 21.1872s
[15/23] media/W3C/video/events/event_loadeddata_manual.html passed 21.1322s
[16/23] media/W3C/video/events/event_loadeddata.html passed 21.1453s
[17/23] media/W3C/video/events/event_canplay_manual.html passed 21.0199s
[18/23] media/W3C/video/events/event_order_loadedmetadata_loadeddata.html passed 21.7323s
[19/23] media/W3C/video/events/event_canplaythrough_manual.html passed 21.7305s
[20/23] media/W3C/video/events/event_canplaythrough.html passed 21.5195s
[21/23] media/W3C/video/events/event_play.html passed 21.2370s
[22/23] media/W3C/video/events/event_timeupdate.html passed 21.9379s
[23/23] media/W3C/video/events/event_timeupdate_manual.html passed 21.4469s
,
Jun 20 2017
Maybe something related to wakeup-based budget throttling. It landed April/May which is about the right timeline: https://bugs.chromium.org/p/chromium/issues/detail?id=699541
,
Jun 22 2017
I've spent some time investigating. What I found is weird:
- running tests fully parallel is slower than sequential
- running tests fully parallel has strange timing gap for stderr.#EOF
- running tests from content_shell + command line is 2x faster
My test setup was to run:
python third_party/WebKit/Tools/Scripts/run-webkit-tests --target=Optimized --timing --fully-parallel -v --no-retry-failures media/W3C/video/events
When running fully parallel, all tests would time out.
When running sequential, tests would run fine, most under 2s.
Did a lot of tracing. What I was seeing when running 23 tests in parallel was:
Test starts:
1498093414.42 media/W3C/video/events/event_loadstart.html'--pixel-test
5s later we get output
1498093419.51 _read ('Content-Type: text/plain\n', None)
output completes immediatiely
1498093419.51 _read_stdout ('#EOF\n', None)
but getting error #EOF takes another 6 seconds
1498093426.14 _read_err #EOF
The time gap between stdout and stderr makes no sense, since the shell code prints both at the same time:
https://cs.chromium.org/chromium/src/content/shell/browser/layout_test/blink_test_controller.cc?dr=C&l=397
If running only two tests in parallel, this gap is smaller.
In real life, this happens even when tests are not being run fully parallel, all you need is lots of tests.
Also, running these tests from command line is much faster, both tests complete in < 3s.
Here are the timing runs:
Running 1 content_shell.
[1/2] media/W3C/video/events/event_loadeddata.html passed 4.1761s
[2/2] media/W3C/video/events/event_timeupdate.html passed 2.0652s
Running 2 content_shells in parallel.
[1/2] media/W3C/video/events/event_loadeddata.html passed 5.8611s
[2/2] media/W3C/video/events/event_timeupdate.html passed 7.8054s
content-shell --run-layout-test media/W3C/video/events/event_loadeddata.html
real 0m2.535s
user 0m0.404s
sys 0m0.352s
content-shell --run-layout-test media/W3C/video/events/event_timeupdate.html
real 0m2.476s
user 0m0.400s
sys 0m0.444s
Even running all 23 tests from command line was much faster.
,
Jun 22 2017
this makes it sound like there's a bug in the select() code reading from the pipes to the subprocesses, and maybe something isn't waking up promptly.
,
Jun 22 2017
I mucked around some more, and there is a 1-line change that makes everything work. It removes the env parameter from popen in server_process.py. I tried doing this because I thought the problem was buffering of stdin/err. One thing led to another... Removing env from popen makes test very fast. My fully parallel tests completes in 7s. Ideal solution would be to figure out what env variables are missing, and just add those, instead of popen using parent's env. I can now run the entire LayoutTests in < 10min. I still end up with 85 unexpected timeouts, almost all of them have media_source or media in their name. My standard, fully parallel test results: [1/23] media/W3C/video/events/event_loadedmetadata_manual.html passed 1.8928s [2/23] media/W3C/video/events/event_loadstart_manual.html passed 1.7858s [3/23] media/W3C/video/events/event_order_canplay_canplaythrough.html passed 1.7595s [4/23] media/W3C/video/events/event_timeupdate_manual.html passed 2.0621s [5/23] media/W3C/video/events/event_play_manual.html passed 1.8592s [6/23] media/W3C/video/events/event_loadeddata_manual.html passed 2.1456s [7/23] media/W3C/video/events/event_loadedmetadata.html passed 1.9456s [8/23] media/W3C/video/events/event_playing_manual.html passed 1.5783s [9/23] media/W3C/video/events/event_progress.html passed 0.6656s [10/23] media/W3C/video/events/event_order_loadstart_progress.html passed 1.3214s [11/23] media/W3C/video/events/event_loadstart.html passed 0.7237s [12/23] media/W3C/video/events/event_order_loadedmetadata_loadeddata.html passed 1.3355s [13/23] media/W3C/video/events/event_playing.html passed 1.2448s [14/23] media/W3C/video/events/event_play.html passed 1.3573s [15/23] media/W3C/video/events/event_loadeddata.html passed 1.2315s [16/23] media/W3C/video/events/event_order_canplay_playing.html passed 1.2831s [17/23] media/W3C/video/events/event_canplaythrough_manual.html passed 1.2938s [18/23] media/W3C/video/events/event_canplay_manual.html passed 0.9186s [19/23] media/W3C/video/events/event_canplaythrough.html passed 0.8826s [20/23] media/W3C/video/events/event_pause_manual.html passed 0.8970s [21/23] media/W3C/video/events/event_progress_manual.html passed 0.8955s [22/23] media/W3C/video/events/event_canplay.html passed 0.9114s [23/23] media/W3C/video/events/event_timeupdate.html passed 1.2381s
,
Jun 26 2017
FWIW, I tried out this patch and I can now run the webaudio tests to completion on my own machine. This hasn't been possible for a month or two, so this is fantastic!
,
Jul 15 2017
,
Jul 15 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e10fa04d83dcaabd776804603b6fd186e5f21fc5 commit e10fa04d83dcaabd776804603b6fd186e5f21fc5 Author: Aleks Totic <atotic@chromium.org> Date: Sat Jul 15 05:10:54 2017 Fix slow tests on Linux, add XDG_RUNTIME_DIR to content_shell environment. They symptom was really slow pipes between content_shell and test runner. I did not know what XDG_RUNTIME_DIR was, so I looked it up. Seems relevant to the symptom. Curious about what the root cause was. https://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html XDG_RUNTIME_DIR defines the base directory relative to which user-specific non-essential runtime files and other file objects (such as sockets, named pipes, ...) should be stored. The directory MUST be owned by the user, and he MUST be the only one having read and write access to it. Its Unix access mode MUST be 0700. The lifetime of the directory MUST be bound to the user being logged in. It MUST be created when the user first logs in and if the user fully logs out the directory MUST be removed. If the user logs in more than once he should get pointed to the same directory, and it is mandatory that the directory continues to exist from his first login to his last logout on the system, and not removed in between. Files in the directory MUST not survive reboot or a full logout/login cycle. The directory MUST be on a local file system and not shared with any other system. The directory MUST by fully-featured by the standards of the operating system. More specifically, on Unix-like operating systems AF_UNIX sockets, symbolic links, hard links, proper permissions, file locking, sparse files, memory mapping, file change notifications, a reliable hard link count must be supported, and no restrictions on the file name character set should be imposed. Files in this directory MAY be subjected to periodic clean-up. To ensure that your files are not removed, they should have their access time timestamp modified at least once every 6 hours of monotonic time or the 'sticky' bit should be set on the file. If $XDG_RUNTIME_DIR is not set applications should fall back to a replacement directory with similar capabilities and print a warning message. Applications should use this directory for communication and synchronization purposes and should not place larger files in it, since it might reside in runtime memory and cannot necessarily be swapped out to disk. Bug: 725229 Change-Id: Ia34c349c94dd44357c135b98deab1a119fa557a6 Reviewed-on: https://chromium-review.googlesource.com/572640 Reviewed-by: Quinten Yearsley <qyearsley@chromium.org> Commit-Queue: Aleks Totic <atotic@chromium.org> Cr-Commit-Position: refs/heads/master@{#486980} [modify] https://crrev.com/e10fa04d83dcaabd776804603b6fd186e5f21fc5/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/port/base.py
,
Jul 17 2017
I *think* that the change to preserve XDG_RUNTIME_DIR seems to have fixed the problem for me; on my workstation I can now run the test fully parallel and (on a Z840, 48 content_shells in parallel), all of the layout tests run in about 8 minutes and 33 of them time out unexpectedly (much less than the hundreds before). Is this consistent with what others see?
,
Jul 17 2017
I can confirm that this change works GREAT with the webaudio tests. Before this change, I would pretty consistently get about 40 out of 240 tests timing out. And it pretty much didn't matter how what timeout I used for run-webkit-tests; the tests still pretty much timed out. Now every test runs without timing out, and it only takes a 10-15 sec to run all of the webaudio tests. Hurray!
,
Jul 24 2017
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by dpranke@chromium.org
, May 22 2017