net_unittests is very slow when run with no options (on linux) |
||||||||
Issue description
I noticed today that every 10th net_unittest seemed to take about 10 seconds when running with no arguments. I verified that when run with --bot-launcher-mode, I see the same behavior. This means that net_unittests take more than 10 minutes on my z620. If I run with --single-process-tests it takes a bit less than 10 minutes. This is super surprising because when I run with no args, or with --bot-launcher-mode, I get a message about running 32 parallel jobs. Since each job is only single threaded I would sure expect 32 jobs to beat out a single job.
If I add --test-launcher-batch-limit=100, it takes less than 2 minutes. This is a nice improvement over the single process variant, though I'm surprised that it's only 5x when we're running 32x as many jobs. Is this behavior expected?
I'm guessing the long pauses result from finishing and starting a batch. Is that right? Should we consider a different default batch size? It seems like the default is 10 tests, which seems to result in the per-batch overhead dominating the runtime. Any thoughts?
10:47 ./out/Build/net_unittests
10:55 ./out/Build/net_unittests --test-launcher-bot-mode
9:21 ./out/Build/net_unittests --single-process-tests >& /tmp/log
1:53 ./out/Build/net_unittests --test-launcher-batch-limit=100
1:46 ./out/Build/net_unittests --test-launcher-batch-limit=1000
,
Jun 14 2016
If this is on Linux, I wonder if this might be related to I wonder if maybe you're also being bitten by the pki issues in bug 595504 ...
,
Jun 15 2016
Dirk, any idea how I would confirm that bug 595504 is affecting me?
,
Jun 15 2016
in https://bugs.chromium.org/p/chromium/issues/detail?id=595504#c33 the claim is made that renaming ~/.pki fixes the problem, so you could try that, I suppose.
,
Jun 15 2016
Ah, ok. So that definitely helped. It sped up the default configuration by a factor of 2x. However, the limit of 100 is still much faster than the default so I think there must be something else also going on: 5:47 ./out/Build/net_unittests 1:38 ./out/Build/net_unittests --test-launcher-batch-limit=100
,
Jun 17 2016
Chromecast team has noticed the net_unittests are also very slow, especially on our older devices (so slow that our test runners actually timeout and can't finish them). In the past we've increased the --test-launcher-batch-limit from 10 (default) to 25 to 50 (b/23156294) But as the number of tests in net_unittests has grown to over 20,000 we've been timing out again. Kind of an orthogonal request, but would the net OWNERs be willing to break up net_unittests into more discrete parts (maybe net_quic_unittests, net_spdy_unittests, etc. ?) so that net_unittests isn't such a monolith?
,
Jun 17 2016
Definitely orthogonal, but breaking up the test seems plausible. That being said, I wonder if the timeout logic is too aggressive on these slow machines?
,
Jun 17 2016
Our current logic gives net_unittests 45 minutes to complete, which doesn't seem particularly aggressive for a unittest suite, but is not enough time on our slowest devices (newer devices run the full suite in ~17mins). It's also possible something with that type of device just isn't meshing well with what the net_unittests are trying to do and resulting in lots of overhead or something like that. I don't want to derail this bug, though, but increasing the timeout time is something we'd consider if needed.
,
Jun 20 2016
Can someone clarify what the infra issue is here? Is there a timeout or parallelism flag that needs to be increased on some bots?
,
Jun 20 2016
I think this is infra in the sense that phajdan.jr owns the test launcher code and it's not obvious what a better component for it would be.
,
Jun 20 2016
,
Jun 24 2016
I've been able to roughly reproduce the numbers and slowness. One idea could be to allow a test suite to customize default value for batch size. I'm thinking about adding some tracing or other visualization to better understand how net_unittests get parallelized, what's the bottleneck and what could be improved. It seems there are several 30s+ long tests, which might make it tricky.
,
Jun 24 2016
I did some testing running multiple instances of net_unittests --single-process-tests in parallel. As you would expect, the more jobs are running, the longer it takes to complete. But if you divide the total time by the number of jobs, that gives us a good estimate for how long it would take to run N jobs, each running 1/Nth of the tests. Total Num Jobs Scaled Time 9:47 1 9:47 12:00 10 1:12 14:32 20 0:43 17:37 32 0:33 When I run net_unittests --test-launcher-batch-limit=100, it takes 1:38. This is about 3x longer than it seems like it would take if I just divided the tests into 32 chunks and ran them all in parallel. I suspect this is an artifact of overhead that the test launch is adding, but as you say, it could be a problem with batch imbalance. Where does the test launcher live? Is there an easy way I could add a printf to it when it starts/ends a job to see if, for example, 31 jobs complete in ~30 seconds, but one jobs takes ~1:30 (which would suggest imbalance).
,
Jun 24 2016
Test launcher code is in base/test/launcher .
,
Jun 27 2016
,
Jul 1 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/29ec06ac0501b79dc08de753ee9c52b5db7369d1 commit 29ec06ac0501b79dc08de753ee9c52b5db7369d1 Author: phajdan.jr <phajdan.jr@chromium.org> Date: Fri Jul 01 14:11:04 2016 Add basic tracing support to the gtest launcher R=primiano TBR=thakis BUG=619993 Review-Url: https://codereview.chromium.org/2112253002 Cr-Commit-Position: refs/heads/master@{#403449} [modify] https://crrev.com/29ec06ac0501b79dc08de753ee9c52b5db7369d1/base/base.gyp [modify] https://crrev.com/29ec06ac0501b79dc08de753ee9c52b5db7369d1/base/test/BUILD.gn [modify] https://crrev.com/29ec06ac0501b79dc08de753ee9c52b5db7369d1/base/test/launcher/test_launcher.cc [add] https://crrev.com/29ec06ac0501b79dc08de753ee9c52b5db7369d1/base/test/launcher/test_launcher_tracer.cc [add] https://crrev.com/29ec06ac0501b79dc08de753ee9c52b5db7369d1/base/test/launcher/test_launcher_tracer.h [modify] https://crrev.com/29ec06ac0501b79dc08de753ee9c52b5db7369d1/base/test/test_switches.cc [modify] https://crrev.com/29ec06ac0501b79dc08de753ee9c52b5db7369d1/base/test/test_switches.h
,
Aug 16 2016
Any update? Could we possibly change the default value of --test-launcher-batch-limit to something higher (like 1000) and if the bots need to use a smaller number, they could do that? (I'd happily send out a CL which changed kDefaultTestBatchLimit from 10 -> 1000)
,
Jan 4 2017
My plan was to use above tracing support to figure out best next steps. Blindly changing batch limit or even tuning it somehow isn't necessarily the best solution. Maybe we need to be smarter about how the tests are batched by the runner, or get rid of some very slow tests (or per-suite setups) in net_unittests.
,
Jan 11 2017
Do you have an ETA for when you expect to have a fix? Or perhaps when you expect to have some traces that will help identify the problem?
,
Mar 10 2017
Feel free to talk to my manager and TL about prioritizing this work.
,
Sep 6 2017
,
Sep 6
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by dpranke@chromium.org
, Jun 14 2016