New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 619993 link

Starred by 3 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 2
Type: Bug



Sign in to add a comment

net_unittests is very slow when run with no options (on linux)

Project Member Reported by rch@chromium.org, Jun 14 2016

Issue description

I noticed today that every 10th net_unittest seemed to take about 10 seconds when running with no arguments. I verified that when run with --bot-launcher-mode, I see the same behavior. This means that net_unittests take more than 10 minutes on my z620. If I run with --single-process-tests it takes a bit less than 10 minutes. This is super surprising because when I run with no args, or with --bot-launcher-mode, I get a message about running 32 parallel jobs. Since each job is only single threaded I would sure expect 32 jobs to beat out a single job.

If I add --test-launcher-batch-limit=100, it takes less than 2 minutes. This is a nice improvement over the single process variant, though I'm surprised that it's only 5x when we're running 32x as many jobs. Is this behavior expected?

I'm guessing the long pauses result from finishing and starting a batch. Is that right? Should we consider a different default batch size? It seems like the default is 10 tests, which seems to result in the per-batch overhead dominating the runtime. Any thoughts?

   10:47  ./out/Build/net_unittests 
   10:55  ./out/Build/net_unittests --test-launcher-bot-mode 
    9:21  ./out/Build/net_unittests --single-process-tests >& /tmp/log
    1:53  ./out/Build/net_unittests --test-launcher-batch-limit=100
    1:46  ./out/Build/net_unittests --test-launcher-batch-limit=1000

 
Cc: phajdan.jr@chromium.org
Cc: tansell@chromium.org
If this is on Linux, I wonder if this might be related to I wonder if maybe you're also being bitten by the pki issues in bug 595504 ...

Comment 3 by rch@chromium.org, Jun 15 2016

Dirk, any idea how I would confirm that bug 595504 is affecting me?
in https://bugs.chromium.org/p/chromium/issues/detail?id=595504#c33 the claim is made that renaming ~/.pki fixes the problem, so you could try that, I suppose.

Comment 5 by rch@chromium.org, Jun 15 2016

Ah, ok. So that definitely helped.  It sped up the default configuration by a factor of 2x. However, the limit of 100 is still much faster than the default so I think there must be something else also going on:

   5:47  ./out/Build/net_unittests
   1:38  ./out/Build/net_unittests --test-launcher-batch-limit=100
Chromecast team has noticed the net_unittests are also very slow, especially on our older devices (so slow that our test runners actually timeout and can't finish them). In the past we've increased the --test-launcher-batch-limit from 10 (default) to 25 to 50 (b/23156294)

But as the number of tests in net_unittests has grown to over 20,000 we've been timing out again.

Kind of an orthogonal request, but would the net OWNERs be willing to break up net_unittests into more discrete parts (maybe net_quic_unittests, net_spdy_unittests, etc. ?) so that net_unittests isn't such a monolith?





Comment 7 by rch@chromium.org, Jun 17 2016

Definitely orthogonal, but breaking up the test seems plausible. That being said, I wonder if the timeout logic is too aggressive on these slow machines?
Our current logic gives net_unittests 45 minutes to complete, which doesn't seem particularly aggressive for a unittest suite, but is not enough time on our slowest devices (newer devices run the full suite in ~17mins). 

It's also possible something with that type of device just isn't meshing well with what the net_unittests are trying to do and resulting in lots of overhead or something like that.

I don't want to derail this bug, though, but increasing the timeout time is something we'd consider if needed.

Comment 9 by mmoss@chromium.org, Jun 20 2016

Can someone clarify what the infra issue is here? Is there a timeout or parallelism flag that needs to be increased on some bots?
I think this is infra in the sense that phajdan.jr owns the test launcher code and it's not obvious what a better component for it would be.

Comment 11 by mmoss@chromium.org, Jun 20 2016

Cc: -phajdan.jr@chromium.org
Owner: phajdan.jr@chromium.org
Status: Assigned (was: Untriaged)
Cc: phajdan@google.com
I've been able to roughly reproduce the numbers and slowness.

One idea could be to allow a test suite to customize default value for batch size.

I'm thinking about adding some tracing or other visualization to better understand how net_unittests get parallelized, what's the bottleneck and what could be improved. It seems there are several 30s+ long tests, which might make it tricky.

Comment 13 by rch@chromium.org, Jun 24 2016

I did some testing running multiple instances of net_unittests --single-process-tests in parallel. As you would expect, the more jobs are running, the longer it takes to complete. But if you divide the total time by the number of jobs, that gives us a good estimate for how long it would take to run N jobs, each running 1/Nth of the tests.

Total  Num Jobs   Scaled Time
 9:47  1          9:47
12:00  10         1:12
14:32  20         0:43
17:37  32         0:33

When I run net_unittests --test-launcher-batch-limit=100, it takes 1:38. This is about 3x longer than it seems like it would take if I just divided the tests into 32 chunks and ran them all in parallel. I suspect this is an artifact of overhead that the test launch is adding, but as you say, it could be a problem with batch imbalance. Where does the test launcher live? Is there an easy way I could add a printf to it when it starts/ends a job to see if, for example, 31 jobs complete in ~30 seconds, but one jobs takes ~1:30 (which would suggest imbalance).
Test launcher code is in base/test/launcher .
Components: -Infra Infra>Platform>Buildbot

Comment 17 by rch@chromium.org, Aug 16 2016

Any update? Could we possibly change the default value of --test-launcher-batch-limit to something higher (like 1000) and if the bots need to use a smaller number, they could do that? (I'd happily send out a CL which changed kDefaultTestBatchLimit from 10 -> 1000)
My plan was to use above tracing support to figure out best next steps.

Blindly changing batch limit or even tuning it somehow isn't necessarily the best solution. Maybe we need to be smarter about how the tests are batched by the runner, or get rid of some very slow tests (or per-suite setups) in net_unittests.

Comment 19 by rch@chromium.org, Jan 11 2017

Do you have an ETA for when you expect to have a fix? Or perhaps when you expect to have some traces that will help identify the problem?
Cc: phajdan.jr@chromium.org
Components: -Infra>Platform>Buildbot
Owner: ----
Status: Available (was: Assigned)
Feel free to talk to my manager and TL about prioritizing this work.
Cc: -tansell@chromium.org
Project Member

Comment 22 by sheriffbot@chromium.org, Sep 6

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Sign in to add a comment