New issue
Advanced search Search tips

Issue 913156 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 3
Cc:
Components:
EstimatedDays: ----
NextAction: 2018-12-18
OS: Fuchsia
Pri: 1
Type: Bug

Blocking:
issue 884299



Sign in to add a comment

SSLClientSocketReadTest.Read* tests flaked on Fuchsia due to too many spawned TestServers

Project Member Reported by w...@chromium.org, Dec 8

Issue description

In run
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-x64-rel/6364
we observed a number of SSLClientSocketReadTest.Read* tests failing, as well as TLS13DowngradeMetricsTest.Metrics/0.  These tests all failed due to StartTestServer failing, e.g:

[ RUN      ] TLS13DowngradeMetricsTest.Metrics/0
[780975:1091581343:1207/090603.555187:731998945:ERROR:remote_test_server_spawner_request.cc(141)] Spawner server returned bad status: HTTP/1.0 400 Invalid request, Too many test servers running
[780975:1548239809:1207/090603.556853:732000611:ERROR:ssl_client_socket_unittest.cc(830)] Could not start SpawnedTestServer
../../net/socket/ssl_client_socket_unittest.cc:4996: Failure
Value of: StartTestServer(ssl_options)
  Actual: false
Expected: true
Stack trace:
#00: testing::internal::UnitTestImpl::CurrentOsStackTraceExceptTop(int) at gtest.cc:?
#01: testing::internal::AssertHelper::operator=(testing::Message const&) const at gtest.cc:?
#02: net::TLS13DowngradeMetricsTest_Metrics_Test::TestBody() at ssl_client_socket_unittest.cc:?

[  FAILED  ] TLS13DowngradeMetricsTest.Metrics/0, where GetParam() = 16-byte object <00-01 00-00 00-00 00-00 00-00 00-00 FF-FF FF-FF> (591 ms)
[14484/30496] TLS13DowngradeMetricsTest.Metrics/0 (591 ms)

 
Project Member

Comment 1 by bugdroid1@chromium.org, Dec 8

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ce988da67294d10b86ecd030bbb54c2be40a3061

commit ce988da67294d10b86ecd030bbb54c2be40a3061
Author: Wez <wez@chromium.org>
Date: Sat Dec 08 01:50:10 2018

[Fuchsia] Filter some more flaking net_unittests.

Bug:  913155 ,  913156 
Change-Id: If5804fb6c8b035a7d220e8eb629f0d6c05a90e0d
Reviewed-on: https://chromium-review.googlesource.com/c/1369044
Commit-Queue: Wez <wez@chromium.org>
Commit-Queue: Fabrice de Gans-Riberi <fdegans@chromium.org>
Reviewed-by: Fabrice de Gans-Riberi <fdegans@chromium.org>
Cr-Commit-Position: refs/heads/master@{#614908}
[modify] https://crrev.com/ce988da67294d10b86ecd030bbb54c2be40a3061/testing/buildbot/filters/fuchsia.net_unittests.filter

This error may indicate that some other tests didn't shutdown the test server or some test tried to start more than one test server. It's likely that this particular test not the culprit.
Hmmm, I wonder if we may have had one or more "hung" or long-running tests that happened to all be using test servers?
Cc: sergeyu@chromium.org fdegans@chromium.org
Bunch more tests failed due to this in https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-x64-rel/6519
and
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-x64-rel/6519

I'll revert the test filter and dig into the code a bit...
Project Member

Comment 5 by bugdroid1@chromium.org, Dec 11

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d377ecc5c538b5e5b2d227158f1d3a9665789c87

commit d377ecc5c538b5e5b2d227158f1d3a9665789c87
Author: Wez <wez@chromium.org>
Date: Tue Dec 11 19:05:56 2018

Revert "[Fuchsia] Filter some more flaking net_unittests."

This reverts commit ce988da67294d10b86ecd030bbb54c2be40a3061.

Reason for revert: SpawnTestServer issue continues, but now affects different tests, suggesting that this is an issue with some earlier/parallel test e.g. leaking a TestServer instance, rather than an issue with the tests that flake as a result.

Original change's description:
> [Fuchsia] Filter some more flaking net_unittests.
> 
> Bug:  913155 ,  913156 
> Change-Id: If5804fb6c8b035a7d220e8eb629f0d6c05a90e0d
> Reviewed-on: https://chromium-review.googlesource.com/c/1369044
> Commit-Queue: Wez <wez@chromium.org>
> Commit-Queue: Fabrice de Gans-Riberi <fdegans@chromium.org>
> Reviewed-by: Fabrice de Gans-Riberi <fdegans@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#614908}

TBR=wez@chromium.org,fdegans@chromium.org

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug:  913155 ,  913156 
Change-Id: I9af3748ebfab26a67f77faf011b7bea526e6a93d
Reviewed-on: https://chromium-review.googlesource.com/c/1372315
Reviewed-by: Wez <wez@chromium.org>
Commit-Queue: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#615614}
[modify] https://crrev.com/d377ecc5c538b5e5b2d227158f1d3a9665789c87/testing/buildbot/filters/fuchsia.net_unittests.filter

This is likely due to a handle leak in sshd. Tracked in bug US-563.
NextAction: 2018-12-18
Status: ExternalDependency (was: Started)
The errors happen immediately after this error is logged: "accept: No file descriptors available", typically after the 240th port is mapped.
Cc: -sergeyu@chromium.org
Owner: sergeyu@chromium.org
I can reproduce handle leak with the following command:
  run_net_unittests --test-launcher-jobs=1 --gtest_filter="SSLClientSocketTest.Connect" --gtest_repeat=1000

The test start failing after about 250 iterations, i.e. something is leaking one FD every time the test is executed.
Project Member

Comment 10 by bugdroid1@chromium.org, Dec 13

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6f5ba5e7287f80c56f2fe72022f623c39299e825

commit 6f5ba5e7287f80c56f2fe72022f623c39299e825
Author: Sergey Ulanov <sergeyu@chromium.org>
Date: Thu Dec 13 00:46:43 2018

[Fuchsia] Fix test runner to shutdown forwarding socket

The runner script starts ssh port forwarding for the test server with
`ssh -O forward -R 0:localhost:<host_port>`. Then to stop it would run
`ssh -O cancel -R <device_port>:localhost:<host_port>`. That last
command succeed with exit code 0, but doesn't really cancel forwarding.
It appears to be a bug in ssh on the client side - it doesn't send
cancel-tcpip-forward command to sshd 🤷 . But, it works properly with
`-R 0:localhost:<host_port>`.

Bug:  913156 
Change-Id: I0668258d21c60e3992e27486494bd8da26c674d0
Reviewed-on: https://chromium-review.googlesource.com/c/1374771
Reviewed-by: Kevin Marshall <kmarshall@chromium.org>
Commit-Queue: Kevin Marshall <kmarshall@chromium.org>
Cr-Commit-Position: refs/heads/master@{#616125}
[modify] https://crrev.com/6f5ba5e7287f80c56f2fe72022f623c39299e825/build/fuchsia/net_test_server.py

The last change didn't really fix the original issue ("Too many test servers running") - net_unittests is still flaky, e.g. https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-x64-rel/6589
Status: Started (was: ExternalDependency)
Project Member

Comment 13 by bugdroid1@chromium.org, Dec 13

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b194ab9a36a8178ed2ef41da9e109b880d5ac62b

commit b194ab9a36a8178ed2ef41da9e109b880d5ac62b
Author: Sergey Ulanov <sergeyu@chromium.org>
Date: Thu Dec 13 22:58:15 2018

Make falures to stop RemoteTestServer fatal.

On Fuchsia test bots we observe that net_unittests flake sometimes
because they exceed the limit on number of test server instances. This
suggests that test server is not always stopped properly. This change
makes these issues more visible.

Bug:  913156 
Change-Id: Idc14956659a461bf8851f0456416275e11ccc544
Reviewed-on: https://chromium-review.googlesource.com/c/1376609
Reviewed-by: Matt Menke <mmenke@chromium.org>
Commit-Queue: Sergey Ulanov <sergeyu@chromium.org>
Cr-Commit-Position: refs/heads/master@{#616466}
[modify] https://crrev.com/b194ab9a36a8178ed2ef41da9e109b880d5ac62b/net/test/spawned_test_server/remote_test_server.cc

With the change above I see the following failures locally when running net_unittests with --test-launcher-jobs=40 

[172196:785625791:1213/221004.747934:60516254:ERROR:remote_test_server_spawner_request.cc(138)] request failed, error: net::ERR_EMPTY_RESPONSE
[172196:746302695:1213/221004.749318:60517654:FATAL:remote_test_server.cc(185)] Failed stopping RemoteTestServer

The NextAction date has arrived: 2018-12-18
 Issue 917074  has been merged into this issue.
Blocking: 884299
Project Member

Comment 18 by bugdroid1@chromium.org, Dec 21

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/29b1fbd64174e3b3cd764723ee3085defdd149d9

commit 29b1fbd64174e3b3cd764723ee3085defdd149d9
Author: Sergey Ulanov <sergeyu@chromium.org>
Date: Fri Dec 21 00:55:45 2018

[Fuchsia] Fix runner scripts to stop test server spawner

Previously the net_unittests was not stopping test server spawner as
expected.
This change should help debug the linked bug: now if a test server is
not stopped by one of the tests it will be reported by the
SpawningServer.Stop() in the log.

Bug:  913156 
Change-Id: If4ad371f4a03b4f44ca4f8cb51b2c8a2a328f647
Reviewed-on: https://chromium-review.googlesource.com/c/1387810
Reviewed-by: Fabrice de Gans-Riberi <fdegans@chromium.org>
Commit-Queue: Sergey Ulanov <sergeyu@chromium.org>
Cr-Commit-Position: refs/heads/master@{#618410}
[modify] https://crrev.com/29b1fbd64174e3b3cd764723ee3085defdd149d9/build/fuchsia/test_runner.py

Cc: -fdegans@chromium.org sergeyu@chromium.org
Owner: fdegans@chromium.org
Looks like the problem is in the TestLauncher. Sometimes it runs more jobs than the limit specified with --test-launcher-jobs. It passes the specified limit as max_tasks for blocking worker pool in TaskScheduler. Problem is that max_tasks doesn't actually limit number of parallel tasks. It limits number of tasks that "aren't blocked", see https://codesearch.chromium.org/chromium/src/base/task/task_scheduler/scheduler_worker_pool_params.h?type=cs&sq=package:chromium&g=0&l=16
Fabrice, your change to increase number of test server instances should make net_unittests stable, so let's land it until the issue in TaskLauncher is fixed.
Filed bug 917216 for the TestLauncher issue.
Project Member

Comment 21 by bugdroid1@chromium.org, Dec 21

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e4de981fd3c40d782c78854867654854332880cd

commit e4de981fd3c40d782c78854867654854332880cd
Author: Fabrice de Gans-Riberi <fdegans@chromium.org>
Date: Fri Dec 21 04:57:45 2018

[Fuchsia] Increase the max number of spawned test server.

The TestLauncher can launch more jobs than the limit specified with
--test-launcher-jobs. This increased the max number of spawned test
servers to twice that limit.

TBR=sergeyu@chromium.org

Bug:  913156 
Change-Id: I45d58f2c375689acd6ccf3e94b87c6a403a6ed60
Reviewed-on: https://chromium-review.googlesource.com/c/1387400
Commit-Queue: Fabrice de Gans-Riberi <fdegans@chromium.org>
Reviewed-by: Fabrice de Gans-Riberi <fdegans@chromium.org>
Cr-Commit-Position: refs/heads/master@{#618460}
[modify] https://crrev.com/e4de981fd3c40d782c78854867654854332880cd/build/fuchsia/net_test_server.py

Status: Fixed (was: Started)
This hasn't flaked in a while, closing.

Sign in to add a comment