Issue metadata
Sign in to add a comment
|
SSLClientSocketReadTest.Read* tests flaked on Fuchsia due to too many spawned TestServers |
||||||||||||||||||||||
Issue descriptionIn run https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-x64-rel/6364 we observed a number of SSLClientSocketReadTest.Read* tests failing, as well as TLS13DowngradeMetricsTest.Metrics/0. These tests all failed due to StartTestServer failing, e.g: [ RUN ] TLS13DowngradeMetricsTest.Metrics/0 [780975:1091581343:1207/090603.555187:731998945:ERROR:remote_test_server_spawner_request.cc(141)] Spawner server returned bad status: HTTP/1.0 400 Invalid request, Too many test servers running [780975:1548239809:1207/090603.556853:732000611:ERROR:ssl_client_socket_unittest.cc(830)] Could not start SpawnedTestServer ../../net/socket/ssl_client_socket_unittest.cc:4996: Failure Value of: StartTestServer(ssl_options) Actual: false Expected: true Stack trace: #00: testing::internal::UnitTestImpl::CurrentOsStackTraceExceptTop(int) at gtest.cc:? #01: testing::internal::AssertHelper::operator=(testing::Message const&) const at gtest.cc:? #02: net::TLS13DowngradeMetricsTest_Metrics_Test::TestBody() at ssl_client_socket_unittest.cc:? [ FAILED ] TLS13DowngradeMetricsTest.Metrics/0, where GetParam() = 16-byte object <00-01 00-00 00-00 00-00 00-00 00-00 FF-FF FF-FF> (591 ms) [14484/30496] TLS13DowngradeMetricsTest.Metrics/0 (591 ms)
,
Dec 10
This error may indicate that some other tests didn't shutdown the test server or some test tried to start more than one test server. It's likely that this particular test not the culprit.
,
Dec 10
Hmmm, I wonder if we may have had one or more "hung" or long-running tests that happened to all be using test servers?
,
Dec 11
Bunch more tests failed due to this in https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-x64-rel/6519 and https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-x64-rel/6519 I'll revert the test filter and dig into the code a bit...
,
Dec 11
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d377ecc5c538b5e5b2d227158f1d3a9665789c87 commit d377ecc5c538b5e5b2d227158f1d3a9665789c87 Author: Wez <wez@chromium.org> Date: Tue Dec 11 19:05:56 2018 Revert "[Fuchsia] Filter some more flaking net_unittests." This reverts commit ce988da67294d10b86ecd030bbb54c2be40a3061. Reason for revert: SpawnTestServer issue continues, but now affects different tests, suggesting that this is an issue with some earlier/parallel test e.g. leaking a TestServer instance, rather than an issue with the tests that flake as a result. Original change's description: > [Fuchsia] Filter some more flaking net_unittests. > > Bug: 913155 , 913156 > Change-Id: If5804fb6c8b035a7d220e8eb629f0d6c05a90e0d > Reviewed-on: https://chromium-review.googlesource.com/c/1369044 > Commit-Queue: Wez <wez@chromium.org> > Commit-Queue: Fabrice de Gans-Riberi <fdegans@chromium.org> > Reviewed-by: Fabrice de Gans-Riberi <fdegans@chromium.org> > Cr-Commit-Position: refs/heads/master@{#614908} TBR=wez@chromium.org,fdegans@chromium.org # Not skipping CQ checks because original CL landed > 1 day ago. Bug: 913155 , 913156 Change-Id: I9af3748ebfab26a67f77faf011b7bea526e6a93d Reviewed-on: https://chromium-review.googlesource.com/c/1372315 Reviewed-by: Wez <wez@chromium.org> Commit-Queue: Wez <wez@chromium.org> Cr-Commit-Position: refs/heads/master@{#615614} [modify] https://crrev.com/d377ecc5c538b5e5b2d227158f1d3a9665789c87/testing/buildbot/filters/fuchsia.net_unittests.filter
,
Dec 11
This is likely due to a handle leak in sshd. Tracked in bug US-563.
,
Dec 11
The errors happen immediately after this error is logged: "accept: No file descriptors available", typically after the 240th port is mapped.
,
Dec 12
,
Dec 12
I can reproduce handle leak with the following command: run_net_unittests --test-launcher-jobs=1 --gtest_filter="SSLClientSocketTest.Connect" --gtest_repeat=1000 The test start failing after about 250 iterations, i.e. something is leaking one FD every time the test is executed.
,
Dec 13
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6f5ba5e7287f80c56f2fe72022f623c39299e825 commit 6f5ba5e7287f80c56f2fe72022f623c39299e825 Author: Sergey Ulanov <sergeyu@chromium.org> Date: Thu Dec 13 00:46:43 2018 [Fuchsia] Fix test runner to shutdown forwarding socket The runner script starts ssh port forwarding for the test server with `ssh -O forward -R 0:localhost:<host_port>`. Then to stop it would run `ssh -O cancel -R <device_port>:localhost:<host_port>`. That last command succeed with exit code 0, but doesn't really cancel forwarding. It appears to be a bug in ssh on the client side - it doesn't send cancel-tcpip-forward command to sshd 🤷 . But, it works properly with `-R 0:localhost:<host_port>`. Bug: 913156 Change-Id: I0668258d21c60e3992e27486494bd8da26c674d0 Reviewed-on: https://chromium-review.googlesource.com/c/1374771 Reviewed-by: Kevin Marshall <kmarshall@chromium.org> Commit-Queue: Kevin Marshall <kmarshall@chromium.org> Cr-Commit-Position: refs/heads/master@{#616125} [modify] https://crrev.com/6f5ba5e7287f80c56f2fe72022f623c39299e825/build/fuchsia/net_test_server.py
,
Dec 13
The last change didn't really fix the original issue ("Too many test servers running") - net_unittests is still flaky, e.g. https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-x64-rel/6589
,
Dec 13
,
Dec 13
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b194ab9a36a8178ed2ef41da9e109b880d5ac62b commit b194ab9a36a8178ed2ef41da9e109b880d5ac62b Author: Sergey Ulanov <sergeyu@chromium.org> Date: Thu Dec 13 22:58:15 2018 Make falures to stop RemoteTestServer fatal. On Fuchsia test bots we observe that net_unittests flake sometimes because they exceed the limit on number of test server instances. This suggests that test server is not always stopped properly. This change makes these issues more visible. Bug: 913156 Change-Id: Idc14956659a461bf8851f0456416275e11ccc544 Reviewed-on: https://chromium-review.googlesource.com/c/1376609 Reviewed-by: Matt Menke <mmenke@chromium.org> Commit-Queue: Sergey Ulanov <sergeyu@chromium.org> Cr-Commit-Position: refs/heads/master@{#616466} [modify] https://crrev.com/b194ab9a36a8178ed2ef41da9e109b880d5ac62b/net/test/spawned_test_server/remote_test_server.cc
,
Dec 13
With the change above I see the following failures locally when running net_unittests with --test-launcher-jobs=40 [172196:785625791:1213/221004.747934:60516254:ERROR:remote_test_server_spawner_request.cc(138)] request failed, error: net::ERR_EMPTY_RESPONSE [172196:746302695:1213/221004.749318:60517654:FATAL:remote_test_server.cc(185)] Failed stopping RemoteTestServer
,
Dec 18
The NextAction date has arrived: 2018-12-18
,
Dec 20
Issue 917074 has been merged into this issue.
,
Dec 20
,
Dec 21
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/29b1fbd64174e3b3cd764723ee3085defdd149d9 commit 29b1fbd64174e3b3cd764723ee3085defdd149d9 Author: Sergey Ulanov <sergeyu@chromium.org> Date: Fri Dec 21 00:55:45 2018 [Fuchsia] Fix runner scripts to stop test server spawner Previously the net_unittests was not stopping test server spawner as expected. This change should help debug the linked bug: now if a test server is not stopped by one of the tests it will be reported by the SpawningServer.Stop() in the log. Bug: 913156 Change-Id: If4ad371f4a03b4f44ca4f8cb51b2c8a2a328f647 Reviewed-on: https://chromium-review.googlesource.com/c/1387810 Reviewed-by: Fabrice de Gans-Riberi <fdegans@chromium.org> Commit-Queue: Sergey Ulanov <sergeyu@chromium.org> Cr-Commit-Position: refs/heads/master@{#618410} [modify] https://crrev.com/29b1fbd64174e3b3cd764723ee3085defdd149d9/build/fuchsia/test_runner.py
,
Dec 21
Looks like the problem is in the TestLauncher. Sometimes it runs more jobs than the limit specified with --test-launcher-jobs. It passes the specified limit as max_tasks for blocking worker pool in TaskScheduler. Problem is that max_tasks doesn't actually limit number of parallel tasks. It limits number of tasks that "aren't blocked", see https://codesearch.chromium.org/chromium/src/base/task/task_scheduler/scheduler_worker_pool_params.h?type=cs&sq=package:chromium&g=0&l=16 Fabrice, your change to increase number of test server instances should make net_unittests stable, so let's land it until the issue in TaskLauncher is fixed.
,
Dec 21
Filed bug 917216 for the TestLauncher issue.
,
Dec 21
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e4de981fd3c40d782c78854867654854332880cd commit e4de981fd3c40d782c78854867654854332880cd Author: Fabrice de Gans-Riberi <fdegans@chromium.org> Date: Fri Dec 21 04:57:45 2018 [Fuchsia] Increase the max number of spawned test server. The TestLauncher can launch more jobs than the limit specified with --test-launcher-jobs. This increased the max number of spawned test servers to twice that limit. TBR=sergeyu@chromium.org Bug: 913156 Change-Id: I45d58f2c375689acd6ccf3e94b87c6a403a6ed60 Reviewed-on: https://chromium-review.googlesource.com/c/1387400 Commit-Queue: Fabrice de Gans-Riberi <fdegans@chromium.org> Reviewed-by: Fabrice de Gans-Riberi <fdegans@chromium.org> Cr-Commit-Position: refs/heads/master@{#618460} [modify] https://crrev.com/e4de981fd3c40d782c78854867654854332880cd/build/fuchsia/net_test_server.py
,
Jan 3
This hasn't flaked in a while, closing. |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by bugdroid1@chromium.org
, Dec 8