net_unittests flaky on android_n5x_swarming_rel (due to test server not starting) |
||||||||
Issue descriptionnet_unittests is failing often on the android_n5x_swarming_rel tryserver, causing rejection of valid CLs. There have been previous bugs filed about this, but none seem to be being actively investigated, nor is the severity properly indicated. At this point, the failure rate appears to be ~10% (18 builds out of 200) on https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel?numbuilds=200 ; I'm not sure how many of the passing jobs actually ran net_unittests, so the rate might be higher. The failure modes are twofold. Either it fails many of the SSLClientSocketTest tests, like: https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/68292 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/68287 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/68280 with the following error: SSLClientSocketTest.ConnectMismatched (run #3): [ RUN ] SSLClientSocketTest.ConnectMismatched [ERROR:spawner_communicator.cc(251)] request failed, error: -324 [ERROR:ssl_client_socket_unittest.cc(780)] Could not start SpawnedTestServer or it fails many (~170) tests: https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/68281 https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/68275 There don't appear to be any failure logs for this situation. This needs to be investigated right away. It's causing valid jobs to be kicked out of the CQ. Marking P0.
,
Nov 16 2016
,
Nov 16 2016
+phajdan.jr@ and pcc@ who fixed a few recent testserver issues.
,
Nov 16 2016
The logs can be extracted by following the corresponding [trigger] task and hitting the link "Shard #0"
,
Nov 16 2016
+jbudorock as well Yeah, from #3 and looking at logs it looks like this is generally issues with spawning test server. Majority of code base has been converted over to EmbeddedTestServer but not some of the SSL tests. That's not going to be fixed in timeframe for this. I'm assigning to eroman since he's current person on network bug rotation - but Eric feel free to rope in anyone else.
,
Nov 16 2016
re the SpawnedTestServer problem: this, from the log, looks relevant:
Traceback (most recent call last):
File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
self.process_request(request, client_address)
File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
self.finish_request(request, client_address)
File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib/python2.7/SocketServer.py", line 649, in __init__
self.handle()
File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
self.handle_one_request()
File "/usr/lib/python2.7/BaseHTTPServer.py", line 328, in handle_one_request
method()
File "/b/swarm_slave/w/iruYZ288/build/android/pylib/chrome_test_server_spawner.py", line 369, in do_POST
self._StartTestServer()
File "/b/swarm_slave/w/iruYZ288/build/android/pylib/chrome_test_server_spawner.py", line 324, in _StartTestServer
assert not self.server.test_server_instance
AssertionError
...
I 1646.819s Thread-129 Test-server has died.
re the ~170 test failure: we currently lose logs in the event of a device shard timeout.
,
Nov 16 2016
I think I've flakily reproduced this locally and am seeing lock contention. Relevant thread traces attached.
,
Nov 16 2016
Re comment #7: Not sure that I see any evidence of a deadlock in that. The test server code is waiting for completion of something on the io thread (by blocking execution of thread 11). Meanwhile thread 16 (possibly the io thread?) is running code in NetworkQualityEstimator::OnUpdatedRTTAvailable(), which is acquiring a different, independent lock: (https://cs.chromium.org/chromium/src/net/nqe/network_quality_estimator.cc?sq=package:chromium&dr=CSs&rcl=1479284732&l=1386) Are you finding that thread 16 is unable to acquire that lock? That would certainly be interesting...
,
Nov 16 2016
Yeah, deadlock is the wrong way to describe it, but it does seem to be unable to acquire that lock. That blocks the spawner_communicator thread, and so the test hangs trying to shut down the SpawnedTestServer. I couldn't find any other thread doing anything relevant.
,
Nov 16 2016
Thanks for the data! I wonder if that may be specific to your local test setup though. What I am seeing from the failure logs is that the test fail because the test server fails to startup. The test server failed to startup because it did some request of the form "http://127.0.0.1:PORT/path" and that failed with error -324 (ERR_EMPTY_RESPONSE). Those test failures are happening in under 1 second. While there are some timeouts on the waitable event, those are in the XX seconds range. So I don't believe the root of this problem is coming from slow execution. We have experienced a number of problems with external test server unreliability though, so I think that is more likely the root of the problem but will keep digging. If this truly is a p0 bug, we can temporarily disable all these tests on android. Cheers.
,
Nov 16 2016
Right. I think what's happening here is: - we hang and don't shut down the server properly in one test - the test shard times out and gets rerun - we later try to use the spawned test server again, but since it wasn't properly shut down, it already exists on the python side, resulting in the AssertionError in #6 and an EMPTY_RESPONSE on the client side.
,
Nov 16 2016
This is still affecting a significant number of CQ jobs. Per https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel?numbuilds=200 , 13 out of 200 builds (including my colleague's tryjob https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/68911 ), so a 6.5% failure rate. Please treat this as urgent.
,
Nov 16 2016
,
Nov 16 2016
#10: Let's temporarily disable as it sounds like this is having large-scale developer productivity issues.
,
Nov 17 2016
Summarizing how I think this code works, and the current hypotheses: When the net unittests run on Android, they don't simply start the python net test server, due to lack of python support on the device. Instead, this net-unittest specific test server is run on the non-android host, and the ports are forwarded so the device can access stuff on http://127.0.0.1:PORT. The net-unittests test-server has test-specific parameters, and is started on a per-test basis. The way this is accomplished is the C++ unit-tests (running on Android device) send a post to http://127.0.0.1:XXX/start with the parameters for the net test server. (Port XXX is forwarded to the non-android host). The handler for /start spawns the python net-unittests test server, and returns the resulting port number. The port for the spawner server (XXX) is obtained from a well-known file on the device's external storage. The non-android host starts this up and writes the file before running any tests. Finally, when the C++ code is done with the net test server (SpawnedTestServer goes out of scope), it does a GET to http://127.0.0.1:XXX/kill, to hopefully kill the server that /start provisioned earlier (kill takes no parameters) The assumption throughout is there is a 1:1 correspondence between the spawning test server, and the test server it spawns. Consequently: * net_unittests cannot be run in parallel on the device, since there is only one spawning test server per device. I am assuming for now this has not changed, and is not the cause of flake. * net_unittests can only have ONE instance of SpawnedTestServer (RemoteTestServer) at a time, because of how Android treats things. In effect, it needs to be treated as a singleton. This is easy enough to verify with some assertions, and I expect code is already following this pattern so unlikely to be the problem either. * Unclean shutdown / termination of net_unittests may result in "/kill" not being called. It is unclear to me if there is another mechanism for cleaning things up before re-trying tests, or completing net-unittests on the device. Certainly if the same instance of the spawning test server is being used on the next run, it could cause the symptoms we are seeing (comment #6), and would lead to every subsequent test that uses RemoteTestServer to fail, up until /kill is called again. @jbudorick observes this could be the problem in comment #11. (It isn't obvious to me from reading the stdout exactly what steps are run prior to, or between tests). So my concrete thoughts on the next steps are to: (a) Change /start on the spawning test server to implicitly kill any lingering test server. This means it can recover from an abandoned test server, from unclean test terminations or otherwise. This should hopefully be sufficient to fix the flakiness. (this does have some impact on future debugging of issues, since we will no longer fail on overlapping calls to /start) (b) Add assertions to the C++ code to ensure that RemoteTestServer is only ever used serially (easy enough to do, may as well) (c) Add some logging for failed calls to /kill on the C++ side (if /kill is failing it could also explain why /start fails later, although doesn't explain why /kill failed)
,
Nov 17 2016
Please note that this is still causing valid CQ jobs to fail. https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/69198 is another example.
,
Nov 17 2016
,
Nov 17 2016
@kbr: Can you identify any regression range when things got more flaky? Here is a possible fix, running some tryjobs on it: https://codereview.chromium.org/2508873003
,
Nov 17 2016
eroman@: sorry, no. I was only a victim of the existing flakiness and didn't look further back in the history of https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel?numbuilds=200 . It seemed to be flaky as far back as visible in the 200 builds (which isn't a long time duration).
,
Nov 17 2016
@jbudorick was able to go back through android_n5x_swaming_rel's history (default view only shows 200 builds which isn't enough), and identified the point when net_unittests got flaky as yesterday afternoon around 3 pst: https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/68188 That build implicates 5ce3e1f8fef0f96. Moreover, the trace in comment #7 also contains related code from NetworkQualityEstimator on thread 16, so we may be onto something. Will attempt a revert: commit 5ce3e1f8fef0f96ae28a09ac0aba2dc49802f9ee Author: mmenke <mmenke@chromium.org> Date: Tue Nov 15 14:45:33 2016 -0800 Add a NetworkQualityEstimator to TestURLRequestContext. This was inspired by a bug in exercised NetworkQualityEstimator NQE codepath, though this CL would not have caught the issue. BUG=663456 Review-Url: https://codereview.chromium.org/2484323002 Cr-Commit-Position: refs/heads/master@{#432277}
,
Nov 17 2016
Hrm...Comment 7 says lock contention, but one lock is an interval value in CalledOnValidThread, and the other is a condition variable in a completely different class... That seems rather weird, and a little concerning, if that's really what's happening. Anyhow, thanks for reverting.
,
Nov 17 2016
I may have misidentified, but it certainly seemed to be unable to acquire that lock for whatever reason every time I was able to reproduce.
,
Nov 17 2016
That almost sounds like some UAF or other corruption messing up that lock. I haven't been able to reproduce locally, but if you can may want to try reverting that CL locally see if it changes things for you. Thanks!
,
Nov 17 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/07dcb5fae952fa8e3343ed0db2d95b28edd416b5 commit 07dcb5fae952fa8e3343ed0db2d95b28edd416b5 Author: eroman <eroman@chromium.org> Date: Thu Nov 17 05:10:57 2016 Add assertions to ensure that only one instance of RemoteTestServer is used at a time. This applies to Android, where RemoteTestServer only supports one test server at a time. (Creating others while one is already alive will lead to failures). BUG= 665686 Review-Url: https://codereview.chromium.org/2505023004 Cr-Commit-Position: refs/heads/master@{#432756} [modify] https://crrev.com/07dcb5fae952fa8e3343ed0db2d95b28edd416b5/net/test/spawned_test_server/remote_test_server.cc
,
Nov 17 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/726737874f2c77c0155872c6963bd86ccc971ff7 commit 726737874f2c77c0155872c6963bd86ccc971ff7 Author: eroman <eroman@chromium.org> Date: Thu Nov 17 05:59:50 2016 Revert of Add a NetworkQualityEstimator to TestURLRequestContext. (patchset #8 id:140001 of https://codereview.chromium.org/2484323002/ ) Reason for revert: May be the cause of increase net_unittests flake, speculatively reverting (see comment #20) BUG= 665686 Original issue's description: > Add a NetworkQualityEstimator to TestURLRequestContext. > > This was inspired by a bug in exercised NetworkQualityEstimator NQE codepath, > though this CL would not have caught the issue. > > BUG=663456 > > Committed: https://crrev.com/5ce3e1f8fef0f96ae28a09ac0aba2dc49802f9ee > Cr-Commit-Position: refs/heads/master@{#432277} TBR=palmer@chromium.org,tbansal@chromium.org,tedchoc@chromium.org,mmenke@chromium.org # Not skipping CQ checks because original CL landed more than 1 days ago. BUG=663456 Review-Url: https://codereview.chromium.org/2506303002 Cr-Commit-Position: refs/heads/master@{#432777} [modify] https://crrev.com/726737874f2c77c0155872c6963bd86ccc971ff7/chrome/browser/ssl/chrome_expect_ct_reporter_unittest.cc [modify] https://crrev.com/726737874f2c77c0155872c6963bd86ccc971ff7/content/browser/android/url_request_content_job_unittest.cc [modify] https://crrev.com/726737874f2c77c0155872c6963bd86ccc971ff7/net/http/http_transaction_test_util.cc [modify] https://crrev.com/726737874f2c77c0155872c6963bd86ccc971ff7/net/http/http_transaction_test_util.h [modify] https://crrev.com/726737874f2c77c0155872c6963bd86ccc971ff7/net/nqe/network_quality_estimator_unittest.cc [modify] https://crrev.com/726737874f2c77c0155872c6963bd86ccc971ff7/net/url_request/url_request_test_util.cc [modify] https://crrev.com/726737874f2c77c0155872c6963bd86ccc971ff7/net/url_request/url_request_test_util.h
,
Nov 17 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/75b27526ecaf8fe51b180639e2c9a46b87613dd9 commit 75b27526ecaf8fe51b180639e2c9a46b87613dd9 Author: eroman <eroman@chromium.org> Date: Thu Nov 17 06:48:04 2016 Kill any lingering spawning test server on Android when handling /start BUG= 665686 Review-Url: https://codereview.chromium.org/2508873003 Cr-Commit-Position: refs/heads/master@{#432788} [modify] https://crrev.com/75b27526ecaf8fe51b180639e2c9a46b87613dd9/build/android/pylib/chrome_test_server_spawner.py
,
Nov 17 2016
0 of the last 200 builds have a net_unittests failure: https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel?numbuilds=200
,
Nov 17 2016
Excellent. Thank you Eric for eliminating this source of flakiness!
,
Nov 17 2016
Woot!
,
Nov 18 2016
,
Nov 18 2016
,
Nov 21 2016
Sheriff here, reopening this issue since net_unittests still seems to be flaking in a similar way. Bumping down to P1 since it seems to be much less severe than last time. See: https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyOgsSBUZsYWtlIi9UcmFuc3BvcnRTZWN1cml0eVN0YXRlVGVzdC5PcHRpb25hbEhTVFNDZXJ0UGlucww Example failed build: https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/71877
,
Nov 21 2016
#32: those are likely a separate issue and may be related to https://bugs.chromium.org/p/chromium/issues/detail?id=666821
,
Nov 21 2016
RE comment #32: From a quick perusal of those tests, they all have a dependency on the system clock, and/or the build timestamp. This is a different issue, so I will move it to a different bug -- issue 667535 |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by kbr@chromium.org
, Nov 16 2016