New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 665686 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug

Blocking:
issue 644452



Sign in to add a comment

net_unittests flaky on android_n5x_swarming_rel (due to test server not starting)

Project Member Reported by kbr@chromium.org, Nov 16 2016

Issue description

net_unittests is failing often on the android_n5x_swarming_rel tryserver, causing rejection of valid CLs.

There have been previous bugs filed about this, but none seem to be being actively investigated, nor is the severity properly indicated.

At this point, the failure rate appears to be ~10% (18 builds out of 200) on https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel?numbuilds=200 ; I'm not sure how many of the passing jobs actually ran net_unittests, so the rate might be higher.

The failure modes are twofold. Either it fails many of the SSLClientSocketTest tests, like:

https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/68292
https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/68287
https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/68280

with the following error:

SSLClientSocketTest.ConnectMismatched (run #3):
[ RUN      ] SSLClientSocketTest.ConnectMismatched
[ERROR:spawner_communicator.cc(251)] request failed, error: -324
[ERROR:ssl_client_socket_unittest.cc(780)] Could not start SpawnedTestServer


or it fails many (~170) tests:
https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/68281
https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/68275

There don't appear to be any failure logs for this situation.

This needs to be investigated right away. It's causing valid jobs to be kicked out of the CQ. Marking P0.

 

Comment 1 by kbr@chromium.org, Nov 16 2016

Cc: cbentzel@chromium.org

Comment 2 by kbr@chromium.org, Nov 16 2016

Cc: eroman@chromium.org mattm@chromium.org
Cc: p...@chromium.org phajdan.jr@chromium.org
+phajdan.jr@ and pcc@ who fixed a few recent testserver issues.

Comment 4 by ti...@chromium.org, Nov 16 2016

The logs can be extracted by following the corresponding [trigger] task and hitting the link "Shard #0"
Cc: jbudorick@chromium.org
Owner: eroman@chromium.org
Status: Assigned (was: Untriaged)
+jbudorock as well

Yeah, from #3 and looking at logs it looks like this is generally issues with spawning test server. Majority of code base has been converted over to EmbeddedTestServer but not some of the SSL tests. That's not going to be fixed in timeframe for this.

I'm assigning to eroman since he's current person on network bug rotation - but Eric feel free to rope in anyone else. 
re the SpawnedTestServer problem: this, from the log, looks relevant:

Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 649, in __init__
    self.handle()
  File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
    self.handle_one_request()
  File "/usr/lib/python2.7/BaseHTTPServer.py", line 328, in handle_one_request
    method()
  File "/b/swarm_slave/w/iruYZ288/build/android/pylib/chrome_test_server_spawner.py", line 369, in do_POST
    self._StartTestServer()
  File "/b/swarm_slave/w/iruYZ288/build/android/pylib/chrome_test_server_spawner.py", line 324, in _StartTestServer
    assert not self.server.test_server_instance
AssertionError
...
I 1646.819s Thread-129  Test-server has died.

re the ~170 test failure: we currently lose logs in the event of a device shard timeout.
I think I've flakily reproduced this locally and am seeing lock contention. Relevant thread traces attached.
deadlock2.txt
10.9 KB View Download

Comment 8 by eroman@chromium.org, Nov 16 2016

Re comment #7: Not sure that I see any evidence of a deadlock in that. The test server code is waiting for completion of something on the io thread (by blocking execution of thread 11).

Meanwhile thread 16 (possibly the io thread?) is running code in NetworkQualityEstimator::OnUpdatedRTTAvailable(), which is acquiring a different, independent lock: (https://cs.chromium.org/chromium/src/net/nqe/network_quality_estimator.cc?sq=package:chromium&dr=CSs&rcl=1479284732&l=1386)

Are you finding that thread 16 is unable to acquire that lock? That would certainly be interesting...
Yeah, deadlock is the wrong way to describe it, but it does seem to be unable to acquire that lock. That blocks the spawner_communicator thread, and so the test hangs trying to shut down the SpawnedTestServer.

I couldn't find any other thread doing anything relevant.
Thanks for the data! I wonder if that may be specific to your local test setup though.

What I am seeing from the failure logs is that the test fail because the test server fails to startup.

The test server failed to startup because it did some request of the form "http://127.0.0.1:PORT/path" and that failed with error -324 (ERR_EMPTY_RESPONSE).

Those test failures are happening in under 1 second. While there are some timeouts on the waitable event, those are in the XX seconds range. So I don't believe the root of this problem is coming from slow execution.

We have experienced a number of problems with external test server unreliability though, so I think that is more likely the root of the problem but will keep digging.

If this truly is a p0 bug, we can temporarily disable all these tests on android.

Cheers.
Right. I think what's happening here is:
 - we hang and don't shut down the server properly in one test
 - the test shard times out and gets rerun
 - we later try to use the spawned test server again, but since it wasn't properly shut down, it already exists on the python side, resulting in the AssertionError in #6 and an EMPTY_RESPONSE on the client side.

Comment 12 by kbr@chromium.org, Nov 16 2016

This is still affecting a significant number of CQ jobs. Per https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel?numbuilds=200 , 13 out of 200 builds (including my colleague's tryjob https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/68911 ), so a 6.5% failure rate.

Please treat this as urgent.

Cc: davidben@chromium.org mmenke@chromium.org
 Issue 665957  has been merged into this issue.
#10: Let's temporarily disable as it sounds like this is having large-scale developer productivity issues. 
Summarizing how I think this code works, and the current hypotheses:

When the net unittests run on Android, they don't simply start the python net test server, due to lack of python support on the device. Instead, this net-unittest specific test server is run on the non-android host, and the ports are forwarded so the device can access stuff on http://127.0.0.1:PORT.

The net-unittests test-server has test-specific parameters, and is started on a per-test basis. The way this is accomplished is the C++ unit-tests (running on Android device) send a post to http://127.0.0.1:XXX/start with the parameters for the net test server. (Port XXX is forwarded to the non-android host). The handler for /start spawns the python net-unittests test server, and returns the resulting port number.

The port for the spawner server (XXX) is obtained from a well-known file on the device's external storage. The non-android host starts this up and writes the file before running any tests.

Finally, when the C++ code is done with the net test server (SpawnedTestServer goes out of scope), it does a GET to http://127.0.0.1:XXX/kill, to hopefully kill the server that /start provisioned earlier (kill takes no parameters)

The assumption throughout is there is a 1:1 correspondence between the spawning test server, and the test server it spawns.

Consequently:

* net_unittests cannot be run in parallel on the device, since there is only one spawning test server per device. I am assuming for now this has not changed, and is not the cause of flake.

* net_unittests can only have ONE instance of SpawnedTestServer (RemoteTestServer) at a time, because of how Android treats things. In effect, it needs to be treated as a singleton. This is easy enough to verify with some assertions, and I expect code is already following this pattern so unlikely to be the problem either.

* Unclean shutdown / termination of net_unittests may result in "/kill" not being called. It is unclear to me if there is another mechanism for cleaning things up before re-trying tests, or completing net-unittests on the device.

Certainly if the same instance of the spawning test server is being used on the next run, it could cause the symptoms we are seeing (comment #6), and would lead to every subsequent test that uses RemoteTestServer to fail, up until /kill is called again.

@jbudorick observes this could be the problem in comment #11. (It isn't obvious to me from reading the stdout exactly what steps are run prior to, or between tests).


So my concrete thoughts on the next steps are to:

(a) Change /start on the spawning test server to implicitly kill any lingering test server. This means it can recover from an abandoned test server, from unclean test terminations or otherwise. This should hopefully be sufficient to fix the flakiness. (this does have some impact on future debugging of issues, since we will no longer fail on overlapping calls to /start)

(b) Add assertions to the C++ code to ensure that RemoteTestServer is only ever used serially (easy enough to do, may as well)

(c) Add some logging for failed calls to /kill on the C++ side (if /kill is failing it could also explain why /start fails later, although doesn't explain why /kill failed)

Comment 16 by kbr@chromium.org, Nov 17 2016

Please note that this is still causing valid CQ jobs to fail. https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/69198 is another example.

Labels: cit-pm
@kbr: Can you identify any regression range when things got more flaky?

Here is a possible fix, running some tryjobs on it:
https://codereview.chromium.org/2508873003

Comment 19 by kbr@chromium.org, Nov 17 2016

eroman@: sorry, no. I was only a victim of the existing flakiness and didn't look further back in the history of https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel?numbuilds=200 . It seemed to be flaky as far back as visible in the 200 builds (which isn't a long time duration).

@jbudorick was able to go back through android_n5x_swaming_rel's history (default view only shows 200 builds which isn't enough), and identified the point when net_unittests got flaky as yesterday afternoon around 3 pst:

https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/68188

That build implicates 5ce3e1f8fef0f96. Moreover, the trace in comment #7 also contains related code from NetworkQualityEstimator on thread 16, so we may be onto something. Will attempt a revert:


commit 5ce3e1f8fef0f96ae28a09ac0aba2dc49802f9ee
Author: mmenke <mmenke@chromium.org>
Date:   Tue Nov 15 14:45:33 2016 -0800

    Add a NetworkQualityEstimator to TestURLRequestContext.
    
    This was inspired by a bug in exercised NetworkQualityEstimator NQE codepath,
    though this CL would not have caught the issue.
    
    BUG=663456
    
    Review-Url: https://codereview.chromium.org/2484323002
    Cr-Commit-Position: refs/heads/master@{#432277}
Hrm...Comment 7 says lock contention, but one lock is an interval value in CalledOnValidThread, and the other is a condition variable in a completely different class...  That seems rather weird, and a little concerning, if that's really what's happening.

Anyhow, thanks for reverting.
I may have misidentified, but it certainly seemed to be unable to acquire that lock for whatever reason every time I was able to reproduce.
That almost sounds like some UAF or other corruption messing up that lock. I haven't been able to reproduce locally, but if you can may want to try reverting that CL locally see if it changes things for you. Thanks!
Project Member

Comment 24 by bugdroid1@chromium.org, Nov 17 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/07dcb5fae952fa8e3343ed0db2d95b28edd416b5

commit 07dcb5fae952fa8e3343ed0db2d95b28edd416b5
Author: eroman <eroman@chromium.org>
Date: Thu Nov 17 05:10:57 2016

Add assertions to ensure that only one instance of RemoteTestServer is
used at a time.

This applies to Android, where RemoteTestServer only supports one test
server at a time.

(Creating others while one is already alive will lead to failures).

BUG= 665686 

Review-Url: https://codereview.chromium.org/2505023004
Cr-Commit-Position: refs/heads/master@{#432756}

[modify] https://crrev.com/07dcb5fae952fa8e3343ed0db2d95b28edd416b5/net/test/spawned_test_server/remote_test_server.cc

Project Member

Comment 25 by bugdroid1@chromium.org, Nov 17 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/726737874f2c77c0155872c6963bd86ccc971ff7

commit 726737874f2c77c0155872c6963bd86ccc971ff7
Author: eroman <eroman@chromium.org>
Date: Thu Nov 17 05:59:50 2016

Revert of Add a NetworkQualityEstimator to TestURLRequestContext. (patchset #8 id:140001 of https://codereview.chromium.org/2484323002/ )

Reason for revert:
May be the cause of increase net_unittests flake, speculatively reverting (see comment #20)

BUG= 665686 

Original issue's description:
> Add a NetworkQualityEstimator to TestURLRequestContext.
>
> This was inspired by a bug in exercised NetworkQualityEstimator NQE codepath,
> though this CL would not have caught the issue.
>
> BUG=663456
>
> Committed: https://crrev.com/5ce3e1f8fef0f96ae28a09ac0aba2dc49802f9ee
> Cr-Commit-Position: refs/heads/master@{#432277}

TBR=palmer@chromium.org,tbansal@chromium.org,tedchoc@chromium.org,mmenke@chromium.org
# Not skipping CQ checks because original CL landed more than 1 days ago.
BUG=663456

Review-Url: https://codereview.chromium.org/2506303002
Cr-Commit-Position: refs/heads/master@{#432777}

[modify] https://crrev.com/726737874f2c77c0155872c6963bd86ccc971ff7/chrome/browser/ssl/chrome_expect_ct_reporter_unittest.cc
[modify] https://crrev.com/726737874f2c77c0155872c6963bd86ccc971ff7/content/browser/android/url_request_content_job_unittest.cc
[modify] https://crrev.com/726737874f2c77c0155872c6963bd86ccc971ff7/net/http/http_transaction_test_util.cc
[modify] https://crrev.com/726737874f2c77c0155872c6963bd86ccc971ff7/net/http/http_transaction_test_util.h
[modify] https://crrev.com/726737874f2c77c0155872c6963bd86ccc971ff7/net/nqe/network_quality_estimator_unittest.cc
[modify] https://crrev.com/726737874f2c77c0155872c6963bd86ccc971ff7/net/url_request/url_request_test_util.cc
[modify] https://crrev.com/726737874f2c77c0155872c6963bd86ccc971ff7/net/url_request/url_request_test_util.h

Project Member

Comment 26 by bugdroid1@chromium.org, Nov 17 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/75b27526ecaf8fe51b180639e2c9a46b87613dd9

commit 75b27526ecaf8fe51b180639e2c9a46b87613dd9
Author: eroman <eroman@chromium.org>
Date: Thu Nov 17 06:48:04 2016

Kill any lingering spawning test server on Android when handling /start

BUG= 665686 

Review-Url: https://codereview.chromium.org/2508873003
Cr-Commit-Position: refs/heads/master@{#432788}

[modify] https://crrev.com/75b27526ecaf8fe51b180639e2c9a46b87613dd9/build/android/pylib/chrome_test_server_spawner.py

Status: Fixed (was: Assigned)
0 of the last 200 builds have a net_unittests failure:

https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel?numbuilds=200

Comment 28 by kbr@chromium.org, Nov 17 2016

Excellent. Thank you Eric for eliminating this source of flakiness!

Woot!
Cc: svaldez@chromium.org
 Issue 666210  has been merged into this issue.
Cc: rsleevi@chromium.org
 Issue 666271  has been merged into this issue.
Labels: -Pri-0 Pri-1
Status: Assigned (was: Fixed)
Sheriff here, reopening this issue since net_unittests still seems to be flaking in a similar way.

Bumping down to P1 since it seems to be much less severe than last time.

See: https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyOgsSBUZsYWtlIi9UcmFuc3BvcnRTZWN1cml0eVN0YXRlVGVzdC5PcHRpb25hbEhTVFNDZXJ0UGlucww

Example failed build: https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/71877
#32: those are likely a separate issue and may be related to https://bugs.chromium.org/p/chromium/issues/detail?id=666821
Status: Fixed (was: Assigned)
Summary: net_unittests flaky on android_n5x_swarming_rel (due to test server not starting) (was: net_unittests flaky on android_n5x_swarming_rel)
RE comment #32: From a quick perusal of those tests, they all have a dependency on the system clock, and/or the build timestamp.

This is a different issue, so I will move it to a different bug --  issue 667535 

Sign in to add a comment