New issue
Advanced search Search tips

Issue 898938 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Dec 6
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Fuchsia
Pri: 2
Type: Bug

Blocking:
issue 874155



Sign in to add a comment

HttpServiceTest.RequestWithHeaders times out on FYI bots

Project Member Reported by sergeyu@chromium.org, Oct 25

Issue description

I've been investigating this in the past few days. Here is what is happening:
The EmbeddedTestServer successfully sends all the data. It is then successfully read by the client. Then the client does one final "read" on the socket.
When the test is successful, the "read" returns 0 bytes read and that's the end of the message. When the test times out, the "read" returns -1, errno is set to EAGAIN, which gets translated to ERR_IO_PENDING in //net. Then, the socket is waited upon until it is readable again but that never happens.

Interestingly, this also appears in the log:
[00030.902] netstack: loopStreamRead got endpoint error: connection aborted (TODO)

I suspect there is something wrong in netstack at this point.
This does look like a bug in netstack. I remember fixing a similar issue that was resulting in the same error message in the log.

Does EmbeddedTestServer always close the socket after sending the response?
Yes, the socket is always closed after sending the response. How did you fix the issue back then?
Cc: -fdegans@chromium.org sergeyu@chromium.org
Labels: -Pri-3 Pri-2
Owner: fdegans@chromium.org
See https://fuchsia-review.googlesource.com/c/garnet/+/73803/3/go/src/netstack/fdio.go . It looks like we just need to update loopSocketRead() to handle ErrConnectionAborted . Can you please try that and see if that fixes the problem?
Unfortunately, that did not fix the problem. I think there is some deeper issue somewhere. The socket is always closed server-side before the client does its final read.
Oh nevermind, I was in the wrong directory, let me try again.
Tried it in the right directory this time, but that didn't improve things.
Blocking: 874155
Cc: scottmg@chromium.org
I've been mucking around with this today, but haven't had much luck at tracking down the cause either.

HttpServiceTest.MultipleRequests fails more often for me than HttpServiceTest.RequestWithHeaders which is what appears to be failing all the time on the bots.

One thing I noticed is that this case gets hit https://fuchsia.googlesource.com/garnet/+/master/go/src/netstack/socket_server.go#289 if I uncomment it. I don't know if that's really relevant or not though.

In .MultipleRequests I also noticed that reducing the number of attempts to as low as 5 will reproduce the hang relatively quickly with a gtest_repeat.

When it hangs, the embedded test server never even gets to Accept() for the hanging request. I'm not sure yet whether it's the client that's not completing the send properly, or if the test server side isn't servicing properly though.

(Ugly debug log here, but not too readable: https://gist.github.com/sgraham/5d3947dfd7da77d4ccdda018cb980435)
I also notice that making SocketPosix::Accept() a bit slow (i.e. adding a sleep(1) to it) makes HttpServiceRequest.MultipleRequests fail every time. I'm not sure why that is yet, but that seems probably not good.
I think the accept() case I've been staring at is basically the same as #c1 (accept goes EWOULDBLOCK, but the watcher never async signals a readable event again).

I thought for a while it might be our MessagePump, but I haven't been able to find anything wrong there (other than it's very complex stateful code :).

I've been poking around in musl too, but I'm still not sure if the client is actually sending the final thing-that-should-have-been-accept()d, or whether the server is missing it. I'm *pretty* sure it's the latter though as the timing around accept() is very temperamental.
I had a thought to look back at history to see if it correlated with an sdk roll, but http_service_tests has been flaky like this since it was added at https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-x64-rel/4903.

(I'm guessing we already knew that, I didn't realize until I checked. :)
Status: Assigned (was: Started)
I am putting this one on hold until NET-1796 is fixed on the Fuchsia side.
I am reasonably confident the problem is not on Chromium side at this point.
I noticed some interesting flakiness in net_unittests. Log below:

[ RUN      ] UDPSocketTest.JoinMulticastGroup
../../net/socket/udp_socket_unittest.cc:661: Failure
Value of: socket.Bind(bind_address)
Expected: net::OK
  Actual: -4, net::ERR_INVALID_ARGUMENT
Stack trace:
bt#00: pc 0x5729f1abfd5b (app:/pkg/net_unittests__exec,0x1ef7d5b)
bt#01: pc 0x5729f1abf903 (app:/pkg/net_unittests__exec,0x1ef7903)
bt#02: pc 0x5729f14c7ad3 (app:/pkg/net_unittests__exec,0x18ffad3)
bt#03: end

../../net/socket/udp_socket_unittest.cc:662: Failure
Value of: socket.JoinGroup(group_ip)
Expected: net::OK
  Actual: -15, net::ERR_SOCKET_NOT_CONNECTED
Stack trace:
bt#00: pc 0x5729f1abfd5b (app:/pkg/net_unittests__exec,0x1ef7d5b)
bt#01: pc 0x5729f1abf903 (app:/pkg/net_unittests__exec,0x1ef7903)
bt#02: pc 0x5729f14c7b7c (app:/pkg/net_unittests__exec,0x18ffb7c)
bt#03: end

../../net/socket/udp_socket_unittest.cc:665: Failure
Value of: socket.LeaveGroup(group_ip)
Expected: net::OK
  Actual: -15, net::ERR_SOCKET_NOT_CONNECTED
Stack trace:
bt#00: pc 0x5729f1abfd5b (app:/pkg/net_unittests__exec,0x1ef7d5b)
bt#01: pc 0x5729f1abf903 (app:/pkg/net_unittests__exec,0x1ef7903)
bt#02: pc 0x5729f14c7cdd (app:/pkg/net_unittests__exec,0x18ffcdd)
bt#03: end

[  FAILED  ] UDPSocketTest.JoinMulticastGroup (21 ms)



The ERR_SOCKET_NOT_CONNECTED seems like it *could* be the same root issue.
Cc: fdegans@chromium.org
Owner: w...@chromium.org
Status: Started (was: Assigned)
OK, think I've found the problem, on the Chromium side. :P

FWIW the net_unittests failures were filed under issue 900709.

Project Member

Comment 17 by bugdroid1@chromium.org, Nov 3

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f130981cbd9032229906a0171bb87aa2c6fe2abb

commit f130981cbd9032229906a0171bb87aa2c6fe2abb
Author: Wez <wez@chromium.org>
Date: Sat Nov 03 05:42:24 2018

Add FDIO_EVT_PEER_CLOSED to WatchFileDescriptor(WATCH_READ) events.

WatchFileDescriptor(WATCH_READ) should result in
OnFileCanReadWithoutBlocking() if the descriptor is at end-of-stream
when it is invoked, to wake the caller to detect that by attempting to
read and receiving zero bytes.

Bug:  898938 , 706592, 884299
Change-Id: I77662fe77eb8fb988089640a75a60d470770b78f
Reviewed-on: https://chromium-review.googlesource.com/c/1316672
Commit-Queue: Wez <wez@chromium.org>
Reviewed-by: Sergey Ulanov <sergeyu@chromium.org>
Cr-Commit-Position: refs/heads/master@{#605154}
[modify] https://crrev.com/f130981cbd9032229906a0171bb87aa2c6fe2abb/base/message_loop/message_pump_fuchsia.cc

Woo! Make sure to ping NET-1796 so that Toshi knows.
Project Member

Comment 19 by bugdroid1@chromium.org, Nov 3

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/088e95617862c91da15e97a233e59811b798a95f

commit 088e95617862c91da15e97a233e59811b798a95f
Author: Wez <wez@chromium.org>
Date: Sat Nov 03 06:08:28 2018

Revert "Add FDIO_EVT_PEER_CLOSED to WatchFileDescriptor(WATCH_READ) events."

This reverts commit f130981cbd9032229906a0171bb87aa2c6fe2abb.

Reason for revert: Doesn't actually fix anything. >.<

Original change's description:
> Add FDIO_EVT_PEER_CLOSED to WatchFileDescriptor(WATCH_READ) events.
> 
> WatchFileDescriptor(WATCH_READ) should result in
> OnFileCanReadWithoutBlocking() if the descriptor is at end-of-stream
> when it is invoked, to wake the caller to detect that by attempting to
> read and receiving zero bytes.
> 
> Bug:  898938 , 706592, 884299
> Change-Id: I77662fe77eb8fb988089640a75a60d470770b78f
> Reviewed-on: https://chromium-review.googlesource.com/c/1316672
> Commit-Queue: Wez <wez@chromium.org>
> Reviewed-by: Sergey Ulanov <sergeyu@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#605154}

TBR=wez@chromium.org,sergeyu@chromium.org,fdegans@chromium.org

Change-Id: I1a98ece4b90de54c2c789dee64c226eb92bcda38
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug:  898938 , 706592, 884299
Reviewed-on: https://chromium-review.googlesource.com/c/1316675
Reviewed-by: Wez <wez@chromium.org>
Commit-Queue: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#605157}
[modify] https://crrev.com/088e95617862c91da15e97a233e59811b798a95f/base/message_loop/message_pump_fuchsia.cc

Labels: M-72
Owner: fdegans@chromium.org
Handing back to Fabrice to investigate further :-/
Darn, that sounded so promising!
NextAction: 2018-11-12
Status: ExternalDependency (was: Started)
An update on NET-1796, "third_party/netstack stops notifying an incoming connection after a while".
Let's wait for a fix on that one. Hopefully that was the root cause of all our network issues.
The NextAction date has arrived: 2018-11-12
NextAction: 2018-11-26
Unfortunately, not much more to report, the problem is definitely in netstack but it's not clear where it is exactly. Moving to after Thanksgiving.
The NextAction date has arrived: 2018-11-26
NextAction: 2018-12-03
Good news on this front. A fix is in code review in Garnet:
https://fuchsia-review.googlesource.com/c/garnet/+/227032

Punting for another week.
NextAction: 2018-12-10
We have rolled with the above fix. Unfortunately, it seems we have another issue in netstack, tracked in NET-1986. Adding another week.
Project Member

Comment 28 by bugdroid1@chromium.org, Dec 4

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/dafc02d5b0e772289361d2caa9d55410a35cb841

commit dafc02d5b0e772289361d2caa9d55410a35cb841
Author: Sergey Ulanov <sergeyu@chromium.org>
Date: Tue Dec 04 21:30:05 2018

[Fuchsia] Disable failing tests in http_service_unittests

Bug:  898938 
Change-Id: I8209b8440639587ca200dd560f8ac2279f831882
Reviewed-on: https://chromium-review.googlesource.com/c/1361827
Commit-Queue: Sergey Ulanov <sergeyu@chromium.org>
Reviewed-by: Fabrice de Gans-Riberi <fdegans@chromium.org>
Reviewed-by: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#613702}
[modify] https://crrev.com/dafc02d5b0e772289361d2caa9d55410a35cb841/webrunner/net_http/http_service_unittest.cc

Project Member

Comment 29 by bugdroid1@chromium.org, Dec 6

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/878a10c54ca6d5f3f406d8c17e60b55272750f9c

commit 878a10c54ca6d5f3f406d8c17e60b55272750f9c
Author: Sergey Ulanov <sergeyu@chromium.org>
Date: Thu Dec 06 01:57:40 2018

[Fuchsia] Fix flake in http_service_tests

CheckResponseStream() wasn't running MessageLoop while blocking the
thread. Often MessageLoop would process stream end by the time
CheckResponseStream() is called, but it wasn't always happening.
In that case end-of-stream is never processed by the service and
so the tests would block forever. Updated CheckResponseStream()
to run MessageLoop when waiting for the socket.

Bug:  898938 
Change-Id: Icf59ad64859de1ddcdbb47890138ff83dc2d2a14
Reviewed-on: https://chromium-review.googlesource.com/c/1364290
Commit-Queue: Sergey Ulanov <sergeyu@chromium.org>
Reviewed-by: Wez <wez@chromium.org>
Reviewed-by: Fabrice de Gans-Riberi <fdegans@chromium.org>
Cr-Commit-Position: refs/heads/master@{#614230}
[modify] https://crrev.com/878a10c54ca6d5f3f406d8c17e60b55272750f9c/webrunner/net_http/http_service_unittest.cc

Status: Verified (was: ExternalDependency)
sergeyu@ definitely fixed it, the test suite hasn't flaked since the fix landed. I'm going to enable the test suite on the main builders. Thanks again!
The NextAction date has arrived: 2018-12-10
NextAction: ----

Sign in to add a comment