HttpServiceTest.RequestWithHeaders times out on FYI bots |
|||||||||||||
Issue description
,
Oct 25
This does look like a bug in netstack. I remember fixing a similar issue that was resulting in the same error message in the log. Does EmbeddedTestServer always close the socket after sending the response?
,
Oct 25
Yes, the socket is always closed after sending the response. How did you fix the issue back then?
,
Oct 25
See https://fuchsia-review.googlesource.com/c/garnet/+/73803/3/go/src/netstack/fdio.go . It looks like we just need to update loopSocketRead() to handle ErrConnectionAborted . Can you please try that and see if that fixes the problem?
,
Oct 25
Unfortunately, that did not fix the problem. I think there is some deeper issue somewhere. The socket is always closed server-side before the client does its final read.
,
Oct 25
Oh nevermind, I was in the wrong directory, let me try again.
,
Oct 25
Tried it in the right directory this time, but that didn't improve things.
,
Oct 25
,
Oct 26
I've been mucking around with this today, but haven't had much luck at tracking down the cause either. HttpServiceTest.MultipleRequests fails more often for me than HttpServiceTest.RequestWithHeaders which is what appears to be failing all the time on the bots. One thing I noticed is that this case gets hit https://fuchsia.googlesource.com/garnet/+/master/go/src/netstack/socket_server.go#289 if I uncomment it. I don't know if that's really relevant or not though.
,
Oct 29
In .MultipleRequests I also noticed that reducing the number of attempts to as low as 5 will reproduce the hang relatively quickly with a gtest_repeat. When it hangs, the embedded test server never even gets to Accept() for the hanging request. I'm not sure yet whether it's the client that's not completing the send properly, or if the test server side isn't servicing properly though. (Ugly debug log here, but not too readable: https://gist.github.com/sgraham/5d3947dfd7da77d4ccdda018cb980435)
,
Oct 29
I also notice that making SocketPosix::Accept() a bit slow (i.e. adding a sleep(1) to it) makes HttpServiceRequest.MultipleRequests fail every time. I'm not sure why that is yet, but that seems probably not good.
,
Oct 29
I think the accept() case I've been staring at is basically the same as #c1 (accept goes EWOULDBLOCK, but the watcher never async signals a readable event again). I thought for a while it might be our MessagePump, but I haven't been able to find anything wrong there (other than it's very complex stateful code :). I've been poking around in musl too, but I'm still not sure if the client is actually sending the final thing-that-should-have-been-accept()d, or whether the server is missing it. I'm *pretty* sure it's the latter though as the timing around accept() is very temperamental.
,
Oct 30
I had a thought to look back at history to see if it correlated with an sdk roll, but http_service_tests has been flaky like this since it was added at https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/fuchsia-fyi-x64-rel/4903. (I'm guessing we already knew that, I didn't realize until I checked. :)
,
Oct 30
I am putting this one on hold until NET-1796 is fixed on the Fuchsia side. I am reasonably confident the problem is not on Chromium side at this point.
,
Oct 31
I noticed some interesting flakiness in net_unittests. Log below: [ RUN ] UDPSocketTest.JoinMulticastGroup ../../net/socket/udp_socket_unittest.cc:661: Failure Value of: socket.Bind(bind_address) Expected: net::OK Actual: -4, net::ERR_INVALID_ARGUMENT Stack trace: bt#00: pc 0x5729f1abfd5b (app:/pkg/net_unittests__exec,0x1ef7d5b) bt#01: pc 0x5729f1abf903 (app:/pkg/net_unittests__exec,0x1ef7903) bt#02: pc 0x5729f14c7ad3 (app:/pkg/net_unittests__exec,0x18ffad3) bt#03: end ../../net/socket/udp_socket_unittest.cc:662: Failure Value of: socket.JoinGroup(group_ip) Expected: net::OK Actual: -15, net::ERR_SOCKET_NOT_CONNECTED Stack trace: bt#00: pc 0x5729f1abfd5b (app:/pkg/net_unittests__exec,0x1ef7d5b) bt#01: pc 0x5729f1abf903 (app:/pkg/net_unittests__exec,0x1ef7903) bt#02: pc 0x5729f14c7b7c (app:/pkg/net_unittests__exec,0x18ffb7c) bt#03: end ../../net/socket/udp_socket_unittest.cc:665: Failure Value of: socket.LeaveGroup(group_ip) Expected: net::OK Actual: -15, net::ERR_SOCKET_NOT_CONNECTED Stack trace: bt#00: pc 0x5729f1abfd5b (app:/pkg/net_unittests__exec,0x1ef7d5b) bt#01: pc 0x5729f1abf903 (app:/pkg/net_unittests__exec,0x1ef7903) bt#02: pc 0x5729f14c7cdd (app:/pkg/net_unittests__exec,0x18ffcdd) bt#03: end [ FAILED ] UDPSocketTest.JoinMulticastGroup (21 ms) The ERR_SOCKET_NOT_CONNECTED seems like it *could* be the same root issue.
,
Nov 3
OK, think I've found the problem, on the Chromium side. :P FWIW the net_unittests failures were filed under issue 900709.
,
Nov 3
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f130981cbd9032229906a0171bb87aa2c6fe2abb commit f130981cbd9032229906a0171bb87aa2c6fe2abb Author: Wez <wez@chromium.org> Date: Sat Nov 03 05:42:24 2018 Add FDIO_EVT_PEER_CLOSED to WatchFileDescriptor(WATCH_READ) events. WatchFileDescriptor(WATCH_READ) should result in OnFileCanReadWithoutBlocking() if the descriptor is at end-of-stream when it is invoked, to wake the caller to detect that by attempting to read and receiving zero bytes. Bug: 898938 , 706592, 884299 Change-Id: I77662fe77eb8fb988089640a75a60d470770b78f Reviewed-on: https://chromium-review.googlesource.com/c/1316672 Commit-Queue: Wez <wez@chromium.org> Reviewed-by: Sergey Ulanov <sergeyu@chromium.org> Cr-Commit-Position: refs/heads/master@{#605154} [modify] https://crrev.com/f130981cbd9032229906a0171bb87aa2c6fe2abb/base/message_loop/message_pump_fuchsia.cc
,
Nov 3
Woo! Make sure to ping NET-1796 so that Toshi knows.
,
Nov 3
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/088e95617862c91da15e97a233e59811b798a95f commit 088e95617862c91da15e97a233e59811b798a95f Author: Wez <wez@chromium.org> Date: Sat Nov 03 06:08:28 2018 Revert "Add FDIO_EVT_PEER_CLOSED to WatchFileDescriptor(WATCH_READ) events." This reverts commit f130981cbd9032229906a0171bb87aa2c6fe2abb. Reason for revert: Doesn't actually fix anything. >.< Original change's description: > Add FDIO_EVT_PEER_CLOSED to WatchFileDescriptor(WATCH_READ) events. > > WatchFileDescriptor(WATCH_READ) should result in > OnFileCanReadWithoutBlocking() if the descriptor is at end-of-stream > when it is invoked, to wake the caller to detect that by attempting to > read and receiving zero bytes. > > Bug: 898938 , 706592, 884299 > Change-Id: I77662fe77eb8fb988089640a75a60d470770b78f > Reviewed-on: https://chromium-review.googlesource.com/c/1316672 > Commit-Queue: Wez <wez@chromium.org> > Reviewed-by: Sergey Ulanov <sergeyu@chromium.org> > Cr-Commit-Position: refs/heads/master@{#605154} TBR=wez@chromium.org,sergeyu@chromium.org,fdegans@chromium.org Change-Id: I1a98ece4b90de54c2c789dee64c226eb92bcda38 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 898938 , 706592, 884299 Reviewed-on: https://chromium-review.googlesource.com/c/1316675 Reviewed-by: Wez <wez@chromium.org> Commit-Queue: Wez <wez@chromium.org> Cr-Commit-Position: refs/heads/master@{#605157} [modify] https://crrev.com/088e95617862c91da15e97a233e59811b798a95f/base/message_loop/message_pump_fuchsia.cc
,
Nov 5
Handing back to Fabrice to investigate further :-/
,
Nov 5
Darn, that sounded so promising!
,
Nov 5
An update on NET-1796, "third_party/netstack stops notifying an incoming connection after a while". Let's wait for a fix on that one. Hopefully that was the root cause of all our network issues.
,
Nov 12
The NextAction date has arrived: 2018-11-12
,
Nov 12
Unfortunately, not much more to report, the problem is definitely in netstack but it's not clear where it is exactly. Moving to after Thanksgiving.
,
Nov 26
The NextAction date has arrived: 2018-11-26
,
Nov 26
Good news on this front. A fix is in code review in Garnet: https://fuchsia-review.googlesource.com/c/garnet/+/227032 Punting for another week.
,
Nov 30
We have rolled with the above fix. Unfortunately, it seems we have another issue in netstack, tracked in NET-1986. Adding another week.
,
Dec 4
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/dafc02d5b0e772289361d2caa9d55410a35cb841 commit dafc02d5b0e772289361d2caa9d55410a35cb841 Author: Sergey Ulanov <sergeyu@chromium.org> Date: Tue Dec 04 21:30:05 2018 [Fuchsia] Disable failing tests in http_service_unittests Bug: 898938 Change-Id: I8209b8440639587ca200dd560f8ac2279f831882 Reviewed-on: https://chromium-review.googlesource.com/c/1361827 Commit-Queue: Sergey Ulanov <sergeyu@chromium.org> Reviewed-by: Fabrice de Gans-Riberi <fdegans@chromium.org> Reviewed-by: Wez <wez@chromium.org> Cr-Commit-Position: refs/heads/master@{#613702} [modify] https://crrev.com/dafc02d5b0e772289361d2caa9d55410a35cb841/webrunner/net_http/http_service_unittest.cc
,
Dec 6
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/878a10c54ca6d5f3f406d8c17e60b55272750f9c commit 878a10c54ca6d5f3f406d8c17e60b55272750f9c Author: Sergey Ulanov <sergeyu@chromium.org> Date: Thu Dec 06 01:57:40 2018 [Fuchsia] Fix flake in http_service_tests CheckResponseStream() wasn't running MessageLoop while blocking the thread. Often MessageLoop would process stream end by the time CheckResponseStream() is called, but it wasn't always happening. In that case end-of-stream is never processed by the service and so the tests would block forever. Updated CheckResponseStream() to run MessageLoop when waiting for the socket. Bug: 898938 Change-Id: Icf59ad64859de1ddcdbb47890138ff83dc2d2a14 Reviewed-on: https://chromium-review.googlesource.com/c/1364290 Commit-Queue: Sergey Ulanov <sergeyu@chromium.org> Reviewed-by: Wez <wez@chromium.org> Reviewed-by: Fabrice de Gans-Riberi <fdegans@chromium.org> Cr-Commit-Position: refs/heads/master@{#614230} [modify] https://crrev.com/878a10c54ca6d5f3f406d8c17e60b55272750f9c/webrunner/net_http/http_service_unittest.cc
,
Dec 6
sergeyu@ definitely fixed it, the test suite hasn't flaked since the fix landed. I'm going to enable the test suite on the main builders. Thanks again!
,
Dec 10
The NextAction date has arrived: 2018-12-10
,
Dec 10
|
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by fdegans@chromium.org
, Oct 25