net_unittests are timing out on Fuchsia |
|||||||
Issue descriptionnet_unittests are purple on the bot, see https://ci.chromium.org/buildbot/chromium.fyi/Fuchsia/14473. It looks like random tests get blocked waiting on various network operations. The issue may be caused by a netstack bug. In the log I see the following message that may be related: netstack: loopSocketRead got endpoint error: operation timed out (TODO)
,
Mar 2 2018
This is happening on Linux Chromium OS ASan LSan Tests (1) as well: https://ci.chromium.org/buildbot/chromium.memory/Linux%20Chromium%20OS%20ASan%20LSan%20Tests%20%281%29/?limit=200.
,
Mar 2 2018
IIUC this only started for us under Fuchsia since the most recent SDK roll, right sergeyu@? I remember seeing mention of a change to locking in the sockets impl, which may be related; may be worth checking whether another roll helps. The other thing I noticed from one of the failing runs was that some earlier test had actually failed expectations - perhaps that is leaving things in a weird state? (The net tests suite has a single ScopedTaskEnvironment shared across all tests in the suite, so we'll be seeing it shared across batches of ten tests at a time, for example)
,
Mar 2 2018
Yes, I'm pretty sure that the ChromeOS ASAN failure is a different problem. The logs are different. Also on Fuchsia the issue goes away when I revert the last Fuchsia SDK roll. I can repro the problem with a newer SDK build, i.e. the issue is not fixed there.
,
Mar 2 2018
Tagging this as Needed for M67. Sergey, do you have time to dig into this, or shall I take a look?
,
Mar 2 2018
I'm investigating. May be the same issue that causes https://bugs.chromium.org/p/crashpad/issues/detail?id=219
,
Mar 2 2018
This could likely be due to https://fuchsia.googlesource.com/zircon/+/4c23a1abda0dd15888bc1b94553983763bb7c2e2 . Do we know what at zircon revision this started failing? Is it possible to run these tests locally?
,
Mar 2 2018
It started failing after recent SDK roll (see https://chromium-review.googlesource.com/c/chromium/src/+/932822). Previous roll was on Feb 14, so the problem is caused by something that landed in Fuchsia between Feb 14 and Feb 26. I'll try reverting 4c23a1abda0dd15888bc1b94553983763bb7c2e2 . You can repro this locally - see https://chromium.googlesource.com/chromium/src/+/lkcr/docs/fuchsia_build_instructions.md for instructions. You can build SDK from local sources with //tools/fuchsia/local-sdk.py
,
Mar 2 2018
Looks like the problem repos with and without 4c23a1abda0dd15888bc1b94553983763bb7c2e2 .
,
Mar 2 2018
Callstack from the hanging thread: #03: base::MessagePumpFuchsia::HandleEvents(unsigned long) at ??:? #04: base::MessagePumpFuchsia::Run(base::MessagePump::Delegate*) at ??:? #05: base::MessageLoop::Run(bool) at ??:? #06: base::RunLoop::Run() at ??:? #07: net::internal::TestCompletionCallbackBaseInternal::WaitForResult() at ??:? #08: net::(anonymous namespace)::SSLClientSocketTest::CreateAndConnectSSLClientSocket(net::SSLConfig const&, int*) at ??:? #09: net::SSLClientSocketReadTest_Read_WithFatalAlert_Test::TestBody() at ??:? All other threads run SchedulerWorker, e.g.: #05: base::ConditionVariable::Wait() at ??:? #06: base::WaitableEvent::TimedWaitUntil(base::TimeTicks const&) at ??:? #07: base::WaitableEvent::Wait() at ??:? #08: base::internal::SchedulerWorker::Delegate::WaitForWork(base::WaitableEvent*) at ??:? #09: base::internal::SchedulerWorker::Thread::ThreadMain() at ??:?
,
Mar 2 2018
Re #10: Based on that stack, the two possibilities seem to be that (1) we're hanging trying to connect() or (2) we're missing the RunLoop::Quit() somehow. #2 can happen if we mistakenly call RunLoop::Run() on the same MessageQueue from different threads (don't ask..) but I don't see how that could happen here. Does Fuchsia/POSIX non-blocking connect() have any in-built timeout?
,
Mar 5 2018
I reproduce this issue by creating a listening socket and repeatedly connecting to it. Both accept() and connect() are not-blocking (haven't verified if it makes any difference(). At some point connect() connect doesn't complete immediately as it should. Instead it waits for about 30 seconds, and then completes with errno=0 (i.e. success). I see the following message in the console: netstack: loopSocketRead got endpoint error: operation timed out (TODO) accept() still waits and never accepts that connection. It looks like there are 2 problems: 1. netstack fails to connect TCP connections sometime and then times out internally. 2. connect() doesn't handle ErrTimeout correctly.
,
Mar 5 2018
With debug logs in netstack I see the following logs for the failed connect():
[00123.004] 03235.03591> netstack: send tcp 127.0.0.1:26913 -> 127.0.0.1:26913 len:0 id:0000 flags:0x02 ( S ) seqnum: 3079296080 ack: 0 win: 65535 xsum:0xdca3 options: {MSS:65495 WS:2 TS:true TSVal:3815551922 TSEcr:0}
[00123.005] 03235.03591> netstack: recv tcp 127.0.0.1:26913 -> 127.0.0.1:26913 len:0 id:0000 flags:0x02 ( S ) seqnum: 3079296080 ack: 0 win: 65535 xsum:0xdca3 options: {MSS:65495 WS:2 TS:true TSVal:3815551922 TSEcr:0}
[00123.005] 03235.03546> netstack: socketServer.fdio: op=21, len=140, arg=140, hcount=0
[00123.005] 03235.03591> netstack: send tcp 127.0.0.1:26913 -> 127.0.0.1:26913 len:0 id:0000 flags:0x12 ( S A ) seqnum: 3002720737 ack: 3079296081 win: 65535 xsum:0x1bf5 options: {MSS:65495 WS:2 TS:true TSVal:3255921072 TSEcr:3815551922}
[00123.005] 03235.03591> netstack: recv tcp 127.0.0.1:26913 -> 127.0.0.1:26913 len:0 id:0000 flags:0x12 ( S A ) seqnum: 3002720737 ack: 3079296081 win: 65535 xsum:0x1bf5 options: {MSS:65495 WS:2 TS:true TSVal:3255921072 TSEcr:3815551922}
[00123.005] 03235.03591> netstack: send tcp 127.0.0.1:26913 -> 127.0.0.1:26913 len:0 id:0000 flags:0x14 ( R A ) seqnum: 3079296081 ack: 3002720738 win: 0 xsum:0x42da options: {TS:true TSVal:3255921072 TSEcr:3815551922}
[00123.005] 03235.03591> netstack: recv tcp 127.0.0.1:26913 -> 127.0.0.1:26913 len:0 id:0000 flags:0x14 ( R A ) seqnum: 3079296081 ack: 3002720738 win: 0 xsum:0x42da options: {TS:true TSVal:3255921072 TSEcr:3815551922}
Notice that the source and destination port numbers are the same. Looks like the bug is that connect() may bind the outgoing socket to the port port that's already used by the listening socket.
,
Mar 5 2018
I can repro the problem in #13 with older version of the SDK, so it may not be related to this net_unittests failure (there were also no recent changes in the relevant code, i.e. that bug has existed for long time). Filed NET-551
,
Mar 7 2018
Filed https://fuchsia.atlassian.net/browse/NET-560 for what I think is the root cause of this bug. I can run net_unittests locally with a fix for NET-560. Still not sure why we didn't see this problem before.
,
Mar 7 2018
Pending fix for NET-560: https://fuchsia-review.googlesource.com/c/third_party/netstack/+/129750
,
Mar 7 2018
The tests were not affected by the bug mentioned above because the likelihood that an incoming port is picked for an outgoing connection is low. The last SDK roll introduced bug NET-559. Due to that bug listening sockets are never fully closed, which results in much higher probability that the corresponding port numbers are reused for outgoing sockets later. Essentially the issue with net_unittests is caused by 3 separate bugs: 1. NET-531: connect() may reuse local port number that's reserved for listening socket. 2. NET-560: When a local port is used for both outgoing socket and a listen socket, then all incoming traffic on the connected socket is routed to the listen socket. 3. NET-559: listen sockets never cleanup after being closed. The first 2 bugs existed for long time. The last bug is new. It significantly increases probability of the first two bugs causing troubles, which is why we started seeing this issue after the SDK roll. Fixing at least one of these 3 bugs will make net_unittests healthier.
,
Mar 7 2018
Wow, nice bug-pile. It sounds like NET-560 and NET-559 are definitely bugs, NET-531 is more questionable, as we discussed offline, since the stack inherently has to cope with connected sockets whose local "name" has a port number on which the stack is also listening (since that is always the case for incoming TCP sockets)?
,
Mar 8 2018
I agree that NET-531 is less clear cut. IMO it's a bug. Reserved ports should not be used for outgoing connections. Consider the following scenario: 1. process A creates socket sA, binds it to portA. 2. process B creates socket sB and calls connect(sB, server). 3. process A calls connect(sA, server) to connect to the same server. Without fixing that bug it's possible that portA will be used for B and then (3) will fail (it will fail to register endpoint address), which is unexpected for A. Also the following scenario is also not handled correctly (though it's much less likely to happen): 1. two machines A and B open listening sockets on portA and portB. 2. some process on A creates a socket and connects it to B:portB. 3. some process on B creates a socket and connects it to A:portA. Currently it's possible that portA and portB will be selected as local ports in (2) and (3) and then there is no way to distinguish between these two connections.
,
Mar 8 2018
It also timeout a lot on "Linux Chromium OS ASan LSan Tests (1)" https://ci.chromium.org/buildbot/chromium.memory/Linux%20Chromium%20OS%20ASan%20LSan%20Tests%20%281%29/?limit=200 Is it the same cause? If not, mind filing a separate bug?
,
Mar 8 2018
Latest known timeout there https://ci.chromium.org/buildbot/chromium.memory/Linux%20Chromium%20OS%20ASan%20LSan%20Tests%20%281%29/26477
,
Mar 8 2018
Re #19: We're talking about a listen()ing socket ending up sharing with connect()ed sockets, which is different from two connect()ed sockets sharing, but yeah, unless we have SO_REUSEADDR set, we can't allow multiple sockets to bind() to a single port, in general, not least since we don't know if they will want to connect() vs listen(). Re #20/#21: These issues are specific to the Fuchsia OS. Timeouts under ASAN are not terrifically surprising; if you look at the runs that pass, they are taking ~50 minutes, which is getting perilously close to the 60-minute limit for each swarming shard - I'd suggest filing a separate bug for the infra folks to e.g. increase the number of shards, to address that.
,
Mar 8 2018
My point was that ports bound with bind() should not be used as ephemeral ports for connect(). The case when both processes call bind() is already handled correctly. Modified description of the second scenario to make the problem clearer: 1. process A1 on machine A listens for connections on portA 2. process B1 on machine B listens for connections on portB 3. process A2 on A creates a socket and connects it to B:portB. 4. process B2 on B creates a socket and connects it to A:portA. Currently it's possible that portA is selected for A2 and portB - for B2. When this happens A2 and B2 will be connected to each other instead of A1 and B1.
,
Mar 9 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/045f87da6588f2cfee731605539fb462c308d03e commit 045f87da6588f2cfee731605539fb462c308d03e Author: Sergey Ulanov <sergeyu@chromium.org> Date: Fri Mar 09 02:21:15 2018 Roll Fuchsia SDK to c277a8a3c81b8012cfe03bff945cf15f945dfd63 The new SDK fixes NET-559. net_unittests should no longer timeout with this change. Bug: 707030, 818046 Change-Id: Ie97df38a6bd02e1e9fcc22dce7f50888a3ad6f7e Reviewed-on: https://chromium-review.googlesource.com/949970 Reviewed-by: Wez <wez@chromium.org> Commit-Queue: Wez <wez@chromium.org> Cr-Commit-Position: refs/heads/master@{#541995} [modify] https://crrev.com/045f87da6588f2cfee731605539fb462c308d03e/build/fuchsia/update_sdk.py
,
Mar 13 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by sergeyu@chromium.org
, Mar 2 2018