New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 774306 link

Starred by 2 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 3
Type: Bug



Sign in to add a comment

re-evaluate net layer socket connection API: UDPClientSocket::Connect() vs UDPClientSocket::ConnectUsingDefaultNetwork

Project Member Reported by zhongyi@chromium.org, Oct 12 2017

Issue description

In a recent CL: https://chromium-review.googlesource.com/c/chromium/src/+/691103, I switched sockets to explicitly use default network to connect if NetworkHandle is supported when we create a socket(not in connection migration). Basically I switched from socket->Connect(addr) to socket->ConnectUsingDefaultNetwork(addr);

This change is not on by default, but controlled by a finch configuration.

However, when talking to lorenzo@, he mentioned using ConnectUsingDefaultNetwork API is risky in two ways:
- the implementation queries waht the default network is and bind the socket to that network explicitly. This is racy because the defaulet network can change in between due to synchronous call and async callbacks. 
- by calling this API, platform also assumes the socket is always bound to the particular network, and expects the app to listen to network change events.

We might want to re-evaluate the API, and figure out the pre-requisites, make sure the switch is safe. 
 
Description: Show this description

Comment 2 by jri@chromium.org, Oct 13 2017

I think we're safe in the world where we use ConnectUsingDefaultNetwork(), but you can tell me if I'm wrong somewhere. Responses to the two risks you mention:

- ConnectUsingDefaultNetwork actually manages this race by calling GetDefaultNetwork() and ConnectUsingNetwork() twice. See https://cs.chromium.org/chromium/src/net/socket/udp_client_socket.cc?type=cs&sq=package:chromium&l=58.

- QUIC now listens for notifications, and specifically closes sessions when IP_ADDRESS_CHANGED is received. So we should be good with handling notifications on these sockets.

Comment 3 by lorenzo@google.com, Oct 13 2017

I don't know what the threading model is. Can it happen that the notification races with ConnectUsingDefaultNetwork?

Particularly it would be bad if the following happened:

1. NetworkChangeNotifier::GetDefaultNetwork() returns cell data.
2. Device switches to wifi.
3. Network change notified, all QUIC sessions closed. This socket is not closed because it's not active yet.
4. socket_.Connect() called, connects on cell data.

As zhongyi@ says, explicitly bound sockets are special. A socket that was explicitly bound to cell data will not be closed by the OS when the device switches to wifi. If the QUIC session is backing a large data transfer such as a video stream or photo upload, then the socket will never be closed by the OS and the device could end up using lots of metered cell data.

So we should definitely be careful here. From past experience, it is very difficult to see this sort of data spike in metrics.
Does the kernel kill UDP sockets?  I only see mention of TCP in http://androidxref.com/8.0.0_r4/xref/system/netd/server/SockDiag.cpp

Comment 5 by lorenzo@google.com, Oct 13 2017

Sorry, "close" is the wrong word for UDP sockets. What we do is make the socket unroutable such that all writes return ENETUNREACH.

The kernel can support SOCK_DESTROY for UDP sockets but we don't do it because the impact on apps is unknown.
Are only !FwMark.explicitlySelected made unroutable and FwMark.explicitlySelected sockets remain routable?

Could we fix this by changing ConnectUsingDefaultNetwork() to logic like:
  Network network = GetDefaultNetwork();
  int ret = socket_.Connect();
  if (ret != OK)
    return ret;
  if (network != GetDefaultNetwork())
    return ERR_NETWORK_CHANGED;  // TODO:create another socket and retry
  network_ = network;
  return OK;

Comment 7 by lorenzo@google.com, Oct 13 2017

Yes: explicitly selected sockets are untouched. Non-explicitly-selected sockets are made unroutable, and if TCP, closed via SOCK_DESTROY.

The proposed code does look safer. It's still racy if the default network flips back and forth between the two networks, like so:

  Network network = GetDefaultNetwork();  // Returns wifi
  // Default network switches to cell
  int ret = socket_.Connect();  // Connects on cell data
  // Default network switches back to wifi
  if (network != GetDefaultNetwork())  // No change detected
  network_ = network;

... but even if there's a race, since the socket is not explicitly bound to a netId, the platform will properly block traffic on it when needed.

GetDefaultNetwork is non-atomic and can return null NetId.INVALID if the default network changes while it's running (see bug 774497), so you'll probably want to bail out if the first call to it returns NetId.INVALID.
Ya, I ignored the flip-back-and-forth situation as I hope connectivity logic isn't so fickle.

I'm not sure this is something we need to act on with high priority for a few reasons:
1. This may or may not be a bug, depending on when QUIC logic adds new sockets to the session pool (which is flushed on network change).  If we add the new socket to the session pool before returning to the message loop (at which point a network change could be processed), then we're fine (because the network change will kill the session/socket).
2. If this is a bug, the window for the race triggering this bug may be very short.
3. I should mention that for at least 2 years up until very recently, no QUIC sessions/sockets were closed upon network change, see Issue 544156.  I understand the sockets may have been made unroutable, but, Lorenzo can correct me, I don't think this was the case in Lollipop.

Comment 9 by lorenzo@google.com, Oct 13 2017

Before O, the potential unintended data usage was limited since cell data would be torn down when the linger timeout expired. In O, by default mobile data is always on, so a socket that's explicitly bound to cell data will not be closed by the system.

The sockets were made unroutable in a post-O update.
I think there is a simpler implementation for ConnectUsingDefaultNetwork()
  socket_.Connect()
  getsockopt(socket_, SOL_SOCKET, SO_MARK, &mark, &marksize)
  network_ = mark & 0xffff;
I think we can actually remove ConnectUsingDefaultNetwork() and just set |network_| like this in the normal Connect().
I greatly prefer this solution to my proposal in comment #6 because it avoids all chance of a race.
I tested calling getsockopt(..., SOL_SOCKET, SO_MARK, ...) on linux and Android and it worked great on both.
I wonder if this actually provides an easy way to find the default network on Lollipop: create UDP socket, connect to 8.8.8.8, read bound network.

Comment 11 by lorenzo@google.com, Oct 27 2017

That relies on implementation details that are not part of the API surface or verified by conformance tests and thus subject to change across releases and across OEMs/devices.

If we had an android_getsocknetwork NDK method we could do something like this. However, it would still not be entirely correct, because the network that is set in the mark is not necessarily the network that the connection is on.

For example: a connection to loopback (or to an IP address that the device itself owns) will be marked with the mark of the default network when connect() is called, but will not go out on that network. A connection established when a VPN is up will be marked as being on the default network, but will actually go over the VPN. These are the reasons why we never added android_getsocknetwork: it wouldn't be correct in all cases.

That said - if we passed the FD to netd, we could at least deal with the VPN case, because netd knows which VPNs apply to which users. Maybe we can deal with the other corner case (cases?) with better documentation.

I do prefer the solution in #7 though - it's much simpler, and even if it's not 100% correct in all cases, it fails safely, unlike the current code of ConnectUsingDefaultNetwork.
The problem with the solution in comment #6 and #7 lies within GetDefaultNetwork().  GetDefaultNetwork() is currently a cached operation in Chromium, so updates may be delayed as they percolate from one thread to another, so it may be out of date.  I could surface a synchronous GetDefaultNetwork() call but it may be extremely slow:
getDefaultNetId (https://cs.chromium.org/chromium/src/net/android/java/src/org/chromium/net/NetworkChangeNotifierAutoDetect.java?rcl=8bffda4cd007e678135806ec0861f046cb378e02&l=331) calls a variety of ConnectivityManager methods, getting all networks, getting all network capabilities, for VPNs it attempts to bind a UDP socket to them to see if they apply to our UID, getting the type of the default network, getting the type of all networks etc.

> For example: a connection to loopback (or to an IP address that the device itself owns) will be marked
> with the mark of the default network when connect() is called, but will not go out on that network.

This could be solved by creating an additional UDP socket, binding it to the network we think the original socket got bound to, then connect()ing it to the intended destination and seeing if it fails.  I'm not sure the solution in comment #6 and #7 handles this correctly either, e.g.  Issue 589598 , I think it just fails the request.

VPNs are more complicated because:
1. the getsockopt() method doesn't work
2. the solution in comment #6 and #7 doesn't actually bind the socket to the VPN, so if the VPN goes away I think the socket may keep working perhaps?


Sign in to add a comment