shill: re-enable IPv6 support for cellular connectivity |
|||||||||||
Issue descriptionIPv6 support for cellular connectivity is currently disabled in shill. We should re-enable it.
,
Aug 1 2017
,
Aug 2 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/5aded2080aa5a1e27a1ec31ca027a90a1db1a888 commit 5aded2080aa5a1e27a1ec31ca027a90a1db1a888 Author: Eric Caruso <ejcaruso@chromium.org> Date: Wed Aug 02 19:47:49 2017 linux-headers: Export SOCK_DESTROY macro This is used by the SOCK_DIAG netlink subsystem to kill connections. We want to be able to use this functionality from the Chrome OS userspace so we can work around the cellular IPv6 issues. It first appears in kernel 4.5 so once linux-headers is uprevved again, this patch will no longer be necessary. BUG=chromium:726815 TEST=emerge, check /build/${BOARD}/usr/include Change-Id: I2b3fc3ab5945b5a6a9f9cbfe1e7700d71363f039 Reviewed-on: https://chromium-review.googlesource.com/592069 Commit-Ready: Eric Caruso <ejcaruso@chromium.org> Tested-by: Eric Caruso <ejcaruso@chromium.org> Reviewed-by: Kevin Cernekee <cernekee@chromium.org> [rename] https://crrev.com/5aded2080aa5a1e27a1ec31ca027a90a1db1a888/sys-kernel/linux-headers/linux-headers-4.4-r6.ebuild [add] https://crrev.com/5aded2080aa5a1e27a1ec31ca027a90a1db1a888/sys-kernel/linux-headers/files/0017-BACKPORT-net-diag-Add-the-ability-to-destroy-a-socke.patch [modify] https://crrev.com/5aded2080aa5a1e27a1ec31ca027a90a1db1a888/sys-kernel/linux-headers/linux-headers-4.4.ebuild
,
Sep 1 2017
,
Oct 26 2017
Issue 724904 has been merged into this issue.
,
Nov 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/aosp/platform/system/connectivity/shill/+/67afba3f14870bdd59f06d1f6a2d62bdd700361c commit 67afba3f14870bdd59f06d1f6a2d62bdd700361c Author: Eric Caruso <ejcaruso@chromium.org> Date: Wed Nov 01 04:49:23 2017 shill: add DeviceId and expose on Cellular device objects This allows us to implement a quirks layer, and also allows users of Cellular-type Device objects to obtain device ID information without digging through ModemManager objects and sysfs themselves. CQ-DEPEND=CL:737965 BUG=b:37953355,chromium:726815 TEST=emerge, deploy, get Device properties Change-Id: Ie6e8d039fd521c4dfb553cc12f4637bbcd23ece6 Reviewed-on: https://chromium-review.googlesource.com/737932 Commit-Ready: Eric Caruso <ejcaruso@chromium.org> Tested-by: Eric Caruso <ejcaruso@chromium.org> Reviewed-by: Ben Chan <benchan@chromium.org> [add] https://crrev.com/67afba3f14870bdd59f06d1f6a2d62bdd700361c/device_id.h [modify] https://crrev.com/67afba3f14870bdd59f06d1f6a2d62bdd700361c/cellular/cellular_capability_universal.cc [modify] https://crrev.com/67afba3f14870bdd59f06d1f6a2d62bdd700361c/shill.gyp [modify] https://crrev.com/67afba3f14870bdd59f06d1f6a2d62bdd700361c/cellular/cellular.h [modify] https://crrev.com/67afba3f14870bdd59f06d1f6a2d62bdd700361c/cellular/cellular.cc [modify] https://crrev.com/67afba3f14870bdd59f06d1f6a2d62bdd700361c/device.cc [add] https://crrev.com/67afba3f14870bdd59f06d1f6a2d62bdd700361c/device_id.cc [modify] https://crrev.com/67afba3f14870bdd59f06d1f6a2d62bdd700361c/cellular/cellular_capability_universal.h
,
Jan 19 2018
Issue 213414 has been merged into this issue.
,
Jan 19 2018
,
Jan 30 2018
,
Sep 6
,
Sep 6
,
Oct 3
,
Oct 5
Taking over on this.
A quick update on where things are:
Enabling IPv6 for cellular on Nautilus leads to an issue that can be reproduced as follows:
1. Enable only LTE
2. Start a long-running network connection (e.g. start a Youtube video)
3. Put the device to sleep, then wake it up
4. Device will reconnect to LTE with a different IPv4 and IPv6 address
5. ~5 seconds later, device will disconnect from the network, then reconnect, again with a different IPv4 and IPv6 address
6. If step 5 has not occurred ~7 times, goto 5.
7. Finally establish a working connection.
We are assuming the issue is that existing sockets are using old IP addresses, which leads the network to kick us off (perhaps it thinks we are trying to perform an IP spoofing attack). Presumably a P-GW that uses static IP allocation would not have this issue, although I have not verified this.
Eric's second WIP CL serves to kill all sockets using the wwan interface when going to sleep in order to avoid the above scenario. As it stands, the issue still occurs (though looking at the logs and running `cat /proc/net/{tcp,udp}` does show that at least some sockets were closed when the device was put to sleep).
Open question on my end:
* Even without IPv6 enabled, putting the device to sleep and then waking up leads to getting a different IP address. Why can't we reproduce this issue without IPv6 if the root cause is the network seeing a mismatch between advertised and allocated source IP addresses?
,
Oct 17
,
Oct 17
Quickly talked with Ben about this issue today. If we decide the best way to deal with the new IPs is to kill old sockets, then +1 to using SOCK_DESTROY. Android uses SOCK_DESTROY to kill udp and tcp sockets bound to particular Android networks when those networks disconnect, and in some other scenarios. I cherry picked all the necessary kernel CLs to do that on 3.18, 4.4, 4.14. 3.14 status is unknown (I haven't checked) but since 3.14 passes Android N cts tests my expectation would be that SOCK_DESTROY TCP would work there at least. For reference, the Android code that does that is in netd in SockDiag.cpp: https://cs.corp.google.com/android/system/netd/server/SockDiag.cpp However, more generally anytime the IP addresses change we should consider this is a new network connection. Isn't there a event or msg we can send to Chrome to tell it that it has to close all its existing socket because the network has changed ? I haven't had time to take a look at Eric's WIP CLs yet, but will do. > reproduce this issue without IPv6 if the root cause is the network seeing a mismatch between advertised and allocated source IP addresses? It is possible that the LTE network anti spoofing code only kicks in for IPv6, maybe.
,
Oct 17
Re: "more generally anytime the IP addresses change we should consider this is a new network connection". That's not true, because it's common to have multiple addresses on the same network. One obvious example is privacy addresses: they rotate every 24 hours, and the previous ones remain valid.
,
Oct 17
Thank you for the info Hugo. I agree that using SOCK_DESTROY is the cleanest solution. Eric's second WIP CL properly kills all of the system's TCP sockets, and with small modifications kills all UDP and TCP sockets. When TCP sockets are killed using SOCK_DESTROY, however, it seems that a TCP termination handshake is performed, rather than simply destroying the socket. I have seen this behavior not only with Eric's CL, but also with a version of SockDiag modified to not use any Android dependencies and with `ss -ut -K`. Since the termination handshake is performed, a problem I am seeing is as follows: 1. Client disconnects from network before completing the termination handshake for all open TCP sockets 2. Client successfully connects to LTE network 3. Old TCP sockets in FIN-WAIT-1 state send FIN packets with old IP address. 4. Get kicked off network for "spoofing" Once we disconnect from the network in step 1, manually calling `ss -t K` does not kill the sockets in FIN-WAIT-1. The sockets disappear eventually (we have /proc/sys/net/ipv4/tcp_orphan_retries set to 0, which special-cases to 8 retries), but not before making the network unusable for a good period of time. Manually disconnecting and reconnecting to LTE only exacerbates the issue, as it simply adds more TCP sockets in the FIN-WAIT-1 state. On the other hand, connecting to a network that does not kick us off for spoofing (in my case connecting to the test network over ethernet) allows for all the termination handshakes to complete successfully and leads to the old TCP sockets closing in a timely manner. As a note, steps 1 and 2 do not need to occur due to sleeping and waking up. If we were connected to any network and then connected to LTE (even if we disconnected from LTE and reconnected), the problem can be reproduced. As such we can't rely on having sufficient time to perform all of the termination handshakes prior to the initial disconnect. > Isn't there a event or msg we can send to Chrome to tell it that it has to close all its existing socket because the network has changed ? I am not sure about this. Would such a mechanism allow for the closing of system and ARC++ sockets as well?
,
Oct 18
I took a look at https://chromium-review.googlesource.com/c/aosp/platform/system/connectivity/shill/+/592690 I think the basis is good, but there are some things that need to be improved, most notably: - we should kill both v4 and v6, and both tcp and udp - we should ensure we are killing sockets for a specific interface more reliably. I suggest trying to use idiag_if field of struct inet_diag_sockid - error handling should be better For monitoring sockets getting killed I would suggest to use $ ss -a -6, or $ losf -i6. lsof has the -p option for filtering output on specific pid, and lsof has the -e option to show uid owner of sockets, so you can monitor just Chrome sockets.
,
Oct 18
Just for context, the tcp destroy handler is here: https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_diag.c#L147 which seems to go into tcp_abort: https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp.c#L3721 For the TCP termination handshake, I think Android works around that problem in disconnect scenarios by also tearing down the routing. We could definitely do that too, with an ip rule on iff lo oif rmnet0 that blackholes the originated traffic going out to cell. That unfortunately would leave the peer remote sockets "dangling" until they timeout. The important thing would be to give a slight grace period to send all FINs we need, after that ignoring the peers ACKs does't matter anyway, and we would just not ACK the peers FINs. However in a scenario where you quickly resume operations on cell, the race you mentioned could still happen. I was wondering if the spoofing detector could not just gracefully ignore and drop FIN and ACK packets with incorrect IPs. Have you actually seen 3. and 4. happen on a real lte network ? > Would such a mechanism allow for the closing of system and ARC++ sockets as well? For ARC++, if we send a "network disconnect" event via net.mojom, then Android's netd will do all the same operations of tearing down the routing and SOCK_DESTROYing the sockets. But all the possible scenarios as sligthly unclear to me as we have to consider existing ARC without multinetworking that only uses the default service. Anyway the tl;dr for ARC is that we just have to notify ARC that cell is going down, and the Android stack will take care of sockets for us.
,
Oct 18
Note that Android does not currently use SOCK_DESTROY for UDP sockets. Anecdotal evidence from upstream is that many applications, when receiving ECONNABORTED on a UDP socket, just retry the operation and get into a hot loop.
,
Oct 18
> Anecdotal evidence from upstream is that many applications, when receiving ECONNABORTED on a UDP socket, just retry the operation and get into a hot loop. Very good to know. Do you know if Android employs some other method to prevent applications using UDP sockets from sending packets with outdated source IP addresses? > ip rule on iff lo oif rmnet0 that blackholes the originated traffic going out to cell If I understand this correctly, this would prevent any traffic from going through the rmnet0 interface. It would be nice if we could set up the routing table to only blackhole traffic that doesn't match the interface's IP, but that seems overly complicated. Setting up a source NAT with `ip6tables -t nat -A POSTROUTING -o wwan0 -j MASQUERADE` has been effective in fixing this issue (FIN-WAIT-1 sockets must still timeout, but at least they don't cause us to get kicked off of the network). A concern I have with using a source NAT in the actual solution is future debugging scenarios involving outgoing traffic with multiple source IPs. I am unsure how common this is on client machines as opposed to routers, but since the NAT will occur prior to packets reaching tcpdump, such debugging may be more difficult without first removing the NAT. Any thoughts on this? > Have you actually seen 3. and 4. happen on a real lte network? Yes, we still get disconnects where the only outgoing IPv6 traffic in which the source IP doesn't match the interface IP is FIN packets from old SOCK_DESTROYed sockets. Since manually setting up a source NAT was effective in resolving this issue, it makes sense that the FIN packets are causing the problem.
,
Oct 18
Update: From an in-person discussion with @benchan, we decided that sending all wwan interface traffic to a blackhole and whitelisting only the source IPs assigned to the interface was the best option. Preliminary changes in shill seem to have solved this issue. CLs will be ready for review soon.
,
Oct 19
We do not do anything special to stop apps from using outdated IP addresses or addresses that are not appropriate for the network they're sending on. It's possible that an app using an unconnected UDP socket could still send a packet on cell with the wifi IP address. It's also possible for an app to bind() to the wifi IP address and send the packet on cell data. We've never used iptables to drop these packets because doing so would be racy. IPv6 addresses are added and removed all the time so it's impossible to keep iptables rules current without races. Please don't do NAT on the device as that breaks end-to-end connectivity. Additionally, if you NAT the Android container's IPv6 address you will fail the CTS test that checks for IPv6 end-to-end connectivity. ISTM that the root of the problem here is that the kernel is sending out packets with source addresses that are no longer assigned to any interface. Do you have proof that this is occurring? If so, that seems like a bug that should be fixed, period.
,
Oct 19
Whitelisting the IP addresses assigned to the interface has the problem that as soon as the network connects and an IP address is assigned, apps will try to use it, and will race with the iptables operation that is trying to add the whitelist. If the app wins the race then it will either get an error or the first SYN packet will be dropped, delaying the connection by about 1 second.
,
Oct 19
> ISTM that the root of the problem here is that the kernel is sending out packets with source addresses that are no longer assigned to any interface. Do you have proof that this is occurring? If so, that seems like a bug that should be fixed, period. Yes, that is the fundamental problem. It seems like even using SOCK_DESTROY would be unnecessary for this issue if the kernel will reject a packet whose source IP does not match the IPs of any interface. I could imagine that some packets might still get transmitted with an outdated source IP if they were placed on a queue somewhere in the network stack after the point where the kernel verifies the source IP. The FIN issue from #18, however, seems like the FIN-WAIT-1 sockets with old IPs are timing out and actively resending a FIN packet without any complaints from the kernel. > Whitelisting the IP addresses assigned to the interface has the problem that as soon as the network connects and an IP address is assigned, apps will try to use it, and will race with the iptables operation that is trying to add the whitelist. If the app wins the race then it will either get an error or the first SYN packet will be dropped, delaying the connection by about 1 second. The whitelisting in this case doesn't use iptables, but is performed when setting up the routing tables. Shill already does whitelisting for uids and interfaces. Honestly, I have to do more research to really understand our entire connection flow, but if this race does exist, then it exists in various scenarios with or without the addition of source IP whitelisting.
,
Nov 15
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/d1461724f25140885538f288e46833b22355978d commit d1461724f25140885538f288e46833b22355978d Author: Alex Khouderchah <akhouderchah@chromium.org> Date: Thu Nov 15 10:16:32 2018 shill: net: Allow for the destruction of TCP and UDP sockets This change allows for sockets to be destroyed on the basis of protocol (either TCP or UDP) and source IP address. The NetlinkSockDiag implementation is largely unchanged from ejcaruso's original CL:592690, which was already functional (although it was unexpected that SOCK_DESTROY's behavior for TCP was to perform the TCP termination handshake rather than just to immediately destroy the socket). BUG=chromium:726815 TEST=Manual testing involving the destruction of TCP and UDP sockets using our interface's IPv4 and IPv6 addresses. `ss -tu` shows the immediate destruction of UDP sockets, and the forced initiation of the TCP termination handshake on TCP sockets. Change-Id: I868cf65aacec5dde85f173008b0b5db14e4bbf47 Reviewed-on: https://chromium-review.googlesource.com/1319982 Commit-Ready: Alex Khouderchah <akhouderchah@chromium.org> Tested-by: Alex Khouderchah <akhouderchah@chromium.org> Reviewed-by: Alex Khouderchah <akhouderchah@chromium.org> [add] https://crrev.com/d1461724f25140885538f288e46833b22355978d/shill/net/netlink_sock_diag.cc [add] https://crrev.com/d1461724f25140885538f288e46833b22355978d/shill/net/netlink_sock_diag.h [modify] https://crrev.com/d1461724f25140885538f288e46833b22355978d/shill/BUILD.gn
,
Dec 6
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/108bc08b5bf05295a846f510f0470968766aa27d commit 108bc08b5bf05295a846f510f0470968766aa27d Author: Alex Khouderchah <akhouderchah@chromium.org> Date: Thu Dec 06 02:28:49 2018 shill: Add data structure to contain elements with finite lifetimes There are various scenarios in which we want to maintain a set of elements that have finite lifetimes, such as expiring routing table entries or dns mappings. The TimeoutSet class serves this purpose, taking care of removing elements from the set when they expire, and optionally informing the client when elements expire. BUG=chromium:726815 TEST=- Manual testing with integer and IPAddress element types - All unit tests are passing Change-Id: Ia38cada01343c364acf00d5161f8906aa01ce20a Reviewed-on: https://chromium-review.googlesource.com/1319983 Commit-Ready: Eric Caruso <ejcaruso@chromium.org> Tested-by: Alex Khouderchah <akhouderchah@chromium.org> Reviewed-by: Eric Caruso <ejcaruso@chromium.org> [add] https://crrev.com/108bc08b5bf05295a846f510f0470968766aa27d/shill/timeout_set.h [modify] https://crrev.com/108bc08b5bf05295a846f510f0470968766aa27d/shill/BUILD.gn [add] https://crrev.com/108bc08b5bf05295a846f510f0470968766aa27d/shill/timeout_set_test.cc
,
Dec 9
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/e4bbdcb6ebb216527cce02a8945b385ec66141ff commit e4bbdcb6ebb216527cce02a8945b385ec66141ff Author: Alex Khouderchah <akhouderchah@chromium.org> Date: Sun Dec 09 08:48:17 2018 shill: net: Overload operator<< for IPAddress The IPAddress class already provides a ToString method. This change allows for the use of the << operator with the class. BUG=chromium:726815 TEST=Manual testing with calls to operator<< involving IPAddresses. Change-Id: I72469724fe048fdf1eae0953e44915e20ba6addb Reviewed-on: https://chromium-review.googlesource.com/1334109 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Alex Khouderchah <akhouderchah@chromium.org> Reviewed-by: Eric Caruso <ejcaruso@chromium.org> [modify] https://crrev.com/e4bbdcb6ebb216527cce02a8945b385ec66141ff/shill/net/ip_address.h
,
Dec 15
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/c49bc931b6f2b9cd168aa6bec61e5272796d117d commit c49bc931b6f2b9cd168aa6bec61e5272796d117d Author: Alex Khouderchah <akhouderchah@chromium.org> Date: Sat Dec 15 06:40:37 2018 shill: net: Allow for blacklisting of outgoing traffic using src IP This change adds functionality such that outgoing traffic may be blacklisted for a certain amount of time based on the source IP of packets being sent. BUG=chromium:726815 TEST=Manual testing with cellular, in which we blacklist existing IP addresses and are unable to connect. When we get a new set of IP addresses from the MNO, we can again connect to the internet. Change-Id: Icd5009f56ba0c573009ad944ab298cde7ccf4c2f Reviewed-on: https://chromium-review.googlesource.com/1319984 Commit-Ready: Alex Khouderchah <akhouderchah@chromium.org> Tested-by: Alex Khouderchah <akhouderchah@chromium.org> Reviewed-by: Ben Chan <benchan@chromium.org> Reviewed-by: Eric Caruso <ejcaruso@chromium.org> [modify] https://crrev.com/c49bc931b6f2b9cd168aa6bec61e5272796d117d/shill/connection.cc [modify] https://crrev.com/c49bc931b6f2b9cd168aa6bec61e5272796d117d/shill/connection.h [modify] https://crrev.com/c49bc931b6f2b9cd168aa6bec61e5272796d117d/shill/device.cc [modify] https://crrev.com/c49bc931b6f2b9cd168aa6bec61e5272796d117d/shill/device.h [modify] https://crrev.com/c49bc931b6f2b9cd168aa6bec61e5272796d117d/shill/ipconfig.cc [modify] https://crrev.com/c49bc931b6f2b9cd168aa6bec61e5272796d117d/shill/ipconfig.h
,
Dec 18
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/89a721bc626d812a9fc63a75e93d5670b6bd7c3d commit 89a721bc626d812a9fc63a75e93d5670b6bd7c3d Author: Alex Khouderchah <akhouderchah@chromium.org> Date: Tue Dec 18 04:47:03 2018 shill: cellular: add IPv6 whitelist In the past, IPv6 has been unconditionally disabled on cellular due to issues with suspend/resume in which a device receives a new IPv6 address on resume but has applications which continue using the old IPv6 address. This change does not address that issue, but simply whitelists the L850-GL modem for use with IPv6. The change is unmodified from @ejcaruso's original CL:740577. BUG=chromium:726815 TEST=Deploying `USE=dhcpv6 emerge-nocturne shill` leads to the DUT getting an IPv6 device on the wwan interface. Putting the DUT to sleep and resuming indeed leads to a cycle of connecting and disconnecting to the LTE network for a few minutes before connecting. Change-Id: I7d71a42dc53b0f695692e4eab02b36fa34046b80 Reviewed-on: https://chromium-review.googlesource.com/1319981 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Alex Khouderchah <akhouderchah@chromium.org> Reviewed-by: Alex Khouderchah <akhouderchah@chromium.org> [modify] https://crrev.com/89a721bc626d812a9fc63a75e93d5670b6bd7c3d/shill/cellular/cellular.cc [modify] https://crrev.com/89a721bc626d812a9fc63a75e93d5670b6bd7c3d/shill/cellular/cellular_capability_universal.h [modify] https://crrev.com/89a721bc626d812a9fc63a75e93d5670b6bd7c3d/shill/cellular/cellular_capability_universal.cc
,
Dec 18
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/cd70d338adc0af8b5f414b9de67a8b4dca2d7d31 commit cd70d338adc0af8b5f414b9de67a8b4dca2d7d31 Author: Alex Khouderchah <akhouderchah@chromium.org> Date: Tue Dec 18 08:42:44 2018 shill: cellular: Prevent outgoing traffic with stale source IPs As network operators do their part to improve the internet by deploying anti-spoofing systems, we see a connectivity issue come up as follows: * application connects using socket that gets bound to the appropriate current source IP address * system is provisioned new IP addresses * application continues to use socket, which results in the sending of packets with a source IP that we no longer have * network operator kicks us off the network for spoofing This issue is common when our MNO uses dynamic IP allocation, as we can expect to get a different IP address every time we connect to the network, regardless of when we were last provisioned a set of IP addresses. This CL employs a two-part solution to solve this issue. First, it uses NetlinkSockDiag to kill all of the TCP sockets on our interface prior to going to sleep. Secondly, it blackholes all outgoing traffic that does not match our interface's current IP addresses. BUG=chromium:726815 TEST=Manually attempted to reproduce the bug after applying this change. Previous repro steps no longer cause connectivity issues. Change-Id: Ib908237ec6355d3147655091ad3769100c6d15c7 Reviewed-on: https://chromium-review.googlesource.com/1319985 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Alex Khouderchah <akhouderchah@chromium.org> Reviewed-by: Eric Caruso <ejcaruso@chromium.org> [modify] https://crrev.com/cd70d338adc0af8b5f414b9de67a8b4dca2d7d31/shill/cellular/cellular.cc [modify] https://crrev.com/cd70d338adc0af8b5f414b9de67a8b4dca2d7d31/shill/cellular/cellular.h |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by bugdroid1@chromium.org
, Jul 26 2017