mysteriously invisible bridge |
||||||
Issue descriptionThis was found and reported by Simran at issue 609610 . We're "fixing" that one by backtracking the shill uprev. These are the symptoms (from #11) ------------------------------------- Here is the CL I suspect https://android-review.googlesource.com/#/c/214451/ To summarize the issue is: * An init script creates a network bridge and restarts shill (twice). * After everything is initialized the network bridge is not listed under ifconfig * Trying to create a bridge with the same name after boot complains that it exists. * Interestingly if I create a bridge with a different name it works. On the same device 8282.0.0 is fine, if I flash to 8283.0.0 the problem occurs. The kernel is 3.14.0
,
May 6 2016
,
May 6 2016
,
May 6 2016
Ramya needs to uprev cros shill to aosp shill because of some CLs that Ramya added which are urgently needed. We are wondering if it would be terribly disruptive to revert Garret's change on aosp. But first we should test and see if the problem goes away when reverting the change. Simran can you help with that, if I give you a shill binary without that change? Or can you help me test it? Thanks!
,
May 6 2016
The easiest way for us to test this would be: * Revert Garret's change (if we're pretty sure its his) from AOSP. * Create the CL to do the uprev. * Trybot the uprev CL for guado_moblab-paladin with --hwtest flag passed in. Let me know your thoughts?
,
May 6 2016
I'm not convinced that my change is the root of the issue yet. Simran/Luigi, when you're running ifconfig are you running `ifconfig -a`? Does `brctl show` list any bridges?
,
May 6 2016
Let me reload a bad build onto a device in the lab and I'll take a look and give you the hostname if you want to poke around.
,
May 6 2016
That'd be great, thanks. That change fixed a few impactful bugs in jetstream so I'm not keen to just revert it.
,
May 6 2016
ssh root@chromeos2-row5-rack10-host11.cros
localhost ~ # ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.18.186.227 netmask 255.255.254.0 broadcast 172.18.187.255
inet6 fe80::2e60:cff:fea9:6aa9 prefixlen 64 scopeid 0x20<link>
ether 2c:60:0c:a9:6a:a9 txqueuelen 1000 (Ethernet)
RX packets 2348 bytes 449047 (438.5 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 674 bytes 115126 (112.4 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether 80:3f:5d:9f:73:5d txqueuelen 1000 (Ethernet)
RX packets 917 bytes 55862 (54.5 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 0 (Local Loopback)
RX packets 45 bytes 2964 (2.8 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 45 bytes 2964 (2.8 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
localhost ~ # ifconfig -a
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.18.186.227 netmask 255.255.254.0 broadcast 172.18.187.255
inet6 fe80::2e60:cff:fea9:6aa9 prefixlen 64 scopeid 0x20<link>
ether 2c:60:0c:a9:6a:a9 txqueuelen 1000 (Ethernet)
RX packets 2114 bytes 429374 (419.3 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 568 bytes 101618 (99.2 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether 80:3f:5d:9f:73:5d txqueuelen 1000 (Ethernet)
RX packets 864 bytes 49761 (48.5 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 0 (Local Loopback)
RX packets 45 bytes 2964 (2.8 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 45 bytes 2964 (2.8 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lxcbr0: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether 80:3f:5d:9f:73:5d txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2 bytes 140 (140.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
wlan0: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether d8:fc:93:c6:2f:ff txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
localhost ~ # brctl show
bridge name bridge id STP enabled interfaces
lxcbr0 8000.803f5d9f735d no eth1
K so interestingly it shows up when I pass -a. Prior to this it showed up if I just typed ifconfig tho.
My shill knowledge is minimal so Garret given the list of AOSP CLs here: https://chromium-review.googlesource.com/#/c/341463/ is there something else that stands out. We know there is a bad CL in this uprev, just not sure which...
,
May 6 2016
And here is the init script that is supposed to set everything up: https://chromium.googlesource.com/chromiumos/overlays/board-overlays.git/+/master/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf Essentially it is creating a network bridge, launch dhcpd, then attach the usb-ethernet dongle to the bridge. With a couple of shill restarts blacklisting the bridge and wireless as well.
,
May 6 2016
Nothing in that script brings the bridge interface up, that seems like a bug to me. Can you add a `ifconfig ${DHCPD_IFACE} up` at line 53 of that script and retry? I'm ssh'd into the machine, but I don't know if anyone else is using it right now.
,
May 6 2016
Or add "up" to the end of line 56, which is cleaner. :P
,
May 6 2016
There's a race in the moblab-network-bridge script between it and shill. Noticing the other 5s sleep in that script, it's not the first race that this script deals with.
What follows is a patch that papers over the issue, and is generally a bad approach. Except for bringing up the bridge interface, that's a good idea. :P
diff --git a/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf b/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf
index 347c3e7..3551ba2 100644
--- a/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf
+++ b/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf
@@ -46,6 +46,9 @@ script
logger -t "${UPSTART_JOB}" "restarting shill with ${BLACKLISTED_DEVICES} blacklisted"
restart shill BLACKLISTED_DEVICES=${BLACKLISTED_DEVICES}
+ # Wait for shill to be on its feet before creating the bridge.
+ sleep 5
+
# Bring up the network bridge and set forward delay to 0.
logger -t "${UPSTART_JOB}" "Bringing up network bridge ${DHCPD_IFACE}"
brctl addbr ${DHCPD_IFACE}
@@ -53,7 +56,7 @@ script
# Configure server IP address with ${SERVER_ADDRESS}.
logger -t "${UPSTART_JOB}" "setting server IP address to ${SERVER_ADDRESS}"
- ifconfig ${DHCPD_IFACE} ${SERVER_ADDRESS} netmask ${SERVER_NETMASK}
+ ifconfig ${DHCPD_IFACE} ${SERVER_ADDRESS} netmask ${SERVER_NETMASK} up
# Start the dhcpd server on MobLab. It needs the DHCPD_IFACE piped in because
# on stumpy_moblab this value is not static. See moblab-network-init for more
,
May 6 2016
+pstew Is there a better way to wait for shill to be on its feet than the sleep?
,
May 6 2016
You can wait for it to export (over D-Bus) the device you're waiting to see, you can wait for it to claim its D-Bus name, etc. This is going to take some digging to figure out the exact nature of this race, my diff was just to prove that it was.
,
May 6 2016
I agree that D-Bus is the best way. You can use dbus_send, python directly to D-Bus, or use the "list_devices" script from shill-testing to query shill's view of the device list.
,
May 7 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/17c0b8fe2abbaf242cf5251c023ea1bcfa006872 commit 17c0b8fe2abbaf242cf5251c023ea1bcfa006872 Author: Simran Basi <sbasi@google.com> Date: Fri May 06 22:47:14 2016 moblab: Add short sleep after shill restart. There appears to be a race between shill restarting and bringing up the network bridge. A short sleep alleviates this problem. Also brings up the network bridge via ifconfig. BUG= chromium:609852 TEST=local moblab setup. Change-Id: Ia18516a274fa6902b3259e4666f2e4d6172282f9 Reviewed-on: https://chromium-review.googlesource.com/343230 Commit-Ready: Simran Basi <sbasi@chromium.org> Tested-by: Simran Basi <sbasi@chromium.org> Reviewed-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/17c0b8fe2abbaf242cf5251c023ea1bcfa006872/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf [rename] https://crrev.com/17c0b8fe2abbaf242cf5251c023ea1bcfa006872/project-moblab/chromeos-base/chromeos-bsp-moblab/chromeos-bsp-moblab-0.0.5-r30.ebuild
,
May 9 2016
Now that sbasi's change has landed should we try the shill uprev again? https://chromium-review.googlesource.com/#/c/341463/ Eric or Kirtika, can you please do this since Luigi is out?
,
May 9 2016
I'll take care of it.
,
May 10 2016
I put up the reland (CL:343547) but it still seems to be having issues with moblab devices.
,
May 12 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/7acf6828f50ae8faacc1817b0a112a1939562bf4 commit 7acf6828f50ae8faacc1817b0a112a1939562bf4 Author: Simran Basi <sbasi@google.com> Date: Tue May 10 22:10:18 2016 moblab: Stop & start shill around lxcbr0 initialization Instead of restarting shill multiple times when initializing the moblab network bridge, simply stop it prior to the setup and start it afterwards. BUG= chromium:609852 TEST=trybot run and local moblab test. Change-Id: Ia3cc793c0ffdc41bae7abf7098ea63b69d0fe11a Reviewed-on: https://chromium-review.googlesource.com/344030 Commit-Ready: Eric Caruso <ejcaruso@chromium.org> Tested-by: Simran Basi <sbasi@chromium.org> Reviewed-by: Garret Kelly <gdk@chromium.org> Reviewed-by: Dan Shi <dshi@google.com> [rename] https://crrev.com/7acf6828f50ae8faacc1817b0a112a1939562bf4/project-moblab/chromeos-base/chromeos-bsp-moblab/chromeos-bsp-moblab-0.0.5-r31.ebuild [modify] https://crrev.com/7acf6828f50ae8faacc1817b0a112a1939562bf4/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf
,
May 19 2016
,
Feb 20 2018
Bulk verify old 'fixed' bugs. |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by semenzato@chromium.org
, May 6 2016Status: Available (was: Untriaged)