New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 609852 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: May 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

mysteriously invisible bridge

Project Member Reported by semenzato@chromium.org, May 6 2016

Issue description

This was found and reported by Simran at  issue 609610 .  We're "fixing" that one by backtracking the shill uprev.

These are the symptoms (from #11)
-------------------------------------

Here is the CL I suspect https://android-review.googlesource.com/#/c/214451/

To summarize the issue is:
* An init script creates a network bridge and restarts shill (twice).
* After everything is initialized the network bridge is not listed under ifconfig

* Trying to create a bridge with the same name after boot complains that it exists.
* Interestingly if I create a bridge with a different name it works.

On the same device 8282.0.0 is fine, if I flash to 8283.0.0 the problem occurs.

The kernel is 3.14.0
 
Cc: kirtika@chromium.org snanda@chromium.org
Status: Available (was: Untriaged)
Cc: ejcaruso@chromium.org
Components: OS>Systems>Network
Labels: -Pri-2 M-52 Pri-1
Owner: gdk@chromium.org
Ramya needs to uprev cros shill to aosp shill because of some CLs that Ramya added which are urgently needed.  We are wondering if it would be terribly disruptive to revert Garret's change on aosp.  But first we should test and see if the problem goes away when reverting the change.  Simran can you help with that, if I give you a shill binary without that change?  Or can you help me test it?  Thanks!

Comment 5 by sbasi@chromium.org, May 6 2016

The easiest way for us to test this would be:

* Revert Garret's change (if we're pretty sure its his) from AOSP.
* Create the CL to do the uprev.
* Trybot the uprev CL for guado_moblab-paladin with --hwtest flag passed in.

Let me know your thoughts?

Comment 6 by gdk@chromium.org, May 6 2016

I'm not convinced that my change is the root of the issue yet.

Simran/Luigi, when you're running ifconfig are you running `ifconfig -a`?  Does `brctl show` list any bridges?

Comment 7 by sbasi@chromium.org, May 6 2016

Let me reload a bad build onto a device in the lab and I'll take a look and give you the hostname if you want to poke around.

Comment 8 by gdk@chromium.org, May 6 2016

That'd be great, thanks.  That change fixed a few impactful bugs in jetstream so I'm not keen to just revert it.

Comment 9 by sbasi@chromium.org, May 6 2016

 ssh root@chromeos2-row5-rack10-host11.cros

localhost ~ # ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.18.186.227  netmask 255.255.254.0  broadcast 172.18.187.255
        inet6 fe80::2e60:cff:fea9:6aa9  prefixlen 64  scopeid 0x20<link>
        ether 2c:60:0c:a9:6a:a9  txqueuelen 1000  (Ethernet)
        RX packets 2348  bytes 449047 (438.5 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 674  bytes 115126 (112.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 80:3f:5d:9f:73:5d  txqueuelen 1000  (Ethernet)
        RX packets 917  bytes 55862 (54.5 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 0  (Local Loopback)
        RX packets 45  bytes 2964 (2.8 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 45  bytes 2964 (2.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

localhost ~ # ifconfig -a
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.18.186.227  netmask 255.255.254.0  broadcast 172.18.187.255
        inet6 fe80::2e60:cff:fea9:6aa9  prefixlen 64  scopeid 0x20<link>
        ether 2c:60:0c:a9:6a:a9  txqueuelen 1000  (Ethernet)
        RX packets 2114  bytes 429374 (419.3 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 568  bytes 101618 (99.2 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 80:3f:5d:9f:73:5d  txqueuelen 1000  (Ethernet)
        RX packets 864  bytes 49761 (48.5 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 0  (Local Loopback)
        RX packets 45  bytes 2964 (2.8 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 45  bytes 2964 (2.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lxcbr0: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether 80:3f:5d:9f:73:5d  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2  bytes 140 (140.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlan0: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether d8:fc:93:c6:2f:ff  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

localhost ~ # brctl show
bridge name     bridge id               STP enabled     interfaces
lxcbr0          8000.803f5d9f735d       no              eth1


K so interestingly it shows up when I pass -a. Prior to this it showed up if I just typed ifconfig tho.

My shill knowledge is minimal so Garret given the list of AOSP CLs here: https://chromium-review.googlesource.com/#/c/341463/ is there something else that stands out. We know there is a bad CL in this uprev, just not sure which...
And here is the init script that is supposed to set everything up:

https://chromium.googlesource.com/chromiumos/overlays/board-overlays.git/+/master/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf

Essentially it is creating a network bridge, launch dhcpd, then attach the usb-ethernet dongle to the bridge. With a couple of shill restarts blacklisting the bridge and wireless as well.

Comment 11 by gdk@chromium.org, May 6 2016

Nothing in that script brings the bridge interface up, that seems like a bug to me.  Can you add a `ifconfig ${DHCPD_IFACE} up` at line 53 of that script and retry?  I'm ssh'd into the machine, but I don't know if anyone else is using it right now.

Comment 12 by gdk@chromium.org, May 6 2016

Or add "up" to the end of line 56, which is cleaner. :P

Comment 13 by gdk@chromium.org, May 6 2016

Cc: pleventis@chromium.org
Owner: sbasi@chromium.org
There's a race in the moblab-network-bridge script between it and shill.  Noticing the other 5s sleep in that script, it's not the first race that this script deals with.

What follows is a patch that papers over the issue, and is generally a bad approach.  Except for bringing up the bridge interface, that's a good idea. :P

diff --git a/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf b/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf
index 347c3e7..3551ba2 100644
--- a/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf
+++ b/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf
@@ -46,6 +46,9 @@ script
   logger -t "${UPSTART_JOB}" "restarting shill with ${BLACKLISTED_DEVICES} blacklisted"
   restart shill BLACKLISTED_DEVICES=${BLACKLISTED_DEVICES}
 
+  # Wait for shill to be on its feet before creating the bridge.
+  sleep 5
+
   # Bring up the network bridge and set forward delay to 0.
   logger -t "${UPSTART_JOB}" "Bringing up network bridge ${DHCPD_IFACE}"
   brctl addbr ${DHCPD_IFACE}
@@ -53,7 +56,7 @@ script
 
   # Configure server IP address with ${SERVER_ADDRESS}.
   logger -t "${UPSTART_JOB}" "setting server IP address to ${SERVER_ADDRESS}"
-  ifconfig ${DHCPD_IFACE} ${SERVER_ADDRESS} netmask ${SERVER_NETMASK}
+  ifconfig ${DHCPD_IFACE} ${SERVER_ADDRESS} netmask ${SERVER_NETMASK} up
 
   # Start the dhcpd server on MobLab. It needs the DHCPD_IFACE piped in because
   # on stumpy_moblab this value is not static. See moblab-network-init for more
+pstew

Is there a better way to wait for shill to be on its feet than the sleep?

Comment 15 by gdk@chromium.org, May 6 2016

You can wait for it to export (over D-Bus) the device you're waiting to see, you can wait for it to claim its D-Bus name, etc.

This is going to take some digging to figure out the exact nature of this race, my diff was just to prove that it was.
I agree that D-Bus is the best way.  You can use dbus_send, python directly to D-Bus, or use the "list_devices" script from shill-testing to query shill's view of the device list.
Project Member

Comment 17 by bugdroid1@chromium.org, May 7 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/17c0b8fe2abbaf242cf5251c023ea1bcfa006872

commit 17c0b8fe2abbaf242cf5251c023ea1bcfa006872
Author: Simran Basi <sbasi@google.com>
Date: Fri May 06 22:47:14 2016

moblab: Add short sleep after shill restart.

There appears to be a race between shill restarting and bringing up the
network bridge. A short sleep alleviates this problem.

Also brings up the network bridge via ifconfig.

BUG= chromium:609852 
TEST=local moblab setup.

Change-Id: Ia18516a274fa6902b3259e4666f2e4d6172282f9
Reviewed-on: https://chromium-review.googlesource.com/343230
Commit-Ready: Simran Basi <sbasi@chromium.org>
Tested-by: Simran Basi <sbasi@chromium.org>
Reviewed-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/17c0b8fe2abbaf242cf5251c023ea1bcfa006872/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf
[rename] https://crrev.com/17c0b8fe2abbaf242cf5251c023ea1bcfa006872/project-moblab/chromeos-base/chromeos-bsp-moblab/chromeos-bsp-moblab-0.0.5-r30.ebuild

Now that sbasi's change has landed should we try the shill uprev again?
https://chromium-review.googlesource.com/#/c/341463/


Eric or Kirtika, can you please do this since Luigi is out?
I'll take care of it.
I put up the reland (CL:343547) but it still seems to be having issues with moblab devices.
Project Member

Comment 21 by bugdroid1@chromium.org, May 12 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/7acf6828f50ae8faacc1817b0a112a1939562bf4

commit 7acf6828f50ae8faacc1817b0a112a1939562bf4
Author: Simran Basi <sbasi@google.com>
Date: Tue May 10 22:10:18 2016

moblab: Stop & start shill around lxcbr0 initialization

Instead of restarting shill multiple times when initializing
the moblab network bridge, simply stop it prior to the setup
and start it afterwards.

BUG= chromium:609852 
TEST=trybot run and local moblab test.

Change-Id: Ia3cc793c0ffdc41bae7abf7098ea63b69d0fe11a
Reviewed-on: https://chromium-review.googlesource.com/344030
Commit-Ready: Eric Caruso <ejcaruso@chromium.org>
Tested-by: Simran Basi <sbasi@chromium.org>
Reviewed-by: Garret Kelly <gdk@chromium.org>
Reviewed-by: Dan Shi <dshi@google.com>

[rename] https://crrev.com/7acf6828f50ae8faacc1817b0a112a1939562bf4/project-moblab/chromeos-base/chromeos-bsp-moblab/chromeos-bsp-moblab-0.0.5-r31.ebuild
[modify] https://crrev.com/7acf6828f50ae8faacc1817b0a112a1939562bf4/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf

Comment 22 by sbasi@chromium.org, May 19 2016

Status: Fixed (was: Available)
Status: Verified (was: Fixed)
Bulk verify old 'fixed' bugs.

Sign in to add a comment