New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 591097 link

Starred by 2 users

Issue metadata

Status: Started
Owner:
Last visit > 30 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

MobLab Failures in the CQ: dhcpd is not running. Crashing on shill restart

Project Member Reported by sbasi@chromium.org, Mar 1 2016

Issue description

03/01 07:23:48.579 DEBUG|          ssh_host:0153| Running (ssh) 'pgrep dhcpd'
03/01 07:23:48.747 WARNI|              test:0543| Autotest caught exception when running test:
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 537, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 747, in _call_test_function
    raise error.UnhandledTestFail(e)
UnhandledTestFail: Unhandled AutoservError: Moblab process: dhcpd is not running.
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 741, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 398, in execute
    dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 284, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 313, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 136, in run_once
    force_full_update=force)
  File "/usr/local/autotest/server/afe_utils.py", line 192, in machine_install_and_update_labels
    image_name, host_attributes = host.machine_install(*args, **dargs)
  File "/usr/local/autotest/server/hosts/cros_host.py", line 815, in machine_install
    self.verify_software()
  File "/usr/local/autotest/server/hosts/moblab_host.py", line 211, in verify_software
    self._verify_moblab_services()
  File "/usr/local/autotest/server/hosts/moblab_host.py", line 229, in _verify_moblab_services
    % process)
AutoservError: Moblab process: dhcpd is not running.


This has occured several times, I suspect a regression has been introduced into MobLab recently:

Take a look at /var/log/messages collected after the failure:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/55535842-chromeos-test/chromeos2-row5-rack10-host11/crashinfo.chromeos2-row5-rack10-host11/?debugUI=DEVELOPERS

2016-03-01T15:22:07.343159+00:00 NOTICE moblab-network-bridge-init[707]: restarting shill with lxcbr0,wlan0 blacklisted
2016-03-01T15:22:07.347484+00:00 WARNING kernel: [    7.361129] init: moblab-network-bridge-init main process (704) terminated with status 1

From the init script:
https://chromium.googlesource.com/chromiumos/overlays/board-overlays.git/+/master/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf

  logger -t "${UPSTART_JOB}" "restarting shill with ${BLACKLISTED_DEVICES} blacklisted"
  restart shill BLACKLISTED_DEVICES=${BLACKLISTED_DEVICES}
  # Bring up the network bridge and set forward delay to 0.
  logger -t "${UPSTART_JOB}" "Bringing up network bridge ${DHCPD_IFACE}"


Paul, can the shill team provide any insight here? This is causing CQ flake.
 

Comment 1 by pstew@chromium.org, Mar 1 2016

2016-03-01T15:23:10.797162+00:00 NOTICE dhcpcd[15089]: dhcpcd is running with reduced privileges
2016-03-01T15:23:10.797827+00:00 INFO dhcpcd[15089]: status changed to Init
2016-03-01T15:23:10.798022+00:00 INFO shill[841]: [INFO:dhcp_config.cc(158)] Init DHCP Proxy: eth1 at :1.32
2016-03-01T15:23:10.816727+00:00 INFO dhcpcd[15089]: status changed to Discover
2016-03-01T15:23:10.816917+00:00 INFO dhcpcd[15089]: eth1: soliciting a DHCP lease
2016-03-01T15:23:10.817052+00:00 INFO dhcpcd[15089]: eth1: sending DISCOVER (xid 0x6d16888e), next in 4.4 seconds
2016-03-01T15:23:15.252605+00:00 INFO dhcpcd[15089]: eth1: sending DISCOVER (xid 0x6d16888e), next in 7.6 seconds
2016-03-01T15:23:22.820634+00:00 INFO dhcpcd[15089]: eth1: sending DISCOVER (xid 0x6d16888e), next in 15.5 seconds
2016-03-01T15:23:38.337426+00:00 INFO dhcpcd[15089]: eth1: sending DISCOVER (xid 0x6d16888e), next in 32.6 seconds
2016-03-01T15:23:40.796618+00:00 ERR shill[841]: [ERROR:dhcp_config.cc(297)] Timed out waiting for DHCP lease on eth1 (after 30 seconds).
2016-03-01T15:23:40.797023+00:00 INFO shill[841]: [INFO:service.cc(343)] Disconnecting from service 2: OnIPConfigFailure
2016-03-01T15:23:40.804638+00:00 INFO dhcpcd[15089]: status changed to Release
2016-03-01T15:23:40.805108+00:00 INFO shill[841]: [INFO:dhcp_config.cc(223)] Stopping 15089 (ReleaseIP)
2016-03-01T15:23:40.805189+00:00 INFO dhcpcd[15089]: received SIGTERM, stopping
2016-03-01T15:23:40.805205+00:00 INFO dhcpcd[15089]: eth1: removing interface
2016-03-01T15:23:40.805215+00:00 INFO dhcpcd[15089]: status changed to Release
2016-03-01T15:23:40.805408+00:00 INFO dhcpcd[15089]: dhcpcd exited

It's not a crash.  Just a timeout.  We do retry shortly after:

2016-03-01T15:23:40.817160+00:00 NOTICE dhcpcd[15611]: dhcpcd is running with reduced privileges
2016-03-01T15:23:40.818553+00:00 INFO dhcpcd[15611]: status changed to Init
2016-03-01T15:23:40.818724+00:00 INFO shill[841]: [INFO:dhcp_config.cc(158)] Init DHCP Proxy: eth1 at :1.33
2016-03-01T15:23:41.826554+00:00 INFO dhcpcd[15611]: status changed to Discover
2016-03-01T15:23:41.826815+00:00 INFO dhcpcd[15611]: eth1: soliciting a DHCP lease
2016-03-01T15:23:41.826831+00:00 INFO dhcpcd[15611]: eth1: sending DISCOVER (xid 0x502d8273), next in 4.0 seconds
2016-03-01T15:23:45.814739+00:00 INFO dhcpcd[15611]: eth1: sending DISCOVER (xid 0x502d8273), next in 8.6 seconds

and so on.  It's not a bug in shill that you can't find a dhcp client running on an interface at every instance in time if it is unable to retrieve a lease.

Comment 2 by sbasi@chromium.org, Mar 1 2016

Adding the build in case we think this might be a bad CL.

https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab%20paladin/builds/1836

Comment 3 by sbasi@chromium.org, Mar 1 2016

Pstew where did you get those log messages from?

If you look at the init script, its not completing what we need done for MobLab...

moblab-network-bridge-init restarts shill blacklisting a network bridge and the wireless device.

This restart is what is causing moblab-network-bridge-init to terminate...

Then it calls `start moblab-dhcpd-init`

And if you look at the logs there is no mention of "moblab-dhcpd-init" and we should see logging as per https://chromium.googlesource.com/chromiumos/overlays/board-overlays.git/+/master/project-moblab/chromeos-base/chromeos-bsp-moblab/files/init/moblab-dhcpd-init.conf


This code path:
  BLACKLISTED_DEVICES=${DHCPD_IFACE},wlan0
  logger -t "${UPSTART_JOB}" "restarting shill with ${BLACKLISTED_DEVICES} blacklisted"
  restart shill BLACKLISTED_DEVICES=${BLACKLISTED_DEVICES}

Should not crash the init script IMO.

Do we need to do the restart differently? Or ignore the exit status?

If its just a timeout, did the devices getting blacklisted as we wanted?

Comment 4 by pstew@chromium.org, Mar 1 2016

I got the log messages out of var/log/net.log from the pantheon link you provided above.

Comment 5 by sbasi@chromium.org, Mar 1 2016

Can you tell me if the blacklist took effect?

Cause if the answer is yes then I just need to ensure moblab-network-init doesn't fail if the restart shill line exits 1.

Comment 6 by sbasi@chromium.org, Mar 1 2016

Labels: -Pri-1 Pri-0
https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab%20paladin/builds/1838

Again caused the CQ to fail.
Cc: englab-sys-cros@google.com
The server did not pick up any request on 2016-03-01T15:23:40.797023+00:00.  Last dhcp request was on 12:28PM PST

dhcp server log here
https://drive.google.com/open?id=0B4twyQSOwbhENVhPY09rYmxhamc

host chromeos2-row5-rack10-host11 {
hardware ethernet 2c:60:0c:a9:6a:a9;
fixed-address 172.18.186.227;
ddns-hostname "chromeos2-row5-rack10-host11";
option host-name "chromeos2-row5-rack10-host11";
}

Mar  1 12:18:16 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:18:16 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPDISCOVER from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPOFFER on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 (172.17.40.23) from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:27:37 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:27:37 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:28:31 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:28:31 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254

I meant to include all of today's request from that DUT....

Mar  1 07:16:39 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 07:16:39 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 07:22:15 chromeos2-dhcp1 dhcpd: DHCPDISCOVER from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 07:22:15 chromeos2-dhcp1 dhcpd: DHCPOFFER on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 07:22:15 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 (172.17.40.23) from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 07:22:15 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 07:25:58 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 07:25:58 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 07:27:04 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 07:27:04 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 10:08:52 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 10:08:52 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 10:14:31 chromeos2-dhcp1 dhcpd: DHCPDISCOVER from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 10:14:31 chromeos2-dhcp1 dhcpd: DHCPOFFER on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 10:14:31 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 (172.17.40.23) from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 10:14:31 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:18:16 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:18:16 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPDISCOVER from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPOFFER on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 (172.17.40.23) from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:27:37 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:27:37 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:28:31 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254
Mar  1 12:28:31 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254

Cc: jrbarnette@chromium.org bfreed@chromium.org sosa@chromium.org
Pstew and I met we have a few theories:

1) A problem with upstart when restarting shill.

Looking at var log messages restart is called:
2016-03-01T15:22:07.343159+00:00

moblab-network-init exited 1:
2016-03-01T15:22:07.347484+00:00

Shill actually started:
2016-03-01T15:22:07.387672+00:00

Thats after the original script ended. Also looking at the shill logs wlan0 was not blacklisted meaning that variable did not stick.

2) Race condition with recover_duts. Pstew pointed me out to this script. Our theory is the restart called stop, before it hit start, recover_duts launched shill, the start for moblab-network-init failed and the blacklist did not stick.

Being a race condition that would explain the flakiness of this failure (but not why it recently manifested).

Maybe we should not install recover_duts on MobLab?
Note that you can temporarily inhibit recover_duts by taking a lock (read the script again), which would probably be the best of all worlds.  Even if recover_duts isn't the root cause of this problem, you probably should do so anyway.

Owner: sbasi@chromium.org
Status: Assigned (was: Untriaged)
Project Member

Comment 15 by sheriffbot@chromium.org, Apr 22 2016

Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable?

If a fix is in active development, please set the status to Started.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Components: Infra>Client>ChromeOS
Labels: -Infra-ChromeOS
Project Member

Comment 17 by sheriffbot@chromium.org, May 6 2016

Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable?

If a fix is in active development, please set the status to Started.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Project Member

Comment 19 by bugdroid1@chromium.org, May 17 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/6883d7507791b62b95545d4b7487f9aee5406737

commit 6883d7507791b62b95545d4b7487f9aee5406737
Author: Brian Norris <briannorris@chromium.org>
Date: Thu Apr 07 18:41:25 2016

moblab: grab shill-start lock when restarting shill

As of chromium:530791, recover_duts supports a lock to allow tests to
stop shill intentionally. In the moblab case, recover_duts may be racing
with us restarting shill, so let's grab the lock before restarting
shill.

BUG=chromium:591097
TEST=trybots

Change-Id: Iea22c544c63001daf2ac69bd6c62480c08f9df11
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/337703
Commit-Ready: Simran Basi <sbasi@chromium.org>
Tested-by: Simran Basi <sbasi@chromium.org>

[rename] https://crrev.com/6883d7507791b62b95545d4b7487f9aee5406737/project-moblab/chromeos-base/chromeos-bsp-moblab/chromeos-bsp-moblab-0.0.5-r33.ebuild
[modify] https://crrev.com/6883d7507791b62b95545d4b7487f9aee5406737/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf

Cc: briannorris@chromium.org
Labels: -Pri-0 Pri-1
Status: Started (was: Assigned)
Made the requested changes for grabbing the lock. Don't know if this is fixed in practice though. I guess wait to see if there are any more sheriff reports about this, and if they manage to find this bug?

Comment 21 by sbasi@chromium.org, May 18 2016

I'll keep checking failures here regularly to see if this reoccurs:
https://chromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin
Labels: akeshet-pending-downgrade
ChromeOS Infra P1 Bugscrub.

P1 Bugs in this component should be important enough to get weekly status updates.

Is this already fixed?  -> Fixed
Is this no longer relevant? -> Archived or WontFix
Is this not a P1, based on go/chromeos-infra-bug-slo rubric? -> lower priority.
Is this a Feature Request rather than a bug? Type -> Feature
Is this missing important information or scope needed to decide how to proceed? -> Ask question on bug, possibly reassign.
Does this bug have the wrong owner? -> reassign.

Bugs that remain in this state next week will be downgraded to P2.
Labels: -akeshet-pending-downgrade Pri-2
ChromeOS Infra P1 Bugscrub.

Issue untouched in a week after previous message. Downgrading to P2.
Hi, this bug has not been updated recently. Please acknowledge the bug and provide status within two weeks (6/22/2018), or the bug will be archived. Thank you.

Sign in to add a comment