MobLab Failures in the CQ: dhcpd is not running. Crashing on shill restart |
|||||||||
Issue description
03/01 07:23:48.579 DEBUG| ssh_host:0153| Running (ssh) 'pgrep dhcpd'
03/01 07:23:48.747 WARNI| test:0543| Autotest caught exception when running test:
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/test.py", line 537, in _exec
_call_test_function(self.execute, *p_args, **p_dargs)
File "/usr/local/autotest/client/common_lib/test.py", line 747, in _call_test_function
raise error.UnhandledTestFail(e)
UnhandledTestFail: Unhandled AutoservError: Moblab process: dhcpd is not running.
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/test.py", line 741, in _call_test_function
return func(*args, **dargs)
File "/usr/local/autotest/client/common_lib/test.py", line 398, in execute
dargs)
File "/usr/local/autotest/client/common_lib/test.py", line 284, in _call_run_once_with_retry
postprocess_profiled_run, args, dargs)
File "/usr/local/autotest/client/common_lib/test.py", line 313, in _call_run_once
self.run_once(*args, **dargs)
File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 136, in run_once
force_full_update=force)
File "/usr/local/autotest/server/afe_utils.py", line 192, in machine_install_and_update_labels
image_name, host_attributes = host.machine_install(*args, **dargs)
File "/usr/local/autotest/server/hosts/cros_host.py", line 815, in machine_install
self.verify_software()
File "/usr/local/autotest/server/hosts/moblab_host.py", line 211, in verify_software
self._verify_moblab_services()
File "/usr/local/autotest/server/hosts/moblab_host.py", line 229, in _verify_moblab_services
% process)
AutoservError: Moblab process: dhcpd is not running.
This has occured several times, I suspect a regression has been introduced into MobLab recently:
Take a look at /var/log/messages collected after the failure:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/55535842-chromeos-test/chromeos2-row5-rack10-host11/crashinfo.chromeos2-row5-rack10-host11/?debugUI=DEVELOPERS
2016-03-01T15:22:07.343159+00:00 NOTICE moblab-network-bridge-init[707]: restarting shill with lxcbr0,wlan0 blacklisted
2016-03-01T15:22:07.347484+00:00 WARNING kernel: [ 7.361129] init: moblab-network-bridge-init main process (704) terminated with status 1
From the init script:
https://chromium.googlesource.com/chromiumos/overlays/board-overlays.git/+/master/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf
logger -t "${UPSTART_JOB}" "restarting shill with ${BLACKLISTED_DEVICES} blacklisted"
restart shill BLACKLISTED_DEVICES=${BLACKLISTED_DEVICES}
# Bring up the network bridge and set forward delay to 0.
logger -t "${UPSTART_JOB}" "Bringing up network bridge ${DHCPD_IFACE}"
Paul, can the shill team provide any insight here? This is causing CQ flake.
,
Mar 1 2016
Adding the build in case we think this might be a bad CL. https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab%20paladin/builds/1836
,
Mar 1 2016
Pstew where did you get those log messages from? If you look at the init script, its not completing what we need done for MobLab... moblab-network-bridge-init restarts shill blacklisting a network bridge and the wireless device. This restart is what is causing moblab-network-bridge-init to terminate... Then it calls `start moblab-dhcpd-init` And if you look at the logs there is no mention of "moblab-dhcpd-init" and we should see logging as per https://chromium.googlesource.com/chromiumos/overlays/board-overlays.git/+/master/project-moblab/chromeos-base/chromeos-bsp-moblab/files/init/moblab-dhcpd-init.conf This code path: BLACKLISTED_DEVICES=${DHCPD_IFACE},wlan0 logger -t "${UPSTART_JOB}" "restarting shill with ${BLACKLISTED_DEVICES} blacklisted" restart shill BLACKLISTED_DEVICES=${BLACKLISTED_DEVICES} Should not crash the init script IMO. Do we need to do the restart differently? Or ignore the exit status? If its just a timeout, did the devices getting blacklisted as we wanted?
,
Mar 1 2016
I got the log messages out of var/log/net.log from the pantheon link you provided above.
,
Mar 1 2016
Can you tell me if the blacklist took effect? Cause if the answer is yes then I just need to ensure moblab-network-init doesn't fail if the restart shill line exits 1.
,
Mar 1 2016
https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab%20paladin/builds/1838 Again caused the CQ to fail.
,
Mar 1 2016
,
Mar 1 2016
The server did not pick up any request on 2016-03-01T15:23:40.797023+00:00. Last dhcp request was on 12:28PM PST dhcp server log here https://drive.google.com/open?id=0B4twyQSOwbhENVhPY09rYmxhamc host chromeos2-row5-rack10-host11 { hardware ethernet 2c:60:0c:a9:6a:a9; fixed-address 172.18.186.227; ddns-hostname "chromeos2-row5-rack10-host11"; option host-name "chromeos2-row5-rack10-host11"; } Mar 1 12:18:16 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:18:16 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPDISCOVER from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPOFFER on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 (172.17.40.23) from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:27:37 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:27:37 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:28:31 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:28:31 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
,
Mar 1 2016
I meant to include all of today's request from that DUT.... Mar 1 07:16:39 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 07:16:39 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 07:22:15 chromeos2-dhcp1 dhcpd: DHCPDISCOVER from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 07:22:15 chromeos2-dhcp1 dhcpd: DHCPOFFER on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 07:22:15 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 (172.17.40.23) from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 07:22:15 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 07:25:58 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 07:25:58 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 07:27:04 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 07:27:04 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 10:08:52 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 10:08:52 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 10:14:31 chromeos2-dhcp1 dhcpd: DHCPDISCOVER from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 10:14:31 chromeos2-dhcp1 dhcpd: DHCPOFFER on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 10:14:31 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 (172.17.40.23) from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 10:14:31 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:18:16 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:18:16 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPDISCOVER from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPOFFER on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 (172.17.40.23) from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:24:50 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:27:37 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:27:37 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:28:31 chromeos2-dhcp1 dhcpd: DHCPREQUEST for 172.18.186.227 from 2c:60:0c:a9:6a:a9 via 172.18.187.254 Mar 1 12:28:31 chromeos2-dhcp1 dhcpd: DHCPACK on 172.18.186.227 to 2c:60:0c:a9:6a:a9 via 172.18.187.254
,
Mar 1 2016
Pstew and I met we have a few theories: 1) A problem with upstart when restarting shill. Looking at var log messages restart is called: 2016-03-01T15:22:07.343159+00:00 moblab-network-init exited 1: 2016-03-01T15:22:07.347484+00:00 Shill actually started: 2016-03-01T15:22:07.387672+00:00 Thats after the original script ended. Also looking at the shill logs wlan0 was not blacklisted meaning that variable did not stick. 2) Race condition with recover_duts. Pstew pointed me out to this script. Our theory is the restart called stop, before it hit start, recover_duts launched shill, the start for moblab-network-init failed and the blacklist did not stick. Being a race condition that would explain the flakiness of this failure (but not why it recently manifested). Maybe we should not install recover_duts on MobLab?
,
Mar 1 2016
Note that you can temporarily inhibit recover_duts by taking a lock (read the script again), which would probably be the best of all worlds. Even if recover_duts isn't the root cause of this problem, you probably should do so anyway.
,
Mar 7 2016
,
Mar 10 2016
Happened on guado_moblab-paladin today : https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/1931
,
Apr 7 2016
Happend on guado_moblab-paladin again: https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/2206
,
Apr 22 2016
Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable? If a fix is in active development, please set the status to Started. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 26 2016
,
May 6 2016
Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable? If a fix is in active development, please set the status to Started. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
May 13 2016
Happened on guado_moblab-paladin again: https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/2609
,
May 17 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/6883d7507791b62b95545d4b7487f9aee5406737 commit 6883d7507791b62b95545d4b7487f9aee5406737 Author: Brian Norris <briannorris@chromium.org> Date: Thu Apr 07 18:41:25 2016 moblab: grab shill-start lock when restarting shill As of chromium:530791, recover_duts supports a lock to allow tests to stop shill intentionally. In the moblab case, recover_duts may be racing with us restarting shill, so let's grab the lock before restarting shill. BUG=chromium:591097 TEST=trybots Change-Id: Iea22c544c63001daf2ac69bd6c62480c08f9df11 Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/337703 Commit-Ready: Simran Basi <sbasi@chromium.org> Tested-by: Simran Basi <sbasi@chromium.org> [rename] https://crrev.com/6883d7507791b62b95545d4b7487f9aee5406737/project-moblab/chromeos-base/chromeos-bsp-moblab/chromeos-bsp-moblab-0.0.5-r33.ebuild [modify] https://crrev.com/6883d7507791b62b95545d4b7487f9aee5406737/project-moblab/chromeos-base/chromeos-bsp-moblab/files/moblab-network-bridge-init.conf
,
May 18 2016
Made the requested changes for grabbing the lock. Don't know if this is fixed in practice though. I guess wait to see if there are any more sheriff reports about this, and if they manage to find this bug?
,
May 18 2016
I'll keep checking failures here regularly to see if this reoccurs: https://chromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin
,
Sep 27 2016
,
Oct 26 2016
Probably happened again: https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/4103
,
Jul 17 2017
ChromeOS Infra P1 Bugscrub. P1 Bugs in this component should be important enough to get weekly status updates. Is this already fixed? -> Fixed Is this no longer relevant? -> Archived or WontFix Is this not a P1, based on go/chromeos-infra-bug-slo rubric? -> lower priority. Is this a Feature Request rather than a bug? Type -> Feature Is this missing important information or scope needed to decide how to proceed? -> Ask question on bug, possibly reassign. Does this bug have the wrong owner? -> reassign. Bugs that remain in this state next week will be downgraded to P2.
,
Jul 24 2017
ChromeOS Infra P1 Bugscrub. Issue untouched in a week after previous message. Downgrading to P2.
,
Jun 8 2018
Hi, this bug has not been updated recently. Please acknowledge the bug and provide status within two weeks (6/22/2018), or the bug will be archived. Thank you. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by pstew@chromium.org
, Mar 1 2016