New issue
Advanced search Search tips

Issue 691829 link

Starred by 2 users

Issue metadata

Status: Archived
Owner: ----
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Feature



Sign in to add a comment

heavy-duty repair could be a simple power cycle

Project Member Reported by semenzato@chromium.org, Feb 14 2017

Issue description

chromeos2-row4-rack7-host8 failed this build:

https://uberchromegw.corp.google.com/i/chromeos/builders/daisy-release/builds/850

because the DUT could not be reached:

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/101019567-chromeos-test/chromeos2-row4-rack7-host8/debug/

autoserv.DEBUG:

02/13 05:15:59.005 DEBUG|          ssh_host:0285| Running (ssh) 'test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start'
02/13 05:17:02.451 ERROR|        base_utils:0280| [stderr] ssh: connect to host chromeos2-row4-rack7-host8 port 22: Connection timed out

I ran dut-status -f -d 12 chromeos2-row4-rack7-host8, here's the relevant snippet:

    2017-02-13 05:34:29  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101019412-chromeos-test/
    2017-02-13 05:34:02  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack7-host8/1785896-reset/
    2017-02-13 05:25:16  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack7-host8/1785846-repair/
    2017-02-13 05:22:44  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack7-host8/1785831-reset/
    2017-02-13 05:14:41  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101019567-chromeos-test/
    2017-02-13 05:14:16  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack7-host8/1785790-reset/
    2017-02-13 05:13:17  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101019322-chromeos-test/

Looks like the repair at 0525 worked.  That repair doesn't have logs (maybe it attempted log collection before the repair but not after the repair?).  However the following test has the syslog, and here's the interestling part:

2017-02-13T13:14:34.295004+00:00 NOTICE ag[5473]: autotest server[stack::verify|verify_software|check_diskspace] -> ssh_run(df -PB 1000000 /mnt/stateful_partition/encrypted | tail -1)
2017-02-13T13:14:34.543412+00:00 NOTICE ag[5478]: autotest server[stack::verify|verify_software|upstart_status] -> ssh_run(status system-services | grep start/running)
2017-02-13T13:14:34.771939+00:00 NOTICE ag[5482]: autotest server[stack::verify|verify_software|_is_factory_image] -> ssh_run([ -f /root/.factory_test ])
2017-02-13T13:14:35.003909+00:00 NOTICE ag[5484]: autotest server[stack::_verify_host|verify|verify_software] -> ssh_run(update_engine_client --status)
2017-02-13T13:14:35.259514+00:00 NOTICE ag[5488]: autotest server[stack::verify_software|verify_cros_version_label|get_release_version] -> ssh_run(cat "/etc/lsb-release")
2017-02-13T13:14:35.305119+00:00 INFO sshd[4994]: Received disconnect from 172.25.90.3 port 49557:11: disconnected by user
2017-02-13T13:14:35.305199+00:00 INFO sshd[4994]: Disconnected from 172.25.90.3 port 49557

Looks like we have a disconnect at 0514 (syslog time assumed to be GMT here), followed by fruitless attempts at reconnecting:

2017-02-13T13:18:20.704011+00:00 NOTICE check_ethernet.hook[6382]: Attempting recovery method "ensure_connection_manager_is_running"
2017-02-13T13:18:20.724779+00:00 NOTICE check_ethernet.hook[6386]: Attempting recovery method "toggle_ethernet_interfaces"
2017-02-13T13:18:20.765045+00:00 INFO avahi-daemon[1562]: Interface eth0.IPv4 no longer relevant for mDNS.
2017-02-13T13:18:20.765071+00:00 INFO avahi-daemon[1562]: Leaving mDNS multicast group on interface eth0.IPv4 with address 100.115.227.5.
2017-02-13T13:18:20.767366+00:00 ERR dhcpcd[1413]: eth0: dhcp if_readrawpacket: Network is down
2017-02-13T13:18:20.767710+00:00 INFO avahi-daemon[1562]: Withdrawing address record for 100.115.227.5 on eth0.
2017-02-13T13:18:20.770697+00:00 INFO avahi-daemon[1562]: Joining mDNS multicast group on interface eth0.IPv4 with address 100.115.227.5.
2017-02-13T13:18:20.771158+00:00 INFO avahi-daemon[1562]: New relevant interface eth0.IPv4 for mDNS.
2017-02-13T13:18:20.771183+00:00 INFO avahi-daemon[1562]: Registering new address record for 100.115.227.5 on eth0.IPv4.
2017-02-13T13:18:20.828659+00:00 INFO kernel: [  375.903408] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
2017-02-13T13:18:20.829863+00:00 INFO avahi-daemon[1562]: Withdrawing address record for 100.115.227.5 on eth0.
2017-02-13T13:18:20.829947+00:00 INFO avahi-daemon[1562]: Leaving mDNS multicast group on interface eth0.IPv4 with address 100.115.227.5.
2017-02-13T13:18:20.830018+00:00 INFO avahi-daemon[1562]: Interface eth0.IPv4 no longer relevant for mDNS.
2017-02-13T13:18:21.009922+00:00 INFO avahi-daemon[1562]: Server startup complete. Host name is bc7aadd91b37ebe1d206290e58a1b21c.local. Local service cookie is 914252924.
2017-02-13T13:18:21.009958+00:00 INFO avahi-daemon[1562]: Changing host name to 'bc7aadd91b37ebe1d206290e58a1b21c'.
2017-02-13T13:18:21.021098+00:00 NOTICE cros-machine-id-regen[6430]: Regenerated /var/lib/dbus/machine-id (reason: network).
2017-02-13T13:18:50.906147+00:00 ERR shill[1234]: [ERROR:dhcp_config.cc(295)] Timed out waiting for DHCP lease on eth0 (after 30 seconds).
2017-02-13T13:19:08.964021+00:00 NOTICE check_ethernet.hook[6558]: Attempting recovery method "reload_usb_ethernet_devices"
2017-02-13T13:19:09.005520+00:00 NOTICE /usr/sbin/reload_network_device[6571]: Restarting network interface eth0
2017-02-13T13:19:09.029185+00:00 INFO kernel: [  424.099614] usbcore: deregistering interface driver cdc_ether
2017-02-13T13:19:09.029212+00:00 INFO kernel: [  424.099705] cdc_ether 2-1:2.0 eth0: unregister 'cdc_ether' usb-xhci-hcd.2.auto-1, CDC Ethernet Device
2017-02-13T13:19:09.029739+00:00 ERR dhcpcd[6520]: eth0: dhcp if_readrawpacket: Network is down
2017-02-13T13:19:09.035673+00:00 ERR shill[1234]: [ERROR:object_proxy.cc(582)] Failed to call method: org.chromium.dhcpcd.Release: object_path= /org/chromium/dhcpcd: org.freedesktop.DBus.Error.NoReply: Message did not receive a reply (timeout by message bus)
2017-02-13T13:19:09.035812+00:00 ERR shill[1234]: [ERROR:dbus_method_invoker.h(111)] CallMethodAndBlockWithTimeout(...): Domain=dbus, Code=org.freedesktop.DBus.Error.NoReply, Message=Message did not receive a reply (timeout by message bus)
2017-02-13T13:19:10.083599+00:00 ERR kernel: [  425.156312] usb 2-1: can't re-read device descriptor for authorization: -71
2017-02-13T13:19:10.100430+00:00 NOTICE /usr/sbin/reload_network_device[6591]: Restarting network interface eth1
2017-02-13T13:19:10.129561+00:00 INFO kernel: [  425.199984] usbcore: deregistering interface driver smsc95xx
2017-02-13T13:19:10.129570+00:00 INFO kernel: [  425.202217] smsc95xx 3-1.1:1.0 eth1: unregister 'smsc95xx' usb-12110000.usb-1.1, smsc95xx USB 2.0 Ethernet
2017-02-13T13:19:10.153244+00:00 ERR shill[1234]: [ERROR:device.cc(1274)] IP flag write failed: 1 to /proc/sys/net/ipv6/conf/eth1/disable_ipv6
2017-02-13T13:19:10.153451+00:00 ERR shill[1234]: [ERROR:device.cc(1274)] IP flag write failed: 1 to /proc/sys/net/ipv6/conf/eth1/disable_ipv6
2017-02-13T13:19:11.173587+00:00 INFO kernel: [  426.246077] usb 3-1.1: authorized to connect
2017-02-13T13:19:11.183642+00:00 INFO kernel: [  426.255862] smsc95xx v1.0.4
2017-02-13T13:19:11.233887+00:00 INFO kernel: [  426.308513] smsc95xx 3-1.1:1.0 eth0: register 'smsc95xx' at usb-12110000.usb-1.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:15:87:d6
2017-02-13T13:19:11.233914+00:00 INFO kernel: [  426.308679] usbcore: registered new interface driver smsc95xx
2017-02-13T13:19:11.299970+00:00 INFO kernel: [  426.372674] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
2017-02-13T13:19:12.932055+00:00 ERR mtpd[1487]: LIBMTP_Detect_Raw_Devices failed with 5
2017-02-13T13:19:50.267149+00:00 NOTICE check_ethernet.hook[6717]: Attempting recovery method "restart_connection_manager"
2017-02-13T13:19:50.444013+00:00 ERR wpa_supplicant[628]: rmdir[ctrl_interface=/var/run/wpa_supplicant]: Permission denied
2017-02-13T13:19:50.732258+00:00 ERR shill[6733]: [ERROR:crypto_des_cbc.cc(103)] Unable to load key matter from /var/lib/whitelist/owner.key
2017-02-13T13:19:50.752376+00:00 ERR shill[6733]: [ERROR:crypto_des_cbc.cc(103)] Unable to load key matter from /var/lib/whitelist/owner.key
2017-02-13T13:19:50.754293+00:00 ERR shill[6733]: [ERROR:crypto_des_cbc.cc(103)] Unable to load key matter from /var/lib/whitelist/owner.key
2017-02-13T13:19:50.824575+00:00 INFO kernel: [  465.897506] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
2017-02-13T13:19:50.878679+00:00 INFO kernel: [  465.948983] IPv6: ADDRCONF(NETDEV_UP): mlan0: link is not ready
2017-02-13T13:19:50.984396+00:00 INFO avahi-daemon[1562]: Server startup complete. Host name is 762769eaa3ea26b1f6910abf58a1b276.local. Local service cookie is 914252924.
2017-02-13T13:19:50.984430+00:00 INFO avahi-daemon[1562]: Changing host name to '762769eaa3ea26b1f6910abf58a1b276'.
2017-02-13T13:19:50.993048+00:00 NOTICE cros-machine-id-regen[6758]: Regenerated /var/lib/dbus/machine-id (reason: network).
2017-02-13T13:20:29.740852+00:00 NOTICE check_ethernet.hook[6868]: Restart failed; will retry recovery for ~15 minutes

etc. etc.  Finally the DUT gives up and reboots:

2017-02-13T13:33:21.468202+00:00 NOTICE check_ethernet.hook[9224]: All efforts to recover ethernet have been exhausted. Rebooting.

This seems to happen in the middle of the repair job started at 0525.  I guess there is no point in overanalyzing that job, but we see this:

02/13 05:32:43.634 DEBUG|          ssh_host:0285| Running (ssh) 'true'
02/13 05:32:43.651 INFO |      abstract_ssh:0795| Master ssh connection to chromeos2-row4-rack7-host8 is down.
02/13 05:32:43.651 DEBUG|      abstract_ssh:0756| Nuking master_ssh_job.
02/13 05:32:43.651 DEBUG|      abstract_ssh:0762| Cleaning master_ssh_tempdir.
02/13 05:32:43.652 INFO |      abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_0ka0FJssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row4-rack7-host8'
02/13 05:32:43.653 DEBUG|        base_utils:0185| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_0ka0FJssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row4-rack7-host8'
02/13 05:32:48.764 INFO |      abstract_ssh:0824| Timed out waiting for master-ssh connection to be established.
02/13 05:32:52.874 WARNI|        base_utils:0912| run process timeout (3.59999990463) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_0ka0FJssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row4-rack7-host8 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true"
02/13 05:32:57.939 WARNI|        base_utils:0912| run process timeout (3.59999990463) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_0ka0FJssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row4-rack7-host8 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true"
02/13 05:32:58.944 DEBUG|          ssh_host:0219| retry 2: restarting master connection
02/13 05:32:58.945 DEBUG|      abstract_ssh:0744| Restarting master ssh connection
02/13 05:32:58.945 DEBUG|      abstract_ssh:0756| Nuking master_ssh_job.
02/13 05:32:59.951 DEBUG|      abstract_ssh:0762| Cleaning master_ssh_tempdir.
02/13 05:32:59.952 INFO |      abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_e3LB4qssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row4-rack7-host8'
02/13 05:32:59.953 DEBUG|        base_utils:0185| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_e3LB4qssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row4-rack7-host8'
02/13 05:33:30.212 INFO |      abstract_ssh:0824| Timed out waiting for master-ssh connection to be established.

and then finally:

02/13 05:33:46.253 DEBUG|          ssh_host:0285| Running (ssh) 'power_supply_info'
02/13 05:33:46.267 INFO |      abstract_ssh:0795| Master ssh connection to chromeos2-row4-rack7-host8 is down.
02/13 05:33:46.267 DEBUG|      abstract_ssh:0756| Nuking master_ssh_job.
02/13 05:33:47.273 DEBUG|      abstract_ssh:0762| Cleaning master_ssh_tempdir.
02/13 05:33:47.274 INFO |      abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_yKvh3Vssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row4-rack7-host8'
02/13 05:33:47.274 DEBUG|        base_utils:0185| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_yKvh3Vssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row4-rack7-host8'
02/13 05:33:48.267 DEBUG|        base_utils:0280| [stdout] Device: Line Power
02/13 05:33:48.269 DEBUG|        base_utils:0280| [stdout]   path:                    /sys/class/power_supply/tps65090-ac
02/13 05:33:48.269 DEBUG|        base_utils:0280| [stdout]   online:                  yes
02/13 05:33:48.270 DEBUG|        base_utils:0280| [stdout]   type:                    Mains

... the command seems to succeed and everything works.

The issue is that the repair took 8.5 minutes, but it could have taken less than one minute if we just tried a power cycle first (first a double 5-second press to trigger a normal shut down, then an EC reset).






 
Labels: OS-Chrome
Or maybe the check_ethernet.hook should be more aggressive?  There are probably other issues with that.

BTW, in this case a simple reboot fixed the problem, but I have cases in which the ethernet dongle was wedged and multiple reboots didn't unwedge it.  Only a power cycle did.

Comment 4 by aut...@google.com, Feb 15 2017

Labels: -current-issue -Type-Bug -Pri-2 Hotlist-Fixit Pri-3 Type-Feature
Good idea, moving to FixIt list
Status: Archived (was: Untriaged)

Sign in to add a comment