Kevin provision failure due to network interface being reset |
|||||
Issue descriptionThe following build had a provision failure which could have easily doomed the entire CQ run. It appears to be the result of some bad interaction with the provision waiting for the DUT to update and the network interface being restarted for some reason: https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/1129 https://viceroy.corp.google.com/chromeos/build_details?build_id=1530680 platform_DMVerityCorruption first time failure is actually due to provision failure: https://viceroy.corp.google.com/chromeos/suite_details?build_id=1530680 https://storage.cloud.google.com/chromeos-autotest-results/hosts/chromeos2-row8-rack8-host8/493885-provision/20171905130222/autoupdate_logs/CrOS_update_100.115.231.27_15614.log?_ga=1.223394899.97383318.1495227418 2017/05/19 13:21:23.010 DEBUG| cros_update:0193| Remote device 100.115.231.27 is connected 2017/05/19 13:21:23.011 DEBUG| cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp06qYJa/testing_rsa root@100.115.231.27 -- mkdir -p /mnt/stateful_partition/unencrypted/preserve/cros-update '&&' mktemp -d '--tmpdir=/mnt/stateful_partition/unencrypted/preserve/cros-update' 2017/05/19 13:23:26.140 ERROR| remote_access:0873| Error connecting to device 100.115.231.27 2017/05/19 13:23:26.141 DEBUG| cros_update:0253| Error happens in CrOS auto-update: SSHConnectionError('ssh: connect to host 100.115.231.27 port 22: Connection timed out\r\n',) From DUT logs at that time, ethernet was reset: https://storage.cloud.google.com/chromeos-autotest-results/hosts/chromeos2-row8-rack8-host8/493885-provision/20171905130222/crashinfo.chromeos2-row8-rack8-host8/messages?_ga=1.211923292.97383318.1495227418 2017-05-19T20:22:42.161067+00:00 NOTICE check_ethernet.hook[4463]: Attempting recovery method "reload_usb_ethernet_devices" 2017-05-19T20:22:42.188004+00:00 NOTICE /usr/sbin/reload_network_device[4472]: Restarting network interface eth0 2017-05-19T20:22:42.215521+00:00 INFO kernel: [ 714.045370] usbcore: deregistering interface driver smsc95xx 2017-05-19T20:22:42.215573+00:00 INFO kernel: [ 714.045562] smsc95xx 5-1.2.1:1.0 eth0: unregister 'smsc95xx' usb-xhci-hcd.2.auto-1.2.1, smsc95xx USB 2.0 Ethernet 2017-05-19T20:22:42.215582+00:00 INFO kernel: [ 714.045605] smsc95xx 5-1.2.1:1.0 eth0: hardware isn't capable of remote wakeup Not sure of owner for this bug..
,
May 19 2017
> 2017-05-19T20:22:42.161067+00:00 NOTICE check_ethernet.hook[4463]: Attempting recovery method "reload_usb_ethernet_devices" This happens because the DUT believed that it had no working network interface. When that happens, it tries a series of local steps designed to restart the network, in an attempt to get back online. IIRC, this handling doesn't kick in until at least 10 minutes after boot.
,
May 19 2017
,
May 19 2017
4 failures out of 24 in the last 72, but we have some known infra issues which I think probably caused a few of those. Full dut-status for those 72 hours included.
,
May 19 2017
> 4 failures out of 24 in the last 72, but we have > some known infra issues which I think probably caused > a few of those. Full dut-status for those 72 hours included. I went through all provision failures for kevin in the bvt and cq pools in the last 24 hours. Except for the single failure on chromeos2-row8-rack8-host8 cited here, all of the failures were attributable to the ongoing RPC problems in bug 723645.
,
May 19 2017
Looking at the failure, I see that provision failed because the DUT was offline at the outset. The provision task was able to gather the crashinfo because the DUT rebooted during provisioning in time to report the data. The reboot happened (as noted earlier) because the DUT knew it was offline, and was trying to get back online. I don't know why the DUT went offline in the first place.
,
May 19 2017
> Looking at the failure, I see that provision failed because the > DUT was offline at the outset. Actually, no, it seems more complicated than that. The provisioning code retried. The first attempt failed because the DUT never got online after updating; that triggered the retry. The second attempt failed because the DUT was still offline at the start. Logs were gathered from the DUT because it successfully got itself back online before the failure handling went looking for the data.
,
May 19 2017
I went through the logs again, looking at the history. I see an
anomaly.
On the first boot, you see messages like this:
2017-05-19T20:10:56.606673+00:00 INFO kernel: [ 2.530674] smsc95xx 5-1.2.1:1.0 eth0: register 'smsc95xx' at usb-xhci-hcd.2.auto-1.2.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:15:88:14
The message occurs multiple times, probably because 'check_ethernet.hook'
was busy trying to restart the network.
Looking at the logs after reboot that succeeded, you see these messages:
2017-05-19T13:32:01.966059-07:00 INFO kernel: [ 2.294875] cdc_ether 6-1.4:2.0 eth0: register 'cdc_ether' at usb-xhci-hcd.2.auto-1.4, CDC Ethernet Device, 14:91:82:3b:ba:b6
2017-05-19T13:32:02.242153-07:00 INFO kernel: [ 2.571642] smsc95xx 5-1.2.1:1.0 eth1: register 'smsc95xx' at usb-xhci-hcd.2.auto-1.2.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:15:88:14
The 'smsc95xx' device is the ethernet interface from the servo V3. It's not
connected to anything, thus, the DUT couldn't get online.
The 'cdc_ether' device is the ethernet USB dongle that's actually connected
to the network, but it wasn't detected on the first boot.
We've got a similar complaint about cdc_ether devices plugged into samus,
see bug 724454.
,
May 19 2017
I just talked with Harpreet@ and confirmed: o smsc9512/14 is builtin to servo v3 (which started rolling out in 2015) o servo v3 in most cases will be replaced with v4 in the next quarter or two. I've tried to look for any other cases of smcs95xx driver/device causing link problems but too many hits for "smsc95xx" in our bug tracker.
,
May 19 2017
> o smsc9512/14 is builtin to servo v3 (which started rolling out in 2015) The problem has nothing to do with the smsc network interface on the servo V3; that interface isn't plugged into anything, isn't supposed to be used, and isn't expected to work. The problem is with the cdc_ether network interface on the USB dongle. That dongle wasn't detected on the first boot, which eventually caused the DUT to reboot. On reboot, the DUT _did_ detect the network interface, and everything worked fine. So, the problem is 'cdc_ether' device not detected on USB. ... which sounds a lot like bug 724454.
,
May 19 2017
Richard - understood - I think you are correct and one of the bugs should be closed as a dup. Which ever one stays open, I'd like to add some "color". >:|
,
Nov 10 2017
,
May 18 2018
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by davidri...@chromium.org
, May 19 2017