New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 724642 link

Starred by 2 users

Issue metadata

Status: Archived
Owner: ----
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Kevin provision failure due to network interface being reset

Project Member Reported by davidri...@chromium.org, May 19 2017

Issue description

The following build had a provision failure which could have easily doomed the entire CQ run.  It appears to be the result of some bad interaction with the provision waiting for the DUT to update and the network interface being restarted for some reason:
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/1129
https://viceroy.corp.google.com/chromeos/build_details?build_id=1530680

platform_DMVerityCorruption first time failure is actually due to provision failure:
https://viceroy.corp.google.com/chromeos/suite_details?build_id=1530680

https://storage.cloud.google.com/chromeos-autotest-results/hosts/chromeos2-row8-rack8-host8/493885-provision/20171905130222/autoupdate_logs/CrOS_update_100.115.231.27_15614.log?_ga=1.223394899.97383318.1495227418
2017/05/19 13:21:23.010 DEBUG|       cros_update:0193| Remote device 100.115.231.27 is connected
2017/05/19 13:21:23.011 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmp06qYJa/testing_rsa root@100.115.231.27 -- mkdir -p /mnt/stateful_partition/unencrypted/preserve/cros-update '&&' mktemp -d '--tmpdir=/mnt/stateful_partition/unencrypted/preserve/cros-update'
2017/05/19 13:23:26.140 ERROR|     remote_access:0873| Error connecting to device 100.115.231.27
2017/05/19 13:23:26.141 DEBUG|       cros_update:0253| Error happens in CrOS auto-update: SSHConnectionError('ssh: connect to host 100.115.231.27 port 22: Connection timed out\r\n',)

From DUT logs at that time, ethernet was reset:
https://storage.cloud.google.com/chromeos-autotest-results/hosts/chromeos2-row8-rack8-host8/493885-provision/20171905130222/crashinfo.chromeos2-row8-rack8-host8/messages?_ga=1.211923292.97383318.1495227418
2017-05-19T20:22:42.161067+00:00 NOTICE check_ethernet.hook[4463]: Attempting recovery method "reload_usb_ethernet_devices"
2017-05-19T20:22:42.188004+00:00 NOTICE /usr/sbin/reload_network_device[4472]: Restarting network interface eth0
2017-05-19T20:22:42.215521+00:00 INFO kernel: [  714.045370] usbcore: deregistering interface driver smsc95xx
2017-05-19T20:22:42.215573+00:00 INFO kernel: [  714.045562] smsc95xx 5-1.2.1:1.0 eth0: unregister 'smsc95xx' usb-xhci-hcd.2.auto-1.2.1, smsc95xx USB 2.0 Ethernet
2017-05-19T20:22:42.215582+00:00 INFO kernel: [  714.045605] smsc95xx 5-1.2.1:1.0 eth0: hardware isn't capable of remote wakeup

Not sure of owner for this bug..
 
Cc: cywang@chromium.org
> 2017-05-19T20:22:42.161067+00:00 NOTICE check_ethernet.hook[4463]: Attempting recovery method "reload_usb_ethernet_devices"

This happens because the DUT believed that it had no working network
interface.  When that happens, it tries a series of local steps designed
to restart the network, in an attempt to get back online.  IIRC, this
handling doesn't kick in until at least 10 minutes after boot.

Comment 3 by snanda@chromium.org, May 19 2017

Cc: grundler@chromium.org
Cc: cernekee@chromium.org
4 failures out of 24 in the last 72, but we have some known infra issues which I think probably caused a few of those.  Full dut-status for those 72 hours included.
chromeos2-row8-rack8-host8-dut-status.txt
27.2 KB View Download
> 4 failures out of 24 in the last 72, but we have
> some known infra issues which I think probably caused
> a few of those.  Full dut-status for those 72 hours included.

I went through all provision failures for kevin in the bvt and cq
pools in the last 24 hours.  Except for the single failure on
chromeos2-row8-rack8-host8 cited here, all of the failures were
attributable to the ongoing RPC problems in bug 723645.

Looking at the failure, I see that provision failed because the
DUT was offline at the outset.  The provision task was able to
gather the crashinfo because the DUT rebooted during provisioning
in time to report the data.  The reboot happened (as noted earlier)
because the DUT knew it was offline, and was trying to get back
online.

I don't know why the DUT went offline in the first place.

> Looking at the failure, I see that provision failed because the
> DUT was offline at the outset.

Actually, no, it seems more complicated than that.  The provisioning
code retried.  The first attempt failed because the DUT never got
online after updating; that triggered the retry.  The second attempt
failed because the DUT was still offline at the start.  Logs were
gathered from the DUT because it successfully got itself back online
before the failure handling went looking for the data.

I went through the logs again, looking at the history.  I see an
anomaly.

On the first boot, you see messages like this:
    2017-05-19T20:10:56.606673+00:00 INFO kernel: [    2.530674] smsc95xx 5-1.2.1:1.0 eth0: register 'smsc95xx' at usb-xhci-hcd.2.auto-1.2.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:15:88:14

The message occurs multiple times, probably because 'check_ethernet.hook'
was busy trying to restart the network.

Looking at the logs after reboot that succeeded, you see these messages:
    2017-05-19T13:32:01.966059-07:00 INFO kernel: [    2.294875] cdc_ether 6-1.4:2.0 eth0: register 'cdc_ether' at usb-xhci-hcd.2.auto-1.4, CDC Ethernet Device, 14:91:82:3b:ba:b6
    2017-05-19T13:32:02.242153-07:00 INFO kernel: [    2.571642] smsc95xx 5-1.2.1:1.0 eth1: register 'smsc95xx' at usb-xhci-hcd.2.auto-1.2.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:15:88:14

The 'smsc95xx' device is the ethernet interface from the servo V3.  It's not
connected to anything, thus, the DUT couldn't get online.

The 'cdc_ether' device is the ethernet USB dongle that's actually connected
to the network, but it wasn't detected on the first boot.

We've got a similar complaint about cdc_ether devices plugged into samus,
see bug 724454.

Comment 9 by grundler@google.com, May 19 2017

I just talked with Harpreet@ and confirmed:
o smsc9512/14 is builtin to servo v3 (which started rolling out in 2015)
o servo v3 in most cases will be replaced with v4 in the next quarter or two.

I've tried to look for any other cases of smcs95xx driver/device causing link problems but too many hits for "smsc95xx" in our bug tracker.
> o smsc9512/14 is builtin to servo v3 (which started rolling out in 2015)

The problem has nothing to do with the smsc network interface
on the servo V3; that interface isn't plugged into anything,
isn't supposed to be used, and isn't expected to work.

The problem is with the cdc_ether network interface on the USB
dongle.  That dongle wasn't detected on the first boot, which
eventually caused the DUT to reboot.  On reboot, the DUT _did_
detect the network interface, and everything worked fine.

So, the problem is 'cdc_ether' device not detected on USB.
... which sounds a lot like bug 724454.

Richard - understood - I think you are correct and one of the bugs should be closed as a dup. Which ever one stays open, I'd like to add some "color". >:|
Components: Infra>Client>ChromeOS
Status: Archived (was: Untriaged)

Sign in to add a comment