New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 753120 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

ASIX USB Dongles breaking lab DUTs.

Project Member Reported by dgarr...@chromium.org, Aug 7 2017

Issue description

This DUT has been failing provisions and tests for a while, but keeps being put back into service because it passes verify.

This caused about 12 hours worth of CQ run failures.

 
Representative failure:

START	----	provision	timestamp=1502136962	localtime=Aug 07 13:16:02	
	START	provision_AutoUpdate	provision_AutoUpdate	timestamp=1502136975	localtime=Aug 07 13:16:15	
		START	----	----	timestamp=1502137196	localtime=Aug 07 13:19:56	
			GOOD	----	sysinfo.before	timestamp=1502137196	localtime=Aug 07 13:19:56	
		END GOOD	----	----	timestamp=1502137196	localtime=Aug 07 13:19:56	
		FAIL	provision_AutoUpdate	provision_AutoUpdate	timestamp=1502137303	localtime=Aug 07 13:21:43	Unhandled DevServerException: CrOS auto-update failed for host chromeos4-row8-rack10-host22: SSHConnectionError: Connection to 100.115.199.107 timed out while waiting to read
  Traceback (most recent call last):
    File "/usr/local/autotest/client/common_lib/test.py", line 806, in _call_test_function
      return func(*args, **dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 470, in execute
      dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry
      postprocess_profiled_run, args, dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 380, in _call_run_once
      self.run_once(*args, **dargs)
    File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 113, in run_once
      force_full_update=force)
    File "/usr/local/autotest/server/afe_utils.py", line 118, in machine_install_and_update_labels
      *args, **dargs)
    File "/usr/local/autotest/server/hosts/cros_host.py", line 802, in machine_install_by_devserver
      force_original=force_original)
    File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 2219, in auto_update
      error_msg % (host_name, error_list[0]))
  DevServerException: CrOS auto-update failed for host chromeos4-row8-rack10-host22: SSHConnectionError: Connection to 100.115.199.107 timed out while waiting to read
	END FAIL	provision_AutoUpdate	provision_AutoUpdate	timestamp=1502137303	localtime=Aug 07 13:21:43	
END FAIL	----	provision	timestamp=1502137303	localtime=Aug 07 13:21:43	
INFO	----	----	timestamp=1502137303	job_abort_reason=	localtime=Aug 07 13:21:43	
It seems it did pass repair sometimes. Here is a log from a "successful" repair:

START	----	repair	timestamp=1502136386	localtime=Aug 07 13:06:26	
	FAIL	----	verify.servo_ssh	timestamp=1502136454	localtime=Aug 07 13:07:34	No answer to ping from chromeos4-row8-rack10-host22-servo
	START	----	repair.rpm	timestamp=1502136454	localtime=Aug 07 13:07:34	
		FAIL	----	repair.rpm	timestamp=1502136454	localtime=Aug 07 13:07:34	Power cycling chromeos4-row8-rack10-host22-servo failed: Client call exception: <Fault 1: "<class 'rpm_infrastructure_exception.RPMInfrastructureException'>:('Could not determine POE hostname for %s. Please check the servo-interface mapping file.', 'chromeos4-row8-rack10-host22-servo')">
	END FAIL	----	repair.rpm	timestamp=1502136454	localtime=Aug 07 13:07:34	
	GOOD	----	verify.ssh	timestamp=1502136477	localtime=Aug 07 13:07:57	
	GOOD	----	verify.power	timestamp=1502136477	localtime=Aug 07 13:07:57	
	GOOD	----	verify.fwstatus	timestamp=1502136477	localtime=Aug 07 13:07:57	
	GOOD	----	verify.good_au	timestamp=1502136477	localtime=Aug 07 13:07:57	
	GOOD	----	verify.devmode	timestamp=1502136478	localtime=Aug 07 13:07:58	
	GOOD	----	verify.writable	timestamp=1502136478	localtime=Aug 07 13:07:58	
	GOOD	----	verify.tpm	timestamp=1502136478	localtime=Aug 07 13:07:58	
	GOOD	----	verify.ext4	timestamp=1502136479	localtime=Aug 07 13:07:59	
	GOOD	----	verify.rwfw	timestamp=1502136480	localtime=Aug 07 13:08:00	
	GOOD	----	verify.python	timestamp=1502136480	localtime=Aug 07 13:08:00	
	GOOD	----	verify.cros	timestamp=1502136483	localtime=Aug 07 13:08:03	
	GOOD	----	verify.hwid	timestamp=1502136484	localtime=Aug 07 13:08:04	
	GOOD	----	verify.PASS	timestamp=1502136484	localtime=Aug 07 13:08:04	
	INFO	----	repair	timestamp=1502136484	localtime=Aug 07 13:08:04	Can't repair label 'cros-version:sentry-paladin/R62-9818.0.0-rc1'.
	INFO	----	repair	timestamp=1502136484	localtime=Aug 07 13:08:04	Can't repair label 'hw_video_acc_h264'.
	INFO	----	repair	timestamp=1502136484	localtime=Aug 07 13:08:04	Can't repair label 'pool:cq'.
	INFO	----	repair	timestamp=1502136484	localtime=Aug 07 13:08:04	Can't repair label 'board:sentry'.
END GOOD	----	repair	timestamp=1502136484	localtime=Aug 07 13:08:04	chromeos4-row8-rack10-host22 repaired successfully
Summary: chromeos4-row8-rack10-host22 Fails Provision Repeatedly, but Passes Verify/Repair (was: chromeos4-row8-rack10-host22 Fails Provision Repeatedly, but Passes Verify)
PS: I have locked the DUT, so it shouldn't be affecting live builds any more.
Cc: jrbarnette@chromium.org
Owner: grundler@chromium.org
grundler@ is looking at the DUT to see if it reproduces  crbug.com/751379 .

Passing him this bug to dup or not.

Do need to be certain to reboot/unlock the DUT after examination is finished.
chromeos4-row1-rack3-host11 appears to be in the same state.
This is using ASIX USB 2.0 GigE. If a reboot is "all" it takes to recover, then we should be able to extend check_ethernet.hook to perform all the same steps. It already tries several of them.

localhost ~ # lsusb
Bus 001 Device 006: ID 0424:ec00 Standard Microsystems Corp. SMSC9512/9514 Fast Ethernet Adapter
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 005: ID 0424:9514 Standard Microsystems Corp. SMC9514 Hub
Bus 001 Device 004: ID 1bcf:2c82 Sunplus Innovation Technology Inc. 
Bus 001 Device 003: ID 0b95:1780 ASIX Electronics Corp. AX88178
Bus 001 Device 002: ID 8087:0a2a Intel Corp. 
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
localhost ~ # ifconfig -a
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 100.115.92.1  netmask 255.255.255.252  broadcast 100.115.92.3
        inet6 fe80::709a:d1ff:feed:5a2f  prefixlen 64  scopeid 0x20<link>
        ether 72:9a:d1:ed:5a:2f  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 536 (536.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 94:eb:2c:15:80:19  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 100.115.199.107  netmask 255.255.254.0  broadcast 100.115.199.255
        inet6 fe80::20e:c6ff:fe89:a1ad  prefixlen 64  scopeid 0x20<link>
        ether 00:0e:c6:89:a1:ad  txqueuelen 1000  (Ethernet)
        RX packets 72083394  bytes 36260186088 (33.7 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 8898338802 (8.2 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 8789427  bytes 36168818579 (33.6 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8789427  bytes 36168818579 (33.6 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlan0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether dc:a9:71:d9:8b:81  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

localhost ~ # ethtool -i eth0
driver: smsc95xx
version: 22-Aug-2005
firmware-version: smsc95xx USB 2.0 Ethernet
bus-info: usb-0000:00:14.0-9.1
supports-statistics: no
supports-test: no
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
localhost ~ # ethtool -i eth1
driver: asix
version: 22-Dec-2011
firmware-version: ASIX AX88178 USB 2.0 Ethernet
bus-info: usb-0000:00:14.0-5
supports-statistics: no
supports-test: no
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no
localhost ~ # ip link show eth0
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000
    link/ether 94:eb:2c:15:80:19 brd ff:ff:ff:ff:ff:ff
localhost ~ # ip link show eth1
5: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 00:0e:c6:89:a1:ad brd ff:ff:ff:ff:ff:ff

Here is another sentry dut that looks similar to me. I'm also locking it away.

chromeos4-row8-rack9-host2/
Labels: -Pri-3 Pri-1
The CQ has been down for over a day, and the primary (but not only) driver has been sentry paladins.

I think we need to prioritize some solution, even if it's to paper over this by having repair reboot more aggressively.
Issue 753344 has been merged into this issue.
All right, here's the inventory of reported problems:

chromeos4-row8-rack9-host2
chromeos4-row8-rack10-host22
    Both of these hosts have ASIX USB dongles.  Those dongles
    are known to suffer from bug 695366.  The fix is to file
    a ticket with englab-sys-cros@ and request replacement of
    the dongles ASAP.  Until then, rebooting the DUTs will
    clear the symptom, if debug is done and they need to go
    back into service.

The reported symptom in  bug 751379  appears to be the same as
bug 695366, but it's not proven to be a duplicate.  If we see
that symptom again on any network adapter that _isn't_ an ASIX
adapter, we need to debug to determine if it's a duplicate.

chromeos4-row1-rack3-host11
    This DUT appears to have an unrelated problem; at first blush,
    I think it's a bad SSD.

Also, it's unclear to me why we have problems with the CQ:  If a
single problem DUT is enough to push the sentry paladin to fail,
we need to add an extra DUT (or two) to reduce the risk.  Failing
that, we can also make the paladin temporarily experimental, while
we sort out the problem.

I don't think this is a USB dongle issue. At least not firmware. It stinks a lot more like a switch issue or something higher level in the networking stack.

I do see one TX accounting bug with ASIX driver, but the general symptom is happening with different chromebooks and with different ethernet dongles.
When "ssh timeout" occurs, could you please try
1) "ping" from corresponding autotest server to DUT
2) check on the DUT console if "ethtool" reports the NIC has link
3) check from DUT console if MAC address has changed (compare with previous entries in /var/log/messages)
4) ssh from workstation to the DUT
5) ssh from autotest server to the DUT

1 will tell us of link is working (ICMP != TCP)
2 will tell us if the DUT thinks it has a link or not. If 2 is yes, then 1 should work.
3,4,5 are to investigate if something related to mismanaged ARP table entries is happening. Routers use ARP to determine which device owns which port and then associates an IP address with that port. It is possible this doesn't match the routing tables if those are manually maintained.

Let's clean up our bugs a bit.  This bug should be about two failed
DUTs with a problem known to be associated with ASIX dongles.  That
symptom is bug 695366.  The for this bug is for the primary infra
deputy to work with englab-sys-cros@ to get the dongles replaced.

We'll use  bug 751379  to discuss whether the known problem with ASIX
dongles is also seen with the adapter in servo V4.

I have filed this bug with the Englab Team. http://b/64485631
Owner: dgarr...@chromium.org
Taking this back, since it's associated with two DUTs that are bad for hardware reasons.
Status: Fixed (was: Untriaged)
The DUTs have new dongles and are back in the suites pool.
Summary: sentry duts had bad USB dongles (was: chromeos4-row8-rack10-host22 Fails Provision Repeatedly, but Passes Verify/Repair)
Status: Started (was: Fixed)
Summary: ASIX USB Dongles breaking lab DUTs. (was: sentry duts had bad USB dongles)
chromeos4-row11-rack7-host11 appears to be another example of this.
Also chromeos4-row11-rack6-host2
And chromeos2-row4-rack2-host2
And chromeos4-row11-rack4-host17
Those DUTs were added too b/36252619
The work should be tracked there now.
Status: Fixed (was: Started)

Sign in to add a comment