ASIX USB Dongles breaking lab DUTs. |
|||||||||
Issue descriptionThis DUT has been failing provisions and tests for a while, but keeps being put back into service because it passes verify. This caused about 12 hours worth of CQ run failures.
,
Aug 7 2017
It seems it did pass repair sometimes. Here is a log from a "successful" repair:
START ---- repair timestamp=1502136386 localtime=Aug 07 13:06:26
FAIL ---- verify.servo_ssh timestamp=1502136454 localtime=Aug 07 13:07:34 No answer to ping from chromeos4-row8-rack10-host22-servo
START ---- repair.rpm timestamp=1502136454 localtime=Aug 07 13:07:34
FAIL ---- repair.rpm timestamp=1502136454 localtime=Aug 07 13:07:34 Power cycling chromeos4-row8-rack10-host22-servo failed: Client call exception: <Fault 1: "<class 'rpm_infrastructure_exception.RPMInfrastructureException'>:('Could not determine POE hostname for %s. Please check the servo-interface mapping file.', 'chromeos4-row8-rack10-host22-servo')">
END FAIL ---- repair.rpm timestamp=1502136454 localtime=Aug 07 13:07:34
GOOD ---- verify.ssh timestamp=1502136477 localtime=Aug 07 13:07:57
GOOD ---- verify.power timestamp=1502136477 localtime=Aug 07 13:07:57
GOOD ---- verify.fwstatus timestamp=1502136477 localtime=Aug 07 13:07:57
GOOD ---- verify.good_au timestamp=1502136477 localtime=Aug 07 13:07:57
GOOD ---- verify.devmode timestamp=1502136478 localtime=Aug 07 13:07:58
GOOD ---- verify.writable timestamp=1502136478 localtime=Aug 07 13:07:58
GOOD ---- verify.tpm timestamp=1502136478 localtime=Aug 07 13:07:58
GOOD ---- verify.ext4 timestamp=1502136479 localtime=Aug 07 13:07:59
GOOD ---- verify.rwfw timestamp=1502136480 localtime=Aug 07 13:08:00
GOOD ---- verify.python timestamp=1502136480 localtime=Aug 07 13:08:00
GOOD ---- verify.cros timestamp=1502136483 localtime=Aug 07 13:08:03
GOOD ---- verify.hwid timestamp=1502136484 localtime=Aug 07 13:08:04
GOOD ---- verify.PASS timestamp=1502136484 localtime=Aug 07 13:08:04
INFO ---- repair timestamp=1502136484 localtime=Aug 07 13:08:04 Can't repair label 'cros-version:sentry-paladin/R62-9818.0.0-rc1'.
INFO ---- repair timestamp=1502136484 localtime=Aug 07 13:08:04 Can't repair label 'hw_video_acc_h264'.
INFO ---- repair timestamp=1502136484 localtime=Aug 07 13:08:04 Can't repair label 'pool:cq'.
INFO ---- repair timestamp=1502136484 localtime=Aug 07 13:08:04 Can't repair label 'board:sentry'.
END GOOD ---- repair timestamp=1502136484 localtime=Aug 07 13:08:04 chromeos4-row8-rack10-host22 repaired successfully
,
Aug 7 2017
,
Aug 7 2017
PS: I have locked the DUT, so it shouldn't be affecting live builds any more.
,
Aug 7 2017
grundler@ is looking at the DUT to see if it reproduces crbug.com/751379 . Passing him this bug to dup or not. Do need to be certain to reboot/unlock the DUT after examination is finished.
,
Aug 8 2017
chromeos4-row1-rack3-host11 appears to be in the same state.
,
Aug 8 2017
This is using ASIX USB 2.0 GigE. If a reboot is "all" it takes to recover, then we should be able to extend check_ethernet.hook to perform all the same steps. It already tries several of them.
localhost ~ # lsusb
Bus 001 Device 006: ID 0424:ec00 Standard Microsystems Corp. SMSC9512/9514 Fast Ethernet Adapter
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 005: ID 0424:9514 Standard Microsystems Corp. SMC9514 Hub
Bus 001 Device 004: ID 1bcf:2c82 Sunplus Innovation Technology Inc.
Bus 001 Device 003: ID 0b95:1780 ASIX Electronics Corp. AX88178
Bus 001 Device 002: ID 8087:0a2a Intel Corp.
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
localhost ~ # ifconfig -a
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 100.115.92.1 netmask 255.255.255.252 broadcast 100.115.92.3
inet6 fe80::709a:d1ff:feed:5a2f prefixlen 64 scopeid 0x20<link>
ether 72:9a:d1:ed:5a:2f txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 8 bytes 536 (536.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether 94:eb:2c:15:80:19 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 100.115.199.107 netmask 255.255.254.0 broadcast 100.115.199.255
inet6 fe80::20e:c6ff:fe89:a1ad prefixlen 64 scopeid 0x20<link>
ether 00:0e:c6:89:a1:ad txqueuelen 1000 (Ethernet)
RX packets 72083394 bytes 36260186088 (33.7 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 8898338802 (8.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 8789427 bytes 36168818579 (33.6 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 8789427 bytes 36168818579 (33.6 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
wlan0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether dc:a9:71:d9:8b:81 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
localhost ~ # ethtool -i eth0
driver: smsc95xx
version: 22-Aug-2005
firmware-version: smsc95xx USB 2.0 Ethernet
bus-info: usb-0000:00:14.0-9.1
supports-statistics: no
supports-test: no
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
localhost ~ # ethtool -i eth1
driver: asix
version: 22-Dec-2011
firmware-version: ASIX AX88178 USB 2.0 Ethernet
bus-info: usb-0000:00:14.0-5
supports-statistics: no
supports-test: no
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no
localhost ~ # ip link show eth0
2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000
link/ether 94:eb:2c:15:80:19 brd ff:ff:ff:ff:ff:ff
localhost ~ # ip link show eth1
5: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 00:0e:c6:89:a1:ad brd ff:ff:ff:ff:ff:ff
,
Aug 8 2017
Here is another sentry dut that looks similar to me. I'm also locking it away. chromeos4-row8-rack9-host2/
,
Aug 8 2017
The CQ has been down for over a day, and the primary (but not only) driver has been sentry paladins. I think we need to prioritize some solution, even if it's to paper over this by having repair reboot more aggressively.
,
Aug 8 2017
Issue 753344 has been merged into this issue.
,
Aug 8 2017
All right, here's the inventory of reported problems:
chromeos4-row8-rack9-host2
chromeos4-row8-rack10-host22
Both of these hosts have ASIX USB dongles. Those dongles
are known to suffer from bug 695366. The fix is to file
a ticket with englab-sys-cros@ and request replacement of
the dongles ASAP. Until then, rebooting the DUTs will
clear the symptom, if debug is done and they need to go
back into service.
The reported symptom in bug 751379 appears to be the same as
bug 695366, but it's not proven to be a duplicate. If we see
that symptom again on any network adapter that _isn't_ an ASIX
adapter, we need to debug to determine if it's a duplicate.
chromeos4-row1-rack3-host11
This DUT appears to have an unrelated problem; at first blush,
I think it's a bad SSD.
Also, it's unclear to me why we have problems with the CQ: If a
single problem DUT is enough to push the sentry paladin to fail,
we need to add an extra DUT (or two) to reduce the risk. Failing
that, we can also make the paladin temporarily experimental, while
we sort out the problem.
,
Aug 8 2017
I don't think this is a USB dongle issue. At least not firmware. It stinks a lot more like a switch issue or something higher level in the networking stack. I do see one TX accounting bug with ASIX driver, but the general symptom is happening with different chromebooks and with different ethernet dongles.
,
Aug 8 2017
When "ssh timeout" occurs, could you please try 1) "ping" from corresponding autotest server to DUT 2) check on the DUT console if "ethtool" reports the NIC has link 3) check from DUT console if MAC address has changed (compare with previous entries in /var/log/messages) 4) ssh from workstation to the DUT 5) ssh from autotest server to the DUT 1 will tell us of link is working (ICMP != TCP) 2 will tell us if the DUT thinks it has a link or not. If 2 is yes, then 1 should work. 3,4,5 are to investigate if something related to mismanaged ARP table entries is happening. Routers use ARP to determine which device owns which port and then associates an IP address with that port. It is possible this doesn't match the routing tables if those are manually maintained.
,
Aug 8 2017
Let's clean up our bugs a bit. This bug should be about two failed DUTs with a problem known to be associated with ASIX dongles. That symptom is bug 695366. The for this bug is for the primary infra deputy to work with englab-sys-cros@ to get the dongles replaced. We'll use bug 751379 to discuss whether the known problem with ASIX dongles is also seen with the adapter in servo V4.
,
Aug 8 2017
I have filed this bug with the Englab Team. http://b/64485631
,
Aug 8 2017
Taking this back, since it's associated with two DUTs that are bad for hardware reasons.
,
Aug 8 2017
The DUTs have new dongles and are back in the suites pool.
,
Aug 9 2017
,
Aug 10 2017
,
Aug 10 2017
chromeos4-row11-rack7-host11 appears to be another example of this.
,
Aug 10 2017
Also chromeos4-row11-rack6-host2
,
Aug 10 2017
And chromeos2-row4-rack2-host2
,
Aug 10 2017
And chromeos4-row11-rack4-host17
,
Aug 10 2017
Those DUTs were added too b/36252619
,
Aug 17 2017
The work should be tracked there now.
,
Aug 17 2017
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by dgarr...@chromium.org
, Aug 7 2017Representative failure: START ---- provision timestamp=1502136962 localtime=Aug 07 13:16:02 START provision_AutoUpdate provision_AutoUpdate timestamp=1502136975 localtime=Aug 07 13:16:15 START ---- ---- timestamp=1502137196 localtime=Aug 07 13:19:56 GOOD ---- sysinfo.before timestamp=1502137196 localtime=Aug 07 13:19:56 END GOOD ---- ---- timestamp=1502137196 localtime=Aug 07 13:19:56 FAIL provision_AutoUpdate provision_AutoUpdate timestamp=1502137303 localtime=Aug 07 13:21:43 Unhandled DevServerException: CrOS auto-update failed for host chromeos4-row8-rack10-host22: SSHConnectionError: Connection to 100.115.199.107 timed out while waiting to read Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/test.py", line 806, in _call_test_function return func(*args, **dargs) File "/usr/local/autotest/client/common_lib/test.py", line 470, in execute dargs) File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry postprocess_profiled_run, args, dargs) File "/usr/local/autotest/client/common_lib/test.py", line 380, in _call_run_once self.run_once(*args, **dargs) File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 113, in run_once force_full_update=force) File "/usr/local/autotest/server/afe_utils.py", line 118, in machine_install_and_update_labels *args, **dargs) File "/usr/local/autotest/server/hosts/cros_host.py", line 802, in machine_install_by_devserver force_original=force_original) File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 2219, in auto_update error_msg % (host_name, error_list[0])) DevServerException: CrOS auto-update failed for host chromeos4-row8-rack10-host22: SSHConnectionError: Connection to 100.115.199.107 timed out while waiting to read END FAIL provision_AutoUpdate provision_AutoUpdate timestamp=1502137303 localtime=Aug 07 13:21:43 END FAIL ---- provision timestamp=1502137303 localtime=Aug 07 13:21:43 INFO ---- ---- timestamp=1502137303 job_abort_reason= localtime=Aug 07 13:21:43