daisy_spring: chromeos6-row2-rack16-host16 failing to provision |
||||||||||
Issue descriptionThis daisy_spring DUT is failing a lot lately with SSH timeout https://wmatrix.googleplex.com/failures/paygen_au_dev?platforms=daisy_spring&builds=R62-9799.0.0 https://wmatrix.googleplex.com/failures/paygen_au_canary?platforms=daisy_spring&builds=R62-9798.0.0 Is there something we can do to the DUT to fix this?
,
Aug 2 2017
Assigning to this week's primary deputy.
,
Aug 2 2017
Looking at history, chromeos6-row2-rack16-host16 has failed all 64 of
the provision tasks that it's run in the past 48 hours:
$ dut-status -d 48 -f chromeos6-row2-rack16-host16 | grep provision | grep OK
$ dut-status -d 48 -f chromeos6-row2-rack16-host16 | grep provision | wc -l
64
However, history also shows it's running and passing tests:
$ dut-status -d 8 -f chromeos6-row2-rack16-host16
chromeos6-row2-rack16-host16
2017-08-02 07:28:49 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132185223-chromeos-test/
2017-08-02 07:28:29 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1158109-reset/
2017-08-02 07:16:13 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132178955-chromeos-test/
2017-08-02 07:15:55 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1158057-reset/
2017-08-02 07:04:18 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132178123-chromeos-test/
2017-08-02 07:03:59 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1158035-reset/
2017-08-02 06:52:11 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132171247-chromeos-test/
2017-08-02 06:51:50 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1157992-reset/
2017-08-02 06:39:33 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132168136-chromeos-test/
2017-08-02 06:39:11 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1157931-reset/
2017-08-02 06:26:52 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132168130-chromeos-test/
2017-08-02 06:26:32 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1157871-reset/
2017-08-02 06:13:12 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132168127-chromeos-test/
2017-08-02 06:12:43 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1157783-reset/
2017-08-02 06:00:16 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132167745-chromeos-test/
2017-08-02 05:59:48 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1157679-reset/
2017-08-02 05:47:45 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132167750-chromeos-test/
2017-08-02 05:47:20 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1157583-reset/
2017-08-02 01:52:13 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132146257-chromeos-test/
2017-08-02 01:51:42 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156679-reset/
2017-08-02 01:47:34 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156644-repair/
2017-08-02 01:37:01 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156586-provision/
2017-08-02 01:25:18 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132132527-chromeos-test/
2017-08-02 01:24:45 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156537-reset/
2017-08-02 01:12:41 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132132097-chromeos-test/
2017-08-02 01:11:57 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156438-reset/
2017-08-02 00:57:24 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156310-reset/
2017-08-02 00:44:00 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156197-reset/
,
Aug 2 2017
Regarding this snippet of the history:
2017-08-02 01:52:13 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132146257-chromeos-test/
2017-08-02 01:51:42 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156679-reset/
2017-08-02 01:47:34 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156644-repair/
2017-08-02 01:37:01 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156586-provision/
After provision failed, the repair task found the DUT was working just
fine. The test job that followed was this one:
https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=132146257
That was part of paygen testing for R61-9765.15.0.
Paygen tests are AU tests, which skip provisioning, which explains how
the test ran. However, AU tests do the same thing as provisioning,
and, true to form, the paygen test failed.
So, this DUT is unable to provision.
,
Aug 2 2017
From a devserver in the lab with the DUT:
chromeos-test@chromeos6-devserver2:~$ ssh chromeos6-row2-rack16-host16
ssh: connect to host chromeos6-row2-rack16-host16 port 22: Connection timed out
From my workstation:
: jrbarnette ~; ssh chromeos6-row2-rack16-host16
localhost ~ # logout
Connection to chromeos6-row2-rack16-host16 closed.
That explains the "provision failed, repair did nothing"
phenomenon.
This symptom is also well-known; it's caused by something
in the network stack on the DUT. Previously, it had been
blamed on the ASIX USB dongles, but there's no ASIX dongle
on this DUT. As best I can tell, the ethernet it's using is
the servo V4 ethernet.
,
Aug 2 2017
The symptom doesn't completely match the known problem with ASIX dongles. In that case, the problem could be temporarily eliminated by rebooting the DUT. For this DUT, reboot doesn't seem to help.
,
Aug 2 2017
> The symptom doesn't completely match the known problem with ASIX
> dongles. In that case, the problem could be temporarily eliminated
> by rebooting the DUT. For this DUT, reboot doesn't seem to help.
I take it back: Reboot did help.
chromeos-test@chromeos6-devserver2:~$ ssh root@chromeos6-row2-rack16-host16
Warning: Permanently added 'chromeos6-row2-rack16-host16,100.115.130.214' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos6-row2-rack16-host16,100.115.130.214' (ED25519) to the list of known hosts.
localhost ~ # logout
Connection to chromeos6-row2-rack16-host16 closed.
chromeos-test@chromeos6-devserver2:~$
,
Aug 2 2017
Assuming this problem is like the problem known problem with ASIX dongles, it's a product bug. Assuming the problem can be reproduced using the Servo V4 ethernet connection, then we have to debug it and fix it. ihf@ and cernekee@ will remember the debug the last time we saw this, and may have suggestions on next steps.
,
Aug 2 2017
Grant, could you PTAL?
,
Aug 2 2017
The ASIX issue was bug 695366. That involved "Pluggable" branded USB 2.0 dongles that were cascaded through a Type C HUB on Caroline. The dongle would get into a funny state where the activity LED was constantly blinking, and larger packets could not be received. It isn't yet clear to me how much overlap this problem has with bug 695366. First thing I'd check is whether the LED is blinking nonstop. And possibly run some of the ping experiments.
,
Aug 2 2017
,
Aug 2 2017
There is currently no ASIX device on chromeos6-row2-rack16-host16. Was this recently changed or am I confused about which host I should be looking at?
$ ssh root@chromeos6-row2-rack16-host16.cros
localhost ~ # ifconfig -a
eth0: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether 94:b9:b4:04:32:1a txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 100.115.130.214 netmask 255.255.254.0 broadcast 100.115.131.255
inet6 fe80::96eb:2cff:fe15:89fc prefixlen 64 scopeid 0x20<link>
ether 94:eb:2c:15:89:fc txqueuelen 1000 (Ethernet)
RX packets 2730 bytes 611526 (597.1 KiB)
RX errors 0 dropped 5 overruns 0 frame 0
TX packets 1388 bytes 265914 (259.6 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 0 (Local Loopback)
RX packets 11835 bytes 296060742 (282.3 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 11835 bytes 296060742 (282.3 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
mlan0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether d0:e7:82:eb:12:37 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
localhost ~ # lsusb -vt
/: Bus 04.Port 1: Dev 1, Class=root_hub, Driver=exynos-ohci/3p, 12M
/: Bus 03.Port 1: Dev 1, Class=root_hub, Driver=s5p-ehci/3p, 480M
|__ Port 2: Dev 2, If 0, Class=Hub, Driver=hub/3p, 480M
|__ Port 2: Dev 3, If 0, Class=Communications, Driver=cdc_ether, 480M
|__ Port 2: Dev 3, If 1, Class=CDC Data, Driver=cdc_ether, 480M
|__ Port 2: Dev 3, If 2, Class=Communications, Driver=cdc_acm, 480M
|__ Port 2: Dev 3, If 3, Class=CDC Data, Driver=cdc_acm, 480M
|__ Port 3: Dev 4, If 0, Class=Hub, Driver=hub/4p, 480M
|__ Port 1: Dev 5, If 0, Class=Vendor Specific Class, Driver=r8152, 480M
/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 480M
|__ Port 1: Dev 2, If 0, Class=Video, Driver=uvcvideo, 480M
|__ Port 1: Dev 2, If 1, Class=Video, Driver=uvcvideo, 480M
localhost ~ # ethtool -i eth0
driver: cdc_ether
version: 22-Aug-2005
firmware-version: CDC Ethernet Device
bus-info: usb-12110000.usb-2.2
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
localhost ~ # ethtool -i eth1
driver: r8152
version: v1.06.0 (2014/03/03)
firmware-version:
bus-info: usb-12110000.usb-2.3.1
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
,
Aug 8 2017
Note crbug.com/753120#c11 : This bug appears to be the same as bug 695366, but it's not proven to be a duplicate. If we see the symptom again on any network adapter that _isn't_ an ASIX adapter, we need to debug to determine if it's a duplicate. The key symptom to be investigated is this: * ssh to the DUT from outside of the lab succeeds. * ssh to the DUT from a devserver in the same lab as the DUT times out.
,
Aug 8 2017
Comment copied from bug 753120 : > I don't think this is a USB dongle issue. At least not firmware. It stinks a lot more like a switch issue or something higher level in the networking stack. > > I do see one TX accounting bug with ASIX driver, but the general symptom is happening with different chromebooks and with different ethernet dongles. Here's what we (well, I) learned from bug 695366: * The problem (till now) has only been seen from ASIX dongles. Replacing the dongle made the problem go away. * Rebooting the DUT made the problem go away. So did re-plugging the dongle. That is, apparently, renegotiating DHCP would make the problem go away. I'm agnostic as to whether the problem lies in the switch or in the dongle. The folks who debugged bug 695366 seemed to suspect the dongle more than the switch, but I don't believe the reasoning is in the bug report. It would be good to find out what they knew, and get it reported here.
,
Aug 8 2017
yes, please include the reasoning why it's a dongle issue. I need to understand that. I thought check_ethernet.hook forces DHCP. But am rewriting that and will post my rewrite for review shortly (untested!). I can double check the behavior on a local host. I thought DHCP also forces ARP since dhclient thinks it has no IP address and has to use ARP to request DHCPD service.
,
Nov 21 2017
,
Feb 14 2018
Has chromeos6-row2-rack16-host16 recently failed provisioning? The wmatrix link in the original comment isn't working any more and I'm not sure what the equivalent is using stainless.corp. I'm suspecting this is "mostly" fixed since loads of related code has been updated.
,
Feb 14 2018
+ deputies
,
Feb 14 2018
A good way to check this is using the new stainless dashboard. Use "custom view". Select Type "List". Put Host "chromeos6-row2-rack16-host16". It is very powerful. You get the result below. Searching for "provision" seems clean. https://stainless.corp.google.com/search?view=list&row=model&col=build&first_date=2018-01-18&last_date=2018-02-14&hostname=chromeos6-row2-rack16-host16&exclude_cts=false&exclude_not_run=false&exclude_non_release=false&exclude_au=false&exclude_acts=false&exclude_retried=false&exclude_non_production=false
,
Feb 14 2018
Thanks Ilja!
I rewrote that to "Board = daisy_spring AND status = FAILED".
For some reason I thought this was related to provisioning but re-reading the bug it seems like the timeouts were a problem in general.
Back to using stainless: Listed Timeouts for "FAILED" do not appear to be related to SSH timeouts.
Reason LIKE "% Timeout encountered: /usr/bin/ssh %" is always associated with wifi_perf and I know that's related to netperf-2.7 upgrade (WIP).
131 "Unhandled TimeoutException: Timed out while waiting"
24 "90s for IsJavaScriptExpressionTrue."
40 "60s for py_utils.WaitFor(lambda: not self._GaiaWebviewContext(), 60)."
62 "20s for IsJavaScriptExpressionTrue."
and 7 more for something else that I didn't feel like searching for. :)
|
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by dhadd...@chromium.org
, Aug 2 2017