New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 751379 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

daisy_spring: chromeos6-row2-rack16-host16 failing to provision

Project Member Reported by dhadd...@chromium.org, Aug 2 2017

Issue description

Labels: -Pri-3 OS-Chrome Pri-2
Components: -Infra>Labs Infra>Client>ChromeOS
Owner: shuqianz@chromium.org
Status: Assigned (was: Untriaged)
Assigning to this week's primary deputy.

Looking at history, chromeos6-row2-rack16-host16 has failed all 64 of
the provision tasks that it's run in the past 48 hours:

    $ dut-status -d 48 -f chromeos6-row2-rack16-host16 | grep provision | grep OK
    $ dut-status -d 48 -f chromeos6-row2-rack16-host16 | grep provision | wc -l  
    64

However, history also shows it's running and passing tests:
    $ dut-status -d 8 -f chromeos6-row2-rack16-host16
    chromeos6-row2-rack16-host16
        2017-08-02 07:28:49  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132185223-chromeos-test/
        2017-08-02 07:28:29  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1158109-reset/
        2017-08-02 07:16:13  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132178955-chromeos-test/
        2017-08-02 07:15:55  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1158057-reset/
        2017-08-02 07:04:18  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132178123-chromeos-test/
        2017-08-02 07:03:59  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1158035-reset/
        2017-08-02 06:52:11  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132171247-chromeos-test/
        2017-08-02 06:51:50  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1157992-reset/
        2017-08-02 06:39:33  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132168136-chromeos-test/
        2017-08-02 06:39:11  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1157931-reset/
        2017-08-02 06:26:52  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132168130-chromeos-test/
        2017-08-02 06:26:32  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1157871-reset/
        2017-08-02 06:13:12  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132168127-chromeos-test/
        2017-08-02 06:12:43  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1157783-reset/
        2017-08-02 06:00:16  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132167745-chromeos-test/
        2017-08-02 05:59:48  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1157679-reset/
        2017-08-02 05:47:45  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132167750-chromeos-test/
        2017-08-02 05:47:20  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1157583-reset/
        2017-08-02 01:52:13  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132146257-chromeos-test/
        2017-08-02 01:51:42  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156679-reset/
        2017-08-02 01:47:34  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156644-repair/
        2017-08-02 01:37:01  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156586-provision/
        2017-08-02 01:25:18  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132132527-chromeos-test/
        2017-08-02 01:24:45  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156537-reset/
        2017-08-02 01:12:41  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132132097-chromeos-test/
        2017-08-02 01:11:57  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156438-reset/
        2017-08-02 00:57:24  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156310-reset/
        2017-08-02 00:44:00  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156197-reset/

Regarding this snippet of the history:
        2017-08-02 01:52:13  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/132146257-chromeos-test/
        2017-08-02 01:51:42  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156679-reset/
        2017-08-02 01:47:34  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156644-repair/
        2017-08-02 01:37:01  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack16-host16/1156586-provision/

After provision failed, the repair task found the DUT was working just
fine.  The test job that followed was this one:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=132146257

That was part of paygen testing for R61-9765.15.0.

Paygen tests are AU tests, which skip provisioning, which explains how
the test ran.  However, AU tests do the same thing as provisioning,
and, true to form, the paygen test failed.

So, this DUT is unable to provision.

From a devserver in the lab with the DUT:

    chromeos-test@chromeos6-devserver2:~$ ssh chromeos6-row2-rack16-host16
    ssh: connect to host chromeos6-row2-rack16-host16 port 22: Connection timed out


From my workstation:
    : jrbarnette ~; ssh chromeos6-row2-rack16-host16
    localhost ~ # logout
    Connection to chromeos6-row2-rack16-host16 closed.

That explains the "provision failed, repair did nothing"
phenomenon.

This symptom is also well-known; it's caused by something
in the network stack on the DUT.  Previously, it had been
blamed on the ASIX USB dongles, but there's no ASIX dongle
on this DUT.  As best I can tell, the ethernet it's using is
the servo V4 ethernet.

Cc: cernekee@chromium.org ihf@chromium.org
Components: -Infra>Client>ChromeOS OS>Kernel
The symptom doesn't completely match the known problem with ASIX
dongles.  In that case, the problem could be temporarily eliminated
by rebooting the DUT.  For this DUT, reboot doesn't seem to help.

> The symptom doesn't completely match the known problem with ASIX
> dongles.  In that case, the problem could be temporarily eliminated
> by rebooting the DUT.  For this DUT, reboot doesn't seem to help.

I take it back:  Reboot did help.
    chromeos-test@chromeos6-devserver2:~$ ssh root@chromeos6-row2-rack16-host16
    Warning: Permanently added 'chromeos6-row2-rack16-host16,100.115.130.214' (ED25519) to the list of known hosts.
    Warning: Permanently added 'chromeos6-row2-rack16-host16,100.115.130.214' (ED25519) to the list of known hosts.
    localhost ~ # logout
    Connection to chromeos6-row2-rack16-host16 closed.
    chromeos-test@chromeos6-devserver2:~$ 

Cc: akes...@chromium.org snanda@chromium.org
Owner: ----
Status: Available (was: Assigned)
Assuming this problem is like the problem known problem
with ASIX dongles, it's a product bug.  Assuming the problem
can be reproduced using the Servo V4 ethernet connection,
then we have to debug it and fix it.

ihf@ and cernekee@ will remember the debug the last time we
saw this, and may have suggestions on next steps.

Owner: grundler@chromium.org
Grant, could you PTAL?
The ASIX issue was bug 695366.  That involved "Pluggable" branded USB 2.0 dongles that were cascaded through a Type C HUB on Caroline.  The dongle would get into a funny state where the activity LED was constantly blinking, and larger packets could not be received.

It isn't yet clear to me how much overlap this problem has with bug 695366.  First thing I'd check is whether the LED is blinking nonstop.  And possibly run some of the ping experiments.

Comment 11 by jkwang@google.com, Aug 2 2017

Cc: jkwang@chromium.org

There is currently no ASIX device on chromeos6-row2-rack16-host16. Was this recently changed or am I confused about which host I should be looking at?

$ ssh root@chromeos6-row2-rack16-host16.cros
localhost ~ # ifconfig -a
eth0: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether 94:b9:b4:04:32:1a  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 100.115.130.214  netmask 255.255.254.0  broadcast 100.115.131.255
        inet6 fe80::96eb:2cff:fe15:89fc  prefixlen 64  scopeid 0x20<link>
        ether 94:eb:2c:15:89:fc  txqueuelen 1000  (Ethernet)
        RX packets 2730  bytes 611526 (597.1 KiB)
        RX errors 0  dropped 5  overruns 0  frame 0
        TX packets 1388  bytes 265914 (259.6 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 0  (Local Loopback)
        RX packets 11835  bytes 296060742 (282.3 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 11835  bytes 296060742 (282.3 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

mlan0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether d0:e7:82:eb:12:37  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

localhost ~ # lsusb -vt
/:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=exynos-ohci/3p, 12M
/:  Bus 03.Port 1: Dev 1, Class=root_hub, Driver=s5p-ehci/3p, 480M
    |__ Port 2: Dev 2, If 0, Class=Hub, Driver=hub/3p, 480M
        |__ Port 2: Dev 3, If 0, Class=Communications, Driver=cdc_ether, 480M
        |__ Port 2: Dev 3, If 1, Class=CDC Data, Driver=cdc_ether, 480M
        |__ Port 2: Dev 3, If 2, Class=Communications, Driver=cdc_acm, 480M
        |__ Port 2: Dev 3, If 3, Class=CDC Data, Driver=cdc_acm, 480M
        |__ Port 3: Dev 4, If 0, Class=Hub, Driver=hub/4p, 480M
            |__ Port 1: Dev 5, If 0, Class=Vendor Specific Class, Driver=r8152, 480M
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 480M
    |__ Port 1: Dev 2, If 0, Class=Video, Driver=uvcvideo, 480M
    |__ Port 1: Dev 2, If 1, Class=Video, Driver=uvcvideo, 480M
localhost ~ # ethtool -i eth0
driver: cdc_ether
version: 22-Aug-2005
firmware-version: CDC Ethernet Device
bus-info: usb-12110000.usb-2.2
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
localhost ~ # ethtool -i eth1
driver: r8152
version: v1.06.0 (2014/03/03)
firmware-version: 
bus-info: usb-12110000.usb-2.3.1
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

Note  crbug.com/753120#c11 :  This bug appears to be the same as
bug 695366, but it's not proven to be a duplicate.  If we see
the symptom again on any network adapter that _isn't_ an ASIX
adapter, we need to debug to determine if it's a duplicate.

The key symptom to be investigated is this:
  * ssh to the DUT from outside of the lab succeeds.
  * ssh to the DUT from a devserver in the same lab as the DUT
    times out.

Comment copied from  bug 753120 :
> I don't think this is a USB dongle issue. At least not firmware. It stinks a lot more like a switch issue or something higher level in the networking stack.
>
> I do see one TX accounting bug with ASIX driver, but the general symptom is happening with different chromebooks and with different ethernet dongles.

Here's what we (well, I) learned from bug 695366:
  * The problem (till now) has only been seen from ASIX dongles.
    Replacing the dongle made the problem go away.
  * Rebooting the DUT made the problem go away.  So did re-plugging
    the dongle.  That is, apparently, renegotiating DHCP would make the
    problem go away.

I'm agnostic as to whether the problem lies in the switch or in the
dongle.  The folks who debugged bug 695366 seemed to suspect the dongle
more than the switch, but I don't believe the reasoning is in the bug
report.  It would be good to find out what they knew, and get it reported
here.

yes, please include the reasoning why it's a dongle issue. I need to understand that.

I thought check_ethernet.hook forces DHCP. But am rewriting that and will post my rewrite for review shortly (untested!). I can double check the behavior on a local host.

I thought DHCP also forces ARP since dhclient thinks it has no IP address and has to use ARP to request DHCPD service.  
Summary: daisy_spring: chromeos6-row2-rack16-host16 failing to provision (was: chromeos6-row2-rack16-host16 is timing out all the time)
Has chromeos6-row2-rack16-host16 recently failed provisioning?

The wmatrix link in the original comment isn't working any more and I'm not sure what the equivalent is using stainless.corp.

I'm suspecting this is "mostly" fixed since loads of related code has been updated.
Cc: -xixuan@chromium.org pho...@chromium.org dgarr...@chromium.org
+ deputies

Comment 19 by ihf@chromium.org, Feb 14 2018

Cc: nya@chromium.org
A good way to check this is using the new stainless dashboard.
Use "custom view".
Select Type "List".
Put Host "chromeos6-row2-rack16-host16".
It is very powerful. You get the result below. Searching for "provision" seems clean.

https://stainless.corp.google.com/search?view=list&row=model&col=build&first_date=2018-01-18&last_date=2018-02-14&hostname=chromeos6-row2-rack16-host16&exclude_cts=false&exclude_not_run=false&exclude_non_release=false&exclude_au=false&exclude_acts=false&exclude_retried=false&exclude_non_production=false
Status: Fixed (was: Available)
Thanks Ilja!

I rewrote that to "Board = daisy_spring AND status = FAILED".

For some reason I thought this was related to provisioning but re-reading the bug it seems like the timeouts were a problem in general.

Back to using stainless: Listed Timeouts for "FAILED" do not appear to be related to SSH timeouts.


Reason LIKE "% Timeout encountered: /usr/bin/ssh %" is always associated with wifi_perf and I know that's related to netperf-2.7 upgrade (WIP).

131 "Unhandled TimeoutException: Timed out while waiting"
     24  "90s for IsJavaScriptExpressionTrue."
     40  "60s for py_utils.WaitFor(lambda: not self._GaiaWebviewContext(), 60)."
     62  "20s for IsJavaScriptExpressionTrue."

and 7 more for something else that I didn't feel like searching for. :)  

Sign in to add a comment