New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 758781 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Feature

Blocking:
issue 677572
issue 768542
issue 777920



Sign in to add a comment

Make check_ethernet.hook reboot aggressively in known failure modes

Project Member Reported by semenzato@chromium.org, Aug 25 2017

Issue description

For instance it could run ethtool to collect more information about the NIC failure.


 
Blocking: 777920 768542
Labels: -Pri-2 Pri-1
Owner: pprabhu@chromium.org
Status: Assigned (was: Untriaged)
Summary: Make check_ethernet.hook reboot aggressively in known failure modes (was: check_ethernet.hook should grab more failure data)
Changing the ask to collect some known failure modes via either ethtool, or by just grepping /var/log/messages and rebooting aggressively when a known failure mode is detected.

The two we could start with are:
issue 777920: /var/log/messages contains:
2017-10-23T12:13:30.933745-07:00 ERR kernel: [    1.552491] usb 3-1: can't set config #1, error -71

issue 768542: /var/log/messages contains:
2017-10-23T12:13:30.933745-07:00 ERR kernel: [    1.552491] usb 3-1: can't set config #1, error -71
The second error messages should have been:
2017-10-24T02:48:58.612695+00:00 ERR kernel: [    3.651830] usb 2-5: device not accepting address 3, error -71
Components: -Test Infra>Client>ChromeOS
Labels: -Type-Bug Type-Feature
Note that grundler@ has changes along this line already...
Most especially, I think this CL is the one with the required
change:
    crosreview.com/611391
Blocking: 677572
> Most especially, I think this CL is the one with the required
> change:
>     crosreview.com/611391

More specifically, I believe that's the change that will make
a DUT reboot more quickly when there's a network problem.

However, w.r.t. the "usb 3-1: can't set config #1" - Is that a
problem that's known to get better if you reboot repeatedly?
Just because we can detect the problem on the DUT, that doesn't
mean that the DUT can do anything for the fix...

Owner: grundler@chromium.org
Re #5: No, I'm not sure it helps. I've asked on issue 768542 for someone to repro this locally. They should be able to answer this question in one of the two cases.
We really want to know if the DUT found a ethernet adapter or not. Scraping /var/log/messages isn't a good idea in the event the log has stale (from previous reboots) output. I think CL Richard pointed at has code that determines there is no ethernet device and we should immediately reboot. It's been a few weeks since I've looked at this.
We can still scrape /var/log/messages and consider only messages after the last boot.

We could also scrape /dev/kmsg but I don't know if it could have wrapped around by then.
Incidentally, a warm reboot does not appear to fix error -71 on reef boards.  The dongle needs to be power-cycled, which is done through the servo in the repair step.

I was informed that there is no way of cycling the USB power bus in software on some boards.  But it is possible on other boards, so maybe check_ethernet.hook should try that right away.

Cc: -grundler@chromium.org semenzato@chromium.org tbroch@chromium.org
I've uploaded a different implementation of flock() for autotest (in do_suspend()) but am pretty sure it won't work right:
   https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/653661

Todd, can someone on your team help me rewrite this side of the autotest so check_ethernet.hook knows when any test that invokes do_suspend (in a loop) is mutually exclusive with check_ethernet.hook?

They should take a look at this group of patches:
   https://chromium-review.googlesource.com/q/topic:check_ethernet_v2+(status:open+OR+status:merged)

Can't we just stop recover_duts?


Todd, sorry - I missed your question. Short answer: No.

Longer answer from Luigi's email:
> If you call "stop recover_duts" and then the test dies, who is going
> to restart recover_duts?  The advantage of flock() is that the file
> is unlocked when the process holding the lock exits.

If you have more thoughts on use of flock, CL 653661 would be a better place to post those:
   https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/653661
Here's a case of toggle_ethernet_interfaces working.

Quick provisioning happening from chromeos2-devserver9 to chromeos4-row2-rack12-host9 (kip-release/R64-10159.0.0).  Notable logs from the three files:

2017/12/12 10:12:05.115 DEBUG|       cros_update:0276| Start CrOS quick provision process...

Device becomes unresponsive midtransfer:
196608K ........ ........ ........ ........ 57% 13.8M 9s
229376K ........ ........ ........ ........ 65% 32.0M 7s
262144K ........ ........ ......
2017/12/12 10:14:47.527 DEBUG|    cros_build_lib:0644| (stderr):

Actual last modification of the log file is 41 seconds earlier:
localhost log # stat /var/log/quick-provision.log
  File: '/var/log/quick-provision.log'
  Size: 5317      	Blocks: 16         IO Block: 4096   regular file
Device: fd01h/64769d	Inode: 215         Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2017-12-12 10:12:05.122000968 -0800
Modify: 2017-12-12 10:14:06.767003417 -0800
Change: 2017-12-12 10:14:06.767003417 -0800

toggle_ethernet_interfaces recovery:
2017-12-12T18:16:42.862042+00:00 NOTICE check_ethernet.hook[3702]: Attempting recovery method "ensure_connection_manager_is_running"
2017-12-12T18:16:42.882949+00:00 NOTICE check_ethernet.hook[3706]: Attempting recovery method "toggle_ethernet_interfaces"
2017-12-12T18:16:42.908905+00:00 INFO kernel: [  364.977925] smsc95xx 1-2.2.1:1.0 eth0: hardware isn't capable of remote wakeup
2017-12-12T18:16:42.947911+00:00 INFO kernel: [  365.016199] smsc95xx 1-2.2.1:1.0 eth0: hardware isn't capable of remote wakeup
2017-12-12T18:16:42.947936+00:00 INFO kernel: [  365.016406] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
2017-12-12T18:16:42.949741+00:00 INFO avahi-daemon[1799]: Interface eth1.IPv4 no longer relevant for mDNS.
2017-12-12T18:16:42.949764+00:00 INFO avahi-daemon[1799]: Leaving mDNS multicast group on interface eth1.IPv4 with address 100.115.193.105.
2017-12-12T18:16:42.953546+00:00 INFO avahi-daemon[1799]: Withdrawing address record for 100.115.193.105 on eth1.
2017-12-12T18:16:42.956911+00:00 ERR dhcpcd[2672]: eth1: dhcp if_readrawpacket: Network is down
2017-12-12T18:16:42.957428+00:00 ERR dhcpcd[2672]: eth1: if_sendrawpacket: Network is down
2017-12-12T18:16:42.957538+00:00 ERR dhcpcd[2672]: eth1: dhcp if_readrawpacket: Network is down
2017-12-12T18:16:42.995884+00:00 INFO avahi-daemon[1799]: Joining mDNS multicast group on interface eth1.IPv4 with address 100.115.193.105.
2017-12-12T18:16:42.996355+00:00 INFO avahi-daemon[1799]: New relevant interface eth1.IPv4 for mDNS.
2017-12-12T18:16:42.996386+00:00 INFO avahi-daemon[1799]: Registering new address record for 100.115.193.105 on eth1.IPv4.
2017-12-12T18:16:42.996923+00:00 INFO avahi-daemon[1799]: Withdrawing address record for 100.115.193.105 on eth1.
2017-12-12T18:16:42.996958+00:00 INFO avahi-daemon[1799]: Leaving mDNS multicast group on interface eth1.IPv4 with address 100.115.193.105.
2017-12-12T18:16:42.997123+00:00 INFO avahi-daemon[1799]: Interface eth1.IPv4 no longer relevant for mDNS.
2017-12-12T18:16:43.002899+00:00 WARNING kernel: [  365.071884] smsc75xx 1-1:1.0 eth1: device_set_wakeup_enable error -22
2017-12-12T18:16:43.894934+00:00 INFO avahi-daemon[1799]: Server startup complete. Host name is 09f3aaa8f8ab8568cf26595d5a301d0b.local. Local service cookie is 3935027636.
2017-12-12T18:16:43.894965+00:00 INFO avahi-daemon[1799]: Changing host name to '09f3aaa8f8ab8568cf26595d5a301d0b'.
2017-12-12T18:16:43.901059+00:00 NOTICE cros-machine-id-regen[3748]: Regenerated /var/lib/dbus/machine-id (reason: network).
2017-12-12T18:16:46.574045+00:00 INFO kernel: [  368.647843] smsc75xx 1-1:1.0 eth1: link up, 1000Mbps, full-duplex, lpa 0xC5E1
2017-12-12T18:16:46.578027+00:00 INFO kernel: [  368.652402] smsc75xx 1-1:1.0 eth1: link up, 1000Mbps, full-duplex, lpa 0xC5E1
2017-12-12T18:16:56.637098+00:00 INFO avahi-daemon[1799]: Joining mDNS multicast group on interface eth1.IPv4 with address 100.115.193.105.
2017-12-12T18:16:56.637538+00:00 INFO avahi-daemon[1799]: New relevant interface eth1.IPv4 for mDNS.
2017-12-12T18:16:56.637587+00:00 INFO avahi-daemon[1799]: Registering new address record for 100.115.193.105 on eth1.IPv4.
2017-12-12T18:17:05.149238+00:00 INFO sshd[3774]: Accepted publickey for root from 100.115.245.201 port 59914 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2017-12-12T18:17:05.246388+00:00 INFO sshd[3774]: Received disconnect from 100.115.245.201 port 59914:11: disconnected by user
2017-12-12T18:17:05.246447+00:00 INFO sshd[3774]: Disconnected from user root 100.115.245.201 port 59914
2017-12-12T18:17:32.071445+00:00 NOTICE check_ethernet.hook[3831]: Recovery method "toggle_ethernet_interfaces" successful

One thing I'd like to verify is the host we're using for verification.  I'm okay using a remote server for initial detection, but at a minimum I think we should be trying a number of other hosts (upstream router, local devserver, an unrelated corporate device) as some information gathering before we tear down things.

From the logs it looks as if we lost connectivity  with the LAN so it does seem like it's likely DUT related, and we should spawn off some bugs to find out what actually happened.  (ie, try sending lots of data to the kip and see if the same issue occurs).
messages-chromeos4-row2-rack12-host9
106 KB View Download
quick-provision.log-chromeos4-row2-rack12-host9
5.2 KB Download
chromeos4-row2-rack12-host9_342388.log
9.3 KB View Download
Project Member

Comment 14 by bugdroid1@chromium.org, Feb 9 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/crostestutils/+/6b10c012a9fa5272ffa2cad15b153a29652664c5

commit 6b10c012a9fa5272ffa2cad15b153a29652664c5
Author: Grant Grundler <grundler@chromium.org>
Date: Fri Feb 09 07:42:19 2018

check_ethernet: try each recovery method only once

Reboot has proven to be the most effective "recovery" method.
15 minutes was too long to get to reboot.

Restructured the main() recovery loop with that in mind:
1) execute each method only once before rebooting.
2) drop methods which have not proven effective.
3) look for PAUSE_FILE before trying any method.
4) Stop avoiding cdc_ether driver. This is the
   NIC driver for Linksys USB3GIGV1 on older kernels.

For (2) provision jobs logs from bvt/cq pools was used to determine which
methods were effective for recovery. "Survivorship bias" was considered
and in fact a primary reason some methods were dropped.

I believe (4) was added to avoid bouncing 3G/4G/LTE modems.
Tests that depend on LTE and bring down the link to lab network will need
to use the open(PAUSE_ETHERNET_HOOK_FILE)/flock() also used by sys_power.py
to disable these checks.

BUG= chromium:758781 
TEST=copy script to /tmp on DUT
       $ stop recover_duts
    Disconnect LAN cable. Then:
       $ /tmp/check_ethernet.hook
    And verify the DUT tries each method and then reboots.

Change-Id: I296d5565d264442a9033d6416a8262b1865961af
Reviewed-on: https://chromium-review.googlesource.com/611391
Commit-Ready: Grant Grundler <grundler@chromium.org>
Tested-by: Grant Grundler <grundler@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Ben Chan <benchan@chromium.org>
Reviewed-by: Grant Grundler <grundler@chromium.org>

[modify] https://crrev.com/6b10c012a9fa5272ffa2cad15b153a29652664c5/recover_duts/hooks/check_ethernet.hook

Status: Fixed (was: Assigned)
Wow. What a long journey. So far the sky hasn't fallen. Please message me directly if this becomes necessary (but don't hesitate to revert - just do it).

Sign in to add a comment