Make check_ethernet.hook reboot aggressively in known failure modes |
||||||
Issue descriptionFor instance it could run ethtool to collect more information about the NIC failure.
,
Oct 24 2017
The second error messages should have been: 2017-10-24T02:48:58.612695+00:00 ERR kernel: [ 3.651830] usb 2-5: device not accepting address 3, error -71
,
Oct 24 2017
Note that grundler@ has changes along this line already...
Most especially, I think this CL is the one with the required
change:
crosreview.com/611391
,
Oct 24 2017
,
Oct 24 2017
> Most especially, I think this CL is the one with the required > change: > crosreview.com/611391 More specifically, I believe that's the change that will make a DUT reboot more quickly when there's a network problem. However, w.r.t. the "usb 3-1: can't set config #1" - Is that a problem that's known to get better if you reboot repeatedly? Just because we can detect the problem on the DUT, that doesn't mean that the DUT can do anything for the fix...
,
Oct 24 2017
Re #5: No, I'm not sure it helps. I've asked on issue 768542 for someone to repro this locally. They should be able to answer this question in one of the two cases.
,
Nov 5 2017
We really want to know if the DUT found a ethernet adapter or not. Scraping /var/log/messages isn't a good idea in the event the log has stale (from previous reboots) output. I think CL Richard pointed at has code that determines there is no ethernet device and we should immediately reboot. It's been a few weeks since I've looked at this.
,
Nov 6 2017
We can still scrape /var/log/messages and consider only messages after the last boot. We could also scrape /dev/kmsg but I don't know if it could have wrapped around by then.
,
Nov 9 2017
Incidentally, a warm reboot does not appear to fix error -71 on reef boards. The dongle needs to be power-cycled, which is done through the servo in the repair step. I was informed that there is no way of cycling the USB power bus in software on some boards. But it is possible on other boards, so maybe check_ethernet.hook should try that right away.
,
Nov 14 2017
I've uploaded a different implementation of flock() for autotest (in do_suspend()) but am pretty sure it won't work right: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/653661 Todd, can someone on your team help me rewrite this side of the autotest so check_ethernet.hook knows when any test that invokes do_suspend (in a loop) is mutually exclusive with check_ethernet.hook? They should take a look at this group of patches: https://chromium-review.googlesource.com/q/topic:check_ethernet_v2+(status:open+OR+status:merged)
,
Nov 15 2017
Can't we just stop recover_duts?
,
Nov 28 2017
Todd, sorry - I missed your question. Short answer: No. Longer answer from Luigi's email: > If you call "stop recover_duts" and then the test dies, who is going > to restart recover_duts? The advantage of flock() is that the file > is unlocked when the process holding the lock exits. If you have more thoughts on use of flock, CL 653661 would be a better place to post those: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/653661
,
Dec 12 2017
Here's a case of toggle_ethernet_interfaces working. Quick provisioning happening from chromeos2-devserver9 to chromeos4-row2-rack12-host9 (kip-release/R64-10159.0.0). Notable logs from the three files: 2017/12/12 10:12:05.115 DEBUG| cros_update:0276| Start CrOS quick provision process... Device becomes unresponsive midtransfer: 196608K ........ ........ ........ ........ 57% 13.8M 9s 229376K ........ ........ ........ ........ 65% 32.0M 7s 262144K ........ ........ ...... 2017/12/12 10:14:47.527 DEBUG| cros_build_lib:0644| (stderr): Actual last modification of the log file is 41 seconds earlier: localhost log # stat /var/log/quick-provision.log File: '/var/log/quick-provision.log' Size: 5317 Blocks: 16 IO Block: 4096 regular file Device: fd01h/64769d Inode: 215 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2017-12-12 10:12:05.122000968 -0800 Modify: 2017-12-12 10:14:06.767003417 -0800 Change: 2017-12-12 10:14:06.767003417 -0800 toggle_ethernet_interfaces recovery: 2017-12-12T18:16:42.862042+00:00 NOTICE check_ethernet.hook[3702]: Attempting recovery method "ensure_connection_manager_is_running" 2017-12-12T18:16:42.882949+00:00 NOTICE check_ethernet.hook[3706]: Attempting recovery method "toggle_ethernet_interfaces" 2017-12-12T18:16:42.908905+00:00 INFO kernel: [ 364.977925] smsc95xx 1-2.2.1:1.0 eth0: hardware isn't capable of remote wakeup 2017-12-12T18:16:42.947911+00:00 INFO kernel: [ 365.016199] smsc95xx 1-2.2.1:1.0 eth0: hardware isn't capable of remote wakeup 2017-12-12T18:16:42.947936+00:00 INFO kernel: [ 365.016406] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready 2017-12-12T18:16:42.949741+00:00 INFO avahi-daemon[1799]: Interface eth1.IPv4 no longer relevant for mDNS. 2017-12-12T18:16:42.949764+00:00 INFO avahi-daemon[1799]: Leaving mDNS multicast group on interface eth1.IPv4 with address 100.115.193.105. 2017-12-12T18:16:42.953546+00:00 INFO avahi-daemon[1799]: Withdrawing address record for 100.115.193.105 on eth1. 2017-12-12T18:16:42.956911+00:00 ERR dhcpcd[2672]: eth1: dhcp if_readrawpacket: Network is down 2017-12-12T18:16:42.957428+00:00 ERR dhcpcd[2672]: eth1: if_sendrawpacket: Network is down 2017-12-12T18:16:42.957538+00:00 ERR dhcpcd[2672]: eth1: dhcp if_readrawpacket: Network is down 2017-12-12T18:16:42.995884+00:00 INFO avahi-daemon[1799]: Joining mDNS multicast group on interface eth1.IPv4 with address 100.115.193.105. 2017-12-12T18:16:42.996355+00:00 INFO avahi-daemon[1799]: New relevant interface eth1.IPv4 for mDNS. 2017-12-12T18:16:42.996386+00:00 INFO avahi-daemon[1799]: Registering new address record for 100.115.193.105 on eth1.IPv4. 2017-12-12T18:16:42.996923+00:00 INFO avahi-daemon[1799]: Withdrawing address record for 100.115.193.105 on eth1. 2017-12-12T18:16:42.996958+00:00 INFO avahi-daemon[1799]: Leaving mDNS multicast group on interface eth1.IPv4 with address 100.115.193.105. 2017-12-12T18:16:42.997123+00:00 INFO avahi-daemon[1799]: Interface eth1.IPv4 no longer relevant for mDNS. 2017-12-12T18:16:43.002899+00:00 WARNING kernel: [ 365.071884] smsc75xx 1-1:1.0 eth1: device_set_wakeup_enable error -22 2017-12-12T18:16:43.894934+00:00 INFO avahi-daemon[1799]: Server startup complete. Host name is 09f3aaa8f8ab8568cf26595d5a301d0b.local. Local service cookie is 3935027636. 2017-12-12T18:16:43.894965+00:00 INFO avahi-daemon[1799]: Changing host name to '09f3aaa8f8ab8568cf26595d5a301d0b'. 2017-12-12T18:16:43.901059+00:00 NOTICE cros-machine-id-regen[3748]: Regenerated /var/lib/dbus/machine-id (reason: network). 2017-12-12T18:16:46.574045+00:00 INFO kernel: [ 368.647843] smsc75xx 1-1:1.0 eth1: link up, 1000Mbps, full-duplex, lpa 0xC5E1 2017-12-12T18:16:46.578027+00:00 INFO kernel: [ 368.652402] smsc75xx 1-1:1.0 eth1: link up, 1000Mbps, full-duplex, lpa 0xC5E1 2017-12-12T18:16:56.637098+00:00 INFO avahi-daemon[1799]: Joining mDNS multicast group on interface eth1.IPv4 with address 100.115.193.105. 2017-12-12T18:16:56.637538+00:00 INFO avahi-daemon[1799]: New relevant interface eth1.IPv4 for mDNS. 2017-12-12T18:16:56.637587+00:00 INFO avahi-daemon[1799]: Registering new address record for 100.115.193.105 on eth1.IPv4. 2017-12-12T18:17:05.149238+00:00 INFO sshd[3774]: Accepted publickey for root from 100.115.245.201 port 59914 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU 2017-12-12T18:17:05.246388+00:00 INFO sshd[3774]: Received disconnect from 100.115.245.201 port 59914:11: disconnected by user 2017-12-12T18:17:05.246447+00:00 INFO sshd[3774]: Disconnected from user root 100.115.245.201 port 59914 2017-12-12T18:17:32.071445+00:00 NOTICE check_ethernet.hook[3831]: Recovery method "toggle_ethernet_interfaces" successful One thing I'd like to verify is the host we're using for verification. I'm okay using a remote server for initial detection, but at a minimum I think we should be trying a number of other hosts (upstream router, local devserver, an unrelated corporate device) as some information gathering before we tear down things. From the logs it looks as if we lost connectivity with the LAN so it does seem like it's likely DUT related, and we should spawn off some bugs to find out what actually happened. (ie, try sending lots of data to the kip and see if the same issue occurs).
,
Feb 9 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/crostestutils/+/6b10c012a9fa5272ffa2cad15b153a29652664c5 commit 6b10c012a9fa5272ffa2cad15b153a29652664c5 Author: Grant Grundler <grundler@chromium.org> Date: Fri Feb 09 07:42:19 2018 check_ethernet: try each recovery method only once Reboot has proven to be the most effective "recovery" method. 15 minutes was too long to get to reboot. Restructured the main() recovery loop with that in mind: 1) execute each method only once before rebooting. 2) drop methods which have not proven effective. 3) look for PAUSE_FILE before trying any method. 4) Stop avoiding cdc_ether driver. This is the NIC driver for Linksys USB3GIGV1 on older kernels. For (2) provision jobs logs from bvt/cq pools was used to determine which methods were effective for recovery. "Survivorship bias" was considered and in fact a primary reason some methods were dropped. I believe (4) was added to avoid bouncing 3G/4G/LTE modems. Tests that depend on LTE and bring down the link to lab network will need to use the open(PAUSE_ETHERNET_HOOK_FILE)/flock() also used by sys_power.py to disable these checks. BUG= chromium:758781 TEST=copy script to /tmp on DUT $ stop recover_duts Disconnect LAN cable. Then: $ /tmp/check_ethernet.hook And verify the DUT tries each method and then reboots. Change-Id: I296d5565d264442a9033d6416a8262b1865961af Reviewed-on: https://chromium-review.googlesource.com/611391 Commit-Ready: Grant Grundler <grundler@chromium.org> Tested-by: Grant Grundler <grundler@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Ben Chan <benchan@chromium.org> Reviewed-by: Grant Grundler <grundler@chromium.org> [modify] https://crrev.com/6b10c012a9fa5272ffa2cad15b153a29652664c5/recover_duts/hooks/check_ethernet.hook
,
Feb 12 2018
Wow. What a long journey. So far the sky hasn't fallen. Please message me directly if this becomes necessary (but don't hesitate to revert - just do it). |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by pprabhu@chromium.org
, Oct 24 2017Labels: -Pri-2 Pri-1
Owner: pprabhu@chromium.org
Status: Assigned (was: Untriaged)
Summary: Make check_ethernet.hook reboot aggressively in known failure modes (was: check_ethernet.hook should grab more failure data)