New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 780600 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner: ----
Closed: Nov 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Downgrade NETDEV WATCHDOG warning

Project Member Reported by diand...@chromium.org, Nov 1 2017

Issue description

The following kernel warning seems like it really doesn't need a full warn-on splat:

 WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
           dev->name, netdev_drivername(dev), i);

Specifically I see this splat quite often when the system is overloaded and I've always just ignored it.

Does anyone know a reason why we shouldn't downgrade this to just a "pr_warn" so we doing get a full splat?  Right now the splat is not only confusing (it makes it seem like the problem is more serious than it apparently is) but it also contributes to our crash metrics.  I'd rather the metrics track things that actually matter.


Presumably we could try to downgrade this even upstream, but I don't know if they'd take it.  If not, then I guess a CHROMIUM patch?
 
Overload is one reason, but information on the web suggests that it is often seen in association with a real network driver hang. Are you sure this is just noise ? My concern is that the overload may cause a real hang, for example if the driver has bad error recovery.

Agreed with c#1. IIRC the timeout is 4 seconds. That's a very long time for the system to stay overloaded so it's likely indicative of issues in the NIC / driver.

If we are seeing this quite often from the field, the wifi folks will probably want to take a closer look and characterize it better.
I believe I've seen this splat on USB ethernet too, not just Wifi.
Cc: grundler@chromium.org benzh@chromium.org
Agreed, I have seen this on all types of NICs.  It is often caused by a hardware issue.  So the backtrace itself might not be all that helpful (other than to identify the driver that was involved), but keeping metrics on how often this happens on a board / wifi chipset / CrOS version is helpful.

I agree with Doug that we don't want to log spurious timeouts.  So the interesting cases are ones where TX gets wedged "forever" (like until the interface is brought down + back up).  The uninteresting cases are ones where I2C, NICs, TPM, eMMC, etc. are all timing out because something else drove the system load average up to 80, and everything returns to normal once that is fixed.

If we can find examples of both cases, maybe we can teach crcrash how to differentiate them.  Or maybe we can make the net/core code smarter.
I'm trying to remember where exactly I've seen this in the past...  I filed this bug because I went to "http://go/cros-crash" and dug down to "MTBF (Kernel Crashes) by OS Version" for Kevin beta channel, which you can find with:

  https://goto.google.com/cr780600_crashes

The 2nd crash from the top is "c64a73e9-nknown-function", which is this crash (yeah, need to fix that signature).

When I looked at the warning, it tickled something in my mind that said: "oh yeah, I always ignore this"...  So now I'm trying to remember why.


I'm nearly certain that I've seen this mostly when doing suspend_stress_test and with USB Ethernet.  I don't think I dug into it too much, but presumably at suspend/resume time then a bunch of code stops running but the watchdog doesn't.

I _might_ have also seen this when doing memory stress testing too.  4 seconds of unresponsiveness, unfortunately, seems pretty common when memory is being hammered.

===

Unfortunately the logs are almost useless.  The total content of one is:

c64a73e9-nknown-function
/mnt/host/source/src/third_party/kernel/v4.4/net/sched/sch_generic.c:306
2017-11-03T00:11:05.899764+09:00 WARNING kernel: [  750.732550] Modules linked in: ip6t_REJECT nf_reject_ipv6 xt_TCPMSS ip6table_mangle veth cmac rfcomm btusb btrtl btbcm btintel uinput mwifiex_pcie uvcvideo mwifiex videobuf2_vmalloc xt_nat zram bridge stp llc bluetooth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat fuse xt_mark cfg80211 ip6table_filter joydev
2017-11-03T00:11:05.899793+09:00 WARNING kernel: [  750.732717] 
2017-11-03T00:11:05.899799+09:00 WARNING kernel: [  750.732730] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.79-11650-ge987f76b729a #1
2017-11-03T00:11:05.899806+09:00 WARNING kernel: [  750.732738] Hardware name: Google Kevin (DT)
2017-11-03T00:11:05.899811+09:00 WARNING kernel: [  750.732748] task: ffffffc0ef02b400 ti: ffffffc0ef044000 task.ti: ffffffc0ef044000
2017-11-03T00:11:05.899817+09:00 WARNING kernel: [  750.732766] PC is at dev_watchdog+0x1ec/0x2a4
2017-11-03T00:11:05.899823+09:00 WARNING kernel: [  750.732776] LR is at dev_watchdog+0x1ec/0x2a4
2017-11-03T00:11:05.899828+09:00 WARNING kernel: [  750.732783] pc : [<ffffffc000845980>] lr : [<ffffffc000845980>] pstate: 60000145
2017-11-03T00:11:05.899833+09:00 WARNING kernel: [  750.732791] sp : ffffffc0ef047b50
2017-11-03T00:11:05.899839+09:00 WARNING kernel: [  750.732798] x29: ffffffc0ef047b50 x28: ffffffc0ef047c70 
2017-11-03T00:11:05.899844+09:00 WARNING kernel: [  750.732812] x27: ffffffc0f7b438c8 x26: 00000000ffffffff 
2017-11-03T00:11:05.899849+09:00 WARNING kernel: [  750.732826] x25: 0000000000000280 x24: 0000000000000003 
2017-11-03T00:11:05.899855+09:00 WARNING kernel: [  750.732839] x23: ffffffc0bbe3b800 x22: ffffffc0bbe38420 
2017-11-03T00:11:05.899860+09:00 WARNING kernel: [  750.732853] x21: 0000000000000004 x20: ffffffc001149000 
2017-11-03T00:11:05.899866+09:00 WARNING kernel: [  750.732867] x19: ffffffc0bbe38000 x18: 0000000000000000 
2017-11-03T00:11:05.899872+09:00 WARNING kernel: [  750.732880] x17: 0000000000000000 x16: 0000000000000000 
2017-11-03T00:11:05.899877+09:00 WARNING kernel: [  750.732896] x15: 0000000000000000 x14: 0ffffffffffffffe 
2017-11-03T00:11:05.899882+09:00 WARNING kernel: [  750.732909] x13: 0000000000000010 x12: 0101010101010101 
2017-11-03T00:11:05.899888+09:00 WARNING kernel: [  750.732922] x11: ffffffffff7fff7f x10: fefefebefb1cef13 
2017-11-03T00:11:05.899893+09:00 WARNING kernel: [  750.732935] x9 : 7f7f7f7f7f7f7f7f x8 : ffffffc000444da0 
2017-11-03T00:11:05.899898+09:00 WARNING kernel: [  750.732948] x7 : ffffffc0010d01d0 x6 : 0000000000000051 
2017-11-03T00:11:05.899904+09:00 WARNING kernel: [  750.732961] x5 : 0000000000000008 x4 : 0000000000000000 
2017-11-03T00:11:05.899909+09:00 WARNING kernel: [  750.732974] x3 : 0000000000000000 x2 : cb88537fdc8ba64f 
2017-11-03T00:11:05.899914+09:00 WARNING kernel: [  750.732987] x1 : cb88537fdc8ba64f x0 : 0000000000000041 
...
... some register dumps ...
...
2017-11-03T00:11:05.900931+09:00 WARNING kernel: [  750.736977] ---[ end trace a909a413a01e8893 ]---

===

So it seems like:

1. Should fix the signature in the crash reports.

2. Should see if we can make the crash reports contain more useful info in this case.

3. Should see if s2r really hits this more often (and confirm it's not useful there) and then find some way to disable it.

4. Possibly might want to lower this to not be a WARN_ON after 4 seconds.  Maybe do a lower print after 4 seconds and a WARN_ON when it goes even longer??
> 1. Should fix the signature in the crash reports.

=>  bug #781343 : nknown => unknown
=>  bug #781384 : fix the signature

===

> 2. Should see if we can make the crash reports contain more useful info in this case.

Newer kernel warning reports also include a "log" which is that last 100 lines of dmesg.  That manages to gather a bit of info from _after_ the warning but not from before.  See <http://crosreview.com/719339>.  We can probably extend that if needed, but I guess let's start with that

Status: WontFix (was: Untriaged)
OK, so I searched through kevin crashes a bit more and it looks likely that the large number of netdev watchdog warnings there are real issues and not false positives.

I say that mostly because it's easy to find instances of mwifiex crashes on crash.corp (due to another bug, b/64074778) and each of those has a NETDEV WATCHDOG warning in them.  ...so the large number of NETDEV WATCHDOG warnings makes sense.

Specifically in those cases, it looks something like this:

<6>[ 1834.733618] NETDEV WATCHDOG: mlan0 (mwifiex_pcie): transmit queue 1 timed out
<4>[ 1834.733645] ------------[ cut here ]------------
<4>[ 1834.733650] WARNING: at /mnt/host/source/src/third_party/kernel/v4.4/net/sched/sch_generic.c:306
<4>[ 1834.733653] Modules linked in: ip6t_REJECT nf_reject_ipv6 xt_TCPMSS ip6table_mangle veth nls_iso8859_1 nls_cp437 vfat fat rfcomm btusb btrtl btbcm btintel uinput uvcvideo videobuf2_vmalloc mwifiex_pcie mwifiex bluetooth xt_nat bridge stp llc zram ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat xt_mark fuse cfg80211 ip6table_filter joydev
<4>[ 1834.733732] 
<4>[ 1834.733738] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.4.86-11909-g72d0ea281b0b #1
<4>[ 1834.733742] Hardware name: Google Kevin (DT)
<4>[ 1834.733747] task: ffffffc0ef034e00 ti: ffffffc0ef050000 task.ti: ffffffc0ef050000
<4>[ 1834.733756] PC is at dev_watchdog+0x15c/0x16c
<4>[ 1834.733760] LR is at dev_watchdog+0x15c/0x16c
...
...
<4>[ 1834.736347] ---[ end trace 9bfe19801ed4e493 ]---
<0>[ 1834.736351] Call trace:
<4>[ 1834.736356] [<ffffffc0007e7424>] dev_watchdog+0x15c/0x16c
<4>[ 1834.736364] [<ffffffc0002851a8>] call_timer_fn+0xb4/0x194
<4>[ 1834.736369] [<ffffffc000284608>] run_timer_softirq+0x164/0x214
<4>[ 1834.736374] [<ffffffc0002008dc>] __do_softirq+0x17c/0x324
<4>[ 1834.736379] [<ffffffc0002231ac>] irq_exit+0xb4/0xd4
<4>[ 1834.736385] [<ffffffc000270c8c>] __handle_domain_irq+0x78/0xb0
<4>[ 1834.736389] [<ffffffc000200704>] gic_handle_irq+0x120/0x17c
<4>[ 1834.736392] Exception stack(0xffffffc0ef053da0 to 0xffffffc0ef053ed0)
<4>[ 1834.736397] 3da0: 0000000000000000 00000000ffffffff 0000000000000000 0000000000000030
<4>[ 1834.736401] 3dc0: 000000000000000c 0000000000000000 ffffffc0060a7ae8 0000000000000000
<4>[ 1834.736405] 3de0: 000000000009f6f1 00000000000003e8 000e8b27713ff700 0000000029aaaaab
<4>[ 1834.736408] 3e00: 0000000000000018 0000000000000002 0000000000000384 00000000a56b7100
<4>[ 1834.736413] 3e20: 0000000000000002 0000000000000000 0000000000000000 ffffffc0ee163c00
<4>[ 1834.736416] 3e40: 0000000000000000 000001ab2ce154ce 000001ab2cd75ddd 0000000000000000
<4>[ 1834.736420] 3e60: ffffffc00108c000 ffffffc00108ca50 ffffffc00106b490 ffffffc001147000
<4>[ 1834.736424] 3e80: ffffffc00108c000 ffffffc0ef053f00 ffffffc0006fe8b0 ffffffc0ef053ed0
<4>[ 1834.736428] 3ea0: ffffffc0006fe8d0 0000000080000145 0000000000000000 ffffffc0ee163c00
<4>[ 1834.736431] 3ec0: ffffffffffffffff ffffffc0006fe8b0
<4>[ 1834.736435] [<ffffffc000203700>] el1_irq+0x80/0xf8
<4>[ 1834.736441] [<ffffffc0006fea98>] cpuidle_enter+0x30/0x40
<4>[ 1834.736445] [<ffffffc00025f380>] call_cpuidle+0x44/0x74
<4>[ 1834.736449] [<ffffffc00025f188>] cpu_startup_entry+0x174/0x24c
<4>[ 1834.736454] [<ffffffc00020d2f8>] secondary_start_kernel+0x144/0x150
<4>[ 1834.736457] [<0000000000400bfc>] 0x400bfc
<6>[ 1834.736495] mwifiex_pcie 0000:01:00.0: 4296502002 : Tx timeout(#1), bss_type-num = 0-0
<6>[ 1844.733635] mwifiex_pcie 0000:01:00.0: 4296512000 : Tx timeout(#2), bss_type-num = 0-0
<6>[ 1854.733635] mwifiex_pcie 0000:01:00.0: 4296522000 : Tx timeout(#3), bss_type-num = 0-0
<6>[ 1864.733631] mwifiex_pcie 0000:01:00.0: 4296532000 : Tx timeout(#4), bss_type-num = 0-0
<6>[ 1874.733623] mwifiex_pcie 0000:01:00.0: 4296542000 : Tx timeout(#5), bss_type-num = 0-0
<6>[ 1884.733630] mwifiex_pcie 0000:01:00.0: 4296552000 : Tx timeout(#6), bss_type-num = 0-0
<6>[ 1909.725632] mwifiex_pcie 0000:01:00.0: 4296576992 : Tx timeout(#1), bss_type-num = 0-0
<6>[ 1914.733650] mwifiex_pcie 0000:01:00.0: 4296582000 : Tx timeout(#2), bss_type-num = 0-0
<6>[ 1924.733650] mwifiex_pcie 0000:01:00.0: 4296592000 : Tx timeout(#3), bss_type-num = 0-0
<6>[ 1934.733661] mwifiex_pcie 0000:01:00.0: 4296602000 : Tx timeout(#4), bss_type-num = 0-0
<6>[ 1944.733640] mwifiex_pcie 0000:01:00.0: 4296612000 : Tx timeout(#5), bss_type-num = 0-0
<6>[ 1954.733625] mwifiex_pcie 0000:01:00.0: 4296622000 : Tx timeout(#6), bss_type-num = 0-0
<6>[ 1964.733644] mwifiex_pcie 0000:01:00.0: 4296632000 : Tx timeout(#7), bss_type-num = 0-0
<6>[ 1964.733657] mwifiex_pcie 0000:01:00.0: tx_timeout_cnt exceeds threshold.	Triggering card reset!
<6>[ 1964.799266] mwifiex_pcie 0000:01:00.0: info: successfully disconnected from 00:00:00:00:00:04: reason code 3
<6>[ 1964.799366] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
<6>[ 1964.799376] mwifiex_pcie 0000:01:00.0: deleting the crypto keys
<6>[ 1964.799405] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
<6>[ 1964.799412] mwifiex_pcie 0000:01:00.0: deleting the crypto keys
<6>[ 1964.799420] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
<6>[ 1964.799426] mwifiex_pcie 0000:01:00.0: deleting the crypto keys
<6>[ 1964.799434] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
<6>[ 1964.799440] mwifiex_pcie 0000:01:00.0: deleting the crypto keys
<6>[ 1964.799447] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
<6>[ 1964.799454] mwifiex_pcie 0000:01:00.0: deleting the crypto keys
<6>[ 1964.799461] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
<6>[ 1964.799468] mwifiex_pcie 0000:01:00.0: deleting the crypto keys
<6>[ 1964.799511] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex...
<6>[ 1964.802785] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
<6>[ 1964.825608] cfg80211: World regulatory domain updated:
...
...
<6>[ 1965.043734] mwifiex_pcie 0000:01:00.0: info: dnld wifi firmware from 175792 bytes
<6>[ 1967.132425] mwifiex_pcie 0000:01:00.0: info: FW download over, size 620800 bytes
<6>[ 1967.838665] mwifiex_pcie 0000:01:00.0: WLAN FW is active
<6>[ 1967.936778] mwifiex_pcie 0000:01:00.0: Unknown api_id: 3
<6>[ 1967.936798] mwifiex_pcie 0000:01:00.0: Unknown api_id: 4
<6>[ 1967.936807] mwifiex_pcie 0000:01:00.0: Unknown GET_HW_SPEC TLV type: 0x217
<6>[ 1967.953240] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (16.68.1.p133) 
<6>[ 1967.953252] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (16.68.1.p133) 
<6>[ 1967.957764] cfg80211: Regulatory domain changed to country: US
...
...
<6>[ 1968.006919] IPv6: ADDRCONF(NETDEV_UP): mlan0: link is not ready
<6>[ 1969.764524] mwifiex_pcie 0000:01:00.0: info: trying to associate to '<redacted>' bssid 00:00:00:00:00:01
<6>[ 1969.771239] mwifiex_pcie 0000:01:00.0: info: associated to bssid 00:00:00:00:00:01 successfully
<6>[ 1969.771444] IPv6: ADDRCONF(NETDEV_CHANGE): mlan0: link becomes ready

===

So it looks like we get this netdev watchdog warning every time WiFi times out.  At the moment we actually recover OK until the next s2r.  ...and once we fix b/64074778 then we'll even be OK across the next s2r.  ...but it still seems sane to keep the netdev watchdog warnings to help track this.  (it would be great to fix this too, but not sure we're going to get WiFi firmware crashes down to 0 ever--we can just minimize them and try to recover).

I'll presume that, in the real world, the number of transitory netdev watchdog errors is pretty small.  If that turns out to be wrong, we can re-open.  Otherwise, closing this bug.  

Sign in to add a comment