network_WiFi_Reset: failing reliably on Oak family - 'mmc2: error -16 whilst initialising SDIO card' |
||||||||||
Issue descriptionThe new network_WiFi_Reset test ensures that the mwifiex driver can reset the Wifi card and then bring up the interface again and ping a router successfully. This fails 100% reliably on Oak family boards, it seems. See the GoldenEye health dashboard: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/testDetails?testName=network_WiFi_Reset&suite=wifi_matfunc&daysBack=30&board=&architecture=&boardFamily=&buildConfig=&reason=&version=&milestone=&dut= I'm not sure if this is a Wifi driver problem, or if the MTK SDIO driver itself is behaving weird on this platform. Attaching longer syslog snippets from one failure: 2017-08-14T08:28:14.030541-07:00 INFO kernel: [11179.578854] mtk-msdc 11260000.mmc: phase: [map:8bffffff] [maxlen:26] [final:8] 2017-08-14T08:28:14.030568-07:00 INFO kernel: [11179.580655] mtk-msdc 11260000.mmc: phase: [map:fffcffff] [maxlen:16] [final:5] 2017-08-14T08:28:14.034503-07:00 INFO kernel: [11179.582537] mtk-msdc 11260000.mmc: phase: [map:819fffff] [maxlen:21] [final:7] 2017-08-14T08:28:14.034521-07:00 ERR kernel: [11179.582580] mmc2: error -16 whilst initialising SDIO card 2017-08-14T08:28:14.066293-07:00 INFO sshd[24474]: Accepted publickey for root from 127.0.0.1 port 39867 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU 2017-08-14T08:28:14.130575-07:00 INFO kernel: [11179.680380] mtk-msdc 11260000.mmc: phase: [map:8bffffff] [maxlen:26] [final:8] 2017-08-14T08:28:14.134565-07:00 INFO kernel: [11179.682277] mtk-msdc 11260000.mmc: phase: [map:ffffffff] [maxlen:32] [final:10] 2017-08-14T08:28:14.134596-07:00 INFO kernel: [11179.684032] mtk-msdc 11260000.mmc: phase: [map:819fffff] [maxlen:21] [final:7] 2017-08-14T08:28:14.134606-07:00 ERR kernel: [11179.684074] mmc2: error -16 whilst initialising SDIO card
,
Aug 15 2017
It's a new test.
,
Nov 28 2017
Issue 788969 has been merged into this issue.
,
Mar 24 2018
,
Mar 24 2018
If I get cycles someday, maybe I'll look at this. But anyone is free to.
,
Mar 26 2018
,
Mar 26 2018
I poked at this a little today on my Elm PVT, and interestingly, it seems like we always fail after the 6th reset. After a failure, I can unbind the entire SDIO controller: echo 11260000.mmc > /sys/bus/platform/drivers/mtk-msdc/unbind echo 11260000.mmc > /sys/bus/platform/drivers/mtk-msdc/bind to recover the device. I tried increasing the delay between MMC detach/re-attach in the mwifiex driver, but that had no noticeable effect. I looked into the schematics a bit, and it looks like the SDIO 3.3V rail is fixed (always on), while the 1.8V rails are (partially?) covered by VGP3_PMU. It seems to my somewhat inexpert eye here that we probably aren't completely resetting the Wifi module, then, and so it doesn't necessarily come up correctly after a few power-cycle attempts. Does that make sense to anyone who worked on this platform? Is there any chance that could be helped? Do we know what's the difference between mmc_remove_host()/mmc_add_host() and unbinding/binding the entire MMC driver? Related: what's the chance that a user manages to crash their Wifi firmware 6 times in a row?
,
Mar 26 2018
BTW, this test is really a depressing sea of red [1]. We have 0 Marvell-based devices that can reliably pass this test. We just have to cross our fingers and hope that their firmware isn't too crashy... [1] https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/testDetails?testName=network_WiFi_Reset&suite=wifi_matfunc&daysBack=30&board=&architecture=&boardFamily=&buildConfig=&reason=&version=&milestone=&dut= https://stainless.corp.google.com/search?view=matrix&row=model&col=build&first_date=2018-02-27&last_date=2018-03-26&test=%5Enetwork%5C_WiFi%5C_Reset%24&status=GOOD&status=WARN&status=FAIL&status=ERROR&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true
,
Mar 26 2018
,
Mar 27 2018
,
Mar 27 2018
,
Mar 27 2018
Jerry, can you comment on the hardware aspects of Comment #7?
,
Mar 27 2018
BTW: any chance that the new version of this function in later kernels somehow makes it more reliable? A quick check in the 3.18 tree (it's wireless, not wireless-4.2 right?) shows that mwifiex_sdio_card_reset_work() does:
mmc_remove_host(target);
/* 200ms delay is based on experiment with sdhci controller */
mdelay(200);
target->rescan_entered = 0; /* rescan non-removable cards */
mmc_add_host(target);
...but this is totally changed in, for instance, the 4.14 tree:
mwifiex_shutdown_sw(adapter);
/* power cycle the adapter */
sdio_claim_host(func);
mmc_hw_reset(func->card->host);
sdio_release_host(func);
/* Previous save_adapter won't be valid after this. We will cancel
* pending work requests.
*/
clear_bit(MWIFIEX_IFACE_WORK_DEVICE_DUMP, &card->work_flags);
clear_bit(MWIFIEX_IFACE_WORK_CARD_RESET, &card->work_flags);
ret = mwifiex_reinit_sw(adapter);
,
Mar 27 2018
Brian, when the mwifiex driver resets the Wifi card, can you confirm that the AUD_DAT_MOSI (GPIO) from AP pin AJ37 to PDn pin of Wifi module is being asserted high? (It gets inverted before it reaches the module.) If so, can you tell how long it goes high? Do any of the power rails go low during reset? (I'd expect 1.8V and 3.3V to remain valid. There are two 3.3V rails; one for the SD interface and one for the rest of the module.) If none of the rails go low, it's probably not rail sequencing.
,
Mar 27 2018
@Doug:
> BTW: any chance that the new version of this function in later kernels somehow makes it more reliable?
Possibly. I believe the main difference would be whether mmc_hw_reset() is somehow better than mmc_{remote,add}_host(). The rest of the logic was supposed to be more or less the same (from a HW standpoint) IIUC.
I could possibly try 4.4, since I ported most of that reset stuff there, though I never really got a chance to retest SDIO thoroughly with it.
> (it's wireless, not wireless-4.2 right?)
Yes. The latter is only for gale (jetstream products).
@jwp:
> can you confirm that the AUD_DAT_MOSI (GPIO) from AP pin AJ37 to PDn pin of Wifi module is being asserted high?
I'm not super keen on trying to tear my unit apart and probe it for things like this yet...but I did realize I was looking at the Oak schematics, not the Elm ones. Not sure if that mattered.
But looking closer at the PDn pin (on Elm now), it seems like the PDn control on the AP side is actually stubbed out -- the signal is just pulled to PP3300_DX_WLAN. So I don't think this could be anything but "high"?
> Do any of the power rails go low during reset?
I believe the core code is trying to power off both vmmc and vqmmc. I'm not familiar enough with this board to know if that maps to anything useful on this board. The device tree says the former is 'pio 85' and the latter is 'ldo_vgp3' (on the mt6397 PMIC).
,
Mar 27 2018
I'm looking at the Elm PVT schematic and the PDn pin on the Wifi module is connected to the PDN_L signal that is the inverted version of WIFI_PDN from the AP. (Don't let the asterisk on R301 fool you; it is actually a shunt, not a resistor, so it is actually connected.) A better way to reset would be to assert the PDn pin on the module. (Is that what mmc_hw_reset() does?)
,
Mar 27 2018
> (Don't let the asterisk on R301 fool you; it is actually a shunt, not a resistor, so it is actually connected.) Wow really? So I really do have to throw out all knowledge of schematics every time I look at new projects... > A better way to reset would be to assert the PDn pin on the module. (Is that what mmc_hw_reset() does?) Now that I've actually found a MT8173 datasheet...I think GPIO85 is AUD_DAT_MOSI, which means that we should already be toggling that (in both the remove/add_host() case and the mmc_hw_reset() case). The GPIO debugfs API confirms that this is 'low' in active use cases, but goes 'hi' during reset. (That inverter was confusing me a bit too. Seems like the AP-side signal shouldn't be called "WIFI_PDN" -- which would imply that "low" means Powered Down.) --- Also, I tried out kernel 4.4 (where we already have the 'mmc_hw_reset()' solution), and it does seem to recover from more than 6 resets. The full test still doesn't pass end-to-end (I see other test timeouts), but I can't yet tell if that's because of other unrelated reasons (e.g., because 4.4 isn't officially supported on our Mediatek devices). So it's possible that the mmc_hw_reset() method does a better job here. I guess I need to tease apart the actual differences there.
,
Mar 27 2018
If we're asserting PDn, then we probably don't need to cycle the power rails. If we're doing both, then the PDn should be released last. >So I really do have to throw out all knowledge of schematics every time I look at new projects... Asterisk still means empty/no stuff. But... If you see a resistor that says "short" then it means the actual component got replaced with an etch pattern that shorts the pads. This is a cost savings scheme to eliminate 0 ohm resistors. The reason why it shows up as empty/no stuff is to keep it out of the bill of materials (BOM) as no component needs to be stuffed at that location.
,
Mar 30 2018
@Jerry: Thanks for the tips. TIL. (Or, 2 days ago I learned.)
I did a simple trace of the regulator and GPIO frameworks here, to see what's really happening with the power sequencing here, and I see:
# grep -e gpio -e 3V3 -e vcamaf /sys/kernel/debug/tracing/trace
kworker/2:1-92 [002] ...1 248.046820: regulator_disable: name=3V3
kworker/2:1-92 [002] ...1 248.046824: gpio_value: 462 set 1
kworker/2:1-92 [002] ...1 248.046830: regulator_disable_complete: name=3V3
kworker/2:1-92 [002] ...1 248.046831: regulator_disable: name=vcamaf
kworker/2:1-92 [002] ...1 248.046838: regulator_disable_complete: name=vcamaf
kworker/2:1-92 [002] ...1 248.256512: regulator_enable: name=3V3
kworker/2:1-92 [002] ...1 248.256517: gpio_value: 462 set 0
kworker/2:1-92 [002] ...1 248.256522: regulator_enable_delay: name=3V3
kworker/2:1-92 [002] ...1 248.256524: regulator_enable_complete: name=3V3
kworker/2:1-92 [002] ...1 248.269794: regulator_enable: name=vcamaf
kworker/2:1-92 [002] ...1 248.269802: regulator_enable_delay: name=vcamaf
kworker/2:1-92 [002] ...1 248.270125: regulator_enable_complete: name=vcamaf
kworker/2:1-92 [002] ...1 248.353662: regulator_disable: name=3V3
kworker/2:1-92 [002] ...1 248.353667: gpio_value: 462 set 1
kworker/2:1-92 [002] ...1 248.353672: regulator_disable_complete: name=3V3
kworker/2:1-92 [002] ...1 248.353674: regulator_disable: name=vcamaf
kworker/2:1-92 [002] ...1 248.353680: regulator_disable_complete: name=vcamaf
kworker/u8:2-110 [002] ...1 248.354730: regulator_enable: name=3V3
kworker/u8:2-110 [002] ...1 248.354734: gpio_value: 462 set 0
kworker/u8:2-110 [002] ...1 248.354739: regulator_enable_delay: name=3V3
kworker/u8:2-110 [002] ...1 248.354740: regulator_enable_complete: name=3V3
kworker/u8:2-110 [002] ...1 248.369804: regulator_enable: name=vcamaf
kworker/u8:2-110 [002] ...1 248.369810: regulator_enable_delay: name=vcamaf
kworker/u8:2-110 [002] ...1 248.370136: regulator_enable_complete: name=vcamaf
That means we're getting a toggle off/on off/on. It also isn't following the sequence that Jerry suggested. And perhaps most importantly, there's only about 1ms between the last off/on toggle. That's probably not long enough?
It looks like the last off/on is because of runtime PM -- the device is briefly allowed to runtime suspend, which causes another power cycle.
I'm not sure if there's an easy way to get the device to avoid runtime suspending in there, but it's probably partly an artifact of us faking what VMMC is (PDn is not really VMMC, per my understanding). At any rate, I think the mmc_hw_reset() approach would be more reliable here. Unfortunately, it requires a lot more driver refactoring to get there...
I'll probably see if there's anything simpler that can simplify the power sequencing here, or else just see if I can upgrade this part of the driver.
,
Oct 31
|
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by wnhuang@chromium.org
, Aug 15 2017Status: Assigned (was: Untriaged)