New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 755329 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

network_WiFi_Reset: failing reliably on Oak family - 'mmc2: error -16 whilst initialising SDIO card'

Project Member Reported by briannorris@chromium.org, Aug 14 2017

Issue description

The new network_WiFi_Reset test ensures that the mwifiex driver can reset the Wifi card and then bring up the interface again and ping a router successfully. This fails 100% reliably on Oak family boards, it seems. See the GoldenEye health dashboard:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/testDetails?testName=network_WiFi_Reset&suite=wifi_matfunc&daysBack=30&board=&architecture=&boardFamily=&buildConfig=&reason=&version=&milestone=&dut=

I'm not sure if this is a Wifi driver problem, or if the MTK SDIO driver itself is behaving weird on this platform.

Attaching longer syslog snippets from one failure:

2017-08-14T08:28:14.030541-07:00 INFO kernel: [11179.578854] mtk-msdc 11260000.mmc: phase: [map:8bffffff] [maxlen:26] [final:8]
2017-08-14T08:28:14.030568-07:00 INFO kernel: [11179.580655] mtk-msdc 11260000.mmc: phase: [map:fffcffff] [maxlen:16] [final:5]
2017-08-14T08:28:14.034503-07:00 INFO kernel: [11179.582537] mtk-msdc 11260000.mmc: phase: [map:819fffff] [maxlen:21] [final:7]
2017-08-14T08:28:14.034521-07:00 ERR kernel: [11179.582580] mmc2: error -16 whilst initialising SDIO card
2017-08-14T08:28:14.066293-07:00 INFO sshd[24474]: Accepted publickey for root from 127.0.0.1 port 39867 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2017-08-14T08:28:14.130575-07:00 INFO kernel: [11179.680380] mtk-msdc 11260000.mmc: phase: [map:8bffffff] [maxlen:26] [final:8]
2017-08-14T08:28:14.134565-07:00 INFO kernel: [11179.682277] mtk-msdc 11260000.mmc: phase: [map:ffffffff] [maxlen:32] [final:10]
2017-08-14T08:28:14.134596-07:00 INFO kernel: [11179.684032] mtk-msdc 11260000.mmc: phase: [map:819fffff] [maxlen:21] [final:7]
2017-08-14T08:28:14.134606-07:00 ERR kernel: [11179.684074] mmc2: error -16 whilst initialising SDIO card

 
syslog-trimmed.txt
9.3 KB View Download
Owner: wnhuang@chromium.org
Status: Assigned (was: Untriaged)
Looks like a regression to me, we used to able to pass the entire wifi_matfunc.

It's a new test.
Cc: kirtika@chromium.org harpreet@chromium.org dsunk...@chromium.org briannorris@google.com
 Issue 788969  has been merged into this issue.

Comment 4 by tbroch@chromium.org, Mar 24 2018

Owner: ----
Status: Available (was: Assigned)
Owner: briannorris@chromium.org
Status: Assigned (was: Available)
If I get cycles someday, maybe I'll look at this. But anyone is free to.
Cc: djkurtz@chromium.org briannorris@chromium.org
 Issue 825945  has been merged into this issue.
Cc: diand...@chromium.org
I poked at this a little today on my Elm PVT, and interestingly, it seems like we always fail after the 6th reset. After a failure, I can unbind the entire SDIO controller:

echo 11260000.mmc > /sys/bus/platform/drivers/mtk-msdc/unbind
echo 11260000.mmc > /sys/bus/platform/drivers/mtk-msdc/bind

to recover the device.

I tried increasing the delay between MMC detach/re-attach in the mwifiex driver, but that had no noticeable effect.

I looked into the schematics a bit, and it looks like the SDIO 3.3V rail is fixed (always on), while the 1.8V rails are (partially?) covered by VGP3_PMU.

It seems to my somewhat inexpert eye here that we probably aren't completely resetting the Wifi module, then, and so it doesn't necessarily come up correctly after a few power-cycle attempts.

Does that make sense to anyone who worked on this platform? Is there any chance that could be helped? Do we know what's the difference between mmc_remove_host()/mmc_add_host() and unbinding/binding the entire MMC driver?

Related: what's the chance that a user manages to crash their Wifi firmware 6 times in a row?
Cc: -briannorris@google.com
Cc: minghsiu...@mediatek.com yong....@mediatek.com oak-mtk@chromium.org
Cc: marvell-wifi@chromium.org
Cc: jwp@chromium.org
Jerry, can you comment on the hardware aspects of Comment #7?
BTW: any chance that the new version of this function in later kernels somehow makes it more reliable?  A quick check in the 3.18 tree (it's wireless, not wireless-4.2 right?) shows that mwifiex_sdio_card_reset_work() does:

        mmc_remove_host(target);
        /* 200ms delay is based on experiment with sdhci controller */
        mdelay(200);
        target->rescan_entered = 0; /* rescan non-removable cards */
        mmc_add_host(target);

...but this is totally changed in, for instance, the 4.14 tree:

        mwifiex_shutdown_sw(adapter);

        /* power cycle the adapter */
        sdio_claim_host(func);
        mmc_hw_reset(func->card->host);
        sdio_release_host(func);

        /* Previous save_adapter won't be valid after this. We will cancel
         * pending work requests.
         */
        clear_bit(MWIFIEX_IFACE_WORK_DEVICE_DUMP, &card->work_flags);
        clear_bit(MWIFIEX_IFACE_WORK_CARD_RESET, &card->work_flags);

        ret = mwifiex_reinit_sw(adapter);

Comment 14 by jwp@google.com, Mar 27 2018

Brian, when the mwifiex driver resets the Wifi card, can you confirm that the AUD_DAT_MOSI (GPIO) from AP pin AJ37 to PDn pin of Wifi module is being asserted high? (It gets inverted before it reaches the module.) If so, can you tell how long it goes high?

Do any of the power rails go low during reset? (I'd expect 1.8V and 3.3V to remain valid. There are two 3.3V rails; one for the SD interface and one for the rest of the module.) If none of the rails go low, it's probably not rail sequencing.
@Doug:
> BTW: any chance that the new version of this function in later kernels somehow makes it more reliable?

Possibly. I believe the main difference would be whether mmc_hw_reset() is somehow better than mmc_{remote,add}_host(). The rest of the logic was supposed to be more or less the same (from a HW standpoint) IIUC.

I could possibly try 4.4, since I ported most of that reset stuff there, though I never really got a chance to retest SDIO thoroughly with it.

> (it's wireless, not wireless-4.2 right?)

Yes. The latter is only for gale (jetstream products).

@jwp:
> can you confirm that the AUD_DAT_MOSI (GPIO) from AP pin AJ37 to PDn pin of Wifi module is being asserted high?

I'm not super keen on trying to tear my unit apart and probe it for things like this yet...but I did realize I was looking at the Oak schematics, not the Elm ones. Not sure if that mattered.

But looking closer at the PDn pin (on Elm now), it seems like the PDn control on the AP side is actually stubbed out -- the signal is just pulled to PP3300_DX_WLAN. So I don't think this could be anything but "high"?

> Do any of the power rails go low during reset?

I believe the core code is trying to power off both vmmc and vqmmc. I'm not familiar enough with this board to know if that maps to anything useful on this board. The device tree says the former is 'pio 85' and the latter is 'ldo_vgp3' (on the mt6397 PMIC).

Comment 16 by jwp@google.com, Mar 27 2018

I'm looking at the Elm PVT schematic and the PDn pin on the Wifi module is connected to the PDN_L signal that is the inverted version of WIFI_PDN from the AP. (Don't let the asterisk on R301 fool you; it is actually a shunt, not a resistor, so it is actually connected.)

A better way to reset would be to assert the PDn pin on the module. (Is that what mmc_hw_reset() does?)
> (Don't let the asterisk on R301 fool you; it is actually a shunt, not a resistor, so it is actually connected.)

Wow really? So I really do have to throw out all knowledge of schematics every time I look at new projects...

> A better way to reset would be to assert the PDn pin on the module. (Is that what mmc_hw_reset() does?)

Now that I've actually found a MT8173 datasheet...I think GPIO85 is AUD_DAT_MOSI, which means that we should already be toggling that (in both the remove/add_host() case and the mmc_hw_reset() case). The GPIO debugfs API confirms that this is 'low' in active use cases, but goes 'hi' during reset.

(That inverter was confusing me a bit too. Seems like the AP-side signal shouldn't be called "WIFI_PDN" -- which would imply that "low" means Powered Down.)

---

Also, I tried out kernel 4.4 (where we already have the 'mmc_hw_reset()' solution), and it does seem to recover from more than 6 resets. The full test still doesn't pass end-to-end (I see other test timeouts), but I can't yet tell if that's because of other unrelated reasons (e.g., because 4.4 isn't officially supported on our Mediatek devices). So it's possible that the mmc_hw_reset() method does a better job here. I guess I need to tease apart the actual differences there.

Comment 18 by jwp@google.com, Mar 27 2018

If we're asserting PDn, then we probably don't need to cycle the power rails. If we're doing both, then the PDn should be released last.

>So I really do have to throw out all knowledge of schematics every time I look at new projects...

Asterisk still means empty/no stuff. But...

If you see a resistor that says "short" then it means the actual component got replaced with an etch pattern that shorts the pads. This is a cost savings scheme to eliminate 0 ohm resistors. The reason why it shows up as empty/no stuff is to keep it out of the bill of materials (BOM) as no component needs to be stuffed at that location.
@Jerry: Thanks for the tips. TIL. (Or, 2 days ago I learned.)

I did a simple trace of the regulator and GPIO frameworks here, to see what's really happening with the power sequencing here, and I see:

# grep -e gpio -e 3V3 -e vcamaf /sys/kernel/debug/tracing/trace                                                           
     kworker/2:1-92    [002] ...1   248.046820: regulator_disable: name=3V3
     kworker/2:1-92    [002] ...1   248.046824: gpio_value: 462 set 1
     kworker/2:1-92    [002] ...1   248.046830: regulator_disable_complete: name=3V3
     kworker/2:1-92    [002] ...1   248.046831: regulator_disable: name=vcamaf
     kworker/2:1-92    [002] ...1   248.046838: regulator_disable_complete: name=vcamaf
     kworker/2:1-92    [002] ...1   248.256512: regulator_enable: name=3V3
     kworker/2:1-92    [002] ...1   248.256517: gpio_value: 462 set 0
     kworker/2:1-92    [002] ...1   248.256522: regulator_enable_delay: name=3V3
     kworker/2:1-92    [002] ...1   248.256524: regulator_enable_complete: name=3V3
     kworker/2:1-92    [002] ...1   248.269794: regulator_enable: name=vcamaf
     kworker/2:1-92    [002] ...1   248.269802: regulator_enable_delay: name=vcamaf
     kworker/2:1-92    [002] ...1   248.270125: regulator_enable_complete: name=vcamaf
     kworker/2:1-92    [002] ...1   248.353662: regulator_disable: name=3V3
     kworker/2:1-92    [002] ...1   248.353667: gpio_value: 462 set 1
     kworker/2:1-92    [002] ...1   248.353672: regulator_disable_complete: name=3V3
     kworker/2:1-92    [002] ...1   248.353674: regulator_disable: name=vcamaf
     kworker/2:1-92    [002] ...1   248.353680: regulator_disable_complete: name=vcamaf
    kworker/u8:2-110   [002] ...1   248.354730: regulator_enable: name=3V3
    kworker/u8:2-110   [002] ...1   248.354734: gpio_value: 462 set 0
    kworker/u8:2-110   [002] ...1   248.354739: regulator_enable_delay: name=3V3
    kworker/u8:2-110   [002] ...1   248.354740: regulator_enable_complete: name=3V3
    kworker/u8:2-110   [002] ...1   248.369804: regulator_enable: name=vcamaf
    kworker/u8:2-110   [002] ...1   248.369810: regulator_enable_delay: name=vcamaf
    kworker/u8:2-110   [002] ...1   248.370136: regulator_enable_complete: name=vcamaf

That means we're getting a toggle off/on off/on. It also isn't following the sequence that Jerry suggested. And perhaps most importantly, there's only about 1ms between the last off/on toggle. That's probably not long enough?

It looks like the last off/on is because of runtime PM -- the device is briefly allowed to runtime suspend, which causes another power cycle.

I'm not sure if there's an easy way to get the device to avoid runtime suspending in there, but it's probably partly an artifact of us faking what VMMC is (PDn is not really VMMC, per my understanding). At any rate, I think the mmc_hw_reset() approach would be more reliable here. Unfortunately, it requires a lot more driver refactoring to get there...

I'll probably see if there's anything simpler that can simplify the power sequencing here, or else just see if I can upgrade this part of the driver.
Labels: wifi-test-failures

Sign in to add a comment