New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 922232 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 918190
issue 904025



Sign in to add a comment

StP2: Enable iwlwifi remove / rescan workaround more widely

Project Member Reported by briannorris@chromium.org, Jan 15

Issue description

We have a long history of issues with Intel's Stone Peak 2 WiFi on a variety of platforms. There are a variety of potential or proven root causes, but suffice to say, the overriding behavior is that the PCIe device becomes essentially non-responsive. Our understanding of these issues is that the PCI device loses its state (and so, e.g., its BARs are reset) due to a power glitch, and memory-mapped accesses stop doing anything useful (e.g., reads return 0xffffffff).

Some sample bugs:

https://issuetracker.google.com/35648315 (Cyan)
https://issuetracker.google.com/35648315 (Gnawty)
crbug.com/808058 (Enguarde)
crbug.com/904025 (Glimmer)
https://issuetracker.google.com/119271221 (Glimmer and others)
crbug.com/918190 (Eve)


Sample bugs from other (non-ChromeOS) platforms:

https://bugzilla.kernel.org/show_bug.cgi?id=91171
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1537183



Sample log symptoms:

Something like this, where we see all 1's on the bus, which represents a typical PCIe failure mode:

  Timeout waiting for hardware access (CSR_GP_CNTRL 0xffffffff)

Other modes (in more-recent iwlwifi driver drops) include, during attempts to start() the WiFi link:

  iwlwifi 0000:01:00.0: Error, can not clear persistence bit

In older driver versions, link start() failed later in the process, with:

  iwlwifi 0000:01:00.0: Could not load the [0] uCode section
  iwlwifi 0000:01:00.0: Failed to start INIT ucode: -5
  iwlwifi 0000:01:00.0: Failed to run INIT ucode: -5
  iwlwifi 0000:01:00.0: Failed to start RT ucode: -5

There are likely other failure modes as well.


Typical user experience: WiFi is non-functional and disabled, and when the user tries to toggle the WiFi switch in the UI, it turns itself back off.


Solutions:
At best, we can employ some workarounds. For one, rebooting the system can sometimes recover. But we've also implemented automated workarounds, that work roughly like this:

1. iwlwifi driver detects the above symptoms (e.g., timeout waiting for hardware access)
2. iwlwifi emits a EVENT=INACCESSIBLE event [1] and removes the WiFi PCI device from the bus
3. udev rules [2] track the events of #2, and force the PCI bus to rescan, reenumerating the device

Based on our understanding of the HW failures described above, re-initializing the PCI and WiFi driver state should be sufficient to get us back to a working state.

We have already enabled this workaround on Cyan, Candy, and Gnawty (see iwlwifi.remove_when_gone=1 in the kernel command line, and the dependency on the net-wireless/iwlwifi_rescan package) and are tracking its behavior via Chrome metrics. See these metrics:

Platform.WiFiDisapppearedFromPCI [3]
Platform.WiFiStatusAfterForcedPCIRescan

e.g.:

https://uma.googleplex.com/p/chrome/histograms?endDate=20190113&dayCount=7&histograms=Platform.WiFiDisapppearedFromPCI%2CPlatform.WiFiStatusAfterForcedPCIRescan&fixupData=true&uniqueUsers=true&showMax=true&filters=platform%2Ceq%2CC%2Cchannel%2Ceq%2C4%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial

It appears that the workaround is successful (defined as: shill establishes wlan0 within a reasonable timeout after the removal/rescan) in a large percentage (~90%, or higher depending on the board) of cases.

Since we're aware of this issue occurring on more boards, we'd like to extend the workaround. This bug will serve as a tracker for that work.

The primary factor we'd want to consider is: is this harmful?
Secondarily, we'd also evaluate whether it is effective.

The main risk: if the removal path in iwlwifi isn't reliable, then we could potentially hose the system (e.g., OOPS / panic). Browsing through crash stats hasn't shown anything too concerning on those platforms where the workaround is currently active.

Effectiveness: per the above, the workaround has a high success rate per that definition. Some side notes:

* Cherry-picking some field reports that include the rescan workaround: in some cases, this workaround gets triggered every few minutes. While it is temporarily successful, that doesn't seem like a great recipe for long-term stability. It's not clear if the workaround could be improved to better handle such systems, or those systems have irreparable hardware

* Some failure modes do *not* get detected properly by iwlwifi, so we don't attempt the workaround see:

  https://issuetracker.google.com/119271221
  iwlwifi 0000:01:00.0: Error, can not clear persistence bit

We may try to address these improvements separately.


---

Proposal (this is open to suggestions):
Enable the remove/rescan workaround on most/all boards that are
(a) known to potentially have this problem (it's even better if we have, e.g., crash stats) and
(b) are not subject to further design fixes (i.e., are in mass production)

Kirtika forwarded me this doc, with stats she collected at some point. I don't recall the exact methodology she used for collecting it:

https://docs.google.com/spreadsheets/d/1SOH8B3UMPyaJz_uwPBOKdVf873lP-Kb_Jd_IaUtolbI/edit#gid=0

I think one could get similar stats by search for crash signatures related to this:

  iwl_trans_pcie_grab_nic_access

For starters, I'd probably go with all Baytrail systems, as well as Eve. I'll probably follow up with either some more specifics in the comments, or else just a CL where we can hash this out.



[1] https://chromium.googlesource.com/chromiumos/third_party/kernel/+/6da702e55181b14d48213227702de195767c8503/drivers/net/wireless/iwl7000/iwlwifi/pcie/trans.c#2004
[2] https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/ef8cee7e2221486cd891e34919a92f8db77e8f45/net-wireless/iwlwifi_rescan/files/60-iwlwifi.rules
[3] Note that this was only recently enabled on the upload server:
https://chromium-review.googlesource.com/c/1406152
 
Blocking: 904025
For the record, here's the CL for enabling the workaround on more boards:

https://chromium-review.googlesource.com/c/chromiumos/overlays/board-overlays/+/1413169

It's still somewhat of a WIP.

(There are some other CLs in flight in preparation, to make that CL easier to do.)
Project Member

Comment 3 by bugdroid1@chromium.org, Jan 16 (6 days ago)

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/c9cd4cdad323651209dd549bdcd640947e6f7272

commit c9cd4cdad323651209dd549bdcd640947e6f7272
Author: Brian Norris <briannorris@chromium.org>
Date: Wed Jan 16 09:46:43 2019

cyan, gnawty, candy: drop iwlwifi cmdline params

These are going into /etc/modprobe.d/ instead.

For Cyan: removing the disablevmx=off line too, since that's in the
parent profile (baseboard-strago) and it was only added here because
we're overriding modify_kernel_command_line(). See chromium:873437 /
CL:1173311.

CQ-DEPEND=CL:1412451
BUG=chromium:922232
TEST=check /sys/module/iwlwifi/parameters/remove_when_gone is still 'Y'
TEST=check /proc/cmdline still has disablevmx=off on cyan

Change-Id: Ifa94f2f4925462d17bb435c370d671a10fb4b459
Reviewed-on: https://chromium-review.googlesource.com/1413162
Commit-Ready: Brian Norris <briannorris@chromium.org>
Tested-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Rajat Jain <rajatja@chromium.org>
Reviewed-by: Dylan Reid <dgreid@chromium.org>

[delete] https://crrev.com/a132a1631087360e5587cdc96c4298cc3f2012b1/overlay-gnawty/scripts/build_kernel_image.sh
[delete] https://crrev.com/a132a1631087360e5587cdc96c4298cc3f2012b1/overlay-candy/scripts/build_kernel_image.sh
[delete] https://crrev.com/a132a1631087360e5587cdc96c4298cc3f2012b1/overlay-cyan/scripts/build_kernel_image.sh

Project Member

Comment 4 by bugdroid1@chromium.org, Jan 16 (6 days ago)

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/5d83ad12e2487b5781f94108bd38c8c5064c5f45

commit 5d83ad12e2487b5781f94108bd38c8c5064c5f45
Author: Brian Norris <briannorris@chromium.org>
Date: Wed Jan 16 09:46:45 2019

cyan, gnawty, candy: utilize USE=iwlwifi_rescan

We may start deploying this rescan workaround more widely, and it's
easier to deal with a USE flag than shuffling RDEPENDs for various BSP
ebuilds.

This should have no net effect on the build.

CQ-DEPEND=CL:1413611
BUG=chromium:922232
TEST=see net-wireless/iwlwifi_rescan still installed on image

Change-Id: Ic2e7f9b5a658a14b17a74141e8bfc0cb7ea023a3
Reviewed-on: https://chromium-review.googlesource.com/1413168
Commit-Ready: Brian Norris <briannorris@chromium.org>
Tested-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Rajat Jain <rajatja@chromium.org>

[rename] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-cyan/chromeos-base/chromeos-bsp-cyan/chromeos-bsp-cyan-0.0.2-r15.ebuild
[modify] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-candy/chromeos-base/chromeos-bsp-candy/chromeos-bsp-candy-0.0.1.ebuild
[modify] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-cyan/profiles/base/make.defaults
[rename] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-gnawty/chromeos-base/chromeos-bsp-gnawty/chromeos-bsp-gnawty-0.0.1-r18.ebuild
[rename] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-candy/chromeos-base/chromeos-bsp-candy/chromeos-bsp-candy-0.0.1-r17.ebuild
[modify] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-gnawty/profiles/base/make.defaults
[modify] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-candy/profiles/base/make.defaults
[modify] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-gnawty/chromeos-base/chromeos-bsp-gnawty/chromeos-bsp-gnawty-0.0.1.ebuild
[modify] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-cyan/chromeos-base/chromeos-bsp-cyan/chromeos-bsp-cyan-0.0.2.ebuild

Project Member

Comment 5 by bugdroid1@chromium.org, Jan 16 (6 days ago)

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/97e7cef3ee4142b862fd15700207caa2838abd55

commit 97e7cef3ee4142b862fd15700207caa2838abd55
Author: Brian Norris <briannorris@chromium.org>
Date: Wed Jan 16 09:46:44 2019

iwlwifi_rescan: add iwlwifi module params

For now, we set this for a few boards directly in the kernel command
line, in the relevant overlays. But this is difficult and a bit
redundant, because we also have to pull in the
net-wireless/iwlwifi_rescan dependency. Also, iwlwifi is never built-in
(it's always a module), so we can rely on modprobe.d to set up the
parameter for us.

Make this easier to manage by specifying the parameter in only one
place.

BUG=chromium:922232
TEST=check /sys/module/iwlwifi/parameters/remove_when_gone on glimmer
     after installing this file

Change-Id: Ief6b54837405e0269c10094e61796510d518c408
Reviewed-on: https://chromium-review.googlesource.com/1412451
Commit-Ready: Brian Norris <briannorris@chromium.org>
Tested-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Rajat Jain <rajatja@chromium.org>

[rename] https://crrev.com/97e7cef3ee4142b862fd15700207caa2838abd55/net-wireless/iwlwifi_rescan/iwlwifi_rescan-0.0.1-r4.ebuild
[modify] https://crrev.com/97e7cef3ee4142b862fd15700207caa2838abd55/net-wireless/iwlwifi_rescan/iwlwifi_rescan-0.0.1.ebuild
[add] https://crrev.com/97e7cef3ee4142b862fd15700207caa2838abd55/net-wireless/iwlwifi_rescan/files/modprobe.d/iwlwifi.conf

Project Member

Comment 6 by bugdroid1@chromium.org, Jan 16 (6 days ago)

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/f863963c17ad4615c4597dffb46f3a3661517f8f

commit f863963c17ad4615c4597dffb46f3a3661517f8f
Author: Brian Norris <briannorris@chromium.org>
Date: Wed Jan 16 09:46:44 2019

target-chromium-os: add iwlwifi_rescan USE flag

Support for removing/rescanning the Intel WiFi device is provided by the
net-wireless/iwlwifi_rescan ebuild, as a workaround for some known
hardware mis-behavior. It can legitimately be enabled on any platform,
as its effects are only seen when the iwlwifi driver is present and a
known PCI ID is seen in the system. But we enable it selectively
(a) to save space
(b) to apply the workaround where we have best understood that it is
    needed (and have no chance of effecting a hardware change instead)

Currently, it's only available on 3 boards, but we plan to widen that
scope. Rather than adding new dependencies to BSP ebuilds all over the
place, I find it cleaner to control this via USE flag.

BUG=chromium:922232
TEST=precq

Change-Id: I2892059868aa656ee7fcd2b2f53465c54d45d390
Reviewed-on: https://chromium-review.googlesource.com/1413611
Commit-Ready: Brian Norris <briannorris@chromium.org>
Tested-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Rajat Jain <rajatja@chromium.org>

[rename] https://crrev.com/f863963c17ad4615c4597dffb46f3a3661517f8f/virtual/target-chromium-os/target-chromium-os-1-r116.ebuild
[modify] https://crrev.com/f863963c17ad4615c4597dffb46f3a3661517f8f/virtual/target-chromium-os/target-chromium-os-1.ebuild

Sign in to add a comment