StP2: Enable iwlwifi remove / rescan workaround more widely |
|
Issue descriptionWe have a long history of issues with Intel's Stone Peak 2 WiFi on a variety of platforms. There are a variety of potential or proven root causes, but suffice to say, the overriding behavior is that the PCIe device becomes essentially non-responsive. Our understanding of these issues is that the PCI device loses its state (and so, e.g., its BARs are reset) due to a power glitch, and memory-mapped accesses stop doing anything useful (e.g., reads return 0xffffffff). Some sample bugs: https://issuetracker.google.com/35648315 (Cyan) https://issuetracker.google.com/35648315 (Gnawty) crbug.com/808058 (Enguarde) crbug.com/904025 (Glimmer) https://issuetracker.google.com/119271221 (Glimmer and others) crbug.com/918190 (Eve) Sample bugs from other (non-ChromeOS) platforms: https://bugzilla.kernel.org/show_bug.cgi?id=91171 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1537183 Sample log symptoms: Something like this, where we see all 1's on the bus, which represents a typical PCIe failure mode: Timeout waiting for hardware access (CSR_GP_CNTRL 0xffffffff) Other modes (in more-recent iwlwifi driver drops) include, during attempts to start() the WiFi link: iwlwifi 0000:01:00.0: Error, can not clear persistence bit In older driver versions, link start() failed later in the process, with: iwlwifi 0000:01:00.0: Could not load the [0] uCode section iwlwifi 0000:01:00.0: Failed to start INIT ucode: -5 iwlwifi 0000:01:00.0: Failed to run INIT ucode: -5 iwlwifi 0000:01:00.0: Failed to start RT ucode: -5 There are likely other failure modes as well. Typical user experience: WiFi is non-functional and disabled, and when the user tries to toggle the WiFi switch in the UI, it turns itself back off. Solutions: At best, we can employ some workarounds. For one, rebooting the system can sometimes recover. But we've also implemented automated workarounds, that work roughly like this: 1. iwlwifi driver detects the above symptoms (e.g., timeout waiting for hardware access) 2. iwlwifi emits a EVENT=INACCESSIBLE event [1] and removes the WiFi PCI device from the bus 3. udev rules [2] track the events of #2, and force the PCI bus to rescan, reenumerating the device Based on our understanding of the HW failures described above, re-initializing the PCI and WiFi driver state should be sufficient to get us back to a working state. We have already enabled this workaround on Cyan, Candy, and Gnawty (see iwlwifi.remove_when_gone=1 in the kernel command line, and the dependency on the net-wireless/iwlwifi_rescan package) and are tracking its behavior via Chrome metrics. See these metrics: Platform.WiFiDisapppearedFromPCI [3] Platform.WiFiStatusAfterForcedPCIRescan e.g.: https://uma.googleplex.com/p/chrome/histograms?endDate=20190113&dayCount=7&histograms=Platform.WiFiDisapppearedFromPCI%2CPlatform.WiFiStatusAfterForcedPCIRescan&fixupData=true&uniqueUsers=true&showMax=true&filters=platform%2Ceq%2CC%2Cchannel%2Ceq%2C4%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial It appears that the workaround is successful (defined as: shill establishes wlan0 within a reasonable timeout after the removal/rescan) in a large percentage (~90%, or higher depending on the board) of cases. Since we're aware of this issue occurring on more boards, we'd like to extend the workaround. This bug will serve as a tracker for that work. The primary factor we'd want to consider is: is this harmful? Secondarily, we'd also evaluate whether it is effective. The main risk: if the removal path in iwlwifi isn't reliable, then we could potentially hose the system (e.g., OOPS / panic). Browsing through crash stats hasn't shown anything too concerning on those platforms where the workaround is currently active. Effectiveness: per the above, the workaround has a high success rate per that definition. Some side notes: * Cherry-picking some field reports that include the rescan workaround: in some cases, this workaround gets triggered every few minutes. While it is temporarily successful, that doesn't seem like a great recipe for long-term stability. It's not clear if the workaround could be improved to better handle such systems, or those systems have irreparable hardware * Some failure modes do *not* get detected properly by iwlwifi, so we don't attempt the workaround see: https://issuetracker.google.com/119271221 iwlwifi 0000:01:00.0: Error, can not clear persistence bit We may try to address these improvements separately. --- Proposal (this is open to suggestions): Enable the remove/rescan workaround on most/all boards that are (a) known to potentially have this problem (it's even better if we have, e.g., crash stats) and (b) are not subject to further design fixes (i.e., are in mass production) Kirtika forwarded me this doc, with stats she collected at some point. I don't recall the exact methodology she used for collecting it: https://docs.google.com/spreadsheets/d/1SOH8B3UMPyaJz_uwPBOKdVf873lP-Kb_Jd_IaUtolbI/edit#gid=0 I think one could get similar stats by search for crash signatures related to this: iwl_trans_pcie_grab_nic_access For starters, I'd probably go with all Baytrail systems, as well as Eve. I'll probably follow up with either some more specifics in the comments, or else just a CL where we can hash this out. [1] https://chromium.googlesource.com/chromiumos/third_party/kernel/+/6da702e55181b14d48213227702de195767c8503/drivers/net/wireless/iwl7000/iwlwifi/pcie/trans.c#2004 [2] https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/ef8cee7e2221486cd891e34919a92f8db77e8f45/net-wireless/iwlwifi_rescan/files/60-iwlwifi.rules [3] Note that this was only recently enabled on the upload server: https://chromium-review.googlesource.com/c/1406152
,
Jan 16
For the record, here's the CL for enabling the workaround on more boards: https://chromium-review.googlesource.com/c/chromiumos/overlays/board-overlays/+/1413169 It's still somewhat of a WIP. (There are some other CLs in flight in preparation, to make that CL easier to do.)
,
Jan 16
(6 days ago)
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/c9cd4cdad323651209dd549bdcd640947e6f7272 commit c9cd4cdad323651209dd549bdcd640947e6f7272 Author: Brian Norris <briannorris@chromium.org> Date: Wed Jan 16 09:46:43 2019 cyan, gnawty, candy: drop iwlwifi cmdline params These are going into /etc/modprobe.d/ instead. For Cyan: removing the disablevmx=off line too, since that's in the parent profile (baseboard-strago) and it was only added here because we're overriding modify_kernel_command_line(). See chromium:873437 / CL:1173311. CQ-DEPEND=CL:1412451 BUG=chromium:922232 TEST=check /sys/module/iwlwifi/parameters/remove_when_gone is still 'Y' TEST=check /proc/cmdline still has disablevmx=off on cyan Change-Id: Ifa94f2f4925462d17bb435c370d671a10fb4b459 Reviewed-on: https://chromium-review.googlesource.com/1413162 Commit-Ready: Brian Norris <briannorris@chromium.org> Tested-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Rajat Jain <rajatja@chromium.org> Reviewed-by: Dylan Reid <dgreid@chromium.org> [delete] https://crrev.com/a132a1631087360e5587cdc96c4298cc3f2012b1/overlay-gnawty/scripts/build_kernel_image.sh [delete] https://crrev.com/a132a1631087360e5587cdc96c4298cc3f2012b1/overlay-candy/scripts/build_kernel_image.sh [delete] https://crrev.com/a132a1631087360e5587cdc96c4298cc3f2012b1/overlay-cyan/scripts/build_kernel_image.sh
,
Jan 16
(6 days ago)
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/5d83ad12e2487b5781f94108bd38c8c5064c5f45 commit 5d83ad12e2487b5781f94108bd38c8c5064c5f45 Author: Brian Norris <briannorris@chromium.org> Date: Wed Jan 16 09:46:45 2019 cyan, gnawty, candy: utilize USE=iwlwifi_rescan We may start deploying this rescan workaround more widely, and it's easier to deal with a USE flag than shuffling RDEPENDs for various BSP ebuilds. This should have no net effect on the build. CQ-DEPEND=CL:1413611 BUG=chromium:922232 TEST=see net-wireless/iwlwifi_rescan still installed on image Change-Id: Ic2e7f9b5a658a14b17a74141e8bfc0cb7ea023a3 Reviewed-on: https://chromium-review.googlesource.com/1413168 Commit-Ready: Brian Norris <briannorris@chromium.org> Tested-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Rajat Jain <rajatja@chromium.org> [rename] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-cyan/chromeos-base/chromeos-bsp-cyan/chromeos-bsp-cyan-0.0.2-r15.ebuild [modify] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-candy/chromeos-base/chromeos-bsp-candy/chromeos-bsp-candy-0.0.1.ebuild [modify] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-cyan/profiles/base/make.defaults [rename] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-gnawty/chromeos-base/chromeos-bsp-gnawty/chromeos-bsp-gnawty-0.0.1-r18.ebuild [rename] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-candy/chromeos-base/chromeos-bsp-candy/chromeos-bsp-candy-0.0.1-r17.ebuild [modify] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-gnawty/profiles/base/make.defaults [modify] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-candy/profiles/base/make.defaults [modify] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-gnawty/chromeos-base/chromeos-bsp-gnawty/chromeos-bsp-gnawty-0.0.1.ebuild [modify] https://crrev.com/5d83ad12e2487b5781f94108bd38c8c5064c5f45/overlay-cyan/chromeos-base/chromeos-bsp-cyan/chromeos-bsp-cyan-0.0.2.ebuild
,
Jan 16
(6 days ago)
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/97e7cef3ee4142b862fd15700207caa2838abd55 commit 97e7cef3ee4142b862fd15700207caa2838abd55 Author: Brian Norris <briannorris@chromium.org> Date: Wed Jan 16 09:46:44 2019 iwlwifi_rescan: add iwlwifi module params For now, we set this for a few boards directly in the kernel command line, in the relevant overlays. But this is difficult and a bit redundant, because we also have to pull in the net-wireless/iwlwifi_rescan dependency. Also, iwlwifi is never built-in (it's always a module), so we can rely on modprobe.d to set up the parameter for us. Make this easier to manage by specifying the parameter in only one place. BUG=chromium:922232 TEST=check /sys/module/iwlwifi/parameters/remove_when_gone on glimmer after installing this file Change-Id: Ief6b54837405e0269c10094e61796510d518c408 Reviewed-on: https://chromium-review.googlesource.com/1412451 Commit-Ready: Brian Norris <briannorris@chromium.org> Tested-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Rajat Jain <rajatja@chromium.org> [rename] https://crrev.com/97e7cef3ee4142b862fd15700207caa2838abd55/net-wireless/iwlwifi_rescan/iwlwifi_rescan-0.0.1-r4.ebuild [modify] https://crrev.com/97e7cef3ee4142b862fd15700207caa2838abd55/net-wireless/iwlwifi_rescan/iwlwifi_rescan-0.0.1.ebuild [add] https://crrev.com/97e7cef3ee4142b862fd15700207caa2838abd55/net-wireless/iwlwifi_rescan/files/modprobe.d/iwlwifi.conf
,
Jan 16
(6 days ago)
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/f863963c17ad4615c4597dffb46f3a3661517f8f commit f863963c17ad4615c4597dffb46f3a3661517f8f Author: Brian Norris <briannorris@chromium.org> Date: Wed Jan 16 09:46:44 2019 target-chromium-os: add iwlwifi_rescan USE flag Support for removing/rescanning the Intel WiFi device is provided by the net-wireless/iwlwifi_rescan ebuild, as a workaround for some known hardware mis-behavior. It can legitimately be enabled on any platform, as its effects are only seen when the iwlwifi driver is present and a known PCI ID is seen in the system. But we enable it selectively (a) to save space (b) to apply the workaround where we have best understood that it is needed (and have no chance of effecting a hardware change instead) Currently, it's only available on 3 boards, but we plan to widen that scope. Rather than adding new dependencies to BSP ebuilds all over the place, I find it cleaner to control this via USE flag. BUG=chromium:922232 TEST=precq Change-Id: I2892059868aa656ee7fcd2b2f53465c54d45d390 Reviewed-on: https://chromium-review.googlesource.com/1413611 Commit-Ready: Brian Norris <briannorris@chromium.org> Tested-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Rajat Jain <rajatja@chromium.org> [rename] https://crrev.com/f863963c17ad4615c4597dffb46f3a3661517f8f/virtual/target-chromium-os/target-chromium-os-1-r116.ebuild [modify] https://crrev.com/f863963c17ad4615c4597dffb46f3a3661517f8f/virtual/target-chromium-os/target-chromium-os-1.ebuild |
|
►
Sign in to add a comment |
|
Comment 1 by briannorris@chromium.org
, Jan 16