Kernel should panic if rootfs device becomes inaccessible |
||||
Issue descriptionIf the controller or disk holding the rootfs becomes inaccessible (e.g. bug 765937), the kernel retries forever. This is undesirable for several reasons: 1) It's a bad UX; they often need to hold down the power button to get the machine back into a usable state. 2) Sometimes we luck out and get a hung tasks panic after a few minutes. In this case, the logs are full of retries and ramoops won't go far enough back to show the root cause. The root cause might be something interesting, like a failure to resume from suspend. 3) Absent a hung tasks panic, we never find out about the problem at all, and therefore it's not reflected in our crash stats. I would propose panicking the kernel if a fatal SATA or eMMC error is detected on the device holding the rootfs.
,
May 14 2018
Looking at 85424681744, console-ramoops reports issue with mmc0 (where the eMMC is located) from even before 1635.881612 to 2019.118076, trying to suspend, then i2c controller at [ 2019.219077], mmc1 controller [sdcard probably] report errors when suspend is aborted and tried to be resumed back. In all cases, we can not read the PCIe mapped memory for these controllers, all returns -1: [ 1635.981744] sdhci: =========== REGISTER DUMP (mmc0)=========== [ 1635.981779] sdhci: Sys addr: 0xffffffff | Version: 0x0000ffff [ 1635.981809] sdhci: Blk size: 0x0000ffff | Blk cnt: 0x0000ffff [ 1635.981842] sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff [ 1635.981875] sdhci: Present: 0xffffffff | Host ctl: 0x000000ff ... console-ramoops is too small to report the origin of the error. same data for 85421797253, 85409922036 and 85414123523. Gwendal.
,
May 14 2018
Back to original problem, the block layer does not know if a device is the root device. The filesystem layer does not care either. Note that in #1, from the kernel pow, the device never formally went away, the controller became unresponsive. One way to address the issue would be for the block layer to issue a message [udev] to user space to take action. However, it may not succeed all the time if some code/script needs to be loaded from disk or memory has been swapped out. The crash has to come from the kernel, user space needs to mark device(s) as mandatory earlier. A failure message must be propagated from a fail device to all its descendants and if one has been marked as critical earlier the kernel would trigger a panic.
,
May 14 2018
> In all cases, we can not read the PCIe mapped memory for these controllers I believe these are onchip SoC devices, not PCIe devices. Although maybe it's connected through a bridge... (These platforms are having PCIe issues too, FWIW.) > console-ramoops is too small to report the origin of the error. Sometimes the system comes back out of suspend in this state, and is (temporarily) semi-functional as long as it can read everything it needs from the buffer cache rather than accessing the SSD. We've looked at /var/log/messages in that case, and saw thousands+ of lines of log spew. Some photos here: https://drive.google.com/corp/drive/u/0/folders/1rNL4hzXTRWs5vg76-Qe0PVlKw5yNM-Fz > The crash has to come from the kernel, user space needs to mark device(s) as mandatory earlier. Maybe one option is to write a kernel module that is called via the notifier interface, with a string representing the device name, whenever a SATA device hits "COMRESET failed" or an MMC device encounters something deemed to be a fatal error. The module can be configured by userspace to match the device name on which the rootfs resides.
,
May 15 2018
,
May 17 2018
For the problem at hand, I am proposing a simpler - chromeos only approach: if the sdhci controller is not accessible through PCI during error recovery (register dumps return only -1), I crash the kernel with BUG_ON().
,
May 17 2018
I would prefer to solve the general problem if possible. AFAICT we really have no data on how often these sorts of failures are happening.
,
May 17 2018
I suggest that we do implement suggestion in #6 as it is straight forward, and still logical (even with/without the general solution) because it addresses a separate problem. Once we can come up with a generic solution, we should implement that too.
,
Aug 3
This bug has an owner, thus, it's been triaged. Changing status to "assigned".
,
Nov 13
|
||||
►
Sign in to add a comment |
||||
Comment 1 by cernekee@chromium.org
, May 14 2018