Should kernel hard lockups lead to panic? |
||||||||
Issue descriptionQuestion: $subject Discuss. Follow-up (if answer is yes): why don't we do this on non-ARM32 kernels? Other info: Documentation/lockup-watchdogs.txt """ A 'hardlockup' is defined as a bug that causes the CPU to loop in kernel mode for more than 10 seconds (see "Implementation" below for details), without letting other interrupts have a chance to run. """ Related question: Is there any chance of false positives with this? --- Particularly, we see that CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1 only in 32-bit rockchip configs (i.e., rk3288) -- all others use =0. That means that for non-ARM32 kernels, it's possible to get hard lockup warnings, but never get formally panicked and rebooted. Usually this would eventually yield a HW watchdog reset, if everything locks up naturally. But otherwise, I suppose it's possible we limp along with some of the system locked up? One benefit of forcing the panic: we can trigger panic-time hooks, which can (for instance) use arch-specific debug hardware to try to dump some useful state of the CPUs before we crash. It's highly likely we will set CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1 on v4.4/arm64 for the benefit of http://crosbug.com/p/58229 . The question then becomes, why don't we do this everywhere? Assigned to me, while I work out our immediate needs, but I'll release it if someone else feels like they have a better universal answer.
,
Oct 12 2016
BTW, Doug gave me this cool tip for trying out (forced) hard lockups: echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
,
Oct 12 2016
IMHO we should test this and do it on as many platforms as we can. If we're in kernel mode for 10 seconds we want to reboot. We've been doing this on arm32 for years and I have no idea why we never applied it to other platforms...
,
Oct 12 2016
I thought the platform_KernelErrorPaths test does test for hardlockup too? Agreed on enabling it on as many platforms as we can. https://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/server/site_tests/platform_KernelErrorPaths/platform_KernelErrorPaths.py
,
Oct 12 2016
I forgot to note that there is a sysctl for this too; maybe we're using that on non-ARM systems? I noticed this curiosity: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/server/site_tests/platform_KernelErrorPaths/platform_KernelErrorPaths.py#240 --- if lkdtm == "HARDLOCKUP": # ARM systems do not (presently) have NMI, so skip them for now. arch = self.client.get_arch() if arch.startswith('arm'): logging.info("Skipping %s on architecture %s.", trigger, arch) return --- Maybe that's from before whenever ARM figured out a non-NMI method for detecting lockups, and so we've arranged to use the sysctl on non-ARM systems, and then later resorted to the Kconfig for ARM only (and never fixed up the test)? </speculation> I haven't actually tested an x86 system to see.
,
Oct 12 2016
Hmm, actually that sysctl is awfully new, and I see no uses of it in our repos. So that's probably not involved here.
,
Oct 12 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/bc5158561b95fc1786893de3b6a5a7b901f521dc commit bc5158561b95fc1786893de3b6a5a7b901f521dc Author: Brian Norris <briannorris@chromium.org> Date: Tue Oct 11 23:46:21 2016 CHROMIUM: config: arm64: panic on hardlockup detect We've had this set to panic-on-hardlockup for a long time for rk3288/arm32. Let's do that for rk3399/arm64 too. This helps for debugging and collecting crash information, since otherwise a hardlockup is not guaranteed to trigger the panic handlers (which we recently instrumented for rk3399 to use ARM debug infrastructure to dump info about recently-retired instruction addresses). BUG=chromium:654931, chrome-os-partner:58229 TEST=`echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT`; see panic, instead of indefinite hang Change-Id: I45f407539d59ff6404e47ec774078874b4c63153 Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/396396 Commit-Ready: Douglas Anderson <dianders@chromium.org> Tested-by: Kevin Cernekee <cernekee@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Kevin Cernekee <cernekee@chromium.org> [modify] https://crrev.com/bc5158561b95fc1786893de3b6a5a7b901f521dc/chromeos/config/arm64/common.config
,
Oct 12 2016
Looks like Oak/Elm is failing the platform_KernelErrorPaths.HARDLOCKUP test in wmatrix:
---
10/11 22:23:39.308 INFO |platform_KernelErr:0055| KernelErrorPaths: executing 'echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT' on chromeos2-row7-rack8-host5
10/11 22:23:39.308 DEBUG| ssh_host:0180| Running (ssh) 'sh -c "sync; sleep 1; echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT" >/dev/null 2>&1 &'
10/11 22:23:40.297 INFO | server_job:0153| START ---- reboot timestamp=1476249820 localtime=Oct 11 22:23:40
10/11 22:23:40.298 DEBUG| abstract_ssh:0621| Host chromeos2-row7-rack8-host5 pre-shutdown boot_id is e898c45d-e8c8-4085-9446-e8225b94941b
10/11 22:23:40.298 DEBUG| ssh_host:0180| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
10/11 22:23:41.266 DEBUG| base_utils:0280| [stdout] e898c45d-e8c8-4085-9446-e8225b94941b
10/11 22:23:42.269 DEBUG| ssh_host:0180| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
...
10/11 22:24:31.787 INFO | server_job:0153| ABORT ---- reboot.verify timestamp=1476249871 localtime=Oct 11 22:24:31 shut down failed
10/11 22:24:31.787 INFO | server_job:0153| END FAIL ---- reboot timestamp=1476249871 localtime=Oct 11 22:24:31 Host did not shut down
Traceback (most recent call last):
File "/usr/local/autotest/server/server_job.py", line 917, in run_op
op_func()
File "/usr/local/autotest/server/hosts/remote.py", line 218, in op_func
super(RemoteHost, self).wait_for_restart(timeout=timeout, **dargs)
File "/usr/local/autotest/client/common_lib/hosts/base_classes.py", line 310, in wait_for_restart
raise error.AutoservShutdownError("Host did not shut down")
AutoservShutdownError: Host did not shut down
---
But x86 platforms (e.g., I checked falco) don't fail.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ah, I see the problem:
build_kernel_image.sh adds kernel cmdline 'nmi_watchdog=panic,lapic' only for x86/amd64:
---
if [[ "${FLAGS_arch}" = "x86" || "${FLAGS_arch}" = "amd64" ]]; then
# Legacy BIOS will use the kernel in the rootfs (via syslinux), as will
# standard EFI BIOS (via grub, from the EFI System Partition). Chrome OS
# BIOS will use a separate signed kernel partition, which we'll create now.
cat <<EOF >> "${FLAGS_working_dir}/config.txt"
add_efi_memmap
boot=local
noresume
noswap
i915.modeset=1
tpm_tis.force=1
tpm_tis.interrupts=0
nmi_watchdog=panic,lapic
---
And we've carved out an exception in cros-signing/security_test_baselines/ensure_secure_kernelparams.config (i.e., these aren't required for non-x86):
---
# Common x86 parameters.
required_kparams_x86=(
add_efi_memmap
boot=local
i915.modeset=1
nmi_watchdog=panic,lapic
noresume
noswap
tpm_tis.force=1
tpm_tis.interrupts=0
)
---
Perhaps we should:
1. fixup build_kernel_image.sh to stop doing arch-specific exceptions for this (it's a generic kernel param, though it will only be supported in certain kernel configurations), then
2. drop the cros-signing and autotest exceptions
3. (optional) stop setting the Kconfig to =1 on all systems, to be consistent? I guess it doesn't hurt to be double-sure
,
Oct 18 2016
Dropping priority and marking 'Available'. There's info there to tackle if anyone wants it (e.g., for Oak?). Or maybe I'll get to it "eventually."
,
Oct 19 2016
Patch for chromeos-3.18: https://chromium-review.googlesource.com/#/c/400125/
,
Oct 19 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/e829a835422fbbe9aed7d2a1cb44a9a2998089ae commit e829a835422fbbe9aed7d2a1cb44a9a2998089ae Author: Daniel Kurtz <djkurtz@chromium.org> Date: Wed Oct 19 12:29:23 2016 CHROMIUM: config: arm64: panic on hardlockup detect We've had this set to panic-on-hardlockup for a long time for arm32. Let's do that for arm64 too. This helps for debugging and collecting crash information, since otherwise a hardlockup is not guaranteed to trigger the panic handlers. Signed-off-by: Daniel Kurtz <djkurtz@chromium.org> Signed-off-by: Brian Norris <briannorris@chromium.org> From chromeos-4.4: https://chromium-review.googlesource.com/396396 BUG=chromium:654931 TEST=On elm: `echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT` => see panic instead of indefinite hang TEST=platform_KernelErrorPaths.HARDLOCKUP => test passes Change-Id: If994b18b40b546aeeb3ca3da20f911416a46fe1c Reviewed-on: https://chromium-review.googlesource.com/400125 Commit-Ready: Daniel Kurtz <djkurtz@chromium.org> Tested-by: Daniel Kurtz <djkurtz@chromium.org> Reviewed-by: Nicolas Boichat <drinkcat@chromium.org> [modify] https://crrev.com/e829a835422fbbe9aed7d2a1cb44a9a2998089ae/chromeos/config/arm64/common.config
,
Oct 19 2016
OK, well I guess we've fixed the relevant platforms via Kconfig. Still seems like we should work this out consistently across all systems eventually, but probably not critical.
,
Oct 20 2017
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. If you change it back, also remove the "Hotlist-Recharge-Cold" label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Aug 1
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by briannorris@chromium.org
, Oct 12 2016