New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 654931 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Feature



Sign in to add a comment

Should kernel hard lockups lead to panic?

Project Member Reported by briannorris@chromium.org, Oct 11 2016

Issue description

Question: $subject

Discuss.

Follow-up (if answer is yes): why don't we do this on non-ARM32 kernels?

Other info: Documentation/lockup-watchdogs.txt

"""
A 'hardlockup' is defined as a bug that causes the CPU to loop in
kernel mode for more than 10 seconds (see "Implementation" below for
details), without letting other interrupts have a chance to run.
"""

Related question: Is there any chance of false positives with this?

---

Particularly, we see that CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1 only in 32-bit rockchip configs (i.e., rk3288) -- all others use =0. That means that for non-ARM32 kernels, it's possible to get hard lockup warnings, but never get formally panicked and rebooted. Usually this would eventually yield a HW watchdog reset, if everything locks up naturally. But otherwise, I suppose it's possible we limp along with some of the system locked up?

One benefit of forcing the panic: we can trigger panic-time hooks, which can (for instance) use arch-specific debug hardware to try to dump some useful state of the CPUs before we crash.

It's highly likely we will set CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1 on v4.4/arm64 for the benefit of http://crosbug.com/p/58229 . The question then becomes, why don't we do this everywhere?

Assigned to me, while I work out our immediate needs, but I'll release it if someone else feels like they have a better universal answer.
 
Description: Show this description
BTW, Doug gave me this cool tip for trying out (forced) hard lockups:

echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
IMHO we should test this and do it on as many platforms as we can.  If we're in kernel mode for 10 seconds we want to reboot.  We've been doing this on arm32 for years and I have no idea why we never applied it to other platforms...

Comment 4 by snanda@chromium.org, Oct 12 2016

I thought the platform_KernelErrorPaths test does test for hardlockup too?  Agreed on enabling it on as many platforms as we can.


https://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/server/site_tests/platform_KernelErrorPaths/platform_KernelErrorPaths.py
I forgot to note that there is a sysctl for this too; maybe we're using that on non-ARM systems?

I noticed this curiosity:

https://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/server/site_tests/platform_KernelErrorPaths/platform_KernelErrorPaths.py#240

---
        if lkdtm == "HARDLOCKUP":
            # ARM systems do not (presently) have NMI, so skip them for now.
            arch = self.client.get_arch()
            if arch.startswith('arm'):
                logging.info("Skipping %s on architecture %s.",
                             trigger, arch)
                return
---

Maybe that's from before whenever ARM figured out a non-NMI method for detecting lockups, and so we've arranged to use the sysctl on non-ARM systems, and then later resorted to the Kconfig for ARM only (and never fixed up the test)? </speculation>

I haven't actually tested an x86 system to see.
Hmm, actually that sysctl is awfully new, and I see no uses of it in our repos. So that's probably not involved here.
Project Member

Comment 7 by bugdroid1@chromium.org, Oct 12 2016

Labels: merge-merged-chromeos-4.4
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/bc5158561b95fc1786893de3b6a5a7b901f521dc

commit bc5158561b95fc1786893de3b6a5a7b901f521dc
Author: Brian Norris <briannorris@chromium.org>
Date: Tue Oct 11 23:46:21 2016

CHROMIUM: config: arm64: panic on hardlockup detect

We've had this set to panic-on-hardlockup for a long time for
rk3288/arm32. Let's do that for rk3399/arm64 too.

This helps for debugging and collecting crash information, since
otherwise a hardlockup is not guaranteed to trigger the panic handlers
(which we recently instrumented for rk3399 to use ARM debug
infrastructure to dump info about recently-retired instruction
addresses).

BUG=chromium:654931, chrome-os-partner:58229
TEST=`echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT`; see
     panic, instead of indefinite hang

Change-Id: I45f407539d59ff6404e47ec774078874b4c63153
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/396396
Commit-Ready: Douglas Anderson <dianders@chromium.org>
Tested-by: Kevin Cernekee <cernekee@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Kevin Cernekee <cernekee@chromium.org>

[modify] https://crrev.com/bc5158561b95fc1786893de3b6a5a7b901f521dc/chromeos/config/arm64/common.config

Cc: djkurtz@chromium.org
Labels: Kernel-3.18
Looks like Oak/Elm is failing the platform_KernelErrorPaths.HARDLOCKUP test in wmatrix:

---
10/11 22:23:39.308 INFO |platform_KernelErr:0055| KernelErrorPaths: executing 'echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT' on chromeos2-row7-rack8-host5
10/11 22:23:39.308 DEBUG|          ssh_host:0180| Running (ssh) 'sh -c "sync; sleep 1; echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT" >/dev/null 2>&1 &'
10/11 22:23:40.297 INFO |        server_job:0153| 	START	----	reboot	timestamp=1476249820	localtime=Oct 11 22:23:40	
10/11 22:23:40.298 DEBUG|      abstract_ssh:0621| Host chromeos2-row7-rack8-host5 pre-shutdown boot_id is e898c45d-e8c8-4085-9446-e8225b94941b
10/11 22:23:40.298 DEBUG|          ssh_host:0180| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
10/11 22:23:41.266 DEBUG|        base_utils:0280| [stdout] e898c45d-e8c8-4085-9446-e8225b94941b
10/11 22:23:42.269 DEBUG|          ssh_host:0180| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'

...

10/11 22:24:31.787 INFO |        server_job:0153| 		ABORT	----	reboot.verify	timestamp=1476249871	localtime=Oct 11 22:24:31	shut down failed
10/11 22:24:31.787 INFO |        server_job:0153| 	END FAIL	----	reboot	timestamp=1476249871	localtime=Oct 11 22:24:31	Host did not shut down
  Traceback (most recent call last):
    File "/usr/local/autotest/server/server_job.py", line 917, in run_op
      op_func()
    File "/usr/local/autotest/server/hosts/remote.py", line 218, in op_func
      super(RemoteHost, self).wait_for_restart(timeout=timeout, **dargs)
    File "/usr/local/autotest/client/common_lib/hosts/base_classes.py", line 310, in wait_for_restart
      raise error.AutoservShutdownError("Host did not shut down")
  AutoservShutdownError: Host did not shut down
---

But x86 platforms (e.g., I checked falco) don't fail.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Ah, I see the problem:

build_kernel_image.sh adds kernel cmdline 'nmi_watchdog=panic,lapic' only for x86/amd64:
---
if [[ "${FLAGS_arch}" = "x86" || "${FLAGS_arch}" = "amd64" ]]; then
  # Legacy BIOS will use the kernel in the rootfs (via syslinux), as will
  # standard EFI BIOS (via grub, from the EFI System Partition). Chrome OS
  # BIOS will use a separate signed kernel partition, which we'll create now.
  cat <<EOF >> "${FLAGS_working_dir}/config.txt"
add_efi_memmap
boot=local
noresume
noswap
i915.modeset=1
tpm_tis.force=1
tpm_tis.interrupts=0
nmi_watchdog=panic,lapic
---

And we've carved out an exception in cros-signing/security_test_baselines/ensure_secure_kernelparams.config (i.e., these aren't required for non-x86):

---
# Common x86 parameters.
required_kparams_x86=(
    add_efi_memmap
    boot=local
    i915.modeset=1
    nmi_watchdog=panic,lapic
    noresume
    noswap
    tpm_tis.force=1
    tpm_tis.interrupts=0
)
---

Perhaps we should:

1. fixup build_kernel_image.sh to stop doing arch-specific exceptions for this (it's a generic kernel param, though it will only be supported in certain kernel configurations), then
2. drop the cros-signing and autotest exceptions
3. (optional) stop setting the Kconfig to =1 on all systems, to be consistent? I guess it doesn't hurt to be double-sure
Components: OS>Kernel
Labels: -Pri-1 Pri-2
Status: Available (was: Assigned)
Dropping priority and marking 'Available'. There's info there to tackle if anyone wants it (e.g., for Oak?). Or maybe I'll get to it "eventually."
Patch for chromeos-3.18:
https://chromium-review.googlesource.com/#/c/400125/
Project Member

Comment 11 by bugdroid1@chromium.org, Oct 19 2016

Labels: merge-merged-chromeos-3.18
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/e829a835422fbbe9aed7d2a1cb44a9a2998089ae

commit e829a835422fbbe9aed7d2a1cb44a9a2998089ae
Author: Daniel Kurtz <djkurtz@chromium.org>
Date: Wed Oct 19 12:29:23 2016

CHROMIUM: config: arm64: panic on hardlockup detect

We've had this set to panic-on-hardlockup for a long time for arm32.
Let's do that for arm64 too.

This helps for debugging and collecting crash information, since
otherwise a hardlockup is not guaranteed to trigger the panic handlers.

Signed-off-by: Daniel Kurtz <djkurtz@chromium.org>
Signed-off-by: Brian Norris <briannorris@chromium.org>

From chromeos-4.4: https://chromium-review.googlesource.com/396396

BUG=chromium:654931
TEST=On elm:
  `echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT`
 => see panic instead of indefinite hang
TEST=platform_KernelErrorPaths.HARDLOCKUP
  => test passes

Change-Id: If994b18b40b546aeeb3ca3da20f911416a46fe1c
Reviewed-on: https://chromium-review.googlesource.com/400125
Commit-Ready: Daniel Kurtz <djkurtz@chromium.org>
Tested-by: Daniel Kurtz <djkurtz@chromium.org>
Reviewed-by: Nicolas Boichat <drinkcat@chromium.org>

[modify] https://crrev.com/e829a835422fbbe9aed7d2a1cb44a9a2998089ae/chromeos/config/arm64/common.config

Labels: -Pri-2 Pri-3
OK, well I guess we've fixed the relevant platforms via Kconfig. Still seems like we should work this out consistently across all systems eventually, but probably not critical.
Project Member

Comment 13 by sheriffbot@chromium.org, Oct 20 2017

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available. If you change it back, also remove the "Hotlist-Recharge-Cold" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Status: Assigned (was: Untriaged)

Sign in to add a comment