New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 789027 link

Starred by 5 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Chrome
Pri: 1
Type: Bug-Regression



Sign in to add a comment

KAISER breaks eve-kvm in dev mode

Project Member Reported by smbar...@chromium.org, Nov 28 2017

Issue description

KAISER landed in kernel 4.4 with 10112.0.0, and this was the first version to fail to boot dev mode eve-kvm.

I can boot successfully by any one of the following:
1) Removing disablevmx=off from the kernel command line
2) Setting CONFIG_KAISER=n
3) Entering verified mode with signed eve-kvm

I'm seeing the same problem on my fizz, too. I can't get any useful debug logs from either machine, though.
 
Cc: smbar...@chromium.org

Comment 2 by sonny...@google.com, Nov 28 2017

Same issue on caroline using 4.4 -- although interestingly if VMX extensions aren't enabled then we don't hit this crash

Comment 3 by groeck@chromium.org, Nov 28 2017

Interesting - I thought there was a boot time option to disable kaiser. Looks like that was dropped. jorgelo@, do you know what happened with it ?

'disablevmx' is very chromeos specific; it may interfer with kaiser operation and with KVM operation in general (after all, it is an abbreviation for "Virtual Machine Extensions". We can look into it further, but does disablevmx in conjunction with kvm really make sense ?

Comment 4 by vapier@chromium.org, Nov 28 2017

keep in mind that the options is "disablevmx=off" as in it's being used to *enable* vmx support (since our default behavior in CrOS is to disable vmx during early boot).  on eve-kvm boards, we are currently leaving vmx *enabled*.
Cc: adurbin@chromium.org
Cc: dtor@chromium.org
Shall we bring the nokaiser option in as a workaround?

The bigger problem here is that we need to find a way to make this work, we cannot disable KAISER on release builds.

Comment 8 by groeck@chromium.org, Nov 28 2017

#7: Sure, we can not disable kaiser on release builds, but having nokaiser available is a better workaround for an immediate problem than CONFIG_KAISER=n.

Comment 9 by groeck@chromium.org, Nov 28 2017

Testing in caroline:

- USE="kvmhost" is not needed to trigger the problem. Loading an image with disablevmx=off followed by a cold boot is sufficient.
- The crash happens early enough to not reach or touch pstore. After rebooting from flash drive, I see the _old_ pstore content from before enabling VMX.
- disablevmx=off actively tries to _enable_ vmx. By taking that out (ie _not_ enabling VMX with disablevmx=off) the system boots. This is without USE=kvmhost.
- With USE=kvmhost, the problem is back. Again, pstore content is the old content, and there is no log message 

Ah so this only fails on dev mode but not on verified mode? That's weird.
The kernel is located at a different physical address in dev mode:

--- eve.iomem.normal    2017-11-28 16:03:25.854647580 -0800
+++ eve.iomem.dev       2017-11-28 16:04:40.836207693 -0800
@@ -17,9 +17,9 @@
   000f0000-000fffff : PCI Bus 0000:00
     000f0000-000fffff : System ROM
 00100000-7a989fff : System RAM
-  39600000-39d0b7d0 : Kernel code
-  39d0b7d1-3a31e4ff : Kernel data
-  3a45b000-3a557fff : Kernel bss
+  06a00000-0710b7d0 : Kernel code
+  0710b7d1-0771e4ff : Kernel data
+  0785b000-07957fff : Kernel bss
 7a98a000-7affffff : reserved
   7a99b000-7aa9afff : GOOG9999:00
   7aabf000-7aac6fff : BOOT0000:00

That seems odd, I'm not sure why it is different.

The only real difference in coreboot behavior between the two modes is that the Intel option rom to enable video is not executed in normal mode.

Also in dev mode depthcharge will wipe memory regions. (as part of vboot I think, to ensure that any memory from normal mode is clear)

+Wipe memory regions:
+       [0x00000000001000, 0x000000000a0000)
+       [0x00000000100000, 0x00000002000000)
+       [0x0000000273ea60, 0x0000007a98a000)
+       [0x00000100000000, 0x0000047f000000)

It also then uses the video that was initialized in coreboot to display screens.
Hrmm. I thought KASLR does both phsyical and virtual randomizatio now. I'd expect every boot to change physical locations. Am I mis-remembering? 

Comment 13 by dlaurie@google.com, Nov 29 2017

Oh I guess it could be, I only checked one set of boots.  That would at least explain it..
Cc: -dtor@chromium.org za...@chromium.org
Cc: -smbar...@chromium.org dtor@chromium.org
Cc: smbar...@chromium.org
I just got access to some patches that implement a nokaiser kernel cmdline option that we could use as a workaround on 4.4. However, somebody with more kernel knowledge than me needs to figure out how dev mode is impacting things here.
Owner: slavamn@chromium.org
> However, somebody with more kernel knowledge than me needs to figure out how dev mode is impacting things here.

Challenge accepted.
Per binary searching the problem on KVM initialization path and discussing results with dtor@, we have a promising lead and a quick fix. Details below.

The ChromeOS-specific function in cpu_control_vmx() is trying to either disable or enable VMX extensions in the following way:

void cpu_control_vmx(int cpu)
{
...
    if (disablevmx)
        bits &= ~(FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX|
            FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX);
    else
        bits |= FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX|
            FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX;
    bits |= FEATURE_CONTROL_LOCKED;

    /* If nothing to do, do nothing. */
    if (msr == bits)
        return;

    /* if it's locked, there's really nothing we can do. */
    if ((msr & FEATURE_CONTROL_LOCKED) && (msr != bits)) {
        /* But only warn them if it's not what we want. */
         pr_warn("can not %s VMX on CPU%d (already locked)\n",
            disablevmx ? "disable" : "enable", cpu);
        return;
    }

    pr_info("%s VMX on cpu %d\n",
            disablevmx ? "Disabling" : "Enabling", cpu);
    ret = wrmsrl_safe(MSR_IA32_FEATURE_CONTROL, bits);
...
}

With disablevmx=off, it tries to enable VMX both in and outside of TXT mode (Linux calls it SMX). The chip in Eve does not support TXT:

localhost ~ # cat /proc/cpuinfo 
...
model name      : Intel(R) Core(TM) i5-7Y54 CPU @ 1.20GHz


From https://ark.intel.com/products/95452/Intel-Core-i5-7Y54-Processor-4M-Cache-up-to-3_20-GHz:
...
IntelĀ® Trusted Execution Technology No
...

This suggests that FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX is reserved and will #GP when trying to set it. The cpu_control_vmx() is using "safe" version of wrmsrl. The way wrmsrl_safe() is implemented is through setting up a fixup secton that traps #GP and makes it return non-zero value instead of crashing the kernel.

In the post-Kaiser world kernel memory mapping is done differently. In particular, it is quite possible that kernel mapping cannot tolerate #GPs that early in the boot. That would explain the issue.

The quick fix is to tweak the code to stop setting FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX. I tried it on EVT Eve, and was able to enable KVM that way without crashing.

I still want to get to the bottom of it to ensure the theory is correct and to check whether other pieces of kernel startup may expect to be able to tolerate #GP.
The 'echo c > /proc/sysrq-trigger' on EVT Eve with test image results in expected exception output reflected in /dev/pstore/console-ramoops-0 after boot.

So at some point in boot Kaiserized kernel learns how to handle it's own faults. It just does it too late.
The Kaiser code creates a shadow PGD for a process and exception entry point is expecting one to be present. I wonder if entry code gets confused if no shadow PGD was created because no processes were born at the time of the fault.

KAISER experts on CC, mind sharing your opinion on that?
The kasierised code, whole-kernel CR3 is at 8KB boundary while user-space-plus-minimal-kernel CR3 is at 4K offset from 8K boundary. I suspect that in early boot kernel runs with CR3 that does not follow this rule. Depending on various factors (e.g. whether it is dev or release image), the initial CR3 kernel uses during boot may end up at either 8K or 8K+4K alignment. In latter case, exception entry would get confused as it'd assume exception happened in the userspace - which does not exist yet. That would crash the kernel before it gets a chance to complain to console-ramoops.
FWIW arch/x86/boot/compressed/head_64.S puts initial PGD at offset 0x1000, which is unfortunate because it is 4K away from 8K boundary, so it will confuse the kaiser exception entrypoint.
I printed out CR3 in cpu_control_vmx() and they were 8K-aligned:

    pr_info("%s VMX on cpu %d CR3 %lx\n",
            disablevmx ? "Disabling" : "Enabling", cpu, native_read_cr3());


[    0.000000] Enabling VMX on cpu 0 CR3 b410000
[    0.055461] Enabling VMX on cpu 1 CR3 b410000
[    0.064135] Enabling VMX on cpu 2 CR3 b410000
[    0.072830] Enabling VMX on cpu 3 CR3 b410000

So the theory may not be entirely correct, but I nevertheless think we are on the right path: something is kaiser-unfriendly in early kernel boot so it cannot tolerate exceptions.
Cc: keescook@chromium.org hughd@google.com
Thanks Slava! I'm adding Hugh and Kees to this thread to see if they have any thoughts.

Hugh: can you take a look starting at comment #18 and see if this makes sense to you?

Symptom: KAISER-enabled kernels break on the Pixelbook, *but only in developer mode*.

Comment 26 by hughd@google.com, Dec 15 2017

#19 sounds very plausible to me.

But #21 less so: the kernel normally handles exceptions from kernel mode, and shouldn't be expecting anything of userspace or shadow pgds at that point.

#22 is a very good idea, but I thought was satisfactorily handled by the NEXT_PGD_PAGE 2*PAGE_SIZE alignments in kernel/head_64.S.

I wouldn't expect #23's compressed/head_64.S pgd alignment to matter: shouldn't that all have been replaced by the time exceptions are handled by kaiser-knowing code?  (But I'm no expert on bootup sequence, Slava will know better.)

There are some odd places (kexec and EFI come to mind) which set up their own pgd hierarchies, which is certainly confusing for kaiser (see my bizarre WARN_ON_ONCE in kaiser_set_shadow_pgd()): there the pgd would risk misalignment.  IIRC interrupts are disabled throughout the EFI business; but of course not exceptions.  Might you have something like that in play, with its own pgds, but expected to handle exceptions?
hughd@, thank you for your comments.

Turns out, to make the problem go away it is sufficient to mark general_protection fault as paranoid (patch below). I tried it on Eve and verified that #GPs do indeed happen based on -5 return from wrmsrl_safe:

[    0.000000] Enabling VMX on cpu 0
[    0.000000] wrmsrl_safe (MSR_IA32_FEATURE_CONTROL, 00000007) failed error -5
[    0.000000] Memory: 8055060K/8283296K available (8229K kernel code, 1220K rwdata, 3016K rodata, 1276K init, 1024K bss, 228236K reserved, 0K cma-reserved)

This isn't a production fix of course, but a good lead toward the root cause.

---------

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 8e22d8a778e7..9af9d1147883 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1045,7 +1045,7 @@ idtentry xen_int3         do_int3                 has_error_code=0
 idtentry xen_stack_segment     do_stack_segment        has_error_code=1
 #endif
 
-idtentry general_protection    do_general_protection   has_error_code=1
+idtentry general_protection    do_general_protection   has_error_code=1 paranoid=2
 trace_idtentry page_fault      do_page_fault           has_error_code=1
 
 #ifdef CONFIG_KVM_GUEST

The x86_cr3_pcid_noflush may not be initialized yet when the wrmsr_safe is executed in cpu_control_vmx(). The standard non-paranoid generic_protection unconditionally switches to kernel CR3 using this:

.macro _SWITCH_TO_KERNEL_CR3 reg
movq %cr3, \reg
andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), \reg
orq  x86_cr3_pcid_noflush, \reg
movq \reg, %cr3
.endm


The x86_cr3_pcid_noflush produces different values for different CPUs, suggesting it wasn't initialized:

void cpu_control_vmx(int cpu)
{
...
    pr_info("%s VMX on cpu %d CR3 %lx MASK %lx\n",
            disablevmx ? "Disabling" : "Enabling", cpu, native_read_cr3(), x86_cr3_pcid_noflush);


[    0.000000] Enabling VMX on cpu 0 CR3 34210000 MASK 0
[    0.000000] wrmsrl_safe (MSR_IA32_FEATURE_CONTROL, 00000007) failed error -5
[    0.055438] Enabling VMX on cpu 1 CR3 34210000 MASK 0
[    0.055463] wrmsrl_safe (MSR_IA32_FEATURE_CONTROL, 00000007) failed error -5
[    0.064134] Enabling VMX on cpu 2 CR3 34210000 MASK 8000000000000000
[    0.064141] wrmsrl_safe (MSR_IA32_FEATURE_CONTROL, 00000007) failed error -5
[    0.072837] Enabling VMX on cpu 3 CR3 34210000 MASK 8000000000000000
[    0.072843] wrmsrl_safe (MSR_IA32_FEATURE_CONTROL, 00000007) failed error -5


Comment 29 by hughd@google.com, Dec 16 2017

Well observed, but that should not be a problem in itself.

x86_cr3_pcid_noflush indeed starts out as 0, and or'ing it in to cr3 has 0 effect, but the cr3 switch will flush TLB unnecessarily on entering the kernel.

Later it gets sets to 0x8000000000000000, once we've established that PCIDs are supported by the cpu, so that bit in cr3 will not cause a #GP.

Or am I misunderstanding you: do you think setting that bit is causing a #GP, that the cpu claims to support it but actually does not?

Oh, could it be that I'm setting that bit before all the cpus have enabled CR4.PCID - I think that would cause #GP on them?  They wouldn't get out to userspace before CR4.PCID were set up, but could come this way, through the (exceptional!) exception path.
hugh@:

> Or am I misunderstanding you: do you think setting that bit is causing a #GP, that the cpu claims to support it but actually does not?

Shouldn't be the case. I checked /proc/cpuinfo on Eve and all 4 logical CPUs support PCID.

> Oh, could it be that I'm setting that bit before all the cpus have enabled CR4.PCID - I think that would cause #GP on them?  They wouldn't get out to userspace before CR4.PCID were set up, but could come this way, through the (exceptional!) exception path.

Yes, I was thinking about scenario when CR4.PCIDE != 0 for whatever reason and we are trying to load CR3 with high bit set. That I think should triple-fault the machine. However, I do not see how this scenario could be possible.

At this point we know the difference is (1) KAISER-related, and (2) it is something that paranoid entries do differently compared to non-paranoid entries. The most obvious candidate is unconditional switching to kernel CR3 in error_entry, but it seems to be harmless unless x86_cr3_pcid_noflush contains bogus value. I will continue running experiments (sampling CR4 in cpu_control_vmx, etc). I'd appreciate suggestions on what else I could try next too.
Correction to comment #30: should've said "CR4.PCIDE == 0"
Hmm, CR4.PCIDE is not set when faults take place:

#define X86_CR4_PCIDE_BIT   17 /* enable PCID support */
#define X86_CR4_PCIDE       _BITUL(X86_CR4_PCIDE_BIT)

    pr_info("%s VMX on cpu %d CR3 %lx CR4 %lx MASK %lx\n",
            disablevmx ? "Disabling" : "Enabling", cpu, native_read_cr3(), native_read_cr4(), x86_cr3_pcid_noflush);


[    0.000000] Enabling VMX on cpu 0 CR3 19a10000 CR4 406b0 MASK 0
[    0.000000] wrmsrl_safe (MSR_IA32_FEATURE_CONTROL, 00000007) failed error -5
[    0.055290] Enabling VMX on cpu 1 CR3 19a10000 CR4 406a0 MASK 0
[    0.055298] wrmsrl_safe (MSR_IA32_FEATURE_CONTROL, 00000007) failed error -5
[    0.063963] Enabling VMX on cpu 2 CR3 19a10000 CR4 406a0 MASK 8000000000000000
[    0.063969] wrmsrl_safe (MSR_IA32_FEATURE_CONTROL, 00000007) failed error -5
[    0.072655] Enabling VMX on cpu 3 CR3 19a10000 CR4 406a0 MASK 8000000000000000
[    0.072661] wrmsrl_safe (MSR_IA32_FEATURE_CONTROL, 00000007) failed error -5

Eve does set CR4.PCIDE eventually, looking at console-ramoops output triggered by sysrq-trigger:

[  747.403459] CR2: 0000000000000000 CR3: 0000000235cc8000 CR4: 00000000003606e0

So this is a race between setting CR4.PCIDE, kaiser_setup_pcid() and function triggering "safe" general protection fault. I wonder if it might effect non-ChromeOS kernels as well.
The initalization of CR4.PCIDE takes this path:

cr4_set_bits(X86_CR4_PCIDE)
setup_pcid()
identify_cpu()
identify_boot_cpu() | identify_secondary_cpu()

identify_boot_cpu()
check_bugs()
start_kernel()

identify_secondary_cpu()
smp_store_cpu_info()
smp_callin()
start_secondary()
IPI

The cpu_control_vmx() is called from cpu_init(), and is therefore called before identify_cpu() gets a chance to set CR4.PCIDE.

The kaiser_init() is called from mm_init() which is called ahead of check_bugs() so here, too, identify_cpu() doesn't get a chance to initialize CR4.PCIDE in time.
On the other hand, kaiser_setup_pcid() is called from setup_pcid() after cr4_set_bits(X86_CR4_PCIDE) is called. Something does not add up.
Perhaps the reason is that x86_cr3_pcid_noflush is shared variable, so if one CPU just finished setting both CR4.PCIDE and called kaiser_setup_pcid(), but the other one didn't do either, then second one will triple-fault when trying to load CR3 based on updated x86_cr3_pcid_noflush but not not consistent with its private CR4?

The IPI latencies between boot CPU and secondary CPUs are theoretically large enough to enable this race.
Labels: -Pri-3 OS-Linux Pri-1
Owner: dgreid@chromium.org
Status: Started (was: Untriaged)
I think the right fix is to split kaiser_setup_pcid() into kaiser_setup_pcid_user() and kaiser_setup_pcid_kernel(). The kaiser_setup_pcid_user() would set x86_cr3_pcid_user only, and would be called from setup_pcid(). It is safe to do because x86_cr3_pcid_user is per-CPU.

The kaiser_setup_pcid_kernel() would be called by boot CPU only, and only after all secondary CPUs finished initializing.

I am attaching unfinished patch. It is unfinished because I haven't found a good place to call kaiser_setup_pcid_kernel(). It is obviously a performance issue, but I verified that Eve at least boots with this patch.

dgreid@, as I am going away for some time, could you please check my reasoning and if you agree finish the fix for ChromeOS?

hughd@, could you please coordinate with dgreid@ regarding fixing this issue on the server side?


kaiser.patch
2.3 KB Download
OK, just gettign spun up on this. What is the cost of setting x86_cr3_pcid_noflush later? Is it performance only?

Can it be set after all the cpus are ready, but after they may start servicing interrupts?

Comment 39 by hughd@google.com, Dec 19 2017

I'm considering this at the moment.  x86_cr3_pcid_noflush can be set later, yes. But I need to consider the sequence in 5 or more backports, and then in their 5 or more nokaisers, which use ALTERNATIVE instead of x86_cr3_pcid_noflush.

Also to be considered in this connection: Andy Lutomirski's c7ad5ad297e6 "x86/mm/64: Initialize CR4.PCIDE early", but I'll more probably choose minimal change, that can be easily applied to multiple trees - if possible.

If you have a good workaround at the Eve dev-mode end, I'd recommend that you go with that: the bug in our kaiser kernel is real, and great that you've found and analysed it; but nobody else has hit it yet - so relying on such exception handling so early in boot might be asking for trouble.

Will report back when I have good compromise patch in mind.
Thanks hughd. For now I'll merge something like slava's patch, with x86_cr3_pcid_noflush set at a time after all cpus are initialized.

We can always pull it back out and merge a future solution you come up with.

Thanks!
Project Member

Comment 41 by bugdroid1@chromium.org, Dec 22 2017

Labels: merge-merged-chromeos-4.4
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/2a44f056fc377e3aeabe8bba3f80b4786ee50759

commit 2a44f056fc377e3aeabe8bba3f80b4786ee50759
Author: Dylan Reid <dgreid@chromium.org>
Date: Fri Dec 22 04:33:03 2017

CHROMIUM: disablevmx: delay call to control_vmx

In the post kaiser world the kernel can't handle a potential fault from
wrmsrl_safe so early. In order to avoid this, do cpu_control_vmx later
in the cpu boot up process.

BUG= 789027 
TEST=boot eve and samus in dev mode with "disablevmx=off", start a VM.
boot eve without "disablevmx=off" check that vmx is restricted and not
in /proc/cpuinfo.

Change-Id: I6850f41a0c74e068a9dbb2b48fbdaf29f7273a3d
Signed-off-by: Dylan Reid <dgreid@chromium.org>

[modify] https://crrev.com/2a44f056fc377e3aeabe8bba3f80b4786ee50759/arch/x86/kernel/cpu/common.c

Thanks Slava, Hugh and Dylan for getting this fixed.
Project Member

Comment 43 by bugdroid1@chromium.org, Dec 23 2017

Labels: merge-merged-chromeos-4.14
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/c73fe89c08b28fba1afc90c354e17295d1dbee16

commit c73fe89c08b28fba1afc90c354e17295d1dbee16
Author: Dylan Reid <dgreid@chromium.org>
Date: Sat Dec 23 04:21:03 2017

CHROMIUM: disablevmx: delay call to control_vmx

In the post kaiser world the kernel can't handle a potential fault from
wrmsrl_safe so early. In order to avoid this, do cpu_control_vmx later
in the cpu boot up process.

BUG= 789027 
TEST=boot eve and samus in dev mode with "disablevmx=off", start a VM.
boot eve without "disablevmx=off" check that vmx is restricted and not
in /proc/cpuinfo.

Change-Id: I6850f41a0c74e068a9dbb2b48fbdaf29f7273a3d
Signed-off-by: Dylan Reid <dgreid@chromium.org>

[modify] https://crrev.com/c73fe89c08b28fba1afc90c354e17295d1dbee16/arch/x86/kernel/cpu/common.c

Status: Fixed (was: Started)
Components: OS>Systems>Containers

Sign in to add a comment