New issue
Advanced search Search tips

Issue 1674 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Sep 26
Cc:



Sign in to add a comment

gVisor reuses pagetables across levels without paging-structure invalidation

Project Member Reported by jannh@google.com, Sep 26

Issue description

On August 14th, I reported the following bug in gVisor. Since I
couldn't confirm that this behavior is actually exploitable back then,
I didn't file a bugtracker bug.
I have a proper reproducer now, so I'm retroactively filing a bug for
this in our bugtracker now.

This has already been fixed in commit a7a8d07d7d6b, so I'm directly
filing this without view restriction.

gVisor uses per-process page table caches. When a page table needs to
be allocated, it is taken from the per-process cache if possible. When
a page table is freed, it is always placed in the per-process cache.
Note that for some reason, the per-process cache operates as a FIFO,
rather than a LIFO, which would normally be chosen to optimize for hot
caches.

When PageTables::Unmap() runs, both page table deallocations
(RuntimeAllocator::FreePTEs) and page table allocations
(RuntimeAllocator::NewPTEs) can be triggered. Deallocations are
triggered when a page table is detected to have become empty;
allocations happen when 1G or 2M hugepages have to be shattered
because the specified address range ends in the middle of a hugepage.
If on entry to PageTables::Unmap(), the page table cache is empty,
and the software page walk first deallocates an L2 pagetable (because
a 2M hugepage is deallocated, and the L2 pagetable referencing it
thereby becomes empty) and afterwards allocates an L1 pagetable (by
shattering a 2M hugepage), the former L2 pagetable is now reused as an
L1 pagetable and contains PTEs, even though no paging-structure cache
shootdowns have been performed yet.

As the Intel SDM, volume 3A, chapter "4.10.3 Paging-Structure Caches"
(https://software.intel.com/sites/default/files/managed/7c/f1/253668-sdm-vol-3a.pdf#page=140)
describes, another thread that uses the same pagetables is allowed to
cache the L3 pagetable entry referencing the L2 pagetable that is
being deallocated. After the former L2 pagetable has then been
reallocated as an L1 pagetable and has been filled with PTEs, that
thread's CPU is still allowed to start page walks inside the former
L2 pagetable, at L2, resulting in a transient paging level confusion
that treats guest userspace memory as an L1 pagetable.
I don't think there is public documentation from Intel on how widely
used the paging-structure caches actually are on real hardware, and
what their structure is, but the VUSec researchers have
reverse-engineered some of it for their paper "Reverse Engineering
Hardware Page Table Caches Using Side-Channel Attacks on the MMU":
https://www.cs.vu.nl/~herbertb/download/papers/revanc_ir-cs-77.pdf .
According to this paper, decently-sized caches for L2 pagetable
entries exist across Intel, AMD and ARM; most high-end Intel
processors also have small caches for L3 entries (containing around
2-6 entries), which are (probably) required for this attack.

I tested this on a Broadwell Xeon system.

To reproduce:

 - Check out commit bbee911179aaf925f58051de4392502743539802 from
   https://github.com/google/gvisor (directly before the fix commit).
 - Apply the attached patch to allow the guest to see its CR3 value.
   This makes it easier to test.
 - Build gVisor.
 - Run a Ubuntu docker image with runsc, with gcc and such installed.
 - Copy the attached tsx_thing.c into the container.
 - In the container:

============================================================
root@d8d6b324a7d6:/# gcc -o /tsx/tsx_thing /tsx/tsx_thing.c -O1 -pthread && /tsx/tsx_thing
cheated CR3: 0xc520b6e000
successfully wrote bad PTE!
runtime: newstack at runtime.printlock+0x76 sp=0xc42052f9f8 stack=[0xc4207f0000, 0xc4207f2000]
        morebuf={pc:0x4293d3 sp:0xc42052fa00 lr:0x0}
        sched={pc:0x429d66 sp:0xc42052f9f8 lr:0x0 ctxt:0x0}
runtime.throw(0xbaa9ab, 0x1a)
        GOROOT/src/runtime/panic.go:610 +0x13 fp=0xc42052fa20 sp=0xc42052fa00 pc=0x4293d3
gvisor.googlesource.com/gvisor/pkg/sentry/platform/kvm.bluepillHandler(0xc42052fac0)
        pkg/sentry/platform/kvm/bluepill_unsafe.go:150 +0x1f7 fp=0xc42052fab0 sp=0xc42052fa20 pc=0x86d0e7
gvisor.googlesource.com/gvisor/pkg/sentry/platform/kvm.sighandler(0x7, 0x0, 0xc420528000, 0x0, 0x8000, 0x0, 0x1, 0xc420040070, 0x216, 0x156000, ...)
        pkg/sentry/platform/kvm/bluepill_amd64.s:79 +0x24 fp=0xc42052fac0 sp=0xc42052fab0 pc=0x87c3d4
runtime: unexpected return pc for runtime.sigreturn called from 0x7
stack: frame={sp:0xc42052fac0, fp:0xc42052fac8} stack=[0xc4207f0000,0xc4207f2000)

runtime.sigreturn(0x0, 0xc420528000, 0x0, 0x8000, 0x0, 0x1, 0xc420040070, 0x216, 0x156000, 0x0, ...)
        bazel-out/k8-fastbuild/bin/external/io_bazel_rules_go/linux_amd64_pure_stripped/stdlib%/src/runtime/sys_linux_amd64.s:444 fp=0xc42052fac8 sp=0xc42052fac0 pc=0x457360
created by gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).Start
        pkg/sentry/kernel/task_start.go:258 +0x100
fatal error: runtime: stack split at bad time

runtime stack:
runtime.throw(0xbafc4f, 0x20)
        GOROOT/src/runtime/panic.go:616 +0x81 fp=0xc42010fe48 sp=0xc42010fe28 pc=0x429441
runtime.newstack()
        GOROOT/src/runtime/stack.go:954 +0xb61 fp=0xc42010ffd8 sp=0xc42010fe48 pc=0x442841
runtime.morestack()
        bazel-out/k8-fastbuild/bin/external/io_bazel_rules_go/linux_amd64_pure_stripped/stdlib%/src/runtime/asm_amd64.s:480 +0x89 fp=0xc42010ffe0 sp=0xc42010ffd8 pc=0x453689

goroutine 202 [syscall, locked to thread]:
runtime.throw(0xbaa9ab, 0x1a)
        GOROOT/src/runtime/panic.go:610 +0x13 fp=0xc42052fa20 sp=0xc42052fa00 pc=0x4293d3
gvisor.googlesource.com/gvisor/pkg/sentry/platform/kvm.bluepillHandler(0xc42052fac0)
        pkg/sentry/platform/kvm/bluepill_unsafe.go:150 +0x1f7 fp=0xc42052fab0 sp=0xc42052fa20 pc=0x86d0e7
gvisor.googlesource.com/gvisor/pkg/sentry/platform/kvm.sighandler(0x7, 0x0, 0xc420528000, 0x0, 0x8000, 0x0, 0x1, 0xc420040070, 0x216, 0x156000, ...)
        pkg/sentry/platform/kvm/bluepill_amd64.s:79 +0x24 fp=0xc42052fac0 sp=0xc42052fab0 pc=0x87c3d4
runtime: unexpected return pc for runtime.sigreturn called from 0x7
stack: frame={sp:0xc42052fac0, fp:0xc42052fac8} stack=[0xc4207f0000,0xc4207f2000)

runtime.sigreturn(0x0, 0xc420528000, 0x0, 0x8000, 0x0, 0x1, 0xc420040070, 0x216, 0x156000, 0x0, ...)
        bazel-out/k8-fastbuild/bin/external/io_bazel_rules_go/linux_amd64_pure_stripped/stdlib%/src/runtime/sys_linux_amd64.s:444 fp=0xc42052fac8 sp=0xc42052fac0 pc=0x457360
created by gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).Start
        pkg/sentry/kernel/task_start.go:258 +0x100
[...]
============================================================

Note the "pkg/sentry/platform/kvm/bluepill_unsafe.go:150" caller of
panic() - that's `throw("physical address not valid")` in the
_KVM_EXIT_MMIO handler.
 
0001-modifications-to-simplify-paging-structure-cache-att.patch
2.7 KB Download
tsx_thing.c
4.7 KB View Download

Sign in to add a comment