|
|
PaX: reference count overflow mitigation can be bypassed on x86 by racing | ||||
| Project Member Reported by jannh@google.com, Jun 27 2016 | Back to list | ||||
PaX contains a mitigation for reference count overflows that is intended to prevent atomic_t variables from reaching 0x80000000 and, more importantly, wrapping around to zero. A documented special case on x86 (https://forums.grsecurity.net/viewtopic.php?f=7&t=4173#APPENDA) is that, because "atomically increment unless current value is X" can't be implemented without a cmpxchg loop, the code instead increments the counter, checks for an overflow and, if an overflow happened, immediately decrements the counter back. It is believed that this race is too narrow to hit in practice; however, in my tests, the race is usually won in a few seconds at most. My test setup: In a VirtualBox VM, I installed Linux 4.5.7 with a config based on Debian's distro config, patched with grsecurity-3.1-4.5.7-201606262019.patch, with grsecurity configured as follows: Configuration Method (Automatic) Usage Type (Desktop) Virtualization Type (Guest) Virtualization Hardware (EPT/RVI Processor Support) Virtualization Software (VirtualBox) Required Priorities (Security) I applied a kernel patch (attached as overflowme.patch) to introduce an atomic_t that starts at 0x7fffffff and can be incremented by the user. After booting the resulting kernel, I compiled (gcc -o test_refcount test_refcount.c -std=gnu99 -Wall) and ran the following test program: ====== #include <stdint.h> #include <sys/eventfd.h> #include <sys/types.h> #include <sys/wait.h> #include <err.h> #include <stdio.h> #include <unistd.h> #include <stdlib.h> #define CONCURRENCY 10 int efd; void make_child(void) { pid_t child = fork(); if (child == -1) err(1, "fork"); if (child != 0) return; uint64_t ctr = 4242; for (int i=0; i<1000; i++) write(efd, &ctr, 8); exit(0); } unsigned int get_value() { char buf[4242]; if (read(efd, buf, 4242) != 8) err(1, "efd read"); return *(uint64_t*)buf; } int main(void) { setbuf(stdout, NULL); efd = eventfd(0, 0); if (efd == -1) err(1, "eventfd"); printf("counter: 0x%x\n", get_value()); for (int i=0; i<CONCURRENCY; i++) make_child(); while (1) { int status; if (wait(&status) == -1) err(1, "wait"); unsigned int value = get_value(); if (value == 0x7fffffff) { putchar('.'); } else if (value == 0xffffffff) { putchar('#'); } else { printf("\nnew status: 0x%x\n", value); if (value > 30 && value < 0x10000000) return 0; } make_child(); } } ====== The program output looks like this (dots stand for failed race attempts): ====== counter: 0x7fffffff .................................................... new status: 0x80000017 new status: 0x8000006f new status: 0x8000157d new status: 0x800015c6 [...] ====== Before the race is won, PaX will repeatedly complain about the overflowing reference counters and terminate the offending processes: ====== Message from syslogd@debian at Jun 27 16:34:29 ... kernel:[ 5699.705439] PAX: refcount overflow detected in: test_refcount:26921, uid/euid: 1000/1000 Message from syslogd@debian at Jun 27 16:34:29 ... kernel:[ 5699.705563] PAX: refcount overflow detected in: test_refcount:26925, uid/euid: 1000/1000 Message from syslogd@debian at Jun 27 16:34:29 ... kernel:[ 5699.706708] PAX: refcount overflow detected in: test_refcount:26919, uid/euid: 1000/1000 [...] ====== The program then continues to increment the reference counter for a while before it reaches 0xffffffff, at which point the second overflow back to small numbers requires the same race to be won again. Some notes: - I haven't tested this outside VirtualBox - being in a VM might increase the probability of hitting the race. - The VM was configured to have 6 processor cores. Because any two of the six tasks can race with each other, I think the birthday paradox applies and significantly raises the probability of success. - The kernel was not configured with CONFIG_PREEMPT (only CONFIG_PREEMPT_VOLUNTARY), so only six tasks raced with each other at a time; on a system with CONFIG_PREEMPT, forcibly preempted tasks might, depending on where the race occurs, also be able to participate in races. For reference counters, a nice way to fix this would be to just replace the "jno" instructions in the x86-specific assembly code with "jns". This would preserve the property that atomic_t can overstep the 0x7fffffff treshold for a short moment, but would get rid of the race permitting permanent increments beyond that treshold. This fix would have the advantage that the code size doesn't change. Sadly, since some users of atomic_t (such as i_writecount) aren't actually reference counters and can legitimately have negative values, this doesn't work directly and would require changing those users of atomic_t to atomic_unchecked_t or so. If removing the refcounting protection from fields that can take negative values is undesirable, it might be necessary to increase the code size and e.g. perform an addition with an immediate value prior to checking the overflow flag. This bug is subject to a 90 day disclosure deadline. If 90 days elapse without a broadly available patch, then the bug report will automatically become visible to the public.
,
Jul 8 2016
Lifting view restriction. The following update was provided by PaX team:
----
in general, we believe that there's no need to keep any of this secret, so feel
free to open it up to the public. in fact we believe that this is a publicly known
and documented issue (in more than one way, as you'll see below), so we're not sure
what the point of all this was, but whatever.
regarding your particular claims:
1. we find the claim that "the race is usually won in a few seconds at most" is
misleading since your test already sets up the refcount precisely to INT_MAX
which otherwise takes time and is part of the attack time. you also created
a refcount read-back feature that we believe is not typically available to an
attacker unless he's able to control the precise count (depends on the bug
and circumstances).
2. in my experiments (quad core SNB at around 2.5GHz, both real and vmware)
i could reproduce the (artificially shortened) race itself fine in a few
seconds (depends somewhat on the number of cores, where one trades off control
over the precise value of the refcount for speed). in my experience 2 cores
don't do much worse than 4 cores.
the next step in the attack is to reach a precise refcount value of 0 which
here takes about 2.5-3 mins (real) and 5-7 mins (vmware). note that there's
a constant workfactor involved here, the syscall path you chose is relatively
fast, back in 2008 when the motivating example for REFCOUNT appeared, the
refcount overflow took more than 20 minutes to trigger in vmware.
i also experimented with disabling IRQs around the refcount increment but it
seemed to play little if any role so the race doesn't seem to be won because
some interrupt arbitrarily extended the inc/dec window (this also means that
preemption would have no useful effect for an attack).
we're not sure why you called this part as needing to win the race again as
there're no checks and thus no saturation attempts at this point. perhaps you
meant that an attack does have to care about precision as reaching any value
other than 0 will mean the whole attack attempt will have to be redone (so we
are talking about several minutes per attack attempt). now i didn't think too
much about how to precisely control the overflowed refcount value, so it's not
clear whether this second step can be controlled with 100% certainity or not.
3. as for solving the problem, we have a few comments:
- as the blog already notes, preventing operations on negative refcount (well,
atomic_t) values doesn't work in general since some real use cases expect
to work in that range already. a better solution is to saturate at a value
that is at least NR_CPUS below the overflow threshold, this way the refcount
cannot overflow into the negative range even if all cores win (in practice
we'd need a bigger safety margin, as much as the number of increments we
can expect while the interrupted saturation code sits in an irq/scheduling
window). the problem with this approach is again that not all atomic users
use single increments on the value, so this would require analysis and special
cases, something we try to avoid.
as for your suggestion about adding an immediate before checking for overflow,
i don't see how it'd work, do you have a code example perhaps? e.g., to what
value would you add the constant? the value before or after the atomic op?
and how would you acquire either while avoiding the very race you're trying
to defend against?
- as we discussed it in private already, we had practical reasons to not enable
the reaction mechanism in grsec (PaX doesn't have much beyond reporting, by
design). one is the existence of non-refcount atomic*_t false positives that,
thanks to some work based on static analysis in recent months, we are just
about to eradicate with reasonable confidence. Brad has in fact jumped the
gun already and enabled the reaction mechanism, i'm still holding out as the
static analysis isn't complete enough to my satisfaction.
another reason for not enabling the attack reaction mechanism was that for
refcount overflows attribution may not be easy if the refcount in question
can also be used by innocent parties (e.g., task or mm struct). in such cases
an attacker could DoS other users though he'd still lose the bug thanks to
the logging (which we already consider a strong deterrent since any refcount
overflow attack will generate logs).
not sure what the conclusion here is, but here's some parting thoughts anyway:
- you can publish this without delay, preferably with our comments (or at least
let us comment on it once published),
- we consider the issue as publicly known and we'd been actively working on it
before your report already,
- we question how close your PoC is to real life bugs and exploits.
----
,
Jul 8 2016
> regarding your particular claims:
>
> 1. we find the claim that "the race is usually won in a few seconds at most" is
> misleading since your test already sets up the refcount precisely to INT_MAX
> which otherwise takes time and is part of the attack time.
Perhaps I wasn't clear enough about what I intended to refer to with "the race".
"The race" was meant to refer to attacking the race condition in the refcounting
mitigation, not the full attack against a reference counting vulnerability.
In other words, I was talking about the added attack time compared to exploiting
such an issue on a kernel without this mitigation.
> you also created
> a refcount read-back feature that we believe is not typically available to an
> attacker unless he's able to control the precise count (depends on the bug
> and circumstances).
The read-back feature was mostly meant to make it easier to see what's going on
and verify that the counter really is overflowing.
However, for a reference counter with a value that is unknown to the attacker but
isn't changed by things outside the attacker's control during the attack, I
believe that it is actually possible to determine the value of the counter in the
middle of the attack and then keep track of it.
When an attacker increments a reference counter and the calling process is killed
by the kernel, this discloses that the current reference counter value is
0x7fffffff. When the attacker then attacks the reference counter overflow, the
reference counter is incremented by the number of processes that are not killed.
This means that an attacker should be able to increment without parallel
execution until the refcounting mitigation triggers (0x7fffffff has been
reached) and then attack the race condition until the mitigation stops
triggering (>=0x80000000 has been reached).
At this point, the new reference count is known to be 0x7fffffff+N, where N is
the number of processes that didn't die after triggering the race condition.
Now the attacker can perform increments again, keeping track of the counter value.
As long as nothing else on the system changes the reference count shortly after
the second race, this should allow an attacker to exploit reference count
overflows as if a counter read primitive was available.
> 2. in my experiments (quad core SNB at around 2.5GHz, both real and vmware)
> i could reproduce the (artificially shortened) race itself fine in a few
> seconds (depends somewhat on the number of cores, where one trades off control
> over the precise value of the refcount for speed). in my experience 2 cores
> don't do much worse than 4 cores.
Ah, I wouldn't have expected that.
> we're not sure why you called this part as needing to win the race again as
> there're no checks and thus no saturation attempts at this point. perhaps you
> meant that an attack does have to care about precision as reaching any value
> other than 0 will mean the whole attack attempt will have to be redone (so we
> are talking about several minutes per attack attempt). now i didn't think too
> much about how to precisely control the overflowed refcount value, so it's not
> clear whether this second step can be controlled with 100% certainity or not.
Sorry, please ignore that part. I was confused about the circumstances under which
the overflow flag is set.
> 3. as for solving the problem, we have a few comments:
> - as the blog already notes, preventing operations on negative refcount (well,
> atomic_t) values doesn't work in general since some real use cases expect
> to work in that range already. a better solution is to saturate at a value
> that is at least NR_CPUS below the overflow threshold, this way the refcount
> cannot overflow into the negative range even if all cores win (in practice
> we'd need a bigger safety margin, as much as the number of increments we
> can expect while the interrupted saturation code sits in an irq/scheduling
> window). the problem with this approach is again that not all atomic users
> use single increments on the value, so this would require analysis and special
> cases, something we try to avoid.
Ah, true. It looks like there is a lot of atomic_t variables that are used for
memory usage tracking and things like that that can be incremented in relatively
big steps.
However: Ignoring them in the mitigation design wouldn't introduce bugs, it would
just mean that the mitigation might be bypassable for these atomic counters, right?
So changing the current mitigation to check against the safety margin instead would
make the protection reliable for normal reference counting in exchange for
potentially making it bypassable for some cases that mostly aren't refcounting.
(Or you could do both checks - but of course that would increase the code size by
2 bytes or so for every atomic operation.)
> as for your suggestion about adding an immediate before checking for overflow,
> i don't see how it'd work, do you have a code example perhaps? e.g., to what
> value would you add the constant? the value before or after the atomic op?
> and how would you acquire either while avoiding the very race you're trying
> to defend against?
It was a stupid idea. I was thinking of something like:
lock addl <increment>, <counter>
mov <counter>, eax
addl $<safety margin>, eax
followed by the existing code that checks for the overflow flag. But of course
that doesn't actually make much sense given that x86 can directly compare a memory
argument to an immediate.
,
Jul 11 2016
,
Nov 4 2016
,
Nov 10 2016
Marking as fixed since grsecurity triggers account lockout when the attacker loses the race, so it should be very hard to get past the mitigation now. (This change was made shortly after the bug was reported.)
,
Nov 10 2016
actually, we do even better now: https://forums.grsecurity.net/viewtopic.php?f=7&t=4173#p16722 |
|||||
| ► Sign in to add a comment | |||||
1.7 KB Download