New issue
Advanced search Search tips

Issue 701645 link

Starred by 1 user

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 3
Type: Bug



Sign in to add a comment

Breakpad failing to retrigger abort() via tgkill inside a PID namespace?

Project Member Reported by thestig@chromium.org, Mar 15 2017

Issue description

In  bug 679353 , my PPAPI process hit an OOM, which abort()s and I ended up with 2 crash IDs:

b241fe1820000000 - SIGABRT
	0x00007f66fdfefc37	(libc-2.19.so -raise.c:56 )	raise
	0x00007f66fdff3027	(libc-2.19.so -abort.c:89 )	abort
	0x00007f6706c28d91	(chrome -debugger_posix.cc:251 )	base::debug::BreakDebugger()
	0x00007f6706c3b792	(chrome -logging.cc:759 )	logging::LogMessage::~LogMessage()
	0x00007f6706c581ad	(chrome -memory_linux.cc:35 )	base::(anonymous namespace)::OnNoMemory()

3cbe311820000000 - SIGSEGV
	0x00007f66fdff3177	(libc-2.19.so -abort.c:125 )	abort
	0x00007f6706c28d91	(chrome -debugger_posix.cc:251 )	base::debug::BreakDebugger()
	0x00007f6706c3b792	(chrome -logging.cc:759 )	logging::LogMessage::~LogMessage()
	0x00007f6706c581ad	(chrome -memory_linux.cc:35 )	base::(anonymous namespace)::OnNoMemory()

It looks like in Breakpad's ExceptionHandler::SignalHandler(), it tries to call tgkill(1, 1, 6) and succeeds, but the kill signal doesn't get delivered somehow? Is the tgid/tid wrong because we are in a PID namespace, and tgkill() should take real tgid/tid? Or is the seccomp sandbox getting in the way maybe?
 
By looking at the code in ExceptionHandler::SignalHandler():
----
if (info->si_code <= 0 || sig == SIGABRT) {
    // This signal was triggered by somebody sending us the signal with kill().
    // In order to retrigger it, we have to queue a new signal by calling
    // kill() ourselves.  The special case (si_pid == 0 && sig == SIGABRT) is
    // due to the kernel sending a SIGABRT from a user request via SysRQ.
    if (tgkill(getpid(), syscall(__NR_gettid), sig) < 0) {
      // If we failed to kill ourselves (e.g. because a sandbox disallows us
      // to do so), we instead resort to terminating our process. This will
      // result in an incorrect exit code.
      _exit(1);
    }
  } 
----

Hmm my reading of the situation is that either:
 Theory1: Breakpad fails to restore the previous signal handler after having handled the signal AND _exit(1) is not really exiting (this is IMHO an extremely weird combination of events, and I'd rule this out)
 Theory2: There is an "outer" breakpad: The 2nd crash report (the one with SIGSEGV) is uploaded by another instance of breakpad which is running in the system and that handles the propagation of the sig handler (should be able to tell this by the metdata in the crash if it is the case?, maybe?)
 Theory3: Something registers "two" breakpad exception handlers, so the breakpad signal handlers invokes two exception handlers back-to-back

I don't think the problem is with tgkill. If tgkill is broken the only effect (I think) should be failing to propagate the crash to the next handler (Typically some system-level debugger daemon, or coredump()). In this sense the _exit(1) there is a safety belt. It should never be hit, unless the next handler in chain is doing something silly and just returning. So, if tgkill fails, this wouldn't explain how we get the second crash dump.

IIRC everything correctly the expected sequence of events is:
- The signal is handled by breakpad
- breakpad restores the previous signal handler (whatever it was before breakpad got initialized)
- breakpad generates a minidump
- breakpad does a tgkill to trigger the next signal handler (The one that has just been restored in the step above).
At this point either:
 - (typical scenario) The next signal handler dumps something on stdout (or coredump) and murders the process 
 - (uncommon, but possible) the next signal handler does something silly and just returns
   In this case, however, we would go back to breakpad and _exit(1) in that code snippet

I re-reasoned on this after getting a coffee. I didn't see the "if" part.
You are right, if the tgkill "pretends" to work, we don't _exit(1).

Still, the thing that I don't still explain is that we would have restored the handler after handling the first signal.
Ahh, now that I look more:
 // Upon returning from this signal handler, sig will become unmasked and then
  // it will be retriggered. If one of the ExceptionHandlers handled it
  // successfully, restore the default handler. Otherwise, restore the
  // previously installed handler. Then, when the signal is retriggered, it will
  // be delivered to the appropriate handler.
  if (handled) {
    InstallDefaultHandler(sig);
  } else {
    RestoreHandlersLocked();
  }

So, if the exception handler returns false (this is really up now to the code in chrome in //components/crash), we install the default handler only for the specific signal (SIGABRT) in this case, but would still be sensitive to SEGV.

Ok, so I just had a big brainfart here. Your theory is perfectly valid.
I think that tgkill is not properly working as we intend.

FWIW I checked libc sources, since the first crash is in abort.c:89 and the 2nd in abort.c:125
abort.c:89 is the first attempt of raise(SIGABRT)
abort.c:125 is the last attempt which does a asm volatile("hlt") (which, I checked, ends up in a SEGV@0x0)

Comment 3 by rsesek@chromium.org, Mar 30 2017

If seccomp were involved, I'd expect the resulting error to be SIGSYS. __NR_tgkill is permitted if the target pid is the current pid (https://cs.chromium.org/chromium/src/sandbox/linux/seccomp-bpf-helpers/baseline_policy.cc?sq=package:chromium&dr=C&l=217). However, how that plays with a PID namespace, I'm not entirely sure... but since the error is not SIGSYS, I don't think that code is in play.

Sign in to add a comment