New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 670466 link

Starred by 3 users

Issue metadata

Status: Started
Owner:
Cc:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug



Sign in to add a comment

Crash not reported for recent renderer crash

Project Member Reported by scottmg@chromium.org, Dec 1 2016

Issue description

https://bugs.chromium.org/p/chromium/issues/detail?id=670243 caused widespread crashyiness.

I'm seeing it locally, but not getting any crash report in chrome://crashes on my Canary, but I don't know why.

I'm on 57.0.2938.2 canary (64-bit) compiler: clang.

Seems similar to what Will reported a while back, and Prudhvi was going to investigate on his end too to try to repro.
 
Cc: mark@chromium.org
So far I've confirmed that Crashpad will catch __asm ud2 crashes in Crashpad standalone tests (illegal instruction exception) which is apparently what CHECK is in WebKit.

I've also put a breakpoint in the exception handler process and it's not getting signaled by the renderer. So I guess we don't have a UnhandledExceptionFilter, or it's failing somehow?
Labels: -Pri-3 M-57 Pri-2
I am reproducing this crash on Win7/10 for Latest Canary#59.0.2938.0 just signing into gmail account and refresh the tab if you don't see crash. However chrome://crashes doesn't report any crash related info.

Thank you!

Comment 4 by mark@chromium.org, Dec 1 2016

Summary: Crash not reported for recent renderer crash (clang on Windows) (was: Crash not reported for recent renderer crash)
What else do we know about these clang builds and how they’ve been configured?
Owner: scottmg@chromium.org
Status: Started (was: Unconfirmed)
OK, since Prudhvi can repro on .0, clang is a red-herring and I can't blame them.

I'm syncing back to just before the CL was reverted for https://bugs.chromium.org/p/chromium/issues/detail?id=669968 and will try to reproduce there. I definitely wasn't getting the UEF() signal in the handler process but it's difficult to determine if we were hitting the in-process part of that on Canary where I'm unable to change code. (When the debugger's attached it breaks at the ud2 and won't pass it on to the registered handler if any).

Comment 6 by mark@chromium.org, Dec 1 2016

Summary: Crash not reported for recent renderer crash (was: Crash not reported for recent renderer crash (clang on Windows))
OK, readjusting the title then.

Crash reporter triggering can be one of the hardest things to work with in a debugger…
A local debug build captures the crash as expected. :(
Release build doesn't, and behaves the same as Canary. So I have an only slightly-painful repro now. Debugging...
Totally baffled. Everything looks fine except that the UnhandledExceptionFilter() just ... doesn't get triggered. Still digging.

Comment 10 by mark@chromium.org, Dec 2 2016

Bet you’ve already barked up all of these trees, but here are some ideas.

Is anyone else calling SetUnhandledExceptionFilter()? If you call SetUnhandledExceptionFilter() again just prior to a known intentional crash to peek at the previous handler, is it still ours? Are we doing something to “break” further calls to SetUnhandledExceptionFilter(), and if so, could that be having an effect? Are SEH or VEH factors here?

Comment 11 by mark@chromium.org, Dec 2 2016

Oh, and…

Do you see the ud2 in disassembly in both Debug and Release, or is something “smart” transforming it into something else in some configuration?
Yeah, that's what I've been trying to divine. SetUnhandledExceptionFilter() may be getting overrwritten by someone, but I don't quite understand how.

about:crash always works, it's only this particular crash (or some category of crashes) that we're not getting. But I don't know what's different or unusual about these crashes yet.

We do attempt to disable SetUEF early on so that after we're installed SetUEF shouldn't do anything. That also means that I can't easily just call it again to see if we're still installed, unless I disable the disabling, but then I won't really be testing the same thing.

The ud2 is only in Release builds and it's in a weird place, right after what looks like a normally ordered function epilog ending in a ret. So I'm not quite sure how execution getting there.

I'm currently going back a couple days and then patching in the crashy webkit code to see if it's more likely a recent change (I'm suspecting my user data dir switch somehow?) or that this crash is really unusually structured somehow. (It's a bit of waiting for builds, but debugging didn't seem to be getting me too far anyway.)
Cc: ananta@chromium.org
More findings in brain-dump format, but not a full explanation yet.

In !OFFICIAL_BUILD, content_main_runner.cc calls base::debug::EnableInProcessStackDumping() to print a stack. That does a SetUEF that overwrites Crashpad's.

There's 3 calls to SetUEF in normal startup:
1. from __scrt_common_main_seh in the C runtime
2. from CrashpadHandler::SetHandlerIPCPipe
3. from base::debug::EnableInProcessStackDumping

#3 succeeds, though we (I) didn't expect it to for too long. I believe that's because https://cs.chromium.org/chromium/src/chrome_elf/crash/crash_helper.cc?rcl=0&l=97 just early-outs when it's called early from chrome_elf_main https://cs.chromium.org/chromium/src/chrome_elf/chrome_elf_main.cc?rcl=0&l=60. (This is surely my bug from when we had to quickly put crashpad startup back into chrome.exe when we had the Enterprise stable regression on 54 -- I moved the init back to a signal from WinMain() but I didn't think through how the patching worked, so it became a no-op).

In any case, in this build config it ought to still be working and getting crashes semi-normally because EnableInProcessStackDumping uses a "well-behaved" handler that chains to the previous one, which is the Crashpad one. But it doesn't. I can't explain that yet.

It also doesn't explain why in an OFFICIAL_BUILD (Canary) we don't capture crashes, because EnableInProcessStackDumping isn't used there. I confirmed that SetUEF() does indeed return the Crashpad function just before BreakDebugger() is called in OFFICIAL_BUILD.

My current guess would be that the process or stack is getting very hosed, and so for whatever reason, it isn't able to get to the filter function. This would fit with the surprising ud2 I see on Canary that I don't really understand how we got to. But this last paragraph is all speculation, still digging into a more OFFICIAL-like build to try to confirm that.
I don't think there's any major corruption happening. __debugbreak() in ~LogMessage() doesn't go to the UEF though, and I still don't know why yet.


Peripherally and it's own "fun" discovery, I tried CHECK(false) at the bottom of content_main_runner.cc::CommonSubprocessInit() (I was trying to test if CHECK(false) worked anywhere at all because I was starting to think I was going crazy.)

This made a bunch of processes crash at the same time (they got sent to Crashpad).

However that in turn caused crashpad_handler to crash, with a stack of:

 	0000000000000000()	Unknown
>	chrome.exe!crashpad::ProcessReaderWin::ReadThreadData<crashpad::process_types::internal::Traits64>(bool is_64_reading_32) Line 355	C++
 	chrome.exe!crashpad::ProcessReaderWin::Threads() Line 311	C++
 	chrome.exe!crashpad::internal::ExceptionSnapshotWin::Initialize(crashpad::ProcessReaderWin * process_reader, unsigned long thread_id, unsigned __int64 exception_pointers_address) Line 79	C++
 	chrome.exe!crashpad::ProcessSnapshotWin::Initialize(void * process, crashpad::ProcessSuspensionState suspension_state, unsigned __int64 exception_information_address, unsigned __int64 debug_critical_section_address) Line 73	C++
 	chrome.exe!crashpad::CrashReportExceptionHandler::ExceptionHandlerServerException(void * process, unsigned __int64 exception_information_address, unsigned __int64 debug_critical_section_address) Line 56	C++
 	chrome.exe!crashpad::ExceptionHandlerServer::OnCrashDumpEvent(void * ctx, unsigned char __formal) Line 548	C++

It's just called crashpad::NtOpenThread<crashpad::process_types::internal::Traits64>() from ReadThreadData.

Here's the code of that function https://gist.github.com/sgraham/4ab9260ef73a468dd54412773e8edf58 . The C looks simple enough, but looking at the disassembly it's clear it writes the static-guard value before writing the function pointer, so the second entrant might get an uninitialized function pointer. More generally, we don't enable thread-safe statics so it's clearly racy. static for speed, I suppose.

Done poking at the house of cards for tonight. Let's make tomorrow a better day!




Comment 15 by mark@chromium.org, Dec 2 2016

I filed the “peripherally” discovery from comment 14 as  bug crashpad:141 .
Ananta suggested a couple other things to check:
- setting a (different, local) SetUEF right before the __debugbreak(): That also doesn't get called. This points to some sort of corruption again.
- setting a vectored exception handler immediately before the __debugbreak(): That _does_ see the crash, and the ExceptionRecord->ExceptionCode = 0x80000003 (i.e debugbreak as expected, not heap corruption).

My best guess now is that it's due to being inside a v8 jit stack frame. It doesn't seem like we've massively corrupted anything, but maybe it's sufficient for it to abort and fall back to watson if it's unable to walk to the top of the stack? windbg gets to the v8 jit and can't get "through" it back to C code. So if the runtime can't get back to check if there's a __try/__catch set up, maybe it bails? (That's complete handwaving, I don't know enough about how exceptions are dispatched. Need to try to find something to read up on that...)

It'd really be nice to be able to step through what happens after the debugbreak, but I don't know of a way to do that.

Comment 17 by mark@chromium.org, Dec 2 2016

Is this one of those things where we (V8, really) should be using the RtlAddFunctionTable()/RtlInstallFunctionTableCallback() family?

I guess that this failure-to-catch problem is x86_64-only and doesn’t happen on 32-bit x86?
Yeah, possibly. I thought v8 did that? https://cs.chromium.org/chromium/src/components/crash/content/app/crashpad_win.cc?rcl=0&l=300 But maybe it doesn't always happen, or always happen correctly.

I haven't tried rebuilding as x86 yet, but I'll do that, as it would also narrow things down a bit.

This might be a reason to do bug crashpad:133 after all, even if it only catches a very narrow band of things.
https://cs.chromium.org/chromium/src/chrome/common/v8_breakpad_support_win.cc?rcl=0&l=16 sigh. Those are in chrome_elf, not chrome.exe, and because of silently swallowing that, no one notices. :(
I wonder why that code's even in crashpad/breakpad_win.cc, I think the code that's doing the stuff that requires it should really just own that bit of code.
Looking up the RtlAddFunctionTable() helper in the right module fixes catching this crash in x64.

Do we actually have valid uses for SEH? (like ... page probing, or something?) I wonder if we should be considering using AddVectoredExceptionHandler after all, and at least catching known ExceptionCodes in there. It looks like Firefox considered this too, but I haven't read that whole thread yet https://bugzilla.mozilla.org/show_bug.cgi?id=844196 .

Comment 22 by mark@chromium.org, Dec 2 2016

Yesss! The broken backtrace in the debugger was the smoking gun.

I don’t know if we care about SEH, but some system modules might. Probably do, even.

I’d still like to understand *why* not having unwind tables broke this. My understanding of SEH is that the TIB has a linked list of exception handlers that get pushed and popped as code enters and leaves __try blocks. It certainly doesn’t require an unwind to navigate that list. Perhaps there’s something on that list that needs to do an unwind? The only thing that comes to mind would be calling C++ destructors, but I didn’t think that this happened with /EHsc, which is what I think Chrome uses.
Ah, I see jochen commented there, and this really is a groundhog day bug https://bugs.chromium.org/p/v8/issues/detail?id=3598.

I guess for now I'll just fix the GetProcAddress() and be a bit sad.
My understanding is the same as yours. I don't see why it should care about the stack for the UnhandledExceptionFilter.

One reference is https://msdn.microsoft.com/en-us/library/ft9x1kdx.aspx

"""For dynamically generated functions [JIT compilers], the runtime to support these functions must either use RtlInstallFunctionTableCallback or RtlAddFunctionTable to provide this information to the operating system. Failure to do so will result in unreliable exception handling and debugging of processes."""

which I guess gives it leeway to terminate unceremoniously, if it sort of vaguely feels uncomfortable with the stack.
Project Member

Comment 25 by bugdroid1@chromium.org, Dec 2 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/56c90a89889693bd6e1a7d7bb4b8f49c69bab2fd

commit 56c90a89889693bd6e1a7d7bb4b8f49c69bab2fd
Author: scottmg <scottmg@chromium.org>
Date: Fri Dec 02 21:52:49 2016

win64: fix lookup of RtlAddFunctionTable() helper for v8

This helper function moved from chrome.exe to chrome_elf. The failure to
look up is unfortunately silent because the same code is used in test
binaries. As a result, RtlAddFunctionTable() wasn't called, so any
crashes with v8 jit on the stack on Win x64 caused program termination
without giving our crash handler a chance to catch it.

R=thakis@chromium.org, jochen@chromium.org
BUG=670466, 656800 ,604923,v8:3598

Review-Url: https://codereview.chromium.org/2548653004
Cr-Commit-Position: refs/heads/master@{#436041}

[modify] https://crrev.com/56c90a89889693bd6e1a7d7bb4b8f49c69bab2fd/chrome/common/v8_breakpad_support_win.cc

Should be fixed now, but we should really have an integration test for this, so I won't close for now.
(I came across this old issue while researching whether V8 registers any unwind info on x86-64 Windows.)

@mark: "My understanding of SEH is that the TIB has a linked list of exception handlers that get pushed and popped as code enters and leaves __try blocks. It certainly doesn’t require an unwind to navigate that list."

Your understanding is correct for x86-32 but incorrect for x86-64.  On x86-64, SEH and __try blocks use "zero-cost" exception handling: __try blocks don't push and pop entries onto a list of exception handlers.  Instead, SEH uses stack unwinding to find the handlers.

NaCl ran into the same issue of Breakpad's UnhandledExceptionFilter function not being called, in 2011: https://bugs.chromium.org/p/nativeclient/issues/detail?id=2237

Windows' stack unwinder used to have some rather dodgy behaviour when unwinding through through a function that doesn't have unwind info (and maybe it still does): https://lackingrhoticity.blogspot.com/2011/11/stack-unwinding-risks-on-64-bit-windows.html

Sign in to add a comment