Issue metadata
Sign in to add a comment
|
nassh app crash in nacl-loader process (WER dialog) when exiting with threads |
||||||||||||||||||||
Issue descriptionChrome Version: 59.0.3067.2 (Official Build) canary (64-bit) (cohort: Clang-64) OS: Windows 10 10.0.10586 What steps will reproduce the problem? (1) Start secure shell (2) (3) Per wfh: this is pretty easy to repro, using Secure Shell extension: 1. SSH into a machine 2. kill the connected sshd from another terminal 3. attempt to type into the first SSH Unfortunately I can't bisect behind M56 since the latest secure shell extension requires Chrome 56. What is the expected result? Starts without crash What happens instead? Crash and a WER dialog. See attached image. Please use labels and text to provide additional information. chrome.exe --type=nacl-loader --service-request-channel-token=E08E0A2C9CF354C40EE37B4ADFB2054C --mojo-platform-channel-handle=24292 --ignored=" --type=renderer " /prefetch:8 For graphics-related bugs, please copy/paste the contents of the about:gpu page at the end of this report.
,
Apr 10 2017
A dump (Google only), at the exception: https://drive.google.com/a/google.com/file/d/0BzKr4yVrnrZNT3ZBcUtnd0RZc3c/view?usp=sharing The exception seems to occur in thread #9 0:009> k # Child-SP RetAddr Call Site 00 000000e7`fc5ff688 00007ff8`3572aadf ntdll!NtWaitForSingleObject+0x14 01 000000e7`fc5ff690 00007ff8`06d01b94 KERNELBASE!WaitForSingleObjectEx+0x8f 02 000000e7`fc5ff730 00007ff8`06cffb4a chrome_child!NaCl::ConditionVariable::TimedWaitRel+0x84 [C:\b\build\slave\win64-clang\build\src\native_client\src\shared\platform\win\condition_variable.cc @ 77] 03 000000e7`fc5ff7b0 00007ff8`0514ac63 chrome_child!NaClCondVarWait+0x1a [C:\b\build\slave\win64-clang\build\src\native_client\src\shared\platform\win\nacl_sync_win.cc @ 130] 04 000000e7`fc5ff7e0 00007ff8`051447bc chrome_child!NaClSysFutexWaitAbs+0xf3 [C:\b\build\slave\win64-clang\build\src\native_client\src\trusted\service_runtime\sys_futex.c @ 209] 05 000000e7`fc5ff860 00007ff8`051420c3 chrome_child!NaClSysFutexWaitAbsDecoder+0x4c [C:\b\build\slave\win64-clang\build\src\native_client\src\trusted\service_runtime\nacl_syscall_list.c @ 109] 06 000000e7`fc5ff8b0 00007ff8`078a6010 chrome_child!NaClSyscallCSegHook+0xe3 [C:\b\build\slave\win64-clang\build\src\native_client\src\trusted\service_runtime\nacl_syscall_hook.c @ 149] 07 000000e7`fc5ff910 00007ff8`0514e72e chrome_child!NaClSwitchSavingStackPtr+0x20 08 000000e7`fc5ff980 00007ff8`0514014e chrome_child!NaClStartThreadInApp+0x5e [C:\b\build\slave\win64-clang\build\src\native_client\src\trusted\service_runtime\arch\x86_64\nacl_switch_to_app_64.c @ 85] 09 000000e7`fc5ff9b0 00007ff8`078e47d0 chrome_child!NaClAppThreadLauncher+0x18e [C:\b\build\slave\win64-clang\build\src\native_client\src\trusted\service_runtime\nacl_app_thread.c @ 80] 0a (Inline Function) --------`-------- chrome_child!invoke_thread_procedure+0xd [d:\rs1\minkernel\crts\ucrt\src\appcrt\startup\thread.cpp @ 91] 0b 000000e7`fc5ff9f0 00007ff8`36e18102 chrome_child!thread_start<unsigned int (__cdecl*)(void * __ptr64)>+0x50 [d:\rs1\minkernel\crts\ucrt\src\appcrt\startup\thread.cpp @ 115] 0c 000000e7`fc5ffa20 00007ff8`38ebc5b4 KERNEL32!BaseThreadInitThunk+0x22 0d 000000e7`fc5ffa50 00000000`00000000 ntdll!RtlUserThreadStart+0x34
,
Apr 12 2017
I can repro this bug, I just need to use Secure Shell to ssh into one of my machine and do a reboot on this machine, this close the connection and I get the following message in Secure Shell (as well as the WER popup): NaCl plugin exited with status code 255. (R)econnect, (C)hoose another connection, or E(x)it?
,
Apr 12 2017
hmm I just had this happen, and I have an entry in ChildProcess.Crashed2 histogram for 13 (PPAPI_PLUGIN_PROCESS) so perhaps something is not correctly catching these exceptions. looking in UMA there doesn't appear to be a spike for this though...
,
Apr 20 2017
this is pretty easy to repro, using Secure Shell extension: 1. SSH into a machine 2. kill the connected sshd from another terminal 3. attempt to type into the first SSH Unfortunately I can't bisect behind M56 since the latest secure shell extension requires Chrome 56. vapier -> is it possible to get a copy of an old version of SSH extension so I can try and bisect before this?
,
Apr 20 2017
the M56 requirement is just for CrOS/sftp integration. you should be able to safely change it back to like 49 everywhere. i could probably change it back in the source actually ... if you run M55 or older, everything except sftp will work fine.
,
Apr 20 2017
I couldn't change it because loading unpacked extension invalided the key and changed the extension id and then it lost access to some private API it needed to work :(
,
Apr 20 2017
you should be able to copy & paste the key: https://chromium.googlesource.com/apps/libapps/+/master/nassh/manifest.json i load that unpacked all the time and it works fine under Linux & CrOS (and i have to assume Windows would be the same) that said, wrt private APIs, you could still validate things ... you'd lose access to: - terminalPrivate is only for crosh - NaCl socket APIs aren't used if you go through a relay server (--config=google) - fileSystemProvider is only for CrOS
,
Apr 20 2017
I can repro this all the way back to 51.0.2704.0 and I don't think there's much point going further back, because I only think I've been seeing this WER dialog in the past month or so. Which means something else in the system has changed, perhaps a windows update? I can try on other OS, I suppose. scottmg might have ideas?
,
Apr 20 2017
,
Apr 20 2017
adding a few more brains since I'm out of ideas.
,
Apr 20 2017
What *is* the exception? The crash dump from comment #2 doesn't contain an exception record. It only contains a single thread and that thread is sitting at a 'ret' instruction really shouldn't be crashable. Given the location of the crash I would guess that it is an invalid-handle exception, but I don't actually know. It looks like the handle value is 0x2dc but windbg knows nothing about that handle and ~handle displays a suspiciously short list of handles so I don't know if that means anything. Running under windbg -g -G -o should let us catch the exception and whatever output is printed at the time of the exception, which seems like a good next step. The repro steps for Will and Seb seem quite different - maybe there are multiple bugs?
,
Apr 21 2017
When running under windbg following #c5, I see this: (5d50.b4fc): Illegal instruction - code c000001d (first chance) (5d50.b4fc): Unknown exception - code c0000096 (first chance) (5d50.b4fc): Unknown exception - code c0000096 (!!! second chance !!!) 00007ffb`70ef0024 f4 hlt https://www.dropbox.com/s/5b4k48q9m2132rh/nacl-loader.dmp?dl=0 At the hlt there's no stack backtrace unfortunately. The code looks like this before there: 00007ffb`70eefffe ?? ??? 00007ffb`70eeffff ?? ??? 00007ffb`70ef0000 50 push rax 00007ffb`70ef0001 51 push rcx 00007ffb`70ef0002 52 push rdx 00007ffb`70ef0003 ba0c000000 mov edx,0Ch 00007ffb`70ef0008 b904000000 mov ecx,4 00007ffb`70ef000d 65488b042558000000 mov rax,qword ptr gs:[58h] 00007ffb`70ef0016 488b04c8 mov rax,qword ptr [rax+rcx*8] 00007ffb`70ef001a 8b0402 mov eax,dword ptr [rdx+rax] 00007ffb`70ef001d 85c0 test eax,eax 00007ffb`70ef001f 7504 jne 00007ffb`70ef0025 00007ffb`70ef0021 4831e4 xor rsp,rsp 00007ffb`70ef0024 f4 hlt 00007ffb`70ef0025 5a pop rdx 00007ffb`70ef0026 59 pop rcx 00007ffb`70ef0027 58 pop rax 00007ffb`70ef0028 fc cld which I'm guessing is some generic "bail" routine (??). I managed to get a breakpoint on 00007ffb`70ef0000 in that process hoping that the stack was getting barfed just by the "xor rsp, rsp", but apparently not, still no stack when that breakpoint gets hit. (different TID, different run from above) 4:069> ?? @rsp unsigned int64 0x000003d5`fed6b128 4:069> dd @rsp 000003d5`fed6b128 00000000 00000000 00032d80 00000000 000003d5`fed6b138 00000012 00000000 00000000 00000000 000003d5`fed6b148 00000000 00000000 00000000 00000000 000003d5`fed6b158 00000000 00000000 00000000 00000000 000003d5`fed6b168 00000000 00000000 0010005f 00001fa0 000003d5`fed6b178 002b0033 0053002b 002b002b 00010202 000003d5`fed6b188 00000000 00000000 00000000 00000000 000003d5`fed6b198 00000000 00000000 00000000 00000000 4:069> k5 # Child-SP RetAddr Call Site 00 000003d5`fed6b128 00000000`00000000 0x00007ffb`70ef0000 /shrug sorry, dunno without a lot more effort. How's about that wasm?
,
Apr 21 2017
,
Apr 21 2017
It looks like some exception handling code runs - maybe just looking for an exception handler. This exception handling code then hits c0000096 (UNHANDLED EXCEPTION PRIV_INSTRUCTION?) and crashes. So, the final crash location is not the root cause. > (5d50.b4fc): Illegal instruction - code c000001d (first chance) > (5d50.b4fc): Unknown exception - code c0000096 (first chance) > (5d50.b4fc): Unknown exception - code c0000096 (!!! second chance !!!) It is possible to configure windbg to halt immediately on exceptions but, as usual with windbg, it is painful and horrible to do do (Debug-> Event Filters-> ??? -> Profit!) and it's not clear whether it applies to all processes or just the current one, so, pain?
,
Apr 21 2017
Still no stack. :( 0000060b`001b686b ba12000000 mov edx,12h 0000060b`001b6870 6880681b00 push 1B6880h 0000060b`001b6875 e9c6c3e7ff jmp 0000060b`00032c40 0000060b`001b687a 660f1f440000 nop word ptr [rax+rax] 0000060b`001b6880 0f0b ud2 <<---- 0000060b`001b6882 6666666666662e0f1f840000000000 nop word ptr cs:[rax+rax] 0000060b`001b6891 6666666666662e0f1f840000000000 nop word ptr cs:[rax+rax] 4:109> r rax=0000000000000012 rbx=0000000000000000 rcx=0000000000032d80 rdx=0000000000000000 rsi=000000003effc324 rdi=00000000feec0588 rip=0000060b001b6880 rsp=0000060bfed6b870 rbp=0000060bfed70028 r8=00000000fed6b580 r9=0000000000000000 r10=0000000000000002 r11=0000060b001b6880 r12=0000000000000000 r13=00000000fecde6c8 r14=00000000000000ff r15=0000060b00000000 iopl=0 nv up ei pl nz na pe nc cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010202 0000060b`001b6880 0f0b ud2 4:109> k5 # Child-SP RetAddr Call Site 00 0000060b`fed6b870 00000000`00000000 0x0000060b`001b6880
,
Apr 21 2017
(Or symbols more importantly.) I assume this must be just a call to __builtin_trap() or something. Either having symbols for the nexe, or perhaps there's some sort of nacl logging we can enable (if it's say hitting something like this https://cs.chromium.org/chromium/src/native_client/src/untrusted/nacl/nacl_irt.c?rcl=aac1de245362df94cb40042e5fbc9059b0d0874b&l=97 )? One of those seem like the next steps rather than feeling around in the dark.
,
Apr 21 2017
I found the logging settings for nacl (https://developer.chrome.com/native-client/devguide/devcycle/debugging ) which spray many things. But didn't prove too useful after all. In the attached log, sshd is killed at 30800,22952:21:25:50.462000 I don't see anything interesting (in particular nothing like a debug log that we might learn something from.)
,
Apr 21 2017
,
Apr 21 2017
https://bugs.chromium.org/p/nativeclient/issues/detail?id=2237 seems to agree that if the stack is mangled then top level exception handler/WER ends up being called. I wonder what changed that would cause this, and why the test added isn't failing, also why I'm unable to bisect this... is nacl downloaded on demand or something?
,
Apr 21 2017
I made a great personal sacrifice, I set enable_nacl=true and rebuilt for x86. On x86 we're going through Breakpad per https://cs.chromium.org/chromium/src/chrome/nacl/nacl_exe_win_64.cc?rcl=36728e56fac466656a8504cbcbd573576b9c0bdb&l=36 so it's at least nothing specific to Crashpad vs. Breakpad. (I think now that we're using Crashpad on x64 because it's a "plain" chrome child, vs. on x86 chrome running in Wow64 we have to use nacl64.exe which is still using Breakpad. Maybe. Rock solid engineering all around in any case.) So I think that points to the stack corruption causing the WER dispatch like you say Will. Another thought I had was some having it be caused by a double fault, i.e. if the ud2 causes an exception, and then during the exception handler it hlt's? I feel like I should be able to codesearch-find-that if that were the case, and I haven't been able to do so.
,
Apr 21 2017
Before I dive in too deeply, is this still cause unknown (that's how I'm reading it right now)? Oh, and if you want to enable all exceptions in the land of cdb/ntsd/windbg, just do sxe*
,
Apr 21 2017
Confirmed that we do indeed only use Breakpad on x86 in nacl64, not in x64. I think that's probably sorta good. Ish. I also confirmed that "normal" uses of NaclAbort() (which does a __halt()) work as expected, either vectoring to Crashpad, or breaks with a sane stack trace in the debugger. But I still don't know what's up with this particular crash.
,
May 2 2017
I consistently hit the Privileged Instruction breakpoint today using the Secure Shell app on Windows 10 with Chrome 60.3087. I have the .dmps if they'd be useful to anyone.
,
May 29 2017
i added an 'error' handler to the object to catch crashes, so i've started noticing a lot more of these on Linux & CrOS. i think an easy way to reproduce this is just add invalid ssh command line args like -asdf. debugging this a bit on the ssh side, the scenario is: - we're using the pepper_49 toolchain - we're using pnacl which means newlib - JS sends a StartSession message to the ppapi plugin - ppapi plugin main thread calls pthread_create to run the ssh code (which calls ssh's main() func). - the ssh code, when it wants to exit, calls exit() - our plugin implements exit() with a call to abort() - nacl's abort() calls __builtin_trap() https://chromium.googlesource.com/native_client/src/native_client/+/master/src/untrusted/nacl/abort.c - __builtin_trap() triggers a crashdump i tried changing our plugin's exit() to call nacl's exit helpers, but it still crashes. looks like it's because that too calls trap: https://chromium.googlesource.com/native_client/src/native_client/+/master/src/untrusted/nacl/_exit.c
,
Jun 30 2017
Issue 738491 has been merged into this issue.
,
Jun 30 2017
|
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by wfh@chromium.org
, Apr 10 2017