New issue
Advanced search Search tips

Issue 710252 link

Starred by 5 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug-Regression



Sign in to add a comment

nassh app crash in nacl-loader process (WER dialog) when exiting with threads

Project Member Reported by wfh@chromium.org, Apr 10 2017

Issue description

Chrome Version: 59.0.3067.2 (Official Build) canary (64-bit) (cohort: Clang-64)
OS: Windows 10 10.0.10586

What steps will reproduce the problem?
(1) Start secure shell
(2)
(3)

Per wfh:
this is pretty easy to repro, using Secure Shell extension:

1. SSH into a machine
2. kill the connected sshd from another terminal
3. attempt to type into the first SSH

Unfortunately I can't bisect behind M56 since the latest secure shell extension requires Chrome 56.

What is the expected result?

Starts without crash

What happens instead?

Crash and a WER dialog. See attached image.

Please use labels and text to provide additional information.

chrome.exe --type=nacl-loader --service-request-channel-token=E08E0A2C9CF354C40EE37B4ADFB2054C --mojo-platform-channel-handle=24292 --ignored=" --type=renderer " /prefetch:8

For graphics-related bugs, please copy/paste the contents of the about:gpu
page at the end of this report.

 
wer-crash.png
17.9 KB View Download

Comment 1 by wfh@chromium.org, Apr 10 2017

00 000000e7`fb3fde08 00007ff8`3572aadf ntdll!NtWaitForSingleObject+0x14
01 000000e7`fb3fde10 00007ff8`06d01b94 KERNELBASE!WaitForSingleObjectEx+0x8f
02 000000e7`fb3fdeb0 00007ff8`06cffb4a chrome_child!NaCl::ConditionVariable::TimedWaitRel+0x84 [C:\b\build\slave\win64-clang\build\src\native_client\src\shared\platform\win\condition_variable.cc @ 77]
03 000000e7`fb3fdf30 00007ff8`06cff899 chrome_child!NaClCondVarWait+0x1a [C:\b\build\slave\win64-clang\build\src\native_client\src\shared\platform\win\nacl_sync_win.cc @ 130]
04 000000e7`fb3fdf60 00007ff8`051481bf chrome_child!NaClXCondVarWait+0x9 [C:\b\build\slave\win64-clang\build\src\native_client\src\shared\platform\nacl_sync_checked.c @ 81]
05 000000e7`fb3fdf90 00007ff8`0513e5ef chrome_child!NaClWaitForMainThreadToExit+0x5f [C:\b\build\slave\win64-clang\build\src\native_client\src\trusted\service_runtime\sel_ldr_standard.c @ 756]
06 (Inline Function) --------`-------- chrome_child!StartApp+0x122 [C:\b\build\slave\win64-clang\build\src\native_client\src\trusted\service_runtime\sel_main_chrome.c @ 372]
07 000000e7`fb3fdfe0 00007ff8`0513bb7f chrome_child!NaClChromeMainStart+0x42f [C:\b\build\slave\win64-clang\build\src\native_client\src\trusted\service_runtime\sel_main_chrome.c @ 392]
08 000000e7`fb3fe0a0 00007ff8`0513b53c chrome_child!NaClListener::OnStart+0x5ff [C:\b\build\slave\win64-clang\build\src\components\nacl\loader\nacl_listener.cc @ 421]
09 (Inline Function) --------`-------- chrome_child!base::DispatchToMethodImpl+0xa [C:\b\build\slave\win64-clang\build\src\base\tuple.h @ 77]
0a (Inline Function) --------`-------- chrome_child!base::DispatchToMethod+0xa [C:\b\build\slave\win64-clang\build\src\base\tuple.h @ 84]
0b (Inline Function) --------`-------- chrome_child!IPC::DispatchToMethod+0xa [C:\b\build\slave\win64-clang\build\src\ipc\ipc_message_templates.h @ 26]
0c 000000e7`fb3fe2f0 00007ff8`0513b147 chrome_child!IPC::MessageT<NaClProcessMsg_Start_Meta, std::tuple<nacl::NaClStartParams>, void>::Dispatch<NaClListener,NaClListener,void,void (NaClListener::*)(const nacl::NaClStartParams &)>+0x11c [C:\b\build\slave\win64-clang\build\src\ipc\ipc_message_templates.h @ 120]
0d 000000e7`fb3fe470 00007ff8`05381337 chrome_child!NaClListener::OnMessageReceived+0x157 [C:\b\build\slave\win64-clang\build\src\components\nacl\loader\nacl_listener.cc @ 236]
0e 000000e7`fb3fe550 00007ff8`05264d50 chrome_child!IPC::ChannelProxy::Context::OnDispatchMessage+0x27 [C:\b\build\slave\win64-clang\build\src\ipc\ipc_channel_proxy.cc @ 329]
0f (Inline Function) --------`-------- chrome_child!base::Callback<void (), base::internal::CopyMode::MoveOnly, base::internal::RepeatMode::Once>::Run+0x1f [C:\b\build\slave\win64-clang\build\src\base\callback.h @ 91]
10 000000e7`fb3fe590 00007ff8`0521b36b chrome_child!base::debug::TaskAnnotator::RunTask+0x230 [C:\b\build\slave\win64-clang\build\src\base\debug\task_annotator.cc @ 57]
11 000000e7`fb3fe750 00007ff8`0521b57b chrome_child!base::MessageLoop::RunTask+0x31b [C:\b\build\slave\win64-clang\build\src\base\message_loop\message_loop.cc @ 423]
12 000000e7`fb3fe8a0 00007ff8`0521baa6 chrome_child!base::MessageLoop::DeferOrRunPendingTask+0x2b [C:\b\build\slave\win64-clang\build\src\base\message_loop\message_loop.cc @ 434]
13 000000e7`fb3fe8e0 00007ff8`052662cb chrome_child!base::MessageLoop::DoWork+0x136 [C:\b\build\slave\win64-clang\build\src\base\message_loop\message_loop.cc @ 527]
14 000000e7`fb3fea90 00007ff8`05265714 chrome_child!base::MessagePumpForIO::DoRunLoop+0x14b [C:\b\build\slave\win64-clang\build\src\base\message_loop\message_pump_win.cc @ 475]
15 000000e7`fb3feb30 00007ff8`05239df1 chrome_child!base::MessagePumpWin::Run+0x64 [C:\b\build\slave\win64-clang\build\src\base\message_loop\message_pump_win.cc @ 58]
16 000000e7`fb3feb90 00007ff8`0513afb4 chrome_child!base::RunLoop::Run+0x81 [C:\b\build\slave\win64-clang\build\src\base\run_loop.cc @ 38]
17 000000e7`fb3fec40 00007ff8`0513d357 chrome_child!NaClListener::Listen+0x214 [C:\b\build\slave\win64-clang\build\src\components\nacl\loader\nacl_listener.cc @ 228]
18 000000e7`fb3fece0 00007ff8`051da7df chrome_child!NaClMain+0xe7 [C:\b\build\slave\win64-clang\build\src\components\nacl\loader\nacl_main.cc @ 59]
19 000000e7`fb3fef90 00007ff8`051dad64 chrome_child!content::RunNamedProcessTypeMain+0x11f [C:\b\build\slave\win64-clang\build\src\content\app\content_main_runner.cc @ 450]
1a 000000e7`fb3ff100 00007ff8`05ccc454 chrome_child!content::ContentMainRunnerImpl::Run+0x134 [C:\b\build\slave\win64-clang\build\src\content\app\content_main_runner.cc @ 729]
1b 000000e7`fb3ff1a0 00007ff8`051da62e chrome_child!service_manager::Main+0x144 [C:\b\build\slave\win64-clang\build\src\services\service_manager\embedder\main.cc @ 179]
1c 000000e7`fb3ff210 00007ff8`046310b5 chrome_child!content::ContentMain+0x3e [C:\b\build\slave\win64-clang\build\src\content\app\content_main.cc @ 19]
1d 000000e7`fb3ff2a0 00007ff6`9a503d2e chrome_child!ChromeMain+0xb5 [C:\b\build\slave\win64-clang\build\src\chrome\app\chrome_main.cc @ 123]
1e 000000e7`fb3ff360 00007ff6`9a50173f chrome!MainDllLoader::Launch+0x12e [C:\b\build\slave\win64-clang\build\src\chrome\app\main_dll_loader_win.cc @ 204]
1f 000000e7`fb3ff450 00007ff6`9a59a263 chrome!wWinMain+0x73f [C:\b\build\slave\win64-clang\build\src\chrome\app\chrome_exe_main_win.cc @ 271]
20 (Inline Function) --------`-------- chrome!invoke_main+0x21 [f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl @ 113]
21 000000e7`fb3ff830 00007ff8`36e18102 chrome!__scrt_common_main_seh+0x117 [f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl @ 253]
22 000000e7`fb3ff870 00007ff8`38ebc5b4 KERNEL32!BaseThreadInitThunk+0x22
23 000000e7`fb3ff8a0 00000000`00000000 ntdll!RtlUserThreadStart+0x34

Comment 2 by wfh@chromium.org, Apr 10 2017

A dump (Google only), at the exception:

https://drive.google.com/a/google.com/file/d/0BzKr4yVrnrZNT3ZBcUtnd0RZc3c/view?usp=sharing

The exception seems to occur in thread #9

0:009> k
 # Child-SP          RetAddr           Call Site
00 000000e7`fc5ff688 00007ff8`3572aadf ntdll!NtWaitForSingleObject+0x14
01 000000e7`fc5ff690 00007ff8`06d01b94 KERNELBASE!WaitForSingleObjectEx+0x8f
02 000000e7`fc5ff730 00007ff8`06cffb4a chrome_child!NaCl::ConditionVariable::TimedWaitRel+0x84 [C:\b\build\slave\win64-clang\build\src\native_client\src\shared\platform\win\condition_variable.cc @ 77]
03 000000e7`fc5ff7b0 00007ff8`0514ac63 chrome_child!NaClCondVarWait+0x1a [C:\b\build\slave\win64-clang\build\src\native_client\src\shared\platform\win\nacl_sync_win.cc @ 130]
04 000000e7`fc5ff7e0 00007ff8`051447bc chrome_child!NaClSysFutexWaitAbs+0xf3 [C:\b\build\slave\win64-clang\build\src\native_client\src\trusted\service_runtime\sys_futex.c @ 209]
05 000000e7`fc5ff860 00007ff8`051420c3 chrome_child!NaClSysFutexWaitAbsDecoder+0x4c [C:\b\build\slave\win64-clang\build\src\native_client\src\trusted\service_runtime\nacl_syscall_list.c @ 109]
06 000000e7`fc5ff8b0 00007ff8`078a6010 chrome_child!NaClSyscallCSegHook+0xe3 [C:\b\build\slave\win64-clang\build\src\native_client\src\trusted\service_runtime\nacl_syscall_hook.c @ 149]
07 000000e7`fc5ff910 00007ff8`0514e72e chrome_child!NaClSwitchSavingStackPtr+0x20
08 000000e7`fc5ff980 00007ff8`0514014e chrome_child!NaClStartThreadInApp+0x5e [C:\b\build\slave\win64-clang\build\src\native_client\src\trusted\service_runtime\arch\x86_64\nacl_switch_to_app_64.c @ 85]
09 000000e7`fc5ff9b0 00007ff8`078e47d0 chrome_child!NaClAppThreadLauncher+0x18e [C:\b\build\slave\win64-clang\build\src\native_client\src\trusted\service_runtime\nacl_app_thread.c @ 80]
0a (Inline Function) --------`-------- chrome_child!invoke_thread_procedure+0xd [d:\rs1\minkernel\crts\ucrt\src\appcrt\startup\thread.cpp @ 91]
0b 000000e7`fc5ff9f0 00007ff8`36e18102 chrome_child!thread_start<unsigned int (__cdecl*)(void * __ptr64)>+0x50 [d:\rs1\minkernel\crts\ucrt\src\appcrt\startup\thread.cpp @ 115]
0c 000000e7`fc5ffa20 00007ff8`38ebc5b4 KERNEL32!BaseThreadInitThunk+0x22
0d 000000e7`fc5ffa50 00000000`00000000 ntdll!RtlUserThreadStart+0x34

I can repro this bug, I just need to use Secure Shell to ssh into one of my machine and do a reboot on this machine, this close the connection and I get the following message in Secure Shell (as well as the WER popup):
NaCl plugin exited with status code 255.
(R)econnect, (C)hoose another connection, or E(x)it?

Comment 4 by wfh@chromium.org, Apr 12 2017

Cc: sebmarchand@chromium.org
hmm I just had this happen, and I have an entry in ChildProcess.Crashed2 histogram for 13 (PPAPI_PLUGIN_PROCESS) so perhaps something is not correctly catching these exceptions.

looking in UMA there doesn't appear to be a spike for this though...

Comment 5 by wfh@chromium.org, Apr 20 2017

Cc: vapier@chromium.org
this is pretty easy to repro, using Secure Shell extension:

1. SSH into a machine
2. kill the connected sshd from another terminal
3. attempt to type into the first SSH

Unfortunately I can't bisect behind M56 since the latest secure shell extension requires Chrome 56.

vapier -> is it possible to get a copy of an old version of SSH extension so I can try and bisect before this?

Comment 6 by vapier@chromium.org, Apr 20 2017

the M56 requirement is just for CrOS/sftp integration.  you should be able to safely change it back to like 49 everywhere.

i could probably change it back in the source actually ... if you run M55 or older, everything except sftp will work fine.

Comment 7 by wfh@chromium.org, Apr 20 2017

I couldn't change it because loading unpacked extension invalided the key and changed the extension id and then it lost access to some private API it needed to work :(

Comment 8 by vapier@chromium.org, Apr 20 2017

you should be able to copy & paste the key:
  https://chromium.googlesource.com/apps/libapps/+/master/nassh/manifest.json

i load that unpacked all the time and it works fine under Linux & CrOS (and i have to assume Windows would be the same)

that said, wrt private APIs, you could still validate things ... you'd lose access to:
- terminalPrivate is only for crosh
- NaCl socket APIs aren't used if you go through a relay server (--config=google)
- fileSystemProvider is only for CrOS

Comment 9 by wfh@chromium.org, Apr 20 2017

Cc: scottmg@chromium.org
I can repro this all the way back to 51.0.2704.0 and I don't think there's much point going further back, because I only think I've been seeing this WER dialog in the past month or so.

Which means something else in the system has changed, perhaps a windows update? I can try on other OS, I suppose. scottmg might have ideas?
Cc: -vapier@chromium.org

Comment 11 by wfh@chromium.org, Apr 20 2017

Cc: robliao@chromium.org brucedaw...@chromium.org
adding a few more brains since I'm out of ideas.
What *is* the exception? The crash dump from comment #2 doesn't contain an exception record. It only contains a single thread and that thread is sitting at a 'ret' instruction really shouldn't be crashable.

Given the location of the crash I would guess that it is an invalid-handle exception, but I don't actually know. It looks like the handle value is 0x2dc but windbg knows nothing about that handle and ~handle displays a suspiciously short list of handles so I don't know if that means anything.

Running under windbg -g -G -o should let us catch the exception and whatever output is printed at the time of the exception, which seems like a good next step.

The repro steps for Will and Seb seem quite different - maybe there are multiple bugs?

When running under windbg following #c5, I see this:

(5d50.b4fc): Illegal instruction - code c000001d (first chance)
(5d50.b4fc): Unknown exception - code c0000096 (first chance)
(5d50.b4fc): Unknown exception - code c0000096 (!!! second chance !!!)
00007ffb`70ef0024 f4              hlt

https://www.dropbox.com/s/5b4k48q9m2132rh/nacl-loader.dmp?dl=0

At the hlt there's no stack backtrace unfortunately. The code looks like this before there:

00007ffb`70eefffe ??              ???
00007ffb`70eeffff ??              ???
00007ffb`70ef0000 50              push    rax
00007ffb`70ef0001 51              push    rcx
00007ffb`70ef0002 52              push    rdx
00007ffb`70ef0003 ba0c000000      mov     edx,0Ch
00007ffb`70ef0008 b904000000      mov     ecx,4
00007ffb`70ef000d 65488b042558000000 mov   rax,qword ptr gs:[58h]
00007ffb`70ef0016 488b04c8        mov     rax,qword ptr [rax+rcx*8]
00007ffb`70ef001a 8b0402          mov     eax,dword ptr [rdx+rax]
00007ffb`70ef001d 85c0            test    eax,eax
00007ffb`70ef001f 7504            jne     00007ffb`70ef0025
00007ffb`70ef0021 4831e4          xor     rsp,rsp
00007ffb`70ef0024 f4              hlt
00007ffb`70ef0025 5a              pop     rdx
00007ffb`70ef0026 59              pop     rcx
00007ffb`70ef0027 58              pop     rax
00007ffb`70ef0028 fc              cld

which I'm guessing is some generic "bail" routine (??).

I managed to get a breakpoint on 00007ffb`70ef0000 in that process hoping that the stack was getting barfed just by the "xor rsp, rsp", but apparently not, still no stack when that breakpoint gets hit. (different TID, different run from above)

4:069> ?? @rsp
unsigned int64 0x000003d5`fed6b128
4:069> dd @rsp
000003d5`fed6b128  00000000 00000000 00032d80 00000000
000003d5`fed6b138  00000012 00000000 00000000 00000000
000003d5`fed6b148  00000000 00000000 00000000 00000000
000003d5`fed6b158  00000000 00000000 00000000 00000000
000003d5`fed6b168  00000000 00000000 0010005f 00001fa0
000003d5`fed6b178  002b0033 0053002b 002b002b 00010202
000003d5`fed6b188  00000000 00000000 00000000 00000000
000003d5`fed6b198  00000000 00000000 00000000 00000000
4:069> k5
 # Child-SP          RetAddr           Call Site
00 000003d5`fed6b128 00000000`00000000 0x00007ffb`70ef0000


/shrug sorry, dunno without a lot more effort. How's about that wasm?
Description: Show this description
It looks like some exception handling code runs - maybe just looking for an exception handler. This exception handling code then hits c0000096 (UNHANDLED EXCEPTION PRIV_INSTRUCTION?) and crashes. So, the final crash location is not the root cause.

> (5d50.b4fc): Illegal instruction - code c000001d (first chance)
> (5d50.b4fc): Unknown exception - code c0000096 (first chance)
> (5d50.b4fc): Unknown exception - code c0000096 (!!! second chance !!!)

It is possible to configure windbg to halt immediately on exceptions but, as usual with windbg, it is painful and horrible to do do (Debug-> Event Filters-> ??? -> Profit!) and it's not clear whether it applies to all processes or just the current one, so, pain?

Still no stack. :(

0000060b`001b686b ba12000000      mov     edx,12h
0000060b`001b6870 6880681b00      push    1B6880h
0000060b`001b6875 e9c6c3e7ff      jmp     0000060b`00032c40
0000060b`001b687a 660f1f440000    nop     word ptr [rax+rax]
0000060b`001b6880 0f0b            ud2        <<----
0000060b`001b6882 6666666666662e0f1f840000000000 nop word ptr cs:[rax+rax]
0000060b`001b6891 6666666666662e0f1f840000000000 nop word ptr cs:[rax+rax]


4:109> r
rax=0000000000000012 rbx=0000000000000000 rcx=0000000000032d80
rdx=0000000000000000 rsi=000000003effc324 rdi=00000000feec0588
rip=0000060b001b6880 rsp=0000060bfed6b870 rbp=0000060bfed70028
 r8=00000000fed6b580  r9=0000000000000000 r10=0000000000000002
r11=0000060b001b6880 r12=0000000000000000 r13=00000000fecde6c8
r14=00000000000000ff r15=0000060b00000000
iopl=0         nv up ei pl nz na pe nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
0000060b`001b6880 0f0b            ud2
4:109> k5
 # Child-SP          RetAddr           Call Site
00 0000060b`fed6b870 00000000`00000000 0x0000060b`001b6880

(Or symbols more importantly.)

I assume this must be just a call to __builtin_trap() or something. Either having symbols for the nexe, or perhaps there's some sort of nacl logging we can enable 
 (if it's say hitting something like this https://cs.chromium.org/chromium/src/native_client/src/untrusted/nacl/nacl_irt.c?rcl=aac1de245362df94cb40042e5fbc9059b0d0874b&l=97 )?

One of those seem like the next steps rather than feeling around in the dark.
I found the logging settings for nacl (https://developer.chrome.com/native-client/devguide/devcycle/debugging ) which spray many things. But didn't prove too useful after all. In the attached log, sshd is killed at 

30800,22952:21:25:50.462000

I don't see anything interesting (in particular nothing like a debug log that we might learn something from.)
nacl.log
18.1 MB Download

Comment 20 by wfh@chromium.org, Apr 21 2017

https://bugs.chromium.org/p/nativeclient/issues/detail?id=2237 seems to agree that if the stack is mangled then top level exception handler/WER ends up being called. I wonder what changed that would cause this, and why the test added isn't failing, also why I'm unable to bisect this... is nacl downloaded on demand or something?
I made a great personal sacrifice, I set enable_nacl=true and rebuilt for x86.

On x86 we're going through Breakpad per

https://cs.chromium.org/chromium/src/chrome/nacl/nacl_exe_win_64.cc?rcl=36728e56fac466656a8504cbcbd573576b9c0bdb&l=36

so it's at least nothing specific to Crashpad vs. Breakpad. (I think now that we're using Crashpad on x64 because it's a "plain" chrome child, vs. on x86 chrome running in Wow64 we have to use nacl64.exe which is still using Breakpad. Maybe. Rock solid engineering all around in any case.)

So I think that points to the stack corruption causing the WER dispatch like you say Will. Another thought I had was some having it be caused by a double fault, i.e. if the ud2 causes an exception, and then during the exception handler it hlt's? I feel like I should be able to codesearch-find-that if that were the case, and I haven't been able to do so.


Before I dive in too deeply, is this still cause unknown (that's how I'm reading it right now)?

Oh, and if you want to enable all exceptions in the land of cdb/ntsd/windbg, just do

sxe*
Confirmed that we do indeed only use Breakpad on x86 in nacl64, not in x64. I think that's probably sorta good. Ish.

I also confirmed that "normal" uses of NaclAbort() (which does a __halt()) work as expected, either vectoring to Crashpad, or breaks with a sane stack trace in the debugger.

But I still don't know what's up with this particular crash.
I consistently hit the Privileged Instruction breakpoint today using the Secure Shell app on Windows 10 with Chrome 60.3087. I have the .dmps if they'd be useful to anyone.
SecureShellCrash.png
38.8 KB View Download
Cc: vapier@chromium.org
i added an 'error' handler to the object to catch crashes, so i've started noticing a lot more of these on Linux & CrOS.  i think an easy way to reproduce this is just add invalid ssh command line args like -asdf.

debugging this a bit on the ssh side, the scenario is:
- we're using the pepper_49 toolchain
- we're using pnacl which means newlib
- JS sends a StartSession message to the ppapi plugin
- ppapi plugin main thread calls pthread_create to run the ssh code (which calls ssh's main() func).
- the ssh code, when it wants to exit, calls exit()
- our plugin implements exit() with a call to abort()
- nacl's abort() calls __builtin_trap()
https://chromium.googlesource.com/native_client/src/native_client/+/master/src/untrusted/nacl/abort.c
- __builtin_trap() triggers a crashdump

i tried changing our plugin's exit() to call nacl's exit helpers, but it still crashes.  looks like it's because that too calls trap:
https://chromium.googlesource.com/native_client/src/native_client/+/master/src/untrusted/nacl/_exit.c
Issue 738491 has been merged into this issue.
Components: Platform>Apps>Default>Hterm
Summary: nassh app crash in nacl-loader process (WER dialog) when exiting with threads (was: app crash in nacl-loader process (WER dialog))

Comment 28 Deleted

Sign in to add a comment