ContextLostIntegrationTest.GpuCrash_GPUProcessCrashesExactlyOncePerVisitToAboutGpuCrash breaking on GPU Win7 bots. |
||||||
Issue descriptionWin7 Release (NVIDIA) and Win7 Debug (NVIDIA) have started failing on this test. Here are the builds on both where the breakage started: https://build.chromium.org/p/chromium.gpu/builders/Win7%20Debug%20%28NVIDIA%29/builds/55976 https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/75902
,
Oct 23 2017
From debug bot: Received fatal exception EXCEPTION_ACCESS_VIOLATION Backtrace: viz::GpuServiceImpl::Crash [0x0A9251E9+457] viz::mojom::GpuServiceStubDispatch::Accept [0x0AEFD196+6294] viz::mojom::GpuServiceStub<mojo::RawPtrImplRefTraits<viz::mojom::GpuService> >::Accept [0x0A93D7D9+89] mojo::InterfaceEndpointClient::HandleValidatedMessage [0x07A217DD+1917] mojo::InterfaceEndpointClient::HandleIncomingMessageThunk::Accept [0x07A21040+32] mojo::FilterChain::Accept [0x07A1EFE9+489] mojo::InterfaceEndpointClient::HandleIncomingMessage [0x07A24A54+260] mojo::internal::MultiplexRouter::ProcessIncomingMessage [0x07A48905+1909] mojo::internal::MultiplexRouter::Accept [0x07A47BBC+732] mojo::FilterChain::Accept [0x07A1EFE9+489] mojo::Connector::ReadSingleMessage [0x07A0D5A7+1015] mojo::Connector::ReadAllAvailableMessages [0x07A0EB3D+125] mojo::Connector::OnHandleReadyInternal [0x07A0E883+291] mojo::Connector::OnWatcherHandleReady [0x07A0E74D+29] base::internal::FunctorTraits<void (__thiscall mojo::Connector::*)(unsigned int),void>::Invoke<mojo::Connector *,unsigned int> [0x07A11962+66] base::internal::InvokeHelper<0,void>::MakeItSo<void (__thiscall mojo::Connector::*const &)(unsigned int),mojo::Connector *,unsigned int> [0x07A11859+105] base::internal::Invoker<base::internal::BindState<void (__thiscall mojo::Connector::*)(unsigned int),base::internal::UnretainedWrapper<mojo::Connector> >,void __cdecl(unsigned int)>::RunImpl<void (__thiscall mojo::Connector::*const &)(unsigned int),std::t [0x07A117BF+111] base::internal::Invoker<base::internal::BindState<void (__thiscall mojo::Connector::*)(unsigned int),base::internal::UnretainedWrapper<mojo::Connector> >,void __cdecl(unsigned int)>::Run [0x07A11673+83] base::RepeatingCallback<void __cdecl(unsigned int)>::Run [0x067BB6D6+86] mojo::SimpleWatcher::DiscardReadyState [0x067BB54A+42] base::internal::FunctorTraits<void (__cdecl*)(base::RepeatingCallback<void __cdecl(unsigned int)> const &,unsigned int,mojo::HandleSignalsState const &),void>::Invoke<base::RepeatingCallback<void __cdecl(unsigned int)> const &,unsigned int,mojo::HandleSig [0x067C1B36+102] base::internal::InvokeHelper<0,void>::MakeItSo<void (__cdecl*const &)(base::RepeatingCallback<void __cdecl(unsigned int)> const &,unsigned int,mojo::HandleSignalsState const &),base::RepeatingCallback<void __cdecl(unsigned int)> const &,unsigned int,mojo: [0x067C1A14+116] base::internal::Invoker<base::internal::BindState<void (__cdecl*)(base::RepeatingCallback<void __cdecl(unsigned int)> const &,unsigned int,mojo::HandleSignalsState const &),base::RepeatingCallback<void __cdecl(unsigned int)> >,void __cdecl(unsigned int,mo [0x067C1972+130] base::internal::Invoker<base::internal::BindState<void (__cdecl*)(base::RepeatingCallback<void __cdecl(unsigned int)> const &,unsigned int,mojo::HandleSignalsState const &),base::RepeatingCallback<void __cdecl(unsigned int)> >,void __cdecl(unsigned int,mo [0x067C180E+110] base::RepeatingCallback<void __cdecl(unsigned int,mojo::HandleSignalsState const &)>::Run [0x067CC612+114] mojo::SimpleWatcher::OnHandleReady [0x067CC28A+474] mojo::SimpleWatcher::Context::Notify [0x067CC86B+347] mojo::SimpleWatcher::Context::CallNotify [0x067CA8AB+107] mojo::edk::WatcherDispatcher::InvokeWatchCallback [0x0E0D8781+225] mojo::edk::Watch::InvokeCallback [0x0E0D7567+167] mojo::edk::RequestContext::~RequestContext [0x0E0C73BE+734] mojo::edk::NodeChannel::OnChannelMessage [0x0E080E58+7656] mojo::edk::Channel::OnReadComplete [0x0E043648+1832] base::MessagePumpForIO::IOHandler::~IOHandler [0x0E049824+516] base::MessagePumpForIO::IOHandler::IOHandler [0x0E0460EC+3644] base::MessagePumpForIO::WaitForIOCompletion [0x10189D64+276] base::MessagePumpForIO::WaitForWork [0x10189F18+312] base::MessagePumpForIO::DoRunLoop [0x10189BF8+280] base::MessagePumpWin::Run [0x10187601+193] base::MessageLoop::Run [0x1017A6BA+266] base::RunLoop::Run [0x1026E94E+270] base::Thread::Run [0x10362FF8+424] base::Thread::ThreadMain [0x10363B4E+1438] base::PlatformThread::GetCurrentThreadPriority [0x103310AD+1181] BaseThreadInitThunk [0x7566336A+18] RtlInitializeExceptionChain [0x77CD9902+99] RtlInitializeExceptionChain [0x77CD98D5+54] Release: Received fatal exception EXCEPTION_ACCESS_VIOLATION Backtrace: viz::GpuServiceImpl::Crash [0x675DC9AF+223] viz::mojom::GpuServiceStubDispatch::Accept [0x65C11368+1056] viz::mojom::GpuServiceStub<mojo::RawPtrImplRefTraits<viz::mojom::GpuService> >::Accept [0x675DF4E3+19] mojo::InterfaceEndpointClient::HandleValidatedMessage [0x6687A36C+608] mojo::FilterChain::Accept [0x6687D643+129] mojo::InterfaceEndpointClient::HandleIncomingMessage [0x6687B0CA+104] mojo::internal::MultiplexRouter::ProcessIncomingMessage [0x66871D7E+694] mojo::internal::MultiplexRouter::Accept [0x6687188F+295] mojo::FilterChain::Accept [0x6687D643+129] mojo::Connector::ReadSingleMessage [0x66877C5A+376] mojo::Connector::ReadAllAvailableMessages [0x6687830D+85] mojo::Connector::OnHandleReadyInternal [0x668781E3+135] base::internal::Invoker<base::internal::BindState<void (__thiscall IPC::SyncChannel::ReceivedSyncMsgQueue::NestedSendDoneWatcher::*)(base::WaitableEvent *),base::internal::UnretainedWrapper<IPC::SyncChannel::ReceivedSyncMsgQueue::NestedSendDoneWatcher> >, [0x66B46741+17] mojo::SimpleWatcher::DiscardReadyState [0x66A9C5E1+41] base::internal::Invoker<base::internal::BindState<void (__cdecl*)(base::RepeatingCallback<void __cdecl(unsigned int)> const &,unsigned int,mojo::HandleSignalsState const &),base::RepeatingCallback<void __cdecl(unsigned int)> >,void __cdecl(unsigned int,mo [0x65C9E6AD+21] mojo::SimpleWatcher::OnHandleReady [0x66883A09+229] mojo::SimpleWatcher::Context::Notify [0x66883B5E+184] mojo::SimpleWatcher::Context::CallNotify [0x66883100+28] mojo::edk::WatcherDispatcher::InvokeWatchCallback [0x676874A5+93] mojo::edk::Watch::InvokeCallback [0x6768D47B+71] mojo::edk::RequestContext::~RequestContext [0x6767AF09+415] mojo::edk::NodeChannel::OnChannelMessage [0x6768F958+2332] mojo::edk::Channel::OnReadComplete [0x67691456+406] mojo::edk::Channel::Create [0x67692E8B+3147] base::MessagePumpForIO::WaitForIOCompletion [0x6684CD3E+334] base::MessagePumpForIO::WaitForWork [0x6684CE6A+250] base::MessagePumpForIO::DoRunLoop [0x6684CBBD+125] base::MessagePumpWin::Run [0x6684BF3C+108] base::MessageLoop::Run [0x667E3387+151] base::RunLoop::Run [0x667D9A2E+110] base::Thread::Run [0x667EECB2+162] base::Thread::ThreadMain [0x667EEF67+663] base::PlatformThread::SetCurrentThreadPriority [0x667E7BF3+515] BaseThreadInitThunk [0x7675338A+18] RtlInitializeExceptionChain [0x77409902+99] RtlInitializeExceptionChain [0x774098D5+54]
,
Oct 23 2017
,
Oct 23 2017
Oh wait, that looks like an intentional crash (also cf test name).
,
Oct 23 2017
khushalsagar, can you paste the part of the test output that explains what exactly failed?
,
Oct 23 2017
I think it's this: [6/6] gpu_tests.context_lost_integration_test.ContextLostIntegrationTest.GpuCrash_GPUProcessCrashesExactlyOncePerVisitToAboutGpuCrash failed unexpectedly 39.1940s: <snip> AssertionError: Timed out waiting for a gpu process crash
,
Oct 23 2017
Ah thanks.
void GpuServiceImpl::Crash() {
DCHECK(io_runner_->BelongsToCurrentThread());
DVLOG(1) << "GPU: Simulating GPU crash";
// Good bye, cruel world.
volatile int* it_s_the_end_of_the_world_as_we_know_it = NULL;
*it_s_the_end_of_the_world_as_we_know_it = 0xdead;
}
I bet clang just optimizes tis away somehow.
,
Oct 23 2017
Seems like a plausible theory. Can you check this out Nico?
,
Oct 23 2017
Looks like the testers download builds from somewhere. Can you link to the builder, so I can look at its args.gn?
,
Oct 23 2017
I think this is where the bots pull the build from: https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium.gpu%2FGPU_Win_Builder%2F72281%2F%2B%2Frecipes%2Fsteps%2Fgenerate_build_files%2F0%2Fstdout
,
Oct 23 2017
dcheck_always_on = true ffmpeg_branding = "Chrome" goma_dir = "E:\\b\\c\\goma_client" is_component_build = false is_debug = false proprietary_codecs = true strip_absolute_paths_from_debug_symbols = true symbol_level = 1 target_cpu = "x86" use_goma = true
,
Oct 23 2017
That's correct. FWIW it does currently appear that this issue is x86/32-bit only.
,
Oct 23 2017
(Short update: the write to 0 is in the disassembly, and on 2nd thought we do get the crash stack that I pasted above, so that part probably works. Now waiting for my local build, then I'll try to figure out how to run the test on swarming. I don't have a Win box at home, so I'm a bit limited in what I can do, but I'll keep at it some more using the win cross build.)
,
Oct 24 2017
Thanks.. it's pretty bad to leave the GPU bots broken for more than a few hours. Do you think you'll be able to fix it today?
,
Oct 24 2017
khushal, maybe we could disable this test on Windows temporarily.
,
Oct 24 2017
Disabling the test while the fix is in progress: https://chromium-review.googlesource.com/c/chromium/src/+/734760
,
Oct 24 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e577590bdc2ef38a6cf3ff9d5c99503108ca8916 commit e577590bdc2ef38a6cf3ff9d5c99503108ca8916 Author: Khushal <khushalsagar@chromium.org> Date: Tue Oct 24 02:57:20 2017 gpu_test: Disable GPUProcessCrashesExactlyOncePerVisitToAboutGpuCrash. Disable this test on Win until fix is in progress. R=zmo@chromium.org Bug: 777579 Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel Change-Id: I9c05ffd84fcbd210ac42729b6fb5401b15375f15 Reviewed-on: https://chromium-review.googlesource.com/734760 Reviewed-by: Zhenyao Mo <zmo@chromium.org> Commit-Queue: Khushal <khushalsagar@chromium.org> Cr-Commit-Position: refs/heads/master@{#511022} [modify] https://crrev.com/e577590bdc2ef38a6cf3ff9d5c99503108ca8916/content/test/gpu/gpu_tests/context_lost_expectations.py
,
Oct 24 2017
Thanks for disabling the test. I asked for help on running the tests on swarming at https://groups.google.com/a/chromium.org/forum/#!topic/graphics-dev/3GzLx2SVn1A but so far no replies. Hans, do you have access to a Win box? Could you take a look at EMEA hours if so?
,
Oct 24 2017
Looking now..
,
Oct 24 2017
,
Oct 24 2017
Notes to self. To build and run: ninja -C out\release -j300 telemetry_gpu_integration_test python content\test\gpu\run_gpu_integration_test.py context_lost --test-filter=CrashesExactlyOnce
,
Oct 24 2017
On my first run, it just sat there forever. On my second try, I got the attached wall of text.
Symbolized minidump:
Last event: 6770.6340: Access violation - code c0000005 (first/second chance not available)
debugger time: Tue Oct 24 02:13:08.281 2017 (UTC - 7:00)
ChildEBP RetAddr Args to Child
WARNING: Frame IP not in any known module. Following frames may be wrong.
0797f55c 0118cc2c 000001d8 c0000005 0797f588 0x8b55ff8b
0797f56c 01187a7d 000001d8 c0000005 01187a40 chrome!crashpad::SafeTerminateProcess+0x10
0797f588 77c61034 0563b178 00000000 db9c4433 chrome!crashpad::ExceptionHandlerServer::OnCrashDumpEvent+0x3d
0797f650 77c6934a 0797f7b4 059435e0 05945478 ntdll!RtlDeleteTimerQueueEx+0x214
0797f6bc 77c692aa 00000000 05935838 77c69220 ntdll!TpReleaseTimer+0x20a
0797f6dc 77c6cd42 0797f7b4 05945568 05945478 ntdll!TpReleaseTimer+0x16a
*** ERROR: Symbol file could not be found. Defaulted to export symbols for kernel32.dll -
0797f894 779138f4 05935838 779138d0 0bc4aee7 ntdll!EtwNotificationRegister+0x732
0797f8a8 77ca5e13 05935838 db9c4a93 00000000 kernel32!BaseThreadInitThunk+0x24
0797f8f0 77ca5dde ffffffff 77ccb80e 00000000 ntdll!RtlUnicodeStringToInteger+0x253
0797f900 00000000 77c6c6e0 05935838 00000000 ntdll!RtlUnicodeStringToInteger+0x21e
That looks suspicious... crashpad::SafeTerminateProcess is supposed to call the win32 function TerminateProcess().
SafeTerminateProcess() is a naked function with inline assembly. Maybe we get that wrong?
Yes! If I drop the naked attribute and replace the body with a call to TerminateProcess() the test seems to pass. This correlates with the problem being 32-bit x86 only.
?SafeTerminateProcess@crashpad@@YA_NPAXI@Z (bool __cdecl crashpad::SafeTerminateProcess(void *,unsigned int)):
00000000: A1 00 00 00 00 mov eax,dword ptr [__imp__TerminateProcess@8]
00000005: 55 push ebp
00000006: 89 E5 mov ebp,esp
00000008: FF 75 0C push dword ptr [ebp+0Ch]
0000000B: FF 75 08 push dword ptr [ebp+8]
0000000E: FF 10 call dword ptr [eax]
00000010: 85 C0 test eax,eax
00000012: 0F 95 C0 setne al
00000015: 89 EC mov esp,ebp
00000017: 5D pop ebp
00000018: C3 ret
Functionally that seems fine, but perhaps something gets confused by the "mov eax" instruction before the prologue.. Hmm.
,
Oct 24 2017
Oh wow, if it's that, I broke this just yesterday, minutes before the switch, in https://chromium-review.googlesource.com/732030 We could revert that, at the cost of breaking cross builds again. I looked at the assembly and figured it's identical, apparently it isn't?
,
Oct 24 2017
I'll make a Cl to do something like https://chromium-review.googlesource.com/c/crashpad/crashpad/+/734102 for SafeTerminateProcess() instead. (I don't understand why my change didn't work though.)
,
Oct 24 2017
> I looked at the assembly and figured it's identical, apparently it isn't? Yes, it looks functionally identical. There must be something special happening, like the call to TerminateProcess causing unwinding perhaps. I thought maybe the "mov eax" before the prologue was a problem, but tricking the compiler to put that after the prologue didn't help. I also changed it to call through a trampoline that did the dllimport call, but that didn't help either. Getting more paranoid, if something is indeed trying to unwind, maybe the naked function causes us to emit weird FPO data or something, whereas the .asm file didn't have that and so just worked. > I'll make a Cl to do something like https://chromium-review.googlesource.com/c/crashpad/crashpad/+/734102 for SafeTerminateProcess() instead Sounds good.
,
Oct 24 2017
,
Oct 24 2017
Attaching the object files built with Clang and MSVC for comparison.
,
Oct 24 2017
Oops no, looks like I had Nico's fix applied while building those.
,
Oct 24 2017
I happen to have the .obj files at hand. Not sure if things are already bad here, or if the linker needs to do relocations first.
,
Oct 24 2017
Looks like this repros in crashpad's unit tests, which is probably easier to work with. Except they're gyp-only. I'm making a CL to port them to gn so that they can be built in a regular chromium checkout. Preview: C:\src\chrome\src>out\gncl\crashpad_util_test.exe [==========] Running 2 tests from 1 test case. [----------] Global test environment set-up. [----------] 2 tests from SafeTerminateProcess [ RUN ] SafeTerminateProcess.PatchBadly [ OK ] SafeTerminateProcess.PatchBadly (0 ms) [ RUN ] SafeTerminateProcess.TerminateChild [ OK ] SafeTerminateProcess.TerminateChild (23 ms) [----------] 2 tests from SafeTerminateProcess (34 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test case ran. (42 ms total) [ PASSED ] 2 tests. C:\src\chrome\src>ninja -C out\gn crashpad_util_test ninja: Entering directory `out\gn' [1/1] Regenerating ninja files [11/12] LIB obj/base/base.lib [12/12] LINK crashpad_util_test.exe C:\src\chrome\src>out\gn\crashpad_util_test.exe [==========] Running 2 tests from 1 test case. [----------] Global test environment set-up. [----------] 2 tests from SafeTerminateProcess [ RUN ] SafeTerminateProcess.PatchBadly unknown file: error: SEH exception with code 0xc0000005 thrown in ïM≡Φh■. [ FAILED ] SafeTerminateProcess.PatchBadly (3 ms) [ RUN ] SafeTerminateProcess.TerminateChild unknown file: error: SEH exception with code 0xc0000005 thrown in ïM≡Φh■. [ FAILED ] SafeTerminateProcess.TerminateChild (29 ms) [----------] 2 tests from SafeTerminateProcess (43 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test case ran. (55 ms total) [ PASSED ] 0 tests. [ FAILED ] 2 tests, listed below: [ FAILED ] SafeTerminateProcess.PatchBadly [ FAILED ] SafeTerminateProcess.TerminateChild 2 FAILED TESTS C:\src\chrome\src>
,
Oct 24 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/af5f31ed615bdbf0ad246cff643a2802a1e0d702 commit af5f31ed615bdbf0ad246cff643a2802a1e0d702 Author: Nico Weber <thakis@chromium.org> Date: Tue Oct 24 13:15:02 2017 crashpad: Revert inlin-asm-for-SafeTerminateProcess(); call TerminateProcess() in cross builds instead. The inline asm doesn't work in clang builds for reasons I don't yet understand. Since we just switched to clang by default, lets get back to a good state first. TBR=mark@chromium.org Bug: 777579 , 777741 ,762167 Change-Id: I928ebaeaca033da69eaf1d832f477fb3fdca0283 Reviewed-on: https://chromium-review.googlesource.com/735619 Commit-Queue: Hans Wennborg <hans@chromium.org> Reviewed-by: Hans Wennborg <hans@chromium.org> Cr-Commit-Position: refs/heads/master@{#511127} [modify] https://crrev.com/af5f31ed615bdbf0ad246cff643a2802a1e0d702/build/secondary/third_party/crashpad/crashpad/util/BUILD.gn [add] https://crrev.com/af5f31ed615bdbf0ad246cff643a2802a1e0d702/third_party/crashpad/crashpad/util/win/safe_terminate_process.asm [delete] https://crrev.com/5ebcc615a08792ecd11ef69119a6a51b3ad615c3/third_party/crashpad/crashpad/util/win/safe_terminate_process.cc [add] https://crrev.com/af5f31ed615bdbf0ad246cff643a2802a1e0d702/third_party/crashpad/crashpad/util/win/safe_terminate_process_broken.cc
,
Oct 24 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6bf9ae4987bf596685189f5281404631dae43311 commit 6bf9ae4987bf596685189f5281404631dae43311 Author: Nico Weber <thakis@chromium.org> Date: Tue Oct 24 14:25:55 2017 Revert "gpu_test: Disable GPUProcessCrashesExactlyOncePerVisitToAboutGpuCrash." This reverts commit e577590bdc2ef38a6cf3ff9d5c99503108ca8916. Reason for revert: This might work again. Original change's description: > gpu_test: Disable GPUProcessCrashesExactlyOncePerVisitToAboutGpuCrash. > > Disable this test on Win until fix is in progress. > > R=zmo@chromium.org > > Bug: 777579 > Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel > Change-Id: I9c05ffd84fcbd210ac42729b6fb5401b15375f15 > Reviewed-on: https://chromium-review.googlesource.com/734760 > Reviewed-by: Zhenyao Mo <zmo@chromium.org> > Commit-Queue: Khushal <khushalsagar@chromium.org> > Cr-Commit-Position: refs/heads/master@{#511022} TBR=zmo@chromium.org,khushalsagar@chromium.org Change-Id: Ic576f227555913897e7d544abad04181519febd3 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 777579 Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel Reviewed-on: https://chromium-review.googlesource.com/735839 Reviewed-by: Nico Weber <thakis@chromium.org> Commit-Queue: Nico Weber <thakis@chromium.org> Cr-Commit-Position: refs/heads/master@{#511138} [modify] https://crrev.com/6bf9ae4987bf596685189f5281404631dae43311/content/test/gpu/gpu_tests/context_lost_expectations.py
,
Oct 24 2017
Test got reenabled here https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/75952 , bot stayed green. I filed bug 777924 for figuring out the clang-cl compiler bug.
,
Oct 24 2017
Thanks Nico and Hans for getting to the bottom of this, and Khusal for helping triage.
,
Oct 24 2017
Khushal* sorry |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by thakis@chromium.org
, Oct 23 2017