Issue metadata
Sign in to add a comment
|
Chrome_Mac: Crash Report - [GPU hang] gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid |
||||||||||||||||||||||||||||||
Issue descriptionProduct name: Chrome_Mac Magic Signature: [GPU hang] gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid Current link: https://crash.corp.google.com/browse?q=product.name%3D'Chrome_Mac'%20AND%20product.version%3D'56.0.2906.0'%20AND%20cpu.Architecture%3D'amd64'%20AND%20custom_data.ChromeCrashProto.ptype%3D'gpu-process'%20AND%20ReportID%3D'0c427d4700000000'%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D'%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3ACheckFramebufferValid'&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#3 Search properties: product.name: Chrome_Mac product.version: 56.0.2906.0 cpu.architecture: amd64 custom_data.chromecrashproto.ptype: gpu-process reportid: 0c427d4700000000 Metadata : Product Name: Chrome_Mac Product Version: 56.0.2906.0 Report ID: 0c427d4700000000 Report Time: Tue, 01 Nov 2016 21:30:51 GMT Uptime: 3682000 ms Cumulative Uptime: 0 ms User Email: OS Name: Mac OS X OS Version: 10.11.6 15G1004 CPU Architecture: amd64 CPU Info: family 6 model 70 stepping 1 Stack Trace: Thread 0 MAGIC SIGNATURE THREAD Stack Quality91%Show frame trust levels 0x00007fff9c833f72 (libsystem_kernel.dylib + 0x00010f72 ) 0x00007fff9d14cc20 (IOKit + 0x00065c20 ) io_connect_method 0x00007fff9d0ed12f (IOKit + 0x0000612f ) IOConnectCallMethod 0x00007fff9c5e3876 (IOAccelerator + 0x00003876 ) IOAccelResourceCreate 0x00007fff9e02bdfd (libGPUSupportMercury.dylib + 0x00008dfd ) gpusGetKernelTexture 0x000000011973bc3e (AMDRadeonX4000GLDriver + 0x00085c3e ) 0x00000001197386c3 (AMDRadeonX4000GLDriver + 0x000826c3 ) 0x00007fff9e027a64 (libGPUSupportMercury.dylib + 0x00004a64 ) gldLoadFramebuffer 0x00007fff8f07250e (GLEngine + 0x0011a50e ) gleCheckFramebufferStatus 0x00007fff8efadf19 (GLEngine + 0x00055f19 ) glCheckFramebufferStatusEXT_Exec 0x000000010ef80517 (Google Chrome Framework -gles2_cmd_decoder.cc:4248 ) gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid(gpu::gles2::Framebuffer*, unsigned int, unsigned int, char const*) 0x000000010ef85ae9 (Google Chrome Framework -gles2_cmd_decoder.cc:4295 ) gpu::gles2::GLES2DecoderImpl::GetHelper(unsigned int, int*, int*) 0x000000010ef64ea0 (Google Chrome Framework -gles2_cmd_decoder.cc:6763 ) gpu::gles2::GLES2DecoderImpl::HandleGetIntegerv(unsigned int, void const volatile*) 0x000000010ef8381f (Google Chrome Framework -gles2_cmd_decoder.cc:5137 ) gpu::error::Error gpu::gles2::GLES2DecoderImpl::DoCommandsImpl<false>(unsigned int, void const volatile*, int, int*) 0x000000010ef416a4 (Google Chrome Framework -cmd_parser.cc:53 ) <name omitted> 0x000000010ef420f8 (Google Chrome Framework -command_executor.cc:61 ) gpu::CommandExecutor::PutChanged() 0x000000010f080b19 (Google Chrome Framework -gpu_command_buffer_stub.cc:783 ) gpu::GpuCommandBufferStub::OnAsyncFlush(int, unsigned int, std::__1::vector<ui::LatencyInfo, std::__1::allocator<ui::LatencyInfo> > const&) 0x000000010f08095a (Google Chrome Framework -tuple.h:144 ) bool IPC::MessageT<GpuCommandBufferMsg_AsyncFlush_Meta, std::__1::tuple<int, unsigned int, std::__1::vector<ui::LatencyInfo, std::__1::allocator<ui::LatencyInfo> > >, void>::Dispatch<gpu::GpuCommandBufferStub, gpu::GpuCommandBufferStub, void, void (gpu::GpuCommandBufferStub::*)(int, unsigned int, std::__1::vector<ui::LatencyInfo, std::__1::allocator<ui::LatencyInfo> > const&)>(IPC::Message const*, gpu::GpuCommandBufferStub*, gpu::GpuCommandBufferStub*, void*, void (gpu::GpuCommandBufferStub::*)(int, unsigned int, std::__1::vector<ui::LatencyInfo, std::__1::allocator<ui::LatencyInfo> > const&)) 0x000000010f07f4e0 (Google Chrome Framework -gpu_command_buffer_stub.cc:243 ) gpu::GpuCommandBufferStub::OnMessageReceived(IPC::Message const&) 0x000000010f079106 (Google Chrome Framework -gpu_channel.cc:802 ) gpu::GpuChannel::HandleMessageHelper(IPC::Message const&) 0x000000010f079099 (Google Chrome Framework -gpu_channel.cc:782 ) gpu::GpuChannel::HandleMessage(scoped_refptr<gpu::GpuChannelMessageQueue> const&) 0x000000010e415498 (Google Chrome Framework -callback.h:47 ) base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*) 0x000000010e438af5 (Google Chrome Framework -message_loop.cc:413 ) base::MessageLoop::RunTask(base::PendingTask*) 0x000000010e438dcb (Google Chrome Framework -message_loop.cc:422 ) base::MessageLoop::DeferOrRunPendingTask(base::PendingTask) 0x000000010e439112 (Google Chrome Framework -message_loop.cc:515 ) base::MessageLoop::DoWork() 0x000000010e43b79c (Google Chrome Framework -message_pump_mac.mm:330 ) base::MessagePumpCFRunLoopBase::RunWork() 0x000000010e42dee9 (Google Chrome Framework + 0x0184eee9 ) base::mac::CallWithEHFrame(void () block_pointer) 0x000000010e43b1b3 (Google Chrome Framework -message_pump_mac.mm:306 ) base::MessagePumpCFRunLoopBase::RunWorkSource(void*) 0x00007fff89937880 (CoreFoundation + 0x000aa880 ) __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ 0x00007fff89916fbb (CoreFoundation + 0x00089fbb ) __CFRunLoopDoSources0 0x00007fff899164de (CoreFoundation + 0x000894de ) __CFRunLoopRun 0x00007fff89915ed7 (CoreFoundation + 0x00088ed7 ) CFRunLoopRunSpecific 0x000000010e43bb7e (Google Chrome Framework -message_pump_mac.mm:554 ) base::MessagePumpCFRunLoop::DoRun(base::MessagePump::Delegate*) 0x000000010e43b5fb (Google Chrome Framework -message_pump_mac.mm:238 ) base::MessagePumpCFRunLoopBase::Run(base::MessagePump::Delegate*) 0x000000010e456802 (Google Chrome Framework -run_loop.cc:35 ) base::RunLoop::Run() 0x0000000110bc3cc4 (Google Chrome Framework -gpu_main.cc:288 ) content::GpuMain(content::MainFunctionParams const&) 0x000000010dfc527c (Google Chrome Framework -content_main_runner.cc:776 ) content::ContentMainRunnerImpl::Run() 0x000000010dfc4505 (Google Chrome Framework -content_main.cc:20 ) content::ContentMain(content::ContentMainParams const&) 0x000000010cbe1bab (Google Chrome Framework -chrome_main.cc:97 ) ChromeMain 0x000000010c9a8d69 (Google Chrome Helper -chrome_exe_main_mac.c:85 ) main 0x00007fff97e565ac (libdyld.dylib + 0x000035ac ) 0x00007fff97e565ac (libdyld.dylib + 0x000035ac ) Thread 2 CRASHED [EXC_BAD_ACCESS / KERN_INVALID_ADDRESS @ 0x00000000 ] Stack Quality90%Show frame trust levels 0x000000010f0858a3 (Google Chrome Framework -gpu_watchdog_thread.cc:377 ) gpu::GpuWatchdogThread::DeliberatelyTerminateToRecoverFromHang() 0x000000010e415498 (Google Chrome Framework -callback.h:47 ) base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*) 0x000000010e438af5 (Google Chrome Framework -message_loop.cc:413 ) base::MessageLoop::RunTask(base::PendingTask*) 0x000000010e438dcb (Google Chrome Framework -message_loop.cc:422 ) base::MessageLoop::DeferOrRunPendingTask(base::PendingTask) 0x000000010e4392fc (Google Chrome Framework -message_loop.cc:554 ) base::MessageLoop::DoDelayedWork(base::TimeTicks*) 0x000000010e43b7b8 (Google Chrome Framework -message_pump_mac.mm:334 ) base::MessagePumpCFRunLoopBase::RunWork() 0x000000010e42dee9 (Google Chrome Framework + 0x0184eee9 ) base::mac::CallWithEHFrame(void () block_pointer) 0x000000010e43b1b3 (Google Chrome Framework -message_pump_mac.mm:306 ) base::MessagePumpCFRunLoopBase::RunWorkSource(void*) 0x00007fff89937880 (CoreFoundation + 0x000aa880 ) __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ 0x00007fff89916fbb (CoreFoundation + 0x00089fbb ) __CFRunLoopDoSources0 0x00007fff899164de (CoreFoundation + 0x000894de ) __CFRunLoopRun 0x00007fff89915ed7 (CoreFoundation + 0x00088ed7 ) CFRunLoopRunSpecific 0x000000010e43bb7e (Google Chrome Framework -message_pump_mac.mm:554 ) base::MessagePumpCFRunLoop::DoRun(base::MessagePump::Delegate*) 0x000000010e43b5fb (Google Chrome Framework -message_pump_mac.mm:238 ) base::MessagePumpCFRunLoopBase::Run(base::MessagePump::Delegate*) 0x000000010e456802 (Google Chrome Framework -run_loop.cc:35 ) base::RunLoop::Run() 0x000000010e480f31 (Google Chrome Framework -thread.cc:333 ) base::Thread::ThreadMain() 0x000000010e47c926 (Google Chrome Framework -platform_thread_posix.cc:71 ) base::(anonymous namespace)::ThreadFunc(void*) 0x00007fff9857d99c (libsystem_pthread.dylib + 0x0000399c ) _pthread_body 0x00007fff9857d919 (libsystem_pthread.dylib + 0x00003919 ) _pthread_start 0x00007fff9857b350 (libsystem_pthread.dylib + 0x00001350 ) thread_start 0x000000010e47c8cf (Google Chrome Framework + 0x0189d8cf ) This crash is seen in latest builds as below 56.0.2911.0 0.03% 3 56.0.2910.0 0.08% 7 from 6 different client Ids 56.0.2909.0 0.23% 21 56.0.2908.0 0.08% 7 56.0.2907.0 0.27% 24 56.0.2906.0 0.54% 48 from 38 different client Ids 56.0.2905.0 0.08% 7 56.0.2904.0 0.03% 3 56.0.2903.0 0.26% 23 56.0.2900.0 0.02% 2 56.0.2899.0 0.01% 1 56.0.2897.0 0.01% 1 56.0.2895.0 0.01% 1 56.0.2891.0 0.02% 2 56.0.2890.0 0.01% 1 56.0.2887.0 0.02% 2 56.0.2886.0 0.01% 1 55.0.2883.35 0.09% 8 from 3 different client Ids 55.0.2883.28 0.19% 17 55.0.2883.21 0.10% 9 55.0.2883.11 0.01% 1 55.0.2883.4 0.01% 1 55.0.2882.0 0.02% 2 55.0.2881.0 0.03% 3 55.0.2880.0 0.02% 2 55.0.2879.0 0.03% 3 55.0.2875.0 0.02% 2 55.0.2873.4 0.01% 1 55.0.2868.0 0.02% 2 55.0.2867.0 0.03% 3 55.0.2865.0 0.01% 1 55.0.2860.0 0.01% 1 55.0.2859.0 0.03% 3 55.0.2858.0 0.02% 2 55.0.2857.0 0.02% 2 55.0.2853.0 0.04% 4 55.0.2850.0 0.01% 1 55.0.2849.0 0.02% 2 55.0.2848.0 0.01% 1 55.0.2847.0 0.01% 1 55.0.2845.0 0.01% 1 55.0.2844.0 0.02% 2 55.0.2843.0 0.01% 1 54.0.2840.87 0.20% 18 from 18 different client Ids 54.0.2840.71 5.63% 504 54.0.2840.59 0.38% 34 54.0.2840.50 0.08% 7 54.0.2840.41 0.03% 3 54.0.2840.34 0.08% 7 54.0.2840.27 0.04% 4 54.0.2840.16 0.08% 7 54.0.2840.8 0.01% 1 Link to the builds: https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3ACheckFramebufferValid%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#samplereports:5,productversion:1000 CL : https://chromium.googlesource.com/chromium/src/+log/56.0.2905.0..56.0.2906.0?pretty=fuller&n=10000 Possible suspect : https://codereview.chromium.org/2456823002 Please reassign if this is not related to your change
,
Nov 7 2016
,
Nov 7 2016
This is a bug on Mac. But my CL only takes effect on Linux AMD. Unassigning me.
,
Nov 7 2016
Users experienced this crash on the following builds: Mac Dev 56.0.2906.0 - 2.57 CPM, 30 reports, 22 clients (signature [GPU hang] gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid) If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates. - Go/Fracas
,
Nov 9 2016
This crash is seen on latest builds as below 56.0.2913.3 0.05% 5 56.0.2912.0 0.08% 7 56.0.2906.0 0.72% 66 from 46 different client Ids 55.0.2883.35 0.14% 13 54.0.2840.87 0.36% 33
,
Nov 9 2016
Adding some Mac gurus. It seems to be the case of querying IMPLEMENTATION_COLOR_READ_FORMAT/TYPE, where we want to check the bound fbo's completeness first, where it crashed in the kernel.
,
Nov 9 2016
+ericrk@ ericrk@ - your cl https://codereview.chromium.org/2382573002 was listed in the suspected change log. Would you mind taking a look to see if it could be the cause of this crash?
,
Nov 9 2016
My change is unrelated to GL command issuing (and the code is only run if you are tracing), so probably not that. suspected change log seems suspicious, as we seem to have an increased rate as early as 56.0.2903.0. Looked through the changes for 2902-2903, but didn't see anything obvious. I'll keep looking a bit more.
,
Nov 9 2016
Thank you for taking a look!
,
Nov 15 2016
Just to update, Latest Dev(56.0.2914.3) on Mac has reported 44 crashes from 25 clients till now. Friendly ping to get an update on this.
,
Nov 15 2016
This is all over the OS versions and GPUs (Intel, AMD, NVidia). It really looks like a MacOSX driver issue that we should file to Apple.
,
Nov 18 2016
shrike@/ericrk@: Based on C#11 shall we mark this as External Dependency then?
,
Nov 18 2016
ericrk@, ccameron@ - should this be marked an External Dependency (and a bug filed with Apple)?
,
Nov 18 2016
Well, this signature -- gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid -- isn't particularly meaningful. We're causing a hang somewhere deeper in the Apple driver, and the signatures there should vary somewhat (the one in the top of the bug is in an AMD driver). If we're going to discuss [GPU hang] issues with Apple, it shouldn't be in the context of this particular chrome signature, but rather in the context of "here are the top N signatures in Apple code". WRT spikes in this, it's probably just a shuffling from other places. Unless core profile is to be implicated.
,
Nov 19 2016
Looked at this some more - it appears that many/most of our [GPU Hang] type crashes spiked in Dev (and in canary, but it's harder to see) between the 2902.0 Dev and 2906.0 Dev. See the followig links: CheckFramebufferValid: https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20%20AND%20(product.Version%20CONTAINS%20%2755.0.%27%20OR%20product.Version%20CONTAINS%20%2756.0.%27)%20AND%20custom_data.ChromeCrashProto.channel%3D%27dev%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3ACheckFramebufferValid%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D glFence::Create - https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20%20AND%20(product.Version%20CONTAINS%20%2755.0.%27%20OR%20product.Version%20CONTAINS%20%2756.0.%27)%20AND%20custom_data.ChromeCrashProto.channel%3D%27dev%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gl%3A%3AGLFence%3A%3ACreate%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D HandleFlush: https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20%20AND%20(product.Version%20CONTAINS%20%2755.0.%27%20OR%20product.Version%20CONTAINS%20%2756.0.%27)%20AND%20custom_data.ChromeCrashProto.channel%3D%27dev%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3AHandleFlush%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D and a number of others which are less dramatic. This makes me feel that somewhere in 56.0.2902.0 to 56.0.2906.0 we landed something which increased the number of GPU hangs overall. This makes me a bit more hesitant to call this an Apple bug, as it doesn't seem to correspond with a specific API call. It feels like a more systemic thing, but it doesn't seem to correspond to the two major changes I could think of (GPU raster and GL Core Profile)... From Dev, it seems like the range to look in is definitely 56.0.2902.0 to 56.0.2906.0, and from Canary, it seems very likely that it is 56.0.2902.0 to 56.0.2903.0.
,
Nov 19 2016
My assumption in #15 was that hangs in a number of different Chrome callstacks meant that this wasn't an Apple issue. However, I took a look at the Apple code being invoked from chrome in each case, and it all seems very uniform. If we look at hangs that have symbols from the dev build after the spike (56.0.2906.0), we see that 88% of all crashes end with the following few calls: (libsystem_kernel.dylib ) mach_msg_trap (IOKit ) io_connect_method (IOKit ) IOConnectCallMethod 44% of these callstacks continue with: (libsystem_kernel.dylib ) mach_msg_trap (IOKit ) io_connect_method (IOKit ) IOConnectCallMethod (IOKit ) IOConnectCallStructMethod (IOAccelerator ) IOAccelContextSubmitDataBuffersExt This seems to indicate that there may be an IOKit issue in play here. Although it still seems like we did something in the 56.0.2906.0 timeframe to aggravate it. I'll file a radar for the above callstacks.
,
Nov 20 2016
Users experienced this crash on the following builds: Mac Canary 57.0.2925.0 - 1.85 CPM, 5 reports, 5 clients (signature [GPU hang] gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid) If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates. - Go/Fracas
,
Nov 21 2016
Users experienced this crash on the following builds: Mac Beta 55.0.2883.52 - 0.15 CPM, 4 reports, 4 clients (signature [GPU hang] gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid) If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates. - Go/Fracas
,
Nov 24 2016
This crash has high impact on Chrome's stability. Signature: [GPU hang] gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid. Channel: canary. Platform: mac. Labeling issue 662802 with ReleaseBlock-Dev. If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates. - Go/Fracas
,
Nov 24 2016
Just to update, latest crash rates on all channels are as below 57.0.2929.4 0.01% 1 latest canary 56.0.2924.3 0.02% 2 latest dev 55.0.2883.59 0.01% 1 latest beta 54.0.2840.98 3.10% 341 latest stable Link to the list of builds https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3ACheckFramebufferValid%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#samplereports:5,productversion:1000 Thanks!
,
Nov 28 2016
Below are crash rates on all channels 57.0.2934.0 0.06% 7 latest canary 56.0.2924.3 0.60% 67 latest dev 55.0.2883.59 0.04% 4 latest beta 54.0.2840.98 4.06% 455 latest stable https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3ACheckFramebufferValid%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#samplereports:5,productversion:1000 Could any one from dev team please look into this issue. Thanks,
,
Nov 28 2016
We need to sort our before M56 hits stable, the GPU crash count are still high for M56 for the stacks mentioned in #15.
,
Nov 28 2016
Could anything have changed with crash reporting in this timeframe? I notice that the renderer crashes show an almost identical pattern of crash rate increase to GPU process, which makes me feel like this might be a more general issue. GPU CPM for several versions in the problem area: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=a5b303cfeb4c12dc6e6ac6119da2ebb9 Renderer CPM: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=bb0634f9bb03f7e2a5a170d45ee528b8
,
Nov 28 2016
Re #16, given that this increase in crashes doesn't track any OS change / etc... I'm actually still not sure a radar is appropritate. It seems like something we did. I'm committing an UMA which should help us guage whether this is a real hang, or whether something is wrong with our hang monitoring. Also putting in a potential mitigation for a memory stomp issue that we've seen in other GPU hang crashes in the waterfall. Will try to land these today and hopefully get some more insight into this bug.
,
Nov 29 2016
Interesting that this seems to be showing up in the wild. Issue 609252 has tracked GPU process watchdog firings due to corruption of the MessageLoop's task observer list. Blocking that bug on this one.
,
Nov 30 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1d9e17fe08c0dd44fa7e525662569f921e241bfc commit 1d9e17fe08c0dd44fa7e525662569f921e241bfc Author: ericrk <ericrk@chromium.org> Date: Wed Nov 30 01:51:28 2016 Move GPU proc message loop to heap We are experiencing what appear to be memory-stomp issues in the GPU process. These issues seem to be impacting the message loop and listeners registered to it, such as the GPU watchdog thread. This change moves the message loop from the stack to a heap object as an experiment to see if it improves things. BUG= 662802 Review-Url: https://codereview.chromium.org/2540513002 Cr-Commit-Position: refs/heads/master@{#435117} [modify] https://crrev.com/1d9e17fe08c0dd44fa7e525662569f921e241bfc/content/gpu/gpu_main.cc
,
Nov 30 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f4eade7013d8327ad0bf168dc82b7e481dd33975 commit f4eade7013d8327ad0bf168dc82b7e481dd33975 Author: ericrk <ericrk@chromium.org> Date: Wed Nov 30 21:09:03 2016 Add UMA to track the duration of CheckFrambufferValid We are getting a lot of GPU watchdog hangs in this function. Add an UMA to confirm that this function is long-running, and that we don't have an issue with the watchdog. Also removes ProgramManager UMAs which were added for a similar bug that has now been resolved. BUG= 662802 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel Review-Url: https://codereview.chromium.org/2539443003 Cr-Commit-Position: refs/heads/master@{#435415} [modify] https://crrev.com/f4eade7013d8327ad0bf168dc82b7e481dd33975/gpu/command_buffer/service/gles2_cmd_decoder.cc [modify] https://crrev.com/f4eade7013d8327ad0bf168dc82b7e481dd33975/gpu/command_buffer/service/program_manager.cc [modify] https://crrev.com/f4eade7013d8327ad0bf168dc82b7e481dd33975/tools/metrics/histograms/histograms.xml
,
Dec 2 2016
Stability sheriff here--I need this assigned to someone since it looks like we want to be actively working on this. I'm going to flip this back to untriaged so the GPU triager gets this assigned. It looks like GPU hangs in CheckFramebuffervalid haven't improved in 57.0.2938.0 canary which includes r435117 (and r435415). At the moment there are 15 Chrome_Mac amd64 gpu-process crashes and four of them (from four unique clients) are [GPU hang] gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid. (FWIW 57.0.2937.0 that rate was 12.5%.) Here's the crashes: https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3ACheckFramebufferValid%27%20AND%20product.version%20%3D%20%2757.0.2938.0%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=
,
Dec 5 2016
From looking at the UMA which landed, it does appear that some number of calls to CheckFramebufferValid do take >10s, which would trip the watchdog timer. So it seems like this isn't an issue of us misreporting (as I had hoped)... Does anyone have ideas on how to investigate this? There are some command buffer changes in the blame range, but the range is quite large and it's hard to pinpoint without blindly reverting them...
,
Dec 7 2016
Friendly ping from stability sheriff. Whoever is responsible for triaging bugs under component "Internals>GPU", can you please triage this bug? We need to find an owner.
,
Dec 7 2016
I've filed a bug with Apple, bug ID 29547096. This bug reports the Apple portion of the callstacks seen in this issue. While there may be an Apple bug, it appears that something we did has increased the frequency of the issue. The range where the rate appears to increase is: https://chromium.googlesource.com/chromium/src/+log/56.0.2902.0..56.0.2906.0?pretty=fuller&n=10000 A CL which seems to line up with the initial spike is: https://codereview.chromium.org/2447423002 - Handle CompressedTex{Sub}Image{2|3}D interaction with PBO. zmo@, I know you said this looks like an Apple issue, but I'm curious whether anything in the CL listed above might have changed our GL calling pattern in a way that could cause us to hit the Apple bug or get backed up in the GL driver more frequently? Assuming this is unrelated, other CLs seem like they could potentially be related are: https://codereview.chromium.org/2458943002 - Support 2D texture sub-source uploads from HTMLImageElement. - kbr@ https://codereview.chromium.org/2461023002 - command buffer: audit validation of ES3 commands (part 2) - kainino@ https://codereview.chromium.org/2458523005 - command buffer: audit validation of ES3 commands (part 1) - kainino@ And even less likely (but still GPU related and in the right range): https://codereview.chromium.org/2461003003 - Reduce GPU mailbox size to 16 bytes - piman@ https://codereview.chromium.org/2456823002 - Remove invariant for input in fragment shader - qiankun.miao@intel.com https://codereview.chromium.org/2454153002 - mac: Offscreen Canvas sets texture wrap to CLAMP_TO_EDGE explicitly - dongseong.hwang@intel.com https://codereview.chromium.org/2453283002 - Allow nested state restorers in DrawingBuffer - ccameron@
,
Dec 7 2016
Interesting. CompressedTex{Sub}Image{2|3}D interaction with PBO is left untested in ES3 dEQP / WebGL2 conformance tests. Let me add some test cases and see if Mac drivers handle it fine.
That said, I doubt this function (uploading compressed textures through PBO) is used at the moment.
,
Dec 8 2016
I'm unrestricting access to this bug in order to get more eyes on it.
,
Dec 9 2016
,
Dec 13 2016
Latest crash rates on all channels are as below 57.0.2949.0 0.10% 13 56.0.2924.21 1.38% 189 56.0.2924.18 0.60% 82 55.0.2883.87 0.04% 5 Link to the list of builds https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3ACheckFramebufferValid%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#samplereports:5,productversion:1000
,
Dec 14 2016
Thanks for the investigation. FYI: Your bug is labelled as Stable Release Block, please make sure to land the fix and get it merged into the release branch ASAP so we can take it for next week's Beta release for Desktop. Thank you!
,
Dec 14 2016
I really think the root cause is a MacOSX driver bug. That said, it seems many crashes are caused by CheckFramebufferStatus from GetIntegerv(IMPLEMENTATION_COLOR_READ_FORMAT/TYPE). I don't feel they were triggered by WebGL because most WebGL apps don't do such queries. However, there is a place in Skia (GrGLCaps::readPixelsSupported) that such queries are made: https://cs.chromium.org/chromium/src/third_party/skia/src/gpu/gl/GrGLCaps.cpp?rcl=1481717748&l=976 +bsalomon
,
Dec 14 2016
I think the root cause will still be there until Apple can have a fix for it, but we could probably reduce the crash rate by removing the above mentioned use case.
,
Dec 14 2016
+vmiura
,
Dec 15 2016
I looked at the crash data with erikchen@. It seems in canary this crash began to show in M54, where we began the rasterization finch, and around Oct 27, where we turn 40% on canary to 90% is when this crash first spiked. Also, as I explained above, this crash is definitely coming from Skia, and most likely from Rasterization using Skia. I can definitely help rewriting the GetInteger(IMPLEMENTATION_COLOR_READ_FORMAT/TYPE) to see if we can reduce the crash rates a bit, but I am afraid this won't solve the root problem.
,
Dec 15 2016
Eric, can I assign this back to you? At this point, I am pretty confident that Rasterization is triggering this crash/hang spike.
,
Dec 15 2016
bsalomon@: I wonder if we could speculatively make GrGLGpu::readPixelsSupported(GrRenderTarget* target, GrPixelConfig readConfig) use GrGLGpu::readPixelsSupported(GrPixelConfig rtConfig, GrPixelConfig readConfig), so that we make the query on a temporary render target instead of whatever the requester target may be? The theory is we may be making a query against an IOSurface that's already locked for writing.
,
Dec 15 2016
Sure, I can make that change.
,
Dec 15 2016
Issue 672365 has been merged into this issue.
,
Dec 15 2016
The following revision refers to this bug: https://skia.googlesource.com/skia.git/+/625cd9e0c9379b45c7f3100677eefcf5e241d032 commit 625cd9e0c9379b45c7f3100677eefcf5e241d032 Author: Brian Salomon <bsalomon@google.com> Date: Thu Dec 15 14:35:19 2016 Workaround freeze on Mac Chrome when checking read pixel config support. Chromium may ask us to read back from locked IOSurfaces. Calling the command buffer's glGetIntegerv() with GL_IMPLEMENTATION_COLOR_READ_FORMAT/_TYPE causes the command buffer to make a call to check the framebuffer status which can hang the driver. So in Mac Chromium we always use a temporary surface to test for glReadPixels format/type support. BUG= chromium:662802 Change-Id: I034e24faf3d780b6243f95af66d03dd68e12633c Reviewed-on: https://skia-review.googlesource.com/6113 Reviewed-by: Robert Phillips <robertphillips@google.com> Commit-Queue: Brian Salomon <bsalomon@google.com> [modify] https://crrev.com/625cd9e0c9379b45c7f3100677eefcf5e241d032/src/gpu/gl/GrGLGpu.cpp
,
Dec 15 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/23354840600e490ec876ed952a7e305fa8538df1 commit 23354840600e490ec876ed952a7e305fa8538df1 Author: skia-deps-roller <skia-deps-roller@chromium.org> Date: Thu Dec 15 16:58:00 2016 Roll src/third_party/skia/ ebccb8268..625cd9e0c (6 commits). https://skia.googlesource.com/skia.git/+log/ebccb82680fc..625cd9e0c937 $ git log ebccb8268..625cd9e0c --date=short --no-merges --format='%ad %ae %s' 2016-12-15 bsalomon Workaround freeze on Mac Chrome when checking read pixel config support. 2016-12-15 bsalomon Rename NVPR batch->op and sk_sp'ify 2016-12-14 raftias Added optimized sRGB/2.2 gamma stages into A2B color xform 2016-12-15 robertphillips Add a deferred copy surface (take 3) 2016-12-15 caryclark speculative pointer to member fix 2016-12-14 bsalomon Even more batch->op and sk_sp'ification. BUG= 662802 ,674047 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_trusty_blink_rel TBR=msarett@google.com Review-Url: https://codereview.chromium.org/2583513002 Cr-Commit-Position: refs/heads/master@{#438850} [modify] https://crrev.com/23354840600e490ec876ed952a7e305fa8538df1/DEPS
,
Dec 15 2016
Thanks Brian for the quick action. This needs to be merged back to M56 to see if the crash rate with this signature goes down.
,
Dec 15 2016
Sure, requesting the merge.
,
Dec 15 2016
[Automated comment] DEPS changes referenced in bugdroid comments, needs manual review.
,
Dec 15 2016
I ended up adding additional logging to Chrome to find out how frequently we were calling the crashing code above. Interestingly, it looks like: - We almost never hit this code from Skia - I added a log statement where Skia calls this code, and was unable to hit it after browsing a good number of sites (with GPU raster enabled or disabled). - We much more frequently hit the code from the display compositor - at least 2 times per page load, from https://cs.chromium.org/chromium/src/components/display_compositor/gl_helper_readback_support.cc?rcl=0&l=92 Interestingly, we actually call glCheckFramebufferStatus a *lot* more than this (multiple times per page update), but through other paths than GetIntegerv/GetHelper. What's interesting here is that the callstacks are almost exclusively showing the usage through GetHelper/GetIntegerv. This makes me think that the display compositor use pointed out above is somehow more problematic than other uses of the same call (maybe it's operating on an IOSurface while the others are on intermediate GL textures?).
,
Dec 15 2016
Eric, thanks for the digging. Code search failed to identify the display compositor use case. In this situation, I think the immediately step is still for me to change the command buffer GetIntegerv handling to avoid triggering CheckFramebufferStatus() and merge back to M56. At the same time, someone who's familiar with display compositor and Mac should keep digging.
,
Dec 16 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a11bcfb22a66f56d8a885efd89e7979d60d638b4 commit a11bcfb22a66f56d8a885efd89e7979d60d638b4 Author: zmo <zmo@chromium.org> Date: Fri Dec 16 22:00:59 2016 Change GetIntegerv(IMPLEMENTATION_COLOR_READ_FORMAT/TYPE) behavior. 1) Clean up a bit mess with framebuffer target 2) On desktop GL, no longer query the driver for these two enums. The drivers won't provide the answers anyway. Instead, use internal logic to determine the format/type BUG= 662802 TEST=gpu_unittests,webgl_conformance R=vmiura@chromium.org,ericrk@chromium.org NOTRY=true CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel Review-Url: https://codereview.chromium.org/2577293002 Cr-Commit-Position: refs/heads/master@{#439206} [modify] https://crrev.com/a11bcfb22a66f56d8a885efd89e7979d60d638b4/content/test/gpu/gpu_tests/webgl2_conformance_expectations.py [modify] https://crrev.com/a11bcfb22a66f56d8a885efd89e7979d60d638b4/gpu/command_buffer/service/gles2_cmd_decoder.cc [modify] https://crrev.com/a11bcfb22a66f56d8a885efd89e7979d60d638b4/gpu/command_buffer/service/gles2_cmd_decoder_unittest_framebuffers.cc
,
Dec 16 2016
I want to merge the CL landed in #53 to see if this can reduce the crash rate.
,
Dec 16 2016
,
Dec 16 2016
[Automated comment] DEPS changes referenced in bugdroid comments, needs manual review.
,
Dec 16 2016
Sorry, we only want to merge the change in #53, so there is no DEPS change
,
Dec 19 2016
This is top#1 crash on Mac beta 56.0.2924.28.Latest crash rates are as below. 57.0.2953.0 0.16% 23 57.0.2950.4 0.48% 70 56.0.2924.28 1.35% 198 55.0.2883.95 0.47% 69 Link to the list of builds https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3ACheckFramebufferValid%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#samplereports:5,productversion:1000
,
Dec 19 2016
Crash not seen in 57.0.2954.0+, after the change in #53 landed. Unfortunately, other hang rates still seem elevated... We'll need to wait a few days to see if the overall GPU proc CPM goes down, or if patching the issue here has just moved crashes to other sites.
,
Dec 20 2016
Taking myself off assignment since I'm not actively looking at this, but feel free to put me back on if there is something to do on the Skia side to mitigate this.
,
Dec 21 2016
This change meets the bar and is approved for merge into M56
,
Dec 22 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ca5a6fb3e2be3f4da8329090e906c0d5fa1b7820 commit ca5a6fb3e2be3f4da8329090e906c0d5fa1b7820 Author: Zhenyao Mo <zmo@chromium.org> Date: Thu Dec 22 00:23:44 2016 Change GetIntegerv(IMPLEMENTATION_COLOR_READ_FORMAT/TYPE) behavior. 1) Clean up a bit mess with framebuffer target 2) On desktop GL, no longer query the driver for these two enums. The drivers won't provide the answers anyway. Instead, use internal logic to determine the format/type BUG= 662802 TEST=gpu_unittests,webgl_conformance R=vmiura@chromium.org,ericrk@chromium.org NOTRY=true CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel Review-Url: https://codereview.chromium.org/2577293002 Cr-Commit-Position: refs/heads/master@{#439206} (cherry picked from commit a11bcfb22a66f56d8a885efd89e7979d60d638b4) Review-Url: https://codereview.chromium.org/2594233002 . Cr-Commit-Position: refs/branch-heads/2924@{#592} Cr-Branched-From: 3a87aecc31cd1ffe751dd72c04e5a96a1fc8108a-refs/heads/master@{#433059} [modify] https://crrev.com/ca5a6fb3e2be3f4da8329090e906c0d5fa1b7820/content/test/gpu/gpu_tests/webgl2_conformance_expectations.py [modify] https://crrev.com/ca5a6fb3e2be3f4da8329090e906c0d5fa1b7820/gpu/command_buffer/service/gles2_cmd_decoder.cc [modify] https://crrev.com/ca5a6fb3e2be3f4da8329090e906c0d5fa1b7820/gpu/command_buffer/service/gles2_cmd_decoder_unittest_framebuffers.cc
,
Jan 3 2017
Just to update the latest behavior of the crash on latest channels as below: 57.0.2950.4 3.38% 590 - Latest Dev 56.0.2924.28 9.15% 1599 - Latest Beta Latest beta has 1599 crashes from 423 unique client ids. Latest dev has 590 crashes from 124 unique client ids. Link to the list of builds: --------------------------- https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3ACheckFramebufferValid%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#samplereports:5,productversion:1000 zmo@ - Could you please have a look into this issue. Thanks...!!
,
Jan 3 2017
+jbauman@, +stanisc@ Re #15: Stability sheriff here, there are also spikes on gpu watchdog kills on Windows around the 2902-2906 timeframe (issue 617977). Perhaps something changed that directly affects the GPU watchdog? Queries: Mac: https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20OMIT%20RECORD%20IF%20SUM(CrashedStackTrace.StackFrame.FunctionName%3D%27gpu%3A%3AGpuWatchdogThread%3A%3ADeliberatelyTerminateToRecoverFromHang()%27)%20%3D%200&ignore_case=false&enable_rewrite=true&omit_field_name=CrashedStackTrace.StackFrame.FunctionName&omit_field_value=gpu%3A%3AGpuWatchdogThread%3A%3ADeliberatelyTerminateToRecoverFromHang()&omit_field_opt=%3D#-samplereports:5,productversion:1000 Windows: https://crash.corp.google.com/browse?q=product.name%3D%27Chrome%27%20OMIT%20RECORD%20IF%20SUM(CrashedStackTrace.StackFrame.FunctionName%3D%27gpu%3A%3AGpuWatchdogThread%3A%3ADeliberatelyTerminateToRecoverFromHang()%27)%20%3D%200&ignore_case=false&enable_rewrite=true&omit_field_name=CrashedStackTrace.StackFrame.FunctionName&omit_field_value=gpu%3A%3AGpuWatchdogThread%3A%3ADeliberatelyTerminateToRecoverFromHang()&omit_field_opt=%3D#-samplereports:5,productversion:1000
,
Jan 3 2017
My Cl merge to beta (#62) hasn't been out to the wild world yet. So I'll wait until that gets into beta and see how it works.
,
Jan 3 2017
Regarding comment #64, I think the reason for the spike that appear in those queries is the signature change - https://chromium.googlesource.com/chromium/src/+/2fb7e150d959023d9a793a74e30cea42798c56b4 reaching the stable branch. Here is another query that gives a different picture and shows that the number of crashes has started to decline: https://crash.corp.google.com/browse?q=product.name%3D%27Chrome%27%20OMIT%20RECORD%20IF%20SUM(custom_data.ChromeCrashProto.magic_signature_1.name%20CONTAINS%20%27GPU%20hang%27)%20%3D%200&ignore_case=false&enable_rewrite=false&omit_field_name=CrashedStackTrace.StackFrame.FunctionName&omit_field_value=gpu%3A%3AGpuWatchdogThread%3A%3ADeliberatelyTerminateToRecoverFromHang()&omit_field_opt=%3D
,
Jan 3 2017
Re #66: Ah, that makes sense, thanks for explaining it. The spike in December seems to be just the stable promotion of that signature change.
,
Jan 3 2017
I'm removing this from the stability sheriff queue as there's progress being made by zmo@ (comment #65).
,
Jan 9 2017
The crash significantly reduced in latest M56 and not reported after in canary after 57.0.2973.0 ( 4 days old build), below is the data. 56.0.2924.51 0.02% 3 56.0.2924.28 11.27% 2079 If there is no pending work , can we tag as fixed?
,
Jan 9 2017
,
Mar 23 2017
|
|||||||||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||||||||
Comment 1 by tkonch...@chromium.org
, Nov 7 2016