New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 662802 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 1
Type: Bug-Regression

Blocking:
issue 609252
issue 622813



Sign in to add a comment

Chrome_Mac: Crash Report - [GPU hang] gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid

Project Member Reported by tkonch...@chromium.org, Nov 7 2016

Issue description

Product name: Chrome_Mac
Magic Signature: [GPU hang] gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid

Current link:
https://crash.corp.google.com/browse?q=product.name%3D'Chrome_Mac'%20AND%20product.version%3D'56.0.2906.0'%20AND%20cpu.Architecture%3D'amd64'%20AND%20custom_data.ChromeCrashProto.ptype%3D'gpu-process'%20AND%20ReportID%3D'0c427d4700000000'%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D'%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3ACheckFramebufferValid'&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#3


Search properties:
product.name: Chrome_Mac
product.version: 56.0.2906.0
cpu.architecture: amd64
custom_data.chromecrashproto.ptype: gpu-process
reportid: 0c427d4700000000

Metadata :
Product Name: Chrome_Mac
Product Version: 56.0.2906.0
Report ID: 0c427d4700000000
Report Time: Tue, 01 Nov 2016 21:30:51 GMT
Uptime: 3682000 ms
Cumulative Uptime: 0 ms
User Email: 
OS Name: Mac OS X
OS Version: 10.11.6 15G1004
CPU Architecture: amd64
CPU Info: family 6 model 70 stepping 1

Stack Trace:
Thread 0 MAGIC SIGNATURE THREAD
Stack Quality91%Show frame trust levels
0x00007fff9c833f72	(libsystem_kernel.dylib + 0x00010f72 )	
0x00007fff9d14cc20	(IOKit + 0x00065c20 )	io_connect_method
0x00007fff9d0ed12f	(IOKit + 0x0000612f )	IOConnectCallMethod
0x00007fff9c5e3876	(IOAccelerator + 0x00003876 )	IOAccelResourceCreate
0x00007fff9e02bdfd	(libGPUSupportMercury.dylib + 0x00008dfd )	gpusGetKernelTexture
0x000000011973bc3e	(AMDRadeonX4000GLDriver + 0x00085c3e )	
0x00000001197386c3	(AMDRadeonX4000GLDriver + 0x000826c3 )	
0x00007fff9e027a64	(libGPUSupportMercury.dylib + 0x00004a64 )	gldLoadFramebuffer
0x00007fff8f07250e	(GLEngine + 0x0011a50e )	gleCheckFramebufferStatus
0x00007fff8efadf19	(GLEngine + 0x00055f19 )	glCheckFramebufferStatusEXT_Exec
0x000000010ef80517	(Google Chrome Framework -gles2_cmd_decoder.cc:4248 )	gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid(gpu::gles2::Framebuffer*, unsigned int, unsigned int, char const*)
0x000000010ef85ae9	(Google Chrome Framework -gles2_cmd_decoder.cc:4295 )	gpu::gles2::GLES2DecoderImpl::GetHelper(unsigned int, int*, int*)
0x000000010ef64ea0	(Google Chrome Framework -gles2_cmd_decoder.cc:6763 )	gpu::gles2::GLES2DecoderImpl::HandleGetIntegerv(unsigned int, void const volatile*)
0x000000010ef8381f	(Google Chrome Framework -gles2_cmd_decoder.cc:5137 )	gpu::error::Error gpu::gles2::GLES2DecoderImpl::DoCommandsImpl<false>(unsigned int, void const volatile*, int, int*)
0x000000010ef416a4	(Google Chrome Framework -cmd_parser.cc:53 )	<name omitted>
0x000000010ef420f8	(Google Chrome Framework -command_executor.cc:61 )	gpu::CommandExecutor::PutChanged()
0x000000010f080b19	(Google Chrome Framework -gpu_command_buffer_stub.cc:783 )	gpu::GpuCommandBufferStub::OnAsyncFlush(int, unsigned int, std::__1::vector<ui::LatencyInfo, std::__1::allocator<ui::LatencyInfo> > const&)
0x000000010f08095a	(Google Chrome Framework -tuple.h:144 )	bool IPC::MessageT<GpuCommandBufferMsg_AsyncFlush_Meta, std::__1::tuple<int, unsigned int, std::__1::vector<ui::LatencyInfo, std::__1::allocator<ui::LatencyInfo> > >, void>::Dispatch<gpu::GpuCommandBufferStub, gpu::GpuCommandBufferStub, void, void (gpu::GpuCommandBufferStub::*)(int, unsigned int, std::__1::vector<ui::LatencyInfo, std::__1::allocator<ui::LatencyInfo> > const&)>(IPC::Message const*, gpu::GpuCommandBufferStub*, gpu::GpuCommandBufferStub*, void*, void (gpu::GpuCommandBufferStub::*)(int, unsigned int, std::__1::vector<ui::LatencyInfo, std::__1::allocator<ui::LatencyInfo> > const&))
0x000000010f07f4e0	(Google Chrome Framework -gpu_command_buffer_stub.cc:243 )	gpu::GpuCommandBufferStub::OnMessageReceived(IPC::Message const&)
0x000000010f079106	(Google Chrome Framework -gpu_channel.cc:802 )	gpu::GpuChannel::HandleMessageHelper(IPC::Message const&)
0x000000010f079099	(Google Chrome Framework -gpu_channel.cc:782 )	gpu::GpuChannel::HandleMessage(scoped_refptr<gpu::GpuChannelMessageQueue> const&)
0x000000010e415498	(Google Chrome Framework -callback.h:47 )	base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*)
0x000000010e438af5	(Google Chrome Framework -message_loop.cc:413 )	base::MessageLoop::RunTask(base::PendingTask*)
0x000000010e438dcb	(Google Chrome Framework -message_loop.cc:422 )	base::MessageLoop::DeferOrRunPendingTask(base::PendingTask)
0x000000010e439112	(Google Chrome Framework -message_loop.cc:515 )	base::MessageLoop::DoWork()
0x000000010e43b79c	(Google Chrome Framework -message_pump_mac.mm:330 )	base::MessagePumpCFRunLoopBase::RunWork()
0x000000010e42dee9	(Google Chrome Framework + 0x0184eee9 )	base::mac::CallWithEHFrame(void () block_pointer)
0x000000010e43b1b3	(Google Chrome Framework -message_pump_mac.mm:306 )	base::MessagePumpCFRunLoopBase::RunWorkSource(void*)
0x00007fff89937880	(CoreFoundation + 0x000aa880 )	__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__
0x00007fff89916fbb	(CoreFoundation + 0x00089fbb )	__CFRunLoopDoSources0
0x00007fff899164de	(CoreFoundation + 0x000894de )	__CFRunLoopRun
0x00007fff89915ed7	(CoreFoundation + 0x00088ed7 )	CFRunLoopRunSpecific
0x000000010e43bb7e	(Google Chrome Framework -message_pump_mac.mm:554 )	base::MessagePumpCFRunLoop::DoRun(base::MessagePump::Delegate*)
0x000000010e43b5fb	(Google Chrome Framework -message_pump_mac.mm:238 )	base::MessagePumpCFRunLoopBase::Run(base::MessagePump::Delegate*)
0x000000010e456802	(Google Chrome Framework -run_loop.cc:35 )	base::RunLoop::Run()
0x0000000110bc3cc4	(Google Chrome Framework -gpu_main.cc:288 )	content::GpuMain(content::MainFunctionParams const&)
0x000000010dfc527c	(Google Chrome Framework -content_main_runner.cc:776 )	content::ContentMainRunnerImpl::Run()
0x000000010dfc4505	(Google Chrome Framework -content_main.cc:20 )	content::ContentMain(content::ContentMainParams const&)
0x000000010cbe1bab	(Google Chrome Framework -chrome_main.cc:97 )	ChromeMain
0x000000010c9a8d69	(Google Chrome Helper -chrome_exe_main_mac.c:85 )	main
0x00007fff97e565ac	(libdyld.dylib + 0x000035ac )	
0x00007fff97e565ac	(libdyld.dylib + 0x000035ac )	
Thread 2 CRASHED [EXC_BAD_ACCESS / KERN_INVALID_ADDRESS @ 0x00000000 ]
Stack Quality90%Show frame trust levels
0x000000010f0858a3	(Google Chrome Framework -gpu_watchdog_thread.cc:377 )	gpu::GpuWatchdogThread::DeliberatelyTerminateToRecoverFromHang()
0x000000010e415498	(Google Chrome Framework -callback.h:47 )	base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*)
0x000000010e438af5	(Google Chrome Framework -message_loop.cc:413 )	base::MessageLoop::RunTask(base::PendingTask*)
0x000000010e438dcb	(Google Chrome Framework -message_loop.cc:422 )	base::MessageLoop::DeferOrRunPendingTask(base::PendingTask)
0x000000010e4392fc	(Google Chrome Framework -message_loop.cc:554 )	base::MessageLoop::DoDelayedWork(base::TimeTicks*)
0x000000010e43b7b8	(Google Chrome Framework -message_pump_mac.mm:334 )	base::MessagePumpCFRunLoopBase::RunWork()
0x000000010e42dee9	(Google Chrome Framework + 0x0184eee9 )	base::mac::CallWithEHFrame(void () block_pointer)
0x000000010e43b1b3	(Google Chrome Framework -message_pump_mac.mm:306 )	base::MessagePumpCFRunLoopBase::RunWorkSource(void*)
0x00007fff89937880	(CoreFoundation + 0x000aa880 )	__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__
0x00007fff89916fbb	(CoreFoundation + 0x00089fbb )	__CFRunLoopDoSources0
0x00007fff899164de	(CoreFoundation + 0x000894de )	__CFRunLoopRun
0x00007fff89915ed7	(CoreFoundation + 0x00088ed7 )	CFRunLoopRunSpecific
0x000000010e43bb7e	(Google Chrome Framework -message_pump_mac.mm:554 )	base::MessagePumpCFRunLoop::DoRun(base::MessagePump::Delegate*)
0x000000010e43b5fb	(Google Chrome Framework -message_pump_mac.mm:238 )	base::MessagePumpCFRunLoopBase::Run(base::MessagePump::Delegate*)
0x000000010e456802	(Google Chrome Framework -run_loop.cc:35 )	base::RunLoop::Run()
0x000000010e480f31	(Google Chrome Framework -thread.cc:333 )	base::Thread::ThreadMain()
0x000000010e47c926	(Google Chrome Framework -platform_thread_posix.cc:71 )	base::(anonymous namespace)::ThreadFunc(void*)
0x00007fff9857d99c	(libsystem_pthread.dylib + 0x0000399c )	_pthread_body
0x00007fff9857d919	(libsystem_pthread.dylib + 0x00003919 )	_pthread_start
0x00007fff9857b350	(libsystem_pthread.dylib + 0x00001350 )	thread_start
0x000000010e47c8cf	(Google Chrome Framework + 0x0189d8cf )	

This crash is seen in latest builds as below
56.0.2911.0	0.03%	3	
56.0.2910.0	0.08%	7  from 6 different client Ids
56.0.2909.0	0.23%	21	
56.0.2908.0	0.08%	7	
56.0.2907.0	0.27%	24	
56.0.2906.0	0.54%	48 from 38 different client Ids	
56.0.2905.0	0.08%	7	
56.0.2904.0	0.03%	3	
56.0.2903.0	0.26%	23	
56.0.2900.0	0.02%	2	
56.0.2899.0	0.01%	1	
56.0.2897.0	0.01%	1	
56.0.2895.0	0.01%	1	
56.0.2891.0	0.02%	2	
56.0.2890.0	0.01%	1	
56.0.2887.0	0.02%	2	
56.0.2886.0	0.01%	1	
55.0.2883.35	0.09%	8 from 3 different client Ids	
55.0.2883.28	0.19%	17	
55.0.2883.21	0.10%	9	
55.0.2883.11	0.01%	1	
55.0.2883.4	0.01%	1	
55.0.2882.0	0.02%	2	
55.0.2881.0	0.03%	3	
55.0.2880.0	0.02%	2	
55.0.2879.0	0.03%	3	
55.0.2875.0	0.02%	2	
55.0.2873.4	0.01%	1	
55.0.2868.0	0.02%	2	
55.0.2867.0	0.03%	3	
55.0.2865.0	0.01%	1	
55.0.2860.0	0.01%	1	
55.0.2859.0	0.03%	3	
55.0.2858.0	0.02%	2	
55.0.2857.0	0.02%	2	
55.0.2853.0	0.04%	4	
55.0.2850.0	0.01%	1	
55.0.2849.0	0.02%	2	
55.0.2848.0	0.01%	1	
55.0.2847.0	0.01%	1	
55.0.2845.0	0.01%	1	
55.0.2844.0	0.02%	2	
55.0.2843.0	0.01%	1	
54.0.2840.87	0.20%	18   from 18 different client Ids	
54.0.2840.71	5.63%	504	
54.0.2840.59	0.38%	34	
54.0.2840.50	0.08%	7	
54.0.2840.41	0.03%	3	
54.0.2840.34	0.08%	7	
54.0.2840.27	0.04%	4	
54.0.2840.16	0.08%	7	
54.0.2840.8	0.01%	1	

Link to the builds:
https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3ACheckFramebufferValid%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#samplereports:5,productversion:1000

CL : https://chromium.googlesource.com/chromium/src/+log/56.0.2905.0..56.0.2906.0?pretty=fuller&n=10000

Possible suspect : https://codereview.chromium.org/2456823002

Please reassign if this is not related to your change


 
Cc: piman@chromium.org
Cc: kainino@chromium.org
Cc: qiankun....@intel.com
Owner: ----
Status: Available (was: Assigned)
This is a bug on Mac. But my CL only takes effect on Linux AMD. Unassigning me.
Project Member

Comment 4 by sheriffbot@chromium.org, Nov 7 2016

Labels: FoundIn-M-56 Fracas
Users experienced this crash on the following builds:

Mac Dev 56.0.2906.0 -  2.57 CPM, 30 reports, 22 clients (signature [GPU hang] gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid)

If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas
This crash is seen on latest builds as below

56.0.2913.3	0.05%	5	
56.0.2912.0	0.08%	7	
56.0.2906.0	0.72%	66	from 46 different client Ids
55.0.2883.35	0.14%	13	
54.0.2840.87	0.36%	33


Comment 6 by zmo@chromium.org, Nov 9 2016

Cc: erikc...@chromium.org ccameron@chromium.org
Adding some Mac gurus.

It seems to be the case of querying IMPLEMENTATION_COLOR_READ_FORMAT/TYPE, where we want to check the bound fbo's completeness first, where it crashed in the kernel.
Cc: ericrk@chromium.org
+ericrk@

ericrk@ - your cl https://codereview.chromium.org/2382573002 was listed in the suspected change log. Would you mind taking a look to see if it could be the cause of this crash?

My change is unrelated to GL command issuing (and the code is only run if you are tracing), so probably not that. suspected change log seems suspicious, as we seem to have an increased rate as early as 56.0.2903.0. Looked through the changes for 2902-2903, but didn't see anything obvious. I'll keep looking a bit more.
Thank you for taking a look!

Comment 10 by ajha@chromium.org, Nov 15 2016

Just to update, Latest Dev(56.0.2914.3) on Mac has reported 44 crashes from 25 clients till now.

Friendly ping to get an update on this.

Comment 11 by zmo@chromium.org, Nov 15 2016

This is all over the OS versions and GPUs (Intel, AMD, NVidia). It really looks like a MacOSX driver issue that we should file to Apple.

Comment 12 by ajha@chromium.org, Nov 18 2016

shrike@/ericrk@: Based on C#11 shall we mark this as External Dependency then?
ericrk@, ccameron@ - should this be marked an External Dependency (and a bug filed with Apple)?
Well, this signature -- gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid -- isn't particularly meaningful. We're causing a hang somewhere deeper in the Apple driver, and the signatures there should vary somewhat (the one in the top of the bug is in an AMD driver).

If we're going to discuss [GPU hang] issues with Apple, it shouldn't be in the context of this particular chrome signature, but rather in the context of "here are the top N signatures in Apple code".

WRT spikes in this, it's probably just a shuffling from other places. Unless core profile is to be implicated.
Looked at this some more - it appears that many/most of our [GPU Hang] type crashes spiked in Dev (and in canary, but it's harder to see) between the 2902.0 Dev and 2906.0 Dev. See the followig links:

CheckFramebufferValid: https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20%20AND%20(product.Version%20CONTAINS%20%2755.0.%27%20OR%20product.Version%20CONTAINS%20%2756.0.%27)%20AND%20custom_data.ChromeCrashProto.channel%3D%27dev%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3ACheckFramebufferValid%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D

glFence::Create - https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20%20AND%20(product.Version%20CONTAINS%20%2755.0.%27%20OR%20product.Version%20CONTAINS%20%2756.0.%27)%20AND%20custom_data.ChromeCrashProto.channel%3D%27dev%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gl%3A%3AGLFence%3A%3ACreate%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D

HandleFlush: https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20%20AND%20(product.Version%20CONTAINS%20%2755.0.%27%20OR%20product.Version%20CONTAINS%20%2756.0.%27)%20AND%20custom_data.ChromeCrashProto.channel%3D%27dev%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3AHandleFlush%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D

and a number of others which are less dramatic.

This makes me feel that somewhere in 56.0.2902.0 to 56.0.2906.0 we landed something which increased the number of GPU hangs overall. This makes me a bit more hesitant to call this an Apple bug, as it doesn't seem to correspond with a specific API call. It feels like a more systemic thing, but it doesn't seem to correspond to the two major changes I could think of (GPU raster and GL Core Profile)...

From Dev, it seems like the range to look in is definitely 56.0.2902.0 to 56.0.2906.0, and from Canary, it seems very likely that it is 56.0.2902.0 to 56.0.2903.0.
My assumption in #15 was that hangs in a number of different Chrome callstacks meant that this wasn't an Apple issue. However, I took a look at the Apple code being invoked from chrome in each case, and it all seems very uniform. 

If we look at hangs that have symbols from the dev build after the spike (56.0.2906.0), we see that 88% of all crashes end with the following few calls:

(libsystem_kernel.dylib )       mach_msg_trap
(IOKit )                        io_connect_method
(IOKit )                        IOConnectCallMethod

44% of these callstacks continue with:

(libsystem_kernel.dylib )  mach_msg_trap
(IOKit )                   io_connect_method
(IOKit )                   IOConnectCallMethod
(IOKit )                   IOConnectCallStructMethod
(IOAccelerator )           IOAccelContextSubmitDataBuffersExt

This seems to indicate that there may be an IOKit issue in play here. Although it still seems like we did something in the 56.0.2906.0 timeframe to aggravate it. I'll file a radar for the above callstacks.
Project Member

Comment 17 by sheriffbot@chromium.org, Nov 20 2016

Labels: FoundIn-M-57
Users experienced this crash on the following builds:

Mac Canary 57.0.2925.0 -  1.85 CPM, 5 reports, 5 clients (signature [GPU hang] gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid)

If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas
Project Member

Comment 18 by sheriffbot@chromium.org, Nov 21 2016

Labels: FoundIn-M-55
Users experienced this crash on the following builds:

Mac Beta 55.0.2883.52 -  0.15 CPM, 4 reports, 4 clients (signature [GPU hang] gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid)

If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas
Project Member

Comment 19 by sheriffbot@chromium.org, Nov 24 2016

Labels: ReleaseBlock-Dev
This crash has high impact on Chrome's stability.
Signature: [GPU hang] gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid.
Channel: canary. Platform: mac.
Labeling  issue 662802  with ReleaseBlock-Dev.


If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas
Labels: -ReleaseBlock-Dev Stability-Sheriff-Desktop
We need to sort our before M56 hits stable, the GPU crash count are still high for M56 for the stacks mentioned in #15.
Could anything have changed with crash reporting in this timeframe? I notice that the renderer crashes show an almost identical pattern of crash rate increase to GPU process, which makes me feel like this might be a more general issue.

GPU CPM for several versions in the problem area: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=a5b303cfeb4c12dc6e6ac6119da2ebb9

Renderer CPM: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=bb0634f9bb03f7e2a5a170d45ee528b8

Re #16, given that this increase in crashes doesn't track any OS change / etc... I'm actually still not sure a radar is appropritate. It seems like something we did.

I'm committing an UMA which should help us guage whether this is a real hang, or whether something is wrong with our hang monitoring. Also putting in a potential mitigation for a memory stomp issue that we've seen in other GPU hang crashes in the waterfall. Will try to land these today and hopefully get some more insight into this bug.

Comment 25 by kbr@chromium.org, Nov 29 2016

Blocking: 609252
Interesting that this seems to be showing up in the wild.  Issue 609252  has tracked GPU process watchdog firings due to corruption of the MessageLoop's task observer list. Blocking that bug on this one.

Project Member

Comment 26 by bugdroid1@chromium.org, Nov 30 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1d9e17fe08c0dd44fa7e525662569f921e241bfc

commit 1d9e17fe08c0dd44fa7e525662569f921e241bfc
Author: ericrk <ericrk@chromium.org>
Date: Wed Nov 30 01:51:28 2016

Move GPU proc message loop to heap

We are experiencing what appear to be memory-stomp issues in the GPU
process. These issues seem to be impacting the message loop and
listeners registered to it, such as the GPU watchdog thread. This
change moves the message loop from the stack to a heap object as an
experiment to see if it improves things.

BUG= 662802 

Review-Url: https://codereview.chromium.org/2540513002
Cr-Commit-Position: refs/heads/master@{#435117}

[modify] https://crrev.com/1d9e17fe08c0dd44fa7e525662569f921e241bfc/content/gpu/gpu_main.cc

Project Member

Comment 27 by bugdroid1@chromium.org, Nov 30 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f4eade7013d8327ad0bf168dc82b7e481dd33975

commit f4eade7013d8327ad0bf168dc82b7e481dd33975
Author: ericrk <ericrk@chromium.org>
Date: Wed Nov 30 21:09:03 2016

Add UMA to track the duration of CheckFrambufferValid

We are getting a lot of GPU watchdog hangs in this function. Add an UMA
to confirm that this function is long-running, and that we don't have an
issue with the watchdog.

Also removes ProgramManager UMAs which were added for a similar bug that
has now been resolved.

BUG= 662802 
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/2539443003
Cr-Commit-Position: refs/heads/master@{#435415}

[modify] https://crrev.com/f4eade7013d8327ad0bf168dc82b7e481dd33975/gpu/command_buffer/service/gles2_cmd_decoder.cc
[modify] https://crrev.com/f4eade7013d8327ad0bf168dc82b7e481dd33975/gpu/command_buffer/service/program_manager.cc
[modify] https://crrev.com/f4eade7013d8327ad0bf168dc82b7e481dd33975/tools/metrics/histograms/histograms.xml

Status: Untriaged (was: Available)
Stability sheriff here--I need this assigned to someone since it looks like we want to be actively working on this. I'm going to flip this back to untriaged so the GPU triager gets this assigned.

It looks like GPU hangs in CheckFramebuffervalid haven't improved in 57.0.2938.0 canary which includes r435117 (and r435415). At the moment there are 15 Chrome_Mac amd64 gpu-process crashes and four of them (from four unique clients) are [GPU hang] gpu::gles2::GLES2DecoderImpl::CheckFramebufferValid. (FWIW 57.0.2937.0 that rate was 12.5%.)

Here's the crashes: https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3ACheckFramebufferValid%27%20AND%20product.version%20%3D%20%2757.0.2938.0%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=
From looking at the UMA which landed, it does appear that some number of calls to CheckFramebufferValid do take >10s, which would trip the watchdog timer. So it seems like this isn't an issue of us misreporting (as I had hoped)...

Does anyone have ideas on how to investigate this? There are some command buffer changes in the blame range, but the range is quite large and it's hard to pinpoint without blindly reverting them...
Friendly ping from stability sheriff.

Whoever is responsible for triaging bugs under component "Internals>GPU", can you please triage this bug?  We need to find an owner.
Owner: zmo@chromium.org
Status: Assigned (was: Untriaged)
I've filed a bug with Apple, bug ID 29547096. This bug reports the Apple portion of the callstacks seen in this issue.

While there may be an Apple bug, it appears that something we did has increased the frequency of the issue. The range where the rate appears to increase is:
https://chromium.googlesource.com/chromium/src/+log/56.0.2902.0..56.0.2906.0?pretty=fuller&n=10000

A CL which seems to line up with the initial spike is:
https://codereview.chromium.org/2447423002 - Handle CompressedTex{Sub}Image{2|3}D interaction with PBO.

zmo@, I know you said this looks like an Apple issue, but I'm curious whether anything in the CL listed above might have changed our GL calling pattern in a way that could cause us to hit the Apple bug or get backed up in the GL driver more frequently?

Assuming this is unrelated, other CLs seem like they could potentially be related are:
https://codereview.chromium.org/2458943002 - Support 2D texture sub-source uploads from HTMLImageElement. - kbr@
https://codereview.chromium.org/2461023002 - command buffer: audit validation of ES3 commands (part 2) - kainino@
https://codereview.chromium.org/2458523005 - command buffer: audit validation of ES3 commands (part 1) - kainino@

And even less likely (but still GPU related and in the right range):
https://codereview.chromium.org/2461003003 - Reduce GPU mailbox size to 16 bytes - piman@
https://codereview.chromium.org/2456823002 - Remove invariant for input in fragment shader - qiankun.miao@intel.com
https://codereview.chromium.org/2454153002 - mac: Offscreen Canvas sets texture wrap to CLAMP_TO_EDGE explicitly - dongseong.hwang@intel.com
https://codereview.chromium.org/2453283002 - Allow nested state restorers in DrawingBuffer - ccameron@

Comment 32 by zmo@chromium.org, Dec 7 2016

Interesting.  CompressedTex{Sub}Image{2|3}D interaction with PBO is left untested in ES3 dEQP / WebGL2 conformance tests.  Let me add some test cases and see if Mac drivers handle it fine.

That said, I doubt this function (uploading compressed textures through PBO) is used at the moment.

Comment 33 by kbr@chromium.org, Dec 8 2016

Labels: -Restrict-View-Google
I'm unrestricting access to this bug in order to get more eyes on it.

Cc: yunchao...@intel.com
Thanks for the investigation.

FYI: Your bug is labelled as Stable Release Block, please make sure to land the fix and get it merged into the release branch ASAP so we can take it for next week's Beta release for Desktop. Thank you!

Comment 37 by zmo@chromium.org, Dec 14 2016

Cc: bsalomon@chromium.org
I really think the root cause is a MacOSX driver bug.  That said, it seems many crashes are caused by CheckFramebufferStatus from GetIntegerv(IMPLEMENTATION_COLOR_READ_FORMAT/TYPE).

I don't feel they were triggered by WebGL because most WebGL apps don't do such queries.  However, there is a place in Skia (GrGLCaps::readPixelsSupported) that such queries are made:

https://cs.chromium.org/chromium/src/third_party/skia/src/gpu/gl/GrGLCaps.cpp?rcl=1481717748&l=976

+bsalomon

Comment 38 by zmo@chromium.org, Dec 14 2016

I think the root cause will still be there until Apple can have a fix for it, but we could probably reduce the crash rate by removing the above mentioned use case.

Comment 39 by zmo@chromium.org, Dec 14 2016

Cc: vmi...@chromium.org
+vmiura

Comment 40 Deleted

Comment 41 by zmo@chromium.org, Dec 15 2016

I looked at the crash data with erikchen@.  It seems in canary this crash began to show in M54, where we began the rasterization finch, and around Oct 27, where we turn 40% on canary to 90% is when this crash first spiked.

Also, as I explained above, this crash is definitely coming from Skia, and most likely from Rasterization using Skia.

I can definitely help rewriting the GetInteger(IMPLEMENTATION_COLOR_READ_FORMAT/TYPE) to see if we can reduce the crash rates a bit, but I am afraid this won't solve the root problem.

Comment 42 by zmo@chromium.org, Dec 15 2016

Owner: ericrk@chromium.org
Eric, can I assign this back to you?  At this point, I am pretty confident that Rasterization is triggering this crash/hang spike.
bsalomon@: I wonder if we could speculatively make GrGLGpu::readPixelsSupported(GrRenderTarget* target, GrPixelConfig readConfig) use GrGLGpu::readPixelsSupported(GrPixelConfig rtConfig, GrPixelConfig readConfig), so that we make the query on a temporary render target instead of whatever the requester target may be?

The theory is we may be making a query against an IOSurface that's already locked for writing.
Owner: bsalomon@chromium.org
Sure, I can make that change.
Issue 672365 has been merged into this issue.
Project Member

Comment 46 by bugdroid1@chromium.org, Dec 15 2016

The following revision refers to this bug:
  https://skia.googlesource.com/skia.git/+/625cd9e0c9379b45c7f3100677eefcf5e241d032

commit 625cd9e0c9379b45c7f3100677eefcf5e241d032
Author: Brian Salomon <bsalomon@google.com>
Date: Thu Dec 15 14:35:19 2016

Workaround freeze on Mac Chrome when checking read pixel config support.

Chromium may ask us to read back from locked IOSurfaces. Calling the command buffer's
glGetIntegerv() with GL_IMPLEMENTATION_COLOR_READ_FORMAT/_TYPE causes the command buffer
to make a call to check the framebuffer status which can hang the driver. So in Mac Chromium
we always use a temporary surface to test for glReadPixels format/type support.

BUG= chromium:662802 

Change-Id: I034e24faf3d780b6243f95af66d03dd68e12633c
Reviewed-on: https://skia-review.googlesource.com/6113
Reviewed-by: Robert Phillips <robertphillips@google.com>
Commit-Queue: Brian Salomon <bsalomon@google.com>

[modify] https://crrev.com/625cd9e0c9379b45c7f3100677eefcf5e241d032/src/gpu/gl/GrGLGpu.cpp

Project Member

Comment 47 by bugdroid1@chromium.org, Dec 15 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/23354840600e490ec876ed952a7e305fa8538df1

commit 23354840600e490ec876ed952a7e305fa8538df1
Author: skia-deps-roller <skia-deps-roller@chromium.org>
Date: Thu Dec 15 16:58:00 2016

Roll src/third_party/skia/ ebccb8268..625cd9e0c (6 commits).

https://skia.googlesource.com/skia.git/+log/ebccb82680fc..625cd9e0c937

$ git log ebccb8268..625cd9e0c --date=short --no-merges --format='%ad %ae %s'
2016-12-15 bsalomon Workaround freeze on Mac Chrome when checking read pixel config support.
2016-12-15 bsalomon Rename NVPR batch->op and sk_sp'ify
2016-12-14 raftias Added optimized sRGB/2.2 gamma stages into A2B color xform
2016-12-15 robertphillips Add a deferred copy surface (take 3)
2016-12-15 caryclark speculative pointer to member fix
2016-12-14 bsalomon Even more batch->op and sk_sp'ification.

BUG= 662802 ,674047

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls

CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_trusty_blink_rel
TBR=msarett@google.com

Review-Url: https://codereview.chromium.org/2583513002
Cr-Commit-Position: refs/heads/master@{#438850}

[modify] https://crrev.com/23354840600e490ec876ed952a7e305fa8538df1/DEPS

Comment 48 by zmo@chromium.org, Dec 15 2016

Thanks Brian for the quick action.  This needs to be merged back to M56 to see if the crash rate with this signature goes down.
Labels: Merge-Request-56
Sure, requesting the merge.

Comment 50 by dimu@chromium.org, Dec 15 2016

Labels: -Merge-Request-56 Merge-Review-56 Hotlist-Merge-Review
[Automated comment] DEPS changes referenced in bugdroid comments, needs manual review.
I ended up adding additional logging to Chrome to find out how frequently we were calling the crashing code above. Interestingly, it looks like:

- We almost never hit this code from Skia - I added a log statement where Skia calls this code, and was unable to hit it after browsing a good number of sites (with GPU raster enabled or disabled). 

- We much more frequently hit the code from the display compositor - at least 2 times per page load, from https://cs.chromium.org/chromium/src/components/display_compositor/gl_helper_readback_support.cc?rcl=0&l=92

Interestingly, we actually call glCheckFramebufferStatus a *lot* more than this (multiple times per page update), but through other paths than GetIntegerv/GetHelper. What's interesting here is that the callstacks are almost exclusively showing the usage through GetHelper/GetIntegerv. This makes me think that the display compositor use pointed out above is somehow more problematic than other uses of the same call (maybe it's operating on an IOSurface while the others are on intermediate GL textures?).

Comment 52 by zmo@chromium.org, Dec 15 2016

Eric, thanks for the digging.  Code search failed to identify the display compositor use case.  In this situation, I think the immediately step is still for me to change the command buffer GetIntegerv handling to avoid triggering CheckFramebufferStatus() and merge back to M56. At the same time, someone who's familiar with display compositor and Mac should keep digging.
Project Member

Comment 53 by bugdroid1@chromium.org, Dec 16 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a11bcfb22a66f56d8a885efd89e7979d60d638b4

commit a11bcfb22a66f56d8a885efd89e7979d60d638b4
Author: zmo <zmo@chromium.org>
Date: Fri Dec 16 22:00:59 2016

Change GetIntegerv(IMPLEMENTATION_COLOR_READ_FORMAT/TYPE) behavior.

1) Clean up a bit mess with framebuffer target
2) On desktop GL, no longer query the driver for these two enums. The drivers
   won't provide the answers anyway.  Instead, use internal logic to determine
   the format/type

BUG= 662802 
TEST=gpu_unittests,webgl_conformance
R=vmiura@chromium.org,ericrk@chromium.org
NOTRY=true
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/2577293002
Cr-Commit-Position: refs/heads/master@{#439206}

[modify] https://crrev.com/a11bcfb22a66f56d8a885efd89e7979d60d638b4/content/test/gpu/gpu_tests/webgl2_conformance_expectations.py
[modify] https://crrev.com/a11bcfb22a66f56d8a885efd89e7979d60d638b4/gpu/command_buffer/service/gles2_cmd_decoder.cc
[modify] https://crrev.com/a11bcfb22a66f56d8a885efd89e7979d60d638b4/gpu/command_buffer/service/gles2_cmd_decoder_unittest_framebuffers.cc

Comment 54 by zmo@chromium.org, Dec 16 2016

Labels: -Merge-Review-56 Merge-Request-56
I want to merge the CL landed in #53 to see if this can reduce the crash rate.

Comment 55 by zmo@chromium.org, Dec 16 2016

Labels: -Hotlist-Merge-Review

Comment 56 by dimu@chromium.org, Dec 16 2016

Labels: -Merge-Request-56 Merge-Review-56 Hotlist-Merge-Review
[Automated comment] DEPS changes referenced in bugdroid comments, needs manual review.

Comment 57 by zmo@chromium.org, Dec 16 2016

Sorry, we only want to merge the change in #53, so there is no DEPS change
Crash not seen in 57.0.2954.0+, after the change in #53 landed.

Unfortunately, other hang rates still seem elevated... We'll need to wait a few days to see if the overall GPU proc CPM goes down, or if patching the issue here has just moved crashes to other sites.
Owner: zmo@chromium.org
Taking myself off assignment since I'm not actively looking at this, but feel free to put me back on if there is something to do on the Skia side to mitigate this.
Labels: -Merge-Review-56 Merge-Approved-56
This change meets the bar and is approved for merge into M56
Project Member

Comment 62 by bugdroid1@chromium.org, Dec 22 2016

Labels: -merge-approved-56 merge-merged-2924
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ca5a6fb3e2be3f4da8329090e906c0d5fa1b7820

commit ca5a6fb3e2be3f4da8329090e906c0d5fa1b7820
Author: Zhenyao Mo <zmo@chromium.org>
Date: Thu Dec 22 00:23:44 2016

Change GetIntegerv(IMPLEMENTATION_COLOR_READ_FORMAT/TYPE) behavior.

1) Clean up a bit mess with framebuffer target
2) On desktop GL, no longer query the driver for these two enums. The drivers
   won't provide the answers anyway.  Instead, use internal logic to determine
   the format/type

BUG= 662802 
TEST=gpu_unittests,webgl_conformance
R=vmiura@chromium.org,ericrk@chromium.org
NOTRY=true
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/2577293002
Cr-Commit-Position: refs/heads/master@{#439206}
(cherry picked from commit a11bcfb22a66f56d8a885efd89e7979d60d638b4)

Review-Url: https://codereview.chromium.org/2594233002 .
Cr-Commit-Position: refs/branch-heads/2924@{#592}
Cr-Branched-From: 3a87aecc31cd1ffe751dd72c04e5a96a1fc8108a-refs/heads/master@{#433059}

[modify] https://crrev.com/ca5a6fb3e2be3f4da8329090e906c0d5fa1b7820/content/test/gpu/gpu_tests/webgl2_conformance_expectations.py
[modify] https://crrev.com/ca5a6fb3e2be3f4da8329090e906c0d5fa1b7820/gpu/command_buffer/service/gles2_cmd_decoder.cc
[modify] https://crrev.com/ca5a6fb3e2be3f4da8329090e906c0d5fa1b7820/gpu/command_buffer/service/gles2_cmd_decoder_unittest_framebuffers.cc

Just to update the latest behavior of the crash on latest channels as below:

57.0.2950.4	3.38%	590	- Latest Dev

56.0.2924.28	9.15%	1599	- Latest Beta
	
Latest beta has 1599 crashes from 423 unique client ids.
Latest dev  has  590 crashes from 124 unique client ids.

Link to the list of builds:
---------------------------
https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20cpu.Architecture%3D%27amd64%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27gpu-process%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BGPU%20hang%5D%20gpu%3A%3Agles2%3A%3AGLES2DecoderImpl%3A%3ACheckFramebufferValid%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#samplereports:5,productversion:1000

zmo@ - Could you please have a look into this issue.

Thanks...!!

Comment 65 by zmo@chromium.org, Jan 3 2017

My Cl merge to beta (#62) hasn't been out to the wild world yet. So I'll wait until that gets into beta and see how it works.

Comment 67 by lfg@chromium.org, Jan 3 2017

Re #66: Ah, that makes sense, thanks for explaining it. The spike in December seems to be just the stable promotion of that signature change.

Comment 68 by lfg@chromium.org, Jan 3 2017

Labels: -Stability-Sheriff-Desktop
I'm removing this from the stability sheriff queue as there's progress being made by zmo@ (comment #65).

The crash significantly reduced in latest M56 and not reported after in canary after 57.0.2973.0 ( 4 days old build), below is the data.

56.0.2924.51	0.02%	3	
56.0.2924.28	11.27%	2079

If there is no pending work , can we tag as fixed?

Comment 70 by zmo@chromium.org, Jan 9 2017

Status: Fixed (was: Assigned)

Comment 71 by kbr@chromium.org, Mar 23 2017

Blocking: 622813

Sign in to add a comment