New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 611805 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 2
Type: Bug

Blocking:
issue 516072
issue 581777
issue 601608



Sign in to add a comment

Back Pressure for WebGL using CA compositor.

Project Member Reported by erikc...@chromium.org, May 13 2016

Issue description

On Mac, we currently don't have any good way of determining when the Window Server has consumed the content we have passed it, and finished executing GL commands to display/transform that content. I've done some research into the topic, but haven't made much progress. We should probably just ask Apple engineers.
https://bugs.chromium.org/p/chromium/issues/detail?id=603320

In the GL compositor, back pressure is created by waiting on the GPU process's main thread for a glFinish(). When the application is GPU bound, this will delay the SwapBuffer Ack, which will eventually delay RequestAnimationFrame from firing.

The reason that glFinish() ends up waiting for WebGL content [which is in a separate GL context], is that on Mac, GPU drivers serialize all drawing commands. The GL compositor needs to do some blits, which cannot be performed until the WebGL commands have been executed.

In the CA compositor, the glFinish() still exists, but now there are no drawing commands being issued on the compositor's GL context, and so the glFinish() does nothing. This removes all backpressure, and causes WebGL gpu-bound content to choke.


 
piman's suggestion was to add a fence from the WebGL context, and have the compositor wait on that fence. I tried this out, and it does do the same job as the glFinish() in terms of creating back pressure, but it also uses a CPU busy wait!

I've attached some traces to illustrate the different behaviors.

Chrome Canary(or something close to it), no CHROMIUM image (GL compositor):
Using QuartzDebug to measure FPS, it starts at around 28, and drops to 20 during difficult frames. The glFinish wall duration is ~50ms. Using Activity Monitor, CPU usage of GPU process sits at around 20% (I forget the exact percentage, but it's less than 50).

Chrome Canary(or something close to it), CHROMIUM image, (CA compositor):
FPS starts at around ~25, drops to 5 over the course of 10 seconds. The glFinish wall duration is < 0.1ms. We know that it takes around 50ms to draw a frame, but the Renderer main thread is issuing frames faster than that.

Hacked Chrome Canary, CHROMIUM image with fence, (CA compositor):
FPS starts at around ~30, drops to ~25 during difficult frames. The fence wall duration is ~40ms. Using Activity monitor shows CPU usage of the GPU process at 100%!!


trace_chromium_image_canary.json.gz
930 KB Download
trace_chromium_image_fence.json.gz
1.5 MB Download
no_chromium_image_trace.json.gz
929 KB Download
piman's suggestion of a fence wait doesn't seem great because of the CPU busy wait.

I made the suggestion: "why not just do a no-op draw command in the CA compositor to force the glFinish to wait for all draw commands?" ccameron replied that it might work. He was worried that it might affect power usage in low-power-fullscreen.

The last alternative I can think of is to do something similar to piman's suggestion, but use a glFinish instead of a fence.
Another option to the "no-op draw command" is to somehow track "is it necessary to wait on any GPU work" (sort of like "is there any fence at all"), and, only then do the no-op draw and glFinish.
Also of note there is that, if we have good work-tracking, we can skip going to the GPU process to do the Finish if we know that we have no work to finish. I'm curious if that will help power or not.

Comment 5 by kbr@chromium.org, May 13 2016

Since it seems most necessary to do this for WebGL, it would be possible to modify DrawingBuffer::prepareMailbox (src/third_party/WebKit/Source/platform/graphics/gpu/DrawingBuffer.cpp) to drop in a server-side glFinish just before the ProduceTextureDirectCHROMIUM. We tried this with the existing client-side glFinish, and it worked to supply back-pressure, but also blocked the renderer process which was undesirable. The primitive that would be needed would be something like glAsyncFinishCHROMIUM, where the client-side returns immediately, but the service side executes the glFinish.

I like #5 as well.
I was under the impression that piman@ did not like #5 because he wanted the back pressure to be applied at a point where it directly delayed the swapbuffers ack [and hence RAF], rather than indirectly delaying the swapbuffers ack by clogging up the GPU process. 

That being said, I'll let him speak for himself. 

Comment 8 by piman@chromium.org, May 16 2016

Right, I would prefer if the glFinish happened at presentation time rather than webgl time, to prevent offscreen content from slowing down everything else. Really, glFinish is a very big hammer. Unless we can do it on a separate thread?

Comment 9 by kbr@chromium.org, May 16 2016

Separate thread on the client side or service side? Assume you meant service side. Would have to do a prototype to see if it would have the desired effect. If it did, that would be the simplest solution.

For simple WebGL content rendering into an overlay, is there any good interposition point where we could issue that glFinish at presentation time? I thought everything was under Core Animation's control in that scenario.

Comment 10 by piman@chromium.org, May 16 2016

@#9: yes I meant service side
re: presentation time, I mean the time we send it to CA. We can track the GLImages that webgl rendered to (similar to how Erik's patch would do it for using fences instead), as well as the context. Before sending it to CA, we would make that context current and issue the glFinish.

Other question: on Mac, there's 2 APIs we can use for fences: GL_ARB_sync and GL_APPLE_fence. We default to GL_ARB_sync because it allows server-side waits whereas GL_APPLE_fence only allows client-side waits. For this use case we only need client-side waits, so it might be worth checking whether GL_APPLE_fence also spins. See GLFence::Create for the logic to use one vs another.
GL_APPLE_fence is only available with the AppleGL implementation. WebGL 2 uses Core Profile. 

If we do choose to use GL_APPLE_fence, the function TestFenceAPPLE is non-blocking, so we could always implement a non-spin wait using sleep and TestFenceAPPLE.
piman's suggestion of tracking the context on the GLImage, and finishing that context at the presentation layer is reasonably simple, and will make piman@ happy, so I suggest we move forward with that? 

Comment 13 by piman@chromium.org, May 17 2016

@#11: FYI, you can also do that with GL_ARB_sync, see GLFenceARB::HasCompleted

Comment 14 by kbr@chromium.org, May 17 2016

To clarify: GL_APPLE_fence is only available in the OpenGL compatibility profile on OS X. The Core Profile doesn't expose it any more since sync objects are built in to the core API.

Here are the extensions on my machine when Chrome's run with --enable-unsafe-es3-apis:

GL_ARB_blend_func_extended GL_ARB_draw_buffers_blend GL_ARB_draw_indirect GL_ARB_ES2_compatibility GL_ARB_explicit_attrib_location GL_ARB_gpu_shader_fp64 GL_ARB_gpu_shader5 GL_ARB_instanced_arrays GL_ARB_internalformat_query GL_ARB_occlusion_query2 GL_ARB_sample_shading GL_ARB_sampler_objects GL_ARB_separate_shader_objects GL_ARB_shader_bit_encoding GL_ARB_shader_subroutine GL_ARB_shading_language_include GL_ARB_tessellation_shader GL_ARB_texture_buffer_object_rgb32 GL_ARB_texture_cube_map_array GL_ARB_texture_gather GL_ARB_texture_query_lod GL_ARB_texture_rgb10_a2ui GL_ARB_texture_storage GL_ARB_texture_swizzle GL_ARB_timer_query GL_ARB_transform_feedback2 GL_ARB_transform_feedback3 GL_ARB_vertex_attrib_64bit GL_ARB_vertex_type_2_10_10_10_rev GL_ARB_viewport_array GL_EXT_debug_label GL_EXT_debug_marker GL_EXT_framebuffer_multisample_blit_scaled GL_EXT_texture_compression_s3tc GL_EXT_texture_filter_anisotropic GL_EXT_texture_sRGB_decode GL_APPLE_client_storage GL_APPLE_container_object_shareable GL_APPLE_flush_render GL_APPLE_object_purgeable GL_APPLE_rgb_422 GL_APPLE_row_bytes GL_APPLE_texture_range GL_ATI_texture_mirror_once GL_NV_texture_barrier

Comment 15 by piman@chromium.org, May 17 2016

Right, what I meant is, the way you suggested to poll using sleep+TestFenceAPPLE, you can also poll using sleep+glClientWaitSync(, 0) (or sleep+glGetSynciv which should work nowsince we don't support 10.7 any more).
I tested three mechanisms for backpressure:
1) glFence + client wait.
  - 100% CPU usage, 1 idle wakeup, "300" energy impact
2) glFence + sleeping client wait (100us sleep)
  - 25% CPU usage, 4000 idle wakeups, "300" energy impact
3) glFinish (using gl compositor)
  - 25% CPU usage, 200 idle wakeups, "300" energy impact

Both "idle wakeups" and "energy impact" are in one sense totally arbitrary, and not reflective of real performance. On the other hand, OS X uses these numbers to shame processes and calls these out to users.  



Comment 17 by piman@chromium.org, May 17 2016

It sounds like 2 with 2ms sleep would be on-par with glFinish? (I may be willing to bet it's what glFinish does under the hood - that's also how often we poll fences when idle).
Yup, a 2ms sleep does the trick. 

Comment 19 by piman@chromium.org, May 17 2016

Excellent. Then I think the fence version is better, because you may get more cpu/gpu overlap, especially for things that are not rAF-throttled, and/or if there are non-webgl updates as well.

Comment 20 by kbr@chromium.org, May 18 2016

2 ms sounds like a really long time to sleep. That's 1/8 of the frame time at 60 Hz. Can this be adjusted?

Agree that a 2ms poll seems unacceptably long.

I still advocate for either kbr's suggestion in #5 (the service-side glFinish).

We should start with the simple solution, and only add more complicated things if that appears insufficient.

Comment 22 by piman@chromium.org, May 18 2016

from #16 it's very likely that glFinish has the same 2ms granularity.

Comment 23 by kbr@chromium.org, May 18 2016

Blocking: 601608 581777

Comment 24 by kbr@chromium.org, May 18 2016

Blocked a couple of bugs on this one.

At this point, GPU-bound WebGL content is slower in Chrome on Mac than it was in previous releases which rendered into textures. How can we avoid this regression in M52? The branch point is tomorrow. Can or should we revert the default enabling of CHROMIUM_image for WebGL on the branch?

Comment 25 by kbr@chromium.org, May 18 2016

Cc: seththompson@chromium.org
CC'ing seththompson@; Seth, fixing this bug is very important for Unity's WebGL-exported content to run well on OS X.

Cc: vmi...@chromium.org sunn...@chromium.org
Today I talked with piman@, ccameron@, vmiura@, and sunnyps@ about this subject. At some point, I'm going to put together a doc that captures all the findings.

There is a medium complexity solution which ccameron@, vmiura@ and sunnyps@ seem happy with. It doesn't block the gpu process main thread, so I expect piman@ to be happy. The details still needs to be specced out.

I was hoping to up-sell piman@ on doing a service side glFinish on the WebGL context as a short term solution [that is a slightly better version of what the GLRenderer does today]. But given that branch point is tomorrow, I think it will be simpler to just turn off CHROMIUM image and fall back to the GLRenderer M52.
Project Member

Comment 27 by bugdroid1@chromium.org, Jun 10 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7843491fb012d3d3a37f77355b92d999da732bca

commit 7843491fb012d3d3a37f77355b92d999da732bca
Author: erikchen <erikchen@chromium.org>
Date: Fri Jun 10 18:17:43 2016

WebGL: Three small fixes to Image CHROMIUM logic.

1. Use the newly added method DescheduleUntilFinishedCHROMIUM() to add back
pressure.
2. Update the color mask appropriately before calling clearFramebuffers().
3. Plumb gpuMemoryBufferId through to the mailbox.

BUG= 611805 ,  617249 ,  607130 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.win:win_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/2053983002
Cr-Commit-Position: refs/heads/master@{#399231}

[modify] https://crrev.com/7843491fb012d3d3a37f77355b92d999da732bca/third_party/WebKit/Source/modules/webgl/WebGLRenderingContextBase.cpp
[modify] https://crrev.com/7843491fb012d3d3a37f77355b92d999da732bca/third_party/WebKit/Source/platform/graphics/gpu/DrawingBuffer.cpp
[modify] https://crrev.com/7843491fb012d3d3a37f77355b92d999da732bca/third_party/WebKit/Source/platform/graphics/gpu/DrawingBuffer.h

Project Member

Comment 28 by bugdroid1@chromium.org, Jun 15 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7843491fb012d3d3a37f77355b92d999da732bca

commit 7843491fb012d3d3a37f77355b92d999da732bca
Author: erikchen <erikchen@chromium.org>
Date: Fri Jun 10 18:17:43 2016

WebGL: Three small fixes to Image CHROMIUM logic.

1. Use the newly added method DescheduleUntilFinishedCHROMIUM() to add back
pressure.
2. Update the color mask appropriately before calling clearFramebuffers().
3. Plumb gpuMemoryBufferId through to the mailbox.

BUG= 611805 ,  617249 ,  607130 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.win:win_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/2053983002
Cr-Commit-Position: refs/heads/master@{#399231}

[modify] https://crrev.com/7843491fb012d3d3a37f77355b92d999da732bca/third_party/WebKit/Source/modules/webgl/WebGLRenderingContextBase.cpp
[modify] https://crrev.com/7843491fb012d3d3a37f77355b92d999da732bca/third_party/WebKit/Source/platform/graphics/gpu/DrawingBuffer.cpp
[modify] https://crrev.com/7843491fb012d3d3a37f77355b92d999da732bca/third_party/WebKit/Source/platform/graphics/gpu/DrawingBuffer.h

Status: Fixed (was: Assigned)
This was fixed by: https://bugs.chromium.org/p/chromium/issues/detail?id=617249#c19
Blocking: 516072
Status: Assigned (was: Fixed)
It would appear that the fix in c#29 does not work for all devices. 

Comment 31 by kbr@chromium.org, Jun 22 2016

Do you have examples of devices on which it does and does not work? Perhaps about:gpu from each?

It succeeds on the MBA with Intel HD 5000, and fails on my MBP with NVIDIA GeForce GT 750M. I've already tested that my fixes to DescheduleUntilFinishedCHROMIUM successfully fix the problem.
Project Member

Comment 33 by bugdroid1@chromium.org, Jun 23 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4acb5991cddf0dd2a0efb392c9a5691dfd5af2d6

commit 4acb5991cddf0dd2a0efb392c9a5691dfd5af2d6
Author: erikchen <erikchen@chromium.org>
Date: Thu Jun 23 05:38:12 2016

Implement new behavior for DescheduleUntilFinishedCHROMIUM.

Previously, the command immediately descheduled the command executor until all
previously issued work had completed. This prevents pipelining of CPU-bound
decoding and GPU-bound drawing.

The new behavior deschedules the command executor until all work issued prior to
the previous call to DescheduleUntilFinishedCHROMIUM has completed.

BUG= 611805 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/2096503002
Cr-Commit-Position: refs/heads/master@{#401544}

[modify] https://crrev.com/4acb5991cddf0dd2a0efb392c9a5691dfd5af2d6/gpu/GLES2/extensions/CHROMIUM/CHROMIUM_deschedule.txt
[modify] https://crrev.com/4acb5991cddf0dd2a0efb392c9a5691dfd5af2d6/gpu/command_buffer/service/gles2_cmd_decoder.cc
[modify] https://crrev.com/4acb5991cddf0dd2a0efb392c9a5691dfd5af2d6/gpu/command_buffer/service/gles2_cmd_decoder_unittest.cc

Project Member

Comment 34 by bugdroid1@chromium.org, Jun 23 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8007bef4b45fcf90c81b1300509acfbda42cae47

commit 8007bef4b45fcf90c81b1300509acfbda42cae47
Author: erikchen <erikchen@chromium.org>
Date: Thu Jun 23 18:35:39 2016

Add an Invalidate method to GLFence.

The destructor of GLFence subclasses assumes that the context is current. This
is not necessarily the case during destruction of the GLES2DecoderImpl.

BUG= 611805 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/2087403003
Cr-Commit-Position: refs/heads/master@{#401657}

[modify] https://crrev.com/8007bef4b45fcf90c81b1300509acfbda42cae47/gpu/command_buffer/service/gles2_cmd_decoder.cc
[modify] https://crrev.com/8007bef4b45fcf90c81b1300509acfbda42cae47/ui/gl/gl_fence.cc
[modify] https://crrev.com/8007bef4b45fcf90c81b1300509acfbda42cae47/ui/gl/gl_fence.h
[modify] https://crrev.com/8007bef4b45fcf90c81b1300509acfbda42cae47/ui/gl/gl_fence_apple.cc
[modify] https://crrev.com/8007bef4b45fcf90c81b1300509acfbda42cae47/ui/gl/gl_fence_apple.h
[modify] https://crrev.com/8007bef4b45fcf90c81b1300509acfbda42cae47/ui/gl/gl_fence_arb.cc
[modify] https://crrev.com/8007bef4b45fcf90c81b1300509acfbda42cae47/ui/gl/gl_fence_arb.h
[modify] https://crrev.com/8007bef4b45fcf90c81b1300509acfbda42cae47/ui/gl/gl_fence_nv.cc
[modify] https://crrev.com/8007bef4b45fcf90c81b1300509acfbda42cae47/ui/gl/gl_fence_nv.h

Project Member

Comment 35 by bugdroid1@chromium.org, Jun 23 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/eb44a4aee7dc6e03ba4670509b08d038fbad2d48

commit eb44a4aee7dc6e03ba4670509b08d038fbad2d48
Author: erikchen <erikchen@chromium.org>
Date: Thu Jun 23 19:45:17 2016

Add a call to DescheduleUntilFinishedCHROMIUM to WebGL.

It was recently removed in https://codereview.chromium.org/2062813003/ because I
thought that the new implementation to
ImageTransportSurfaceOverlayMac::ClientWait correctly waited for the WebGL
context's work to finish. That is only the case on some macOS/gpu
configurations.

BUG= 611805 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/2093533002
Cr-Commit-Position: refs/heads/master@{#401688}

[modify] https://crrev.com/eb44a4aee7dc6e03ba4670509b08d038fbad2d48/gpu/ipc/service/gpu_command_buffer_stub.cc
[modify] https://crrev.com/eb44a4aee7dc6e03ba4670509b08d038fbad2d48/third_party/WebKit/Source/platform/graphics/gpu/DrawingBuffer.cpp

Status: Fixed (was: Assigned)

Comment 37 Deleted

Sign in to add a comment