New issue
Advanced search Search tips

Issue 603407 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: May 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android , Chrome
Pri: 3
Type: Bug



Sign in to add a comment

GPU process spends large amount of GPU rasterization time in redundant MakeCurrent state restoring when using virtualized GL contexts

Project Member Reported by kkinnu...@nvidia.com, Apr 14 2016

Issue description

GPU process spends large amount of time in MakeCurrent state restoring when using virtualized GL contexts. Often this state restore is redundant for GPU rasterization, as Skia resets its expected GL state for each tile.

Biggest bottleneck is binding the framebuffer. While optimized, it is a rather expensive operation in GL drivers.

The MakeCurrent is called always when command buffer starts to execute the command stream or when command buffer runs "idle work" callbacks.

The MakeCurrent of virtualized GL contexts changes the real GL state to match what's stored in ContextState.

Considering the case of CommandBufferStub::PerformWork running MakeCurrent:
The task runs decoder idle work callbacks. These callbacks do the readpixels finishing and inspect the running queries. Most of the restored state is not needed.


Considering the case of GPU rasterization:
When the Skia calls start to execute in the GPU process, CommandBufferStub will call MakeCurrent on the context and first reset all GL state. 

Since GrContext::resetContext is called before each tile, Skia will reset its internal state to GL, overwriting the state that MakeCurrent just set.

Even if this would not be called, the first thing Skia would do is to bind the framebuffer that is associated to the Skia render target of the tile to be rendered to.

If the GL state caching could be somehow changed in decoder, one could potentially not write the redundant state.

Attaching a report of a case of GPU rasterizing a testcase that triggers MSAA codepath, such as the tiger svg replicated 4 times. The testcase can update 31 FPS on this Android device. The MSAA problem described in  bug 594928  is fixed for this run.

Attached is a picture of sampling profiler run on Android. You can observe that BindFramebuffer consumes 8% of the runtime. Not all of this can be removed, but maybe some can.

Attached is a picture of chromium tracing I'm basing the deductions on. You can observe the state restore for 
CommandBufferStub::PerformWork, which consumes majority of the work that CommandBufferStub::PerformWork does. Note: in practice CommandBufferStub::PerformWork probably is not a bottleneck, as this is run because tracing is being done. When running without tracing, CommandBufferStub::PerformWork will be called seldomly if ever. Note: the above sampling profiler profile does not have samples due to CommandBufferStub::PerformWork, since the profile was done without tracing.

Attached is the trace, should the reader be extremely interested in this.

 
Screenshot from 2016-04-14 09:52:15.png
226 KB View Download
Screenshot from 2016-04-14 09:54:17.png
38.6 KB View Download
trace_virtualized-gl-context-gpu-rasterization-redundant-state-restore.json.zip
3.3 MB Download
Attached is a picture of chromium tracing:
You can also observe the second state restore, due to CommandBufferStub being about to call OnAsyncFlush. The state restored during this operation is being overwritten by some other state set, beginning with BindFramebuffer, FramebufferTexture2D
Cc: zmo@chromium.org piman@chromium.org
As I understand this part of the system:

Currently the decoder works so that when GL state set comes in, it will do two things:
1) Set the actual GL state
2) Store the state to ContextState

I believe this is to make the system more robust to implement with respect to the GL errors reported by the actual driver. This way the GL error that can happen due to the call will happen during correct time and the client gets the actual error correctly.

Alternative would be to defer the call until needed. In deferred system, a call failing with an error would be confusing, if it was done as a response to a seemingly unrelated call.

Some command buffer GL calls need complex sequence of real GL calls. For example, some emulated features. In these cases, decoder will modify the GL state to implement the use-case. Afterwards, the decoder will use ContextState to inspect the expected state and restore that.

In order for MakeCurrent induced calls such as BindFramebuffer to be deferred one would need to, at minimum:
1) Add the deferred state somewhere
2) check the deferred state in all the call sites that would need that state and flush the state if the state is dirty.
3) Make MakeCurrent modify the deferred state (and possibly ContextState)
4) Make all the state set calls modify also the deferred state (in addition of modifying GL state and possibly current ContextState).

At maximum, one would add the deferring to the real client induced calls as opposed to the system induced calls:
5) Make all the state set calls not modify the real GL state

The GL calls induced by the decoder seem to be spread across different places, in managers and such. Some of these take the ContextState as a parameter, but probably not all. It seems rather big shift to current design to use ContextState to do this deferred state management.

OTOH, it'd be rather confusing if ContextState would not be the place where the deferred state would be stored, as it is a place that stores the context state.

Any thoughts on the solution, whether it makes sense to pursue or not? Any ideas on how to achieve the effect?


Comment 3 by piman@chromium.org, Apr 14 2016

Cc: vmi...@chromium.org sunn...@chromium.org
+vmiura who spent some time on looking at context switch cost.

It's always a difficult tradeoff of how much is the validation cost ("is the bound framebuffer state dirty?", which we'd need to do on every draw call, even though it's mostly false - a case of death by a 1000 cuts when i-cache effects come in) vs the cost of doing the more costly thing (reset the state) less often. That tradeoff is of course dependent on the system (cpu, driver performance) and the use case. Probably not that different to what you have to do in the driver itself.

If there's good data that shows this is generally good across a good range of use cases and devices, we should do it. I agree that ContextState is where this should be. The question is picking the states that makes a difference.

(of course there's also a code complexity cost).


Hopefully the GPU scheduler would help reducing the context switches, which also tilts the balance.

Comment 4 by vmi...@chromium.org, Apr 14 2016

I do think it would be worth pursuing.  We switch context sometimes only to encounter a WaitSyncTokenCHROMIUM for example.

FWIW an idea I had some time ago was:
 - Deffer sub-sets of ContextState::Restore*.
 - The virtual MakeCurrent would just mark all sub-sets dirty.
 - Commands would be tagged with the state sub-sets they depend on in gpu/command_buffer/build_gles2_cmd_buffer.py.
 - Restoring necessary state could be a cheap "&" check in the command decoder.

However, we should keep in mind that we're working towards switching GL state and command validation to MANGLE.  Optimizations in Chrome's command_buffer may be short lived, and MANGLE does some dirty-bits and lazy state optimizations.  If we do something it should maybe be directed there?
Yeah, maybe MANGLE could fix things if it is optimized for apps that do redundant state changes, compared against drivers optimized for apps that do not do redundant state changes...

I'll work on this in command buffer when I have the energy
It However can be "Refreshed", but cannot be removed, unless most if not all needs some caching.
Project Member

Comment 7 by bugdroid1@chromium.org, Apr 26 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a7238d6dd542777f652a6ebd223a744c52e94e83

commit a7238d6dd542777f652a6ebd223a744c52e94e83
Author: kkinnunen <kkinnunen@nvidia.com>
Date: Tue Apr 26 12:42:34 2016

command_buffer: Check draw FBO for path rendering draw functions

Check draw FBO validity for path rendering draw functions.

All other draw functions call the method too.

BUG= 603407 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel

Review URL: https://codereview.chromium.org/1913813002

Cr-Commit-Position: refs/heads/master@{#389757}

[modify] https://crrev.com/a7238d6dd542777f652a6ebd223a744c52e94e83/gpu/command_buffer/service/gles2_cmd_decoder.cc

Project Member

Comment 8 by bugdroid1@chromium.org, Apr 27 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/cb5c7ba15f5c0844f46d16158b5fb94735dea7de

commit cb5c7ba15f5c0844f46d16158b5fb94735dea7de
Author: kkinnunen <kkinnunen@nvidia.com>
Date: Wed Apr 27 07:30:16 2016

command_buffer: Remove dead code for gl_begin_gl_end_on_fbo_change_to_backbuffer workaround

Remove dead code for gl_begin_gl_end_on_fbo_change_to_backbuffer
workaround.

Does not remove the gpu bug workarounds entry, so that the list stays
consistent with the numbering.

This is a work aiming to improve decoder MakeCurrent performance by
making more of the state updated in lazy manner. This works towards
updating the logic for FBOs. The removed hunk was requiring the GL
binding to be up-to-date.

BUG= 603407 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel

Review URL: https://codereview.chromium.org/1919803002

Cr-Commit-Position: refs/heads/master@{#390012}

[modify] https://crrev.com/cb5c7ba15f5c0844f46d16158b5fb94735dea7de/gpu/command_buffer/service/gles2_cmd_decoder.cc
[modify] https://crrev.com/cb5c7ba15f5c0844f46d16158b5fb94735dea7de/gpu/config/gpu_driver_bug_list_json.cc
[modify] https://crrev.com/cb5c7ba15f5c0844f46d16158b5fb94735dea7de/gpu/config/gpu_driver_bug_workaround_type.h
[modify] https://crrev.com/cb5c7ba15f5c0844f46d16158b5fb94735dea7de/ui/gl/gl_surface.cc
[modify] https://crrev.com/cb5c7ba15f5c0844f46d16158b5fb94735dea7de/ui/gl/gl_surface.h

Project Member

Comment 9 by bugdroid1@chromium.org, Apr 28 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/50d4774a32f18c10e2cc5285fc26e187d088fcab

commit 50d4774a32f18c10e2cc5285fc26e187d088fcab
Author: kkinnunen <kkinnunen@nvidia.com>
Date: Thu Apr 28 09:03:14 2016

command_buffer: Remove unneeded FBO binding restore during CopyTextureCHROMIUM

Remove unneeded restoring of FBO bindings after creating
CopyTextureCHROMIUM helper instance.

CopyTextureCHROMIUMResourceManager::Initialize() does not change the FBO
bindings at all, so the removed call
GLES2DecoderImpl::RestoreCurrentFramebufferBindings should not be needed.

This is a work aiming to improve decoder MakeCurrent performance by
making more of the state updated in lazy manner. This works towards
updating the logic for FBOs. Removing the extra FBO restore simplifies
this.

BUG= 603407 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/1927503002
Cr-Commit-Position: refs/heads/master@{#390333}

[modify] https://crrev.com/50d4774a32f18c10e2cc5285fc26e187d088fcab/gpu/command_buffer/service/gles2_cmd_decoder.cc

Labels: -Needs-Feedback
Owner: kkinnu...@nvidia.com
Status: WontFix (was: Untriaged)
From related email discussion.

On 05/06/2016 01:57 AM, Zhenyao Mo wrote:
> To me, upgrading command buffer to ES3 and getting MANGLE in place
> and stabilizing them are priorities. I agree with Antoine that
> optimizing for Android is as important. I just don't feel
> implementing dirty bits in command buffer is the right way to go.
> 
> Even without considering the required efforts and hours, I honestly
> don't feel command buffer is the right place for dirty bits. It
> should be lower and ANGLE (or driver) is a much better place for that
> in design. Doing it in command buffer will be a really "dirty" and
> intrusive implementation, code all over places, and easy to miss a
> few hooks (for example, Chrome extensions, which are heavily used
> internally and definitely under-tested) and de-stabilize Chrome GPU.

Sign in to add a comment