GPU process spends large amount of GPU rasterization time in redundant MakeCurrent state restoring when using virtualized GL contexts |
||||
Issue descriptionGPU process spends large amount of time in MakeCurrent state restoring when using virtualized GL contexts. Often this state restore is redundant for GPU rasterization, as Skia resets its expected GL state for each tile. Biggest bottleneck is binding the framebuffer. While optimized, it is a rather expensive operation in GL drivers. The MakeCurrent is called always when command buffer starts to execute the command stream or when command buffer runs "idle work" callbacks. The MakeCurrent of virtualized GL contexts changes the real GL state to match what's stored in ContextState. Considering the case of CommandBufferStub::PerformWork running MakeCurrent: The task runs decoder idle work callbacks. These callbacks do the readpixels finishing and inspect the running queries. Most of the restored state is not needed. Considering the case of GPU rasterization: When the Skia calls start to execute in the GPU process, CommandBufferStub will call MakeCurrent on the context and first reset all GL state. Since GrContext::resetContext is called before each tile, Skia will reset its internal state to GL, overwriting the state that MakeCurrent just set. Even if this would not be called, the first thing Skia would do is to bind the framebuffer that is associated to the Skia render target of the tile to be rendered to. If the GL state caching could be somehow changed in decoder, one could potentially not write the redundant state. Attaching a report of a case of GPU rasterizing a testcase that triggers MSAA codepath, such as the tiger svg replicated 4 times. The testcase can update 31 FPS on this Android device. The MSAA problem described in bug 594928 is fixed for this run. Attached is a picture of sampling profiler run on Android. You can observe that BindFramebuffer consumes 8% of the runtime. Not all of this can be removed, but maybe some can. Attached is a picture of chromium tracing I'm basing the deductions on. You can observe the state restore for CommandBufferStub::PerformWork, which consumes majority of the work that CommandBufferStub::PerformWork does. Note: in practice CommandBufferStub::PerformWork probably is not a bottleneck, as this is run because tracing is being done. When running without tracing, CommandBufferStub::PerformWork will be called seldomly if ever. Note: the above sampling profiler profile does not have samples due to CommandBufferStub::PerformWork, since the profile was done without tracing. Attached is the trace, should the reader be extremely interested in this.
,
Apr 14 2016
As I understand this part of the system: Currently the decoder works so that when GL state set comes in, it will do two things: 1) Set the actual GL state 2) Store the state to ContextState I believe this is to make the system more robust to implement with respect to the GL errors reported by the actual driver. This way the GL error that can happen due to the call will happen during correct time and the client gets the actual error correctly. Alternative would be to defer the call until needed. In deferred system, a call failing with an error would be confusing, if it was done as a response to a seemingly unrelated call. Some command buffer GL calls need complex sequence of real GL calls. For example, some emulated features. In these cases, decoder will modify the GL state to implement the use-case. Afterwards, the decoder will use ContextState to inspect the expected state and restore that. In order for MakeCurrent induced calls such as BindFramebuffer to be deferred one would need to, at minimum: 1) Add the deferred state somewhere 2) check the deferred state in all the call sites that would need that state and flush the state if the state is dirty. 3) Make MakeCurrent modify the deferred state (and possibly ContextState) 4) Make all the state set calls modify also the deferred state (in addition of modifying GL state and possibly current ContextState). At maximum, one would add the deferring to the real client induced calls as opposed to the system induced calls: 5) Make all the state set calls not modify the real GL state The GL calls induced by the decoder seem to be spread across different places, in managers and such. Some of these take the ContextState as a parameter, but probably not all. It seems rather big shift to current design to use ContextState to do this deferred state management. OTOH, it'd be rather confusing if ContextState would not be the place where the deferred state would be stored, as it is a place that stores the context state. Any thoughts on the solution, whether it makes sense to pursue or not? Any ideas on how to achieve the effect?
,
Apr 14 2016
+vmiura who spent some time on looking at context switch cost.
It's always a difficult tradeoff of how much is the validation cost ("is the bound framebuffer state dirty?", which we'd need to do on every draw call, even though it's mostly false - a case of death by a 1000 cuts when i-cache effects come in) vs the cost of doing the more costly thing (reset the state) less often. That tradeoff is of course dependent on the system (cpu, driver performance) and the use case. Probably not that different to what you have to do in the driver itself.
If there's good data that shows this is generally good across a good range of use cases and devices, we should do it. I agree that ContextState is where this should be. The question is picking the states that makes a difference.
(of course there's also a code complexity cost).
Hopefully the GPU scheduler would help reducing the context switches, which also tilts the balance.
,
Apr 14 2016
I do think it would be worth pursuing. We switch context sometimes only to encounter a WaitSyncTokenCHROMIUM for example. FWIW an idea I had some time ago was: - Deffer sub-sets of ContextState::Restore*. - The virtual MakeCurrent would just mark all sub-sets dirty. - Commands would be tagged with the state sub-sets they depend on in gpu/command_buffer/build_gles2_cmd_buffer.py. - Restoring necessary state could be a cheap "&" check in the command decoder. However, we should keep in mind that we're working towards switching GL state and command validation to MANGLE. Optimizations in Chrome's command_buffer may be short lived, and MANGLE does some dirty-bits and lazy state optimizations. If we do something it should maybe be directed there?
,
Apr 15 2016
Yeah, maybe MANGLE could fix things if it is optimized for apps that do redundant state changes, compared against drivers optimized for apps that do not do redundant state changes... I'll work on this in command buffer when I have the energy
,
Apr 17 2016
It However can be "Refreshed", but cannot be removed, unless most if not all needs some caching.
,
Apr 26 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a7238d6dd542777f652a6ebd223a744c52e94e83 commit a7238d6dd542777f652a6ebd223a744c52e94e83 Author: kkinnunen <kkinnunen@nvidia.com> Date: Tue Apr 26 12:42:34 2016 command_buffer: Check draw FBO for path rendering draw functions Check draw FBO validity for path rendering draw functions. All other draw functions call the method too. BUG= 603407 CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel Review URL: https://codereview.chromium.org/1913813002 Cr-Commit-Position: refs/heads/master@{#389757} [modify] https://crrev.com/a7238d6dd542777f652a6ebd223a744c52e94e83/gpu/command_buffer/service/gles2_cmd_decoder.cc
,
Apr 27 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/cb5c7ba15f5c0844f46d16158b5fb94735dea7de commit cb5c7ba15f5c0844f46d16158b5fb94735dea7de Author: kkinnunen <kkinnunen@nvidia.com> Date: Wed Apr 27 07:30:16 2016 command_buffer: Remove dead code for gl_begin_gl_end_on_fbo_change_to_backbuffer workaround Remove dead code for gl_begin_gl_end_on_fbo_change_to_backbuffer workaround. Does not remove the gpu bug workarounds entry, so that the list stays consistent with the numbering. This is a work aiming to improve decoder MakeCurrent performance by making more of the state updated in lazy manner. This works towards updating the logic for FBOs. The removed hunk was requiring the GL binding to be up-to-date. BUG= 603407 CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel Review URL: https://codereview.chromium.org/1919803002 Cr-Commit-Position: refs/heads/master@{#390012} [modify] https://crrev.com/cb5c7ba15f5c0844f46d16158b5fb94735dea7de/gpu/command_buffer/service/gles2_cmd_decoder.cc [modify] https://crrev.com/cb5c7ba15f5c0844f46d16158b5fb94735dea7de/gpu/config/gpu_driver_bug_list_json.cc [modify] https://crrev.com/cb5c7ba15f5c0844f46d16158b5fb94735dea7de/gpu/config/gpu_driver_bug_workaround_type.h [modify] https://crrev.com/cb5c7ba15f5c0844f46d16158b5fb94735dea7de/ui/gl/gl_surface.cc [modify] https://crrev.com/cb5c7ba15f5c0844f46d16158b5fb94735dea7de/ui/gl/gl_surface.h
,
Apr 28 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/50d4774a32f18c10e2cc5285fc26e187d088fcab commit 50d4774a32f18c10e2cc5285fc26e187d088fcab Author: kkinnunen <kkinnunen@nvidia.com> Date: Thu Apr 28 09:03:14 2016 command_buffer: Remove unneeded FBO binding restore during CopyTextureCHROMIUM Remove unneeded restoring of FBO bindings after creating CopyTextureCHROMIUM helper instance. CopyTextureCHROMIUMResourceManager::Initialize() does not change the FBO bindings at all, so the removed call GLES2DecoderImpl::RestoreCurrentFramebufferBindings should not be needed. This is a work aiming to improve decoder MakeCurrent performance by making more of the state updated in lazy manner. This works towards updating the logic for FBOs. Removing the extra FBO restore simplifies this. BUG= 603407 CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel Review-Url: https://codereview.chromium.org/1927503002 Cr-Commit-Position: refs/heads/master@{#390333} [modify] https://crrev.com/50d4774a32f18c10e2cc5285fc26e187d088fcab/gpu/command_buffer/service/gles2_cmd_decoder.cc
,
May 6 2016
From related email discussion. On 05/06/2016 01:57 AM, Zhenyao Mo wrote: > To me, upgrading command buffer to ES3 and getting MANGLE in place > and stabilizing them are priorities. I agree with Antoine that > optimizing for Android is as important. I just don't feel > implementing dirty bits in command buffer is the right way to go. > > Even without considering the required efforts and hours, I honestly > don't feel command buffer is the right place for dirty bits. It > should be lower and ANGLE (or driver) is a much better place for that > in design. Doing it in command buffer will be a really "dirty" and > intrusive implementation, code all over places, and easy to miss a > few hooks (for example, Chrome extensions, which are heavily used > internally and definitely under-tested) and de-stabilize Chrome GPU. |
||||
►
Sign in to add a comment |
||||
Comment 1 by kkinnu...@nvidia.com
, Apr 14 2016