Excessive latency in GPU rasterization GL command execution |
|||||||||||
Issue descriptionThere appears to be quite some latency between the time that GPU rasterization executes a GL command and the time the GL command is actually sent to the GPU. The problem in high level: GPU rasterization and GL execution of a particular frame are done in serial fashion, not in parallel. This contributes to poor graphics performance. It also tends to make it hard to benefit from optimizations done at Skia and GL level, as it seems as if the optimizations just add to the idle time of the processes. The root cause in high level: GPU rasterization appends GL commands to the command stream, splitting each tile with ordering barrier. The GPU process will not execute these commands, as there's nothing to trigger a flush, as long as there's space in the command buffer. The GL command stream will be flushed by seemingly unrelated process, the compositor doing an ordering barrier in TileTaskManagerImpl::ScheduleTasks -> GpuRasterBufferProvider::OrderingBarrier -> CommandBufferProxyImpl::OrderingBarrier. If I understand correctly, the command stream is flushed at this point due to the contexts being different. This happens after all the tiles have been rasterized. The question of this bug: Is this by design? Is it up for improvement? If improvement can be done, what class of improvements are in scope? I can think of small things like: 1) Somehow (shallow?) flush after each tile Not ideal, but this would be practical way to reduce latency. Increases overhead in the GPU process due to flush processing causing MakeCurrent, and MakeCurrent overhead being out of scope of fixing. Quantizes the GPU process according to the tile raster length. Thus, for example in the example below, the improvement is not very big. In the example, the second tile takes, say, 13ms. The GPU process would idle still between the first and the second tile. MakeCurrent overhead might be mitigated by the scheduler work (AFAICT?) but I have not seen that work. 2) Somehow flush after N amount of commands Works around the long raster durations. Big things would be like: 3) Not using Chrome IPC for GL IPC. Rather have some sort of lower overhead queue, make each context in its own thread, wait on some sort of sync variable and process the GL commands as fast as they're written in the queue. This probably is out of scope, as it would need changes in the synchronization model. As an example: The device: NV Shield TV (Android) The testcase: svg-anim.html?l,chalkboard The testcase changes the background color with rAF as fast as it can. The actual content is the MS "chalkboard" image, filling roughly 2/3 of the views height. The testcase animates at ~40FPS, with use of NV_path_rendering GL extension as well as without it. In the attached trace, you can observe the time a rasterization starts in the compositor raster thread. I have added an event flow indicating the point where the decoder starts to execute the commands. In the other attached trace, you can observe the reason why the GPU process starts to execute the particular GL stream. Here you can observe the AsyncFlush processing start as a result of TileTaskManagerImpl::ScheduleTasks -> GpuRasterBufferProvider::OrderingBarrier -> CommandBufferProxyImpl::OrderingBarrier From the traces, you can observe that first GL command processing starts 20-30ms after the GPU rasterization starts. This is rather unfortunate, as the scene can be rendered fully in 25ms with standalone Skia. If the scene can be rendered faster than the time it takes to even get the first GL command executed, our attempts to optimize Skia, optimize the GL driver, and to utilize GL extensions like NV_path_rendering, seems a bit futile. The context to this work: This particular scene, for example, can be rendered with standalone Skia like so: MSAA4 with NV_path_rendering: 22ms MSAA4 without NV_path_rendering: 87ms I'm trying to understand reasons why the improvements at Skia level do not get reflected in improvements in Chromium level. I'm observing that with many cases, skipping work at GL level or even optimizing at Skia level does not improve Chromium. Any insights wrt this would be helpful: - Have you have seen similar observation with some testcase (maybe this one?), e.g. for example with this testcase or many interesting testcases, blink is the bottleneck?
,
Jun 3 2016
,
Jun 3 2016
It's a known issue. Originally for threaded GPU rasterization, we didn't want to run Flush for each Tile's work as in cases with many rasterized tiles the Flush IPC costs can add up. Say 30 tiles, roughly 0.2ms CPU overhead per Flush on Android would add up to 6ms. Instead we had the compositor thread trigger the flush via an OrderinBarrier on the compositor's context about 1 time per frame. The expectation is the GPU would run pipelined, so we increase latency but not throughput. If you optimize path rendering I'd expect to see an improvement. However we do have some limits on number of outstanding frames, so it's possible that is throttling us. In your example it definitely looks like we should flush earlier to reduce latency. Each tile is taking up to 15ms, so we may want to trigger flushes at finer granularity than once per tile in this case. We have a PeriodicFlushCheck on desktop but disabled on Android. Maybe we should enable it on Android https://cs.chromium.org/chromium/src/gpu/command_buffer/client/cmd_buffer_helper.cc?q=PeriodicFlushCheck&sq=package:chromium&l=199 btw, I'm surprised to see ProxyImpl::ScheduledActionActivateSyncTree of ~15ms.
,
Jun 3 2016
Usually we draw the frame soon after ProxyImpl::NotifyReadyToActivate, which triggers the flush. In this example I believe we're in "high latency" mode, so we don't draw until the start of the next frame. In between we have this very unusually long ProxyImpl::ScheduledActionActivateSyncTree, as well as a ProxyImpl::ScheduledActionCommit since the main thread was already waiting to commit. If we add an OrderingBarrier after TileTaskManagerImpl::CheckForCompletedTasks() in TileManager::CheckAndIssueSignals() then the tiles' work would be flushed before those Activation & Commit actions. Follow ups: 1) Try adding an OrderingBarrier earlier in TileManager::CheckAndIssueSignals() before NotifyReadyToActivate() or NotifyReadyToDraw(). 2) Try enabling CMD_HELPER_PERIODIC_FLUSH_CHECK on Android. 3) Look into why activation is so slow here.
,
Jun 3 2016
,
Jun 5 2016
I tried out "svg-anim.html?1,chalkboard" on my MBP, Chrome 53.0.2757.0 with GPU rasterization. I'm not seeing slow activation. Note in this trace we're calling CommandBufferProxyImpl::Flush very often, as often as every 100us. CMD_HELPER_PERIODIC_FLUSH_CHECK is set to flush every 300us. I guess the command buffer is constantly getting full.
,
Jun 6 2016
Is your trace still using SW rasterization?
,
Jun 6 2016
I had GPU rasterization forced on. Here is a trace running on Pixel C, which uses the path rendering extension. I enabled full command tracing (Service.TraceCmd) which makes everything ~50% slower, but the bottlenecks don't change. It's spending all the time in the path rendering extension. kStencilThenCoverFillPathCHROMIUM 198.291 ms 198.227 ms 0.000 ms 0.000 ms 2735 kDeletePathsCHROMIUM 55.143 ms 52.846 ms 0.000 ms 0.000 ms 2628 kPathCommandsCHROMIUM 25.958 ms 26.445 ms 0.000 ms 0.000 ms 2628 kDeleteBuffersImmediate 1.310 ms 1.312 ms 0.000 ms 0.000 ms 3 kDrawElements 1.208 ms 1.250 ms 0.000 ms 0.000 ms 99 kScissor 0.697 ms 0.719 ms 0.000 ms 0.000 ms 144
,
Jun 6 2016
In kimmo's traces, one of them has short activations and one long. I just reproduced this difference locally as well. On both Chromium and Webview activation is getting progressively slower over time. I'm not able to confirm this on OS X. Test for easy repro: http://vmiura.github.io/svg-anim/svg-anim.html?1,tiger
,
Jun 6 2016
chrishtr@: fyi, tree activation getting slower the longer the Renderer is running may be related to the performance alerts you've been seeing. I reproduced the slow activation on ToT r398134 as well (though the SVG rendering on Pixel C is broken, all black).
,
Jun 7 2016
,
Jun 7 2016
> Here is a trace running on Pixel C, which uses the path rendering extension. > It's spending all the time in the path rendering extension. Thanks for looking into Pixel C. I'll try to verify your observations. So due to the black screen regression I assume you're using a bit older Chromium? In your trace you can see the "delete paths" commands that are sign of path cache trashing. I've fixed this since, so you would need to try a newer Chromium with Pixel C to confirm slowness. > I enabled full command tracing (Service.TraceCmd) which makes everything ~50% slower, but the bottlenecks don't change. In general, kStencilThenCoverFillPathCHROMIUM should be and will be the function taking most of the time, since it's the one doing any actual work. It's similar to DrawArrays. However, getting numbers with Service.TraceCmd and/or --enable-gpu-service-tracing (with increased log level) seems typically skew the results for me, up to a point where no conclusions could be made with any confidence. "Easiest" way to blame path rendering is to first measure with path rendering (default Chromium flags) and then measure with following setting: build/android/adb_chrome_public_command_line --disable-gl-extensions=GL_NV_path_rendering However, this is unfair, as Chromium uses Skia in suboptimal way that penaltizes the (non-path-rendering) MSAA rasterization. I've been trying to fix this also, but I'm getting nowhere with the reviews. So in order to really blame path rendering, you'd need to apply the patch to fix msaa tiling. Note, the test case should have lower-case letter L, as in "log", not number one: http://vmiura.github.io/svg-anim/svg-anim.html?l,tiger http://vmiura.github.io/svg-anim/svg-anim.html?l,chalkboard This way you can observe the rAF FPS from the console.
,
Jun 7 2016
I filed a new issue 617869 for the Pixel C investigation.
,
Jun 7 2016
Thanks kkinnunen@. I'm curious if you can try the latency fixing ideas in #4. The CL crrev.com/2044803002 should fix the slow activation issue.
,
Jun 7 2016
I'll try them, thank you for the suggestions. I'm also available (as an tester, for example) if you ever decide to work on the root causes, eg. on better GL IPC and getting rid of the makecurrent overhead.
,
Jun 10 2016
Latency fixing ideas from comment#4: > Follow ups: > 1) Try adding an OrderingBarrier earlier in TileManager::CheckAndIssueSignals() before NotifyReadyToActivate() or NotifyReadyToDraw(). I'm looking at CheckAndIssueSignals, which takes 1.5ms. Then I'm looking at the tile rasterization, which takes 50ms. I can not see how moving the barrier at most 1.5ms earlier would fix the latency. I did not pursue this further due to this. > 2) Try enabling CMD_HELPER_PERIODIC_FLUSH_CHECK on Android. This is a regression. I'm attaching a trace, but it's not terribly interesting. By my layman reasoning, this is even worse than flushing after each tile.
,
Jun 10 2016
>> 2) Try enabling CMD_HELPER_PERIODIC_FLUSH_CHECK on Android. > > This is a regression. I'm attaching a trace, but it's not terribly interesting. By my > layman reasoning, this is even worse than flushing after each tile. This may require tuning kCommandsPerFlushCheck down and kPeriodicFlushDelayInMicroseconds up on Android. What regressed, framerate? Looking at the trace the latency does look improved. I'm not sure in what way this is even worse than flushing after each tile?
,
Jun 13 2016
> This may require tuning kCommandsPerFlushCheck down and kPeriodicFlushDelayInMicroseconds up on Android. To me, that does not really sound like potential for a major breakthrough. In todays code, to decrease latency you need to increase the flushes. To me, it is not obvious why would it *really* matter what knob you turn and how much. It's still the same fundamental limitation -- makecurrent will reduce the perf. If makecurrent is 10-20%, the steps of the knob must be quite coarse. Also, I have a feel that Skia GL commands are not distributed evenly in time and thus time-based cutoff might be hard to apply for real consistent results. Skia tries to generate resources in the beginning, does its processing and then issues commands in the end. I have not measured this, though. > What regressed, framerate? Yeah, the rAF framerate drops. As an rough example: 24 to 20. > Looking at the trace the latency does look improved. Personally, I can not look at the trace timestamps to the degree that I could make that claim. > I'm not sure in what way this is even worse than flushing after each tile? At least flushing after each tile partitions the work so that no pending work is left after flushes. Flushing mid-tile leaves work pending, but also still costs the makecurrent overhead. Flushing after each tile is not a progression either.
,
Jun 13 2016
Understood, thanks for the update. In the traces in #1 it seems there are 3 to 4 MakeCurrent calls per frame. This is about expected. It is unfortunate that we run MakeCurrent 2 times for DisplayCompositor; the first time we run up to a WaitSyncTokenCHROMIUM. That said optimizing away one MakeCurrent would remove ~0.2ms from the GPU thread but this won't make a difference as long as the raster task time is >22ms. Do you know what is taking the time in Skia side?
,
Jun 13 2016
> Do you know what is taking the time in Skia side? I kind of know what is taking time, but I don't know why :) (Profile is in the skia issue linked below) The strange thing is that there seems to be quite a difference between standalone Skia and Chromium raster thread. I can not really say if the difference is within the expected range of variance and differences in the scenarios, or if there is some small but fundamental difference between the environments that cause the performance difference. I've mostly been looking at the path rendering code path. There the majority is spent in path cache lookup, and then there's much just malloc+free. Some of the investigation is dumped here: https://bugs.chromium.org/p/skia/issues/detail?id=5357 For the non path rendering codepath the profile is a bit different, but there too there's major samples in malloc+free:ing internal abstractions. I would suppose the idea is that as long as raster tasks are run back to back without the thread idling, there's no point in optimizing the GPU process side (from perf perspective). Still, when I remove work from Skia side to simulate optimization, I don't see much benefit. Hence the inquiries in this bug. One thing I keep seeing is back to back raster tasks in Chromium traces. Looking at the traces, they say that not all of the tasks are running, sometimes the tasks have significant delta in "wall" and "cpu" time spent. In the profiler, I've never seen ("optimized") raster thread execute 100%, rather there keeps on being these kind of spaces where the thread is runnable but not running. I wonder anybody has seen this or has an idea why this is? So to that point about improving GPU process efficiency: one perspective here is that if one reduces the amount of redundant work that is done in GPU process, more CPU time is left to the raster thread and thus the raster process would become faster. I don't know how realistic argument this is.
,
Dec 8 2016
,
Dec 11 2017
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. If you change it back, also remove the "Hotlist-Recharge-Cold" label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Dec 15 2017
From a quick read it sounds like viz work to remove use of command buffer will help.
,
Dec 17
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Dec 17
With parallel OOP-R recording, we can probably make this better. |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by kkinnu...@nvidia.com
, Jun 3 2016