Evaluate enabling native GMB with one-copy on rockchip. |
|||||||||||
Issue descriptionOn rockchip we're currently using shared memory implementation of GpuMemoryBuffers, this means we do a texture upload for every new tile in the GPU process. David tried to add the switch --enable-native-gpu-memory-buffer on kevin with one-copy and took a look at a few traces. His first analysis was that it was not a win overall. More time was now spent rasterizing content by Skia into GpuMemoryBuffers than the time saved doing an upload on the GPU process. We should investigate if/why this is the case and if we have any option make dmabuf perform better. We currently have a gpu/perftests/texture_upload_perftest.cc in Chrome that measure texture upload costs and could be refactored to check GpuMemoryBuffer costs. Alternatively we could just write cros specific perf tests.
,
Mar 13 2017
I've noticed that dma-buf mmaping is very slow on rockchip compared to Intel. Some of drm-tests are failing on rockchip because of it. Repro command: stop ui mmap_test We could add some performance metrics to the test, but even without metrics the lag is obvious.
,
Mar 13 2017
,
Mar 13 2017
,
Mar 13 2017
We usually map buffers once in the constructor of ClientNativePixmapDmaBuf when we import them in the renderer and keep them mapped, so I don't think that the added mmap cost is what David noticed.
There is also a trace for how long mmap takes: TRACE_EVENT0("drm", "ClientNativePixmapDmaBuf")
I think David was talking about the time it takes to playback the Skia image into the buffer. That one regressed significantly.
,
Mar 13 2017
I agree mmap()/unmap() is not the problem. The test doesn't mmap()/unmap() with every frame with this patch: https://chromium-review.googlesource.com/c/453981/ However, the test still takes a long time on Kevin, leading me to believe STEP_DRAW is the problem. Seems to imply rockchip's page swapping mechanism is slow compared to Intel.
,
Mar 13 2017
FWIW, the laser pointer in Chrome OS is now using mmap and GMBs for low-latency single buffer updates. I haven't seen any performance problems in this case on Kevin but it's only writing to the buffer using a memcpy like pattern and the updates are usually relatively small.
,
Mar 14 2017
The test draws to the entire screen (2400x1600) multiple times. I'm not sure if that's the use case Chrome cares about, but I noticed some very big performance differences when I added timing: Cyan: mmap_time: 0.001541 s, fault_time: 0.000397 s, flip_time: 0.021028 s, draw_time: 7.984037 s Kevin: mmap_time: 0.006817 s, fault_time: 0.000204 s, flip_time: 0.008645 s, draw_time: 192.220471 s Mediatek also exhibits the same slowness, so adding djkurtz@ if he knows anything. One ARM based family (Nyan) passes the test without timing out: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/testDetails?testName=graphics_Drm.mmap_test&bypassOAuth=true# Let's try to fix whatever issue we're seeing with the test and hopefully this will help the Chrome one-copy use case.
,
Mar 14 2017
The leading theory is rockchip mappings are uncached (pre-faulting isn't used very much on any GPU family) compared to Intel/Nyan. Playing around with the page-protection flags during map/unmap may make a difference...
,
Mar 21 2017
What kind of drawing is performed by that mmap test? If it exhibits random access patterns and/or multiple overwriting of the same area it's going to be slow indeed. I'd expect this kind of behavior when rasterizing tiles with Skia too. However for simple one time read or write (as David mentioned regarding the laser pointer), it shouldn't be so horribly bad, although (as our other tests with V4L2 on Veyron) indicated, there is still some benefit from caching.
,
Mar 21 2017
It's a linear access pattern (https://chromium.googlesource.com/chromiumos/platform/drm-tests/+/master/mmap_test.c). Also WiP CL if Chrome folks would like to test: https://chromium-review.googlesource.com/c/457176/
,
Mar 22 2017
The test case reads back form the framebuffer: *ptr |= (frame_index % 0x100) << 8; which is going to be extremely slow on uncached and write-combined maps. Rockchip gives us write-combined maps.
,
Mar 22 2017
Does Skia also read back from the framebuffer when rasterizing content? If not, the issue we're experiencing with the test may be different.
,
Mar 22 2017
yes skia will behave like this when blending for example.
,
Apr 12 2017
We're experiencing some artifacts (see attachment) with --enable-native-gpu-memory-buffers on Kevin (even without the caching patches). reveman@, Daniele mentioned you may know how to fix this. If you know, how do we get rid these artifacts? The reason is b/c with caching enabled, we notice some additional artifacts. We're wondering if getting rid of the "normal" artifacts will get rid of the additional artifacts as well.
,
Apr 13 2017
This patch will fix artifacts: https://codereview.chromium.org/2394833002 We can't land it as it is since it will regress on devices without native GMBs enabled.
,
Apr 14 2017
We can fix the synchronization issues by landing https://codereview.chromium.org/2446523002 and the set of patches that that change depends on.
,
Apr 14 2017
https://codereview.chromium.org/2394833002 fixes the normal artifacts, but not the ones introduced by caching. I'll make sure to test again whenever https://codereview.chromium.org/2446523002 is rebased to see if there is any improvement.
,
Jun 27 2017
When we worked on Intel chip, we tested smoothness.tough_texture_upload_cases, especially background_color_animation.html https://bugs.chromium.org/p/chromium/issues/detail?id=475633#c46 we can run by tools/perf/run_benchmark smoothness.tough_texture_upload_cases --browser=release --use-live-sites --also-run-disabled-tests --story-filter=background_color_animation.html It's the same html in my herokuapp http://browsertests.herokuapp.com/perf/background_color_animation.html FPS counter showed us significant different FPS before. --show-fps-counter --enable-logging=stderr --vmodule="head*=1"
,
Jun 27 2017
,
Jun 27 2017
the write-combine page-protection flag on IA caused slowness before, because the major client of the memory is skia, which read/write random point a lot. It may be slow inherently in ARM. In Intel, both CPU and GPU use shmem, and last level cache guarantees cache coherence between CPU and GPU. Dose GPU of rockchip and ARM use system memory?
,
Jun 27 2017
@dongseong.hwang, I believe the FPS counter has been removed in current versions of Telemetry. Dongseong, can you run the smoothness.tough_texture_upload_cases on an Intel board (preferably 3.18 or 3.14, since 4.4 is unstable) with "--enable-native-gpu-memory-buffers" (this is the default with mmap bug fixed) and "!--enable-native-gpu-memory-buffers" and report the results? I want to see if the smoothness.tough_texture_upload_cases are still valid for measuring performance.
,
Jun 28 2017
My word about fps counter is confusing. sorry. I mean chrome with fps counter. Load http://browsertests.herokuapp.com/perf/background_color_animation.html on chrome and it shows huge fps difference. Currently, gpu rasterization raster the given test. To see difference, following configuration are needed. --disable-native-gpu-memory-buffers --disable-gpu-rasterization --enable-zero-copy --enable-native-gpu-memory-buffers --disable-gpu-rasterization --enable-zero-copy I'll do it on IA and report it soon. Now smoothness.tough_texture_upload_cases doesn't work on my machine for some reasons.
,
Jun 28 2017
Cools, thanks for the info. Don't test with --enable-zero-copy since we're only consider --enable-native-gpu-memory-buffers for rockchip.
,
Jul 10 2017
As this issue seems to be performance related issue for rockchip, adding 'TE-NeedsTriageHelp' for further investigation. Thanks..!!
,
Jul 14 2017
dongseong.hwang@, any updates on the testing? Having a Telemetry test that measures differences with --enable-native-gpu-memory-buffers would be very useful to evaluate one-copy texture upload.
,
Jul 19 2017
,
Jul 19 2017
,
Nov 15 2017
I am not actively working on this since we should probably do GPU rasterization instead of the one-copy with cached mappings approach. For the mmap_test, we'll probably decrease the size of buffers it draws to so it can pass everywhere.
,
Nov 15 2017
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by dcasta...@chromium.org
, Mar 13 2017