New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 785415 link

Starred by 3 users

Issue metadata

Status: Verified
Owner: ----
Closed: Sep 17
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocked on:
issue 616554
issue 770381
issue 828135



Sign in to add a comment

WebGL 2.0 getBufferSubData is slow

Project Member Reported by aleksand...@intel.com, Nov 15 2017

Issue description

Logging this one as I plan to investigate it further.
I'm especially interested in transform feedback as it seems like the best choice to get back computed data from GPU and feed to physics engine (ammo.js).

The issue is known and related to Chrome graphics pipeline implementation (see  Issue 616554  description).
To mitigate, WEBGL_get_buffer_sub_data_async extension ( Issue 616554 ) is implemented. It is still a draft. TODO: plan to investigate how it behaves with complex scenes.

Given that the extension is a draft, the bug here is about synchronous call - getBufferSubData. As the impact is severe, is makes sense to spend some more time to try to optimize this.

Attached is a simple app used to measure performance - open the tf_demo.html and get the elapsed time printed in console. 

The issue is most severe on Windows - Windows 10, Intel® Core™ i7-7700HQ CPU @ 2.80GHz × 8, GeForce GTX 1060/PCIe/SSE2. Similar numbers with integrated card.

On Windows, getBufferSubData with 640x480 transform feedback (the attached test) takes:
Chrome: 2-40ms, occasionally >60. Mostly values > 10ms.
Firefox: 1-3ms.

On Ubuntu 16.04, same hardware the numbers are: Chrome: ~6ms, Firefox: <0.9ms.

On Chrome, similar code with more complex scenes takes almost always around 60ms; it doesnt make a difference if transform feedback mode ended in the same or previous frames. 


In another experiment, gl.flush() and gl.finish() are inserted just before getBufferSubData (tf_demo.html), and time spend in getBufferSubData is measured:
Chrome: the same, 2-40ms, mostly >10ms.
Firefox: 0.3ms.




 
 
tf_demo.html
5.6 KB View Download
Two more notes:
- the time is spent in WaitForCmd() in MapBufferRange implementation [1].

- adding  Issue 770381  as dependency. From Khronos mailing list (see  issue 770381 ), it looks like that promotion of WEBGL_get_buffer_sub_data_async from draft state depends on sync or query objects based API. Might be wrong there.

[1]
https://cs.chromium.org/chromium/src/gpu/command_buffer/client/gles2_implementation.cc?rcl=9e66f6a2648e6dd60e292806c011a45d179b76d5&l=5193
Blockedon: 770381 616554
Components: Internals>GPU>ANGLE
Status: Available (was: Untriaged)
As far as I know, there will be no practical way to make getBufferSubData particularly fast in Chrome, due to its fundamental architectural difference (Chrome executes GPU calls in another process).

That said, the numbers you're seeing are worse than I would have expected. There may be a bad interaction going on with ANGLE. I think that we incur a shadow copy in some cases. +ANGLE.

The WebGL working group has been making steady progress on the design of WEBGL_get_buffer_sub_data_async. I think we can expect it to provide better performance even in other browsers (with a latency tradeoff). The next draft will be announced on the public_webgl Khronos mailing list, hopefully quite soon.
There are multiple  GpuChannelHost::InternalFlush(uint32_t flush_id) calls issued every frame, from:
unique_notifier.cc:32
task_queue_manager.cc:408
display_scheduler.cc:460
and especially from CommandBufferHelper::InsertToken().
Significant time is spent, within gpu service, after receiving GpuCommandBufferMsg_WaitForGetOffsetInRange until it gets executed. gpu process does multiple tasks for diffferent clients.
As you pointed, it would be difficult to change that.

I would like to propose another approach:
Web developer could use WebGLSync knowing that when it is signalled, the subsequent call to getBufferSubData would be fast or, don't use WebGLSync and, as specified, try luck after some time to get getBufferSubData with no blocking.

It could be implemented by bookkeeping the offset of the latest call, (command ring buffer state offset) changing the state of buffer, on client side - within GLES2Implementation and CommandBufferHelper.
I don't think it would be necessary to use CommandBufferHelper::InsertToken(), like the one used for asynchronous PBO readback in GLES2Implementation::MapBufferCHROMIUM.


>That said, the numbers you're seeing are worse than I would have expected. There may be a bad interaction going on with ANGLE. I think that we incur a shadow copy in some cases. +ANGLE.

The numbers are volatile, changing within the same session and also depending on system load. Need to do further investigation to know if it is ANGLE related.

>The WebGL working group has been making steady progress on the design of WEBGL_get_buffer_sub_data_async. I think we can expect it to provide better performance even in other browsers (with a latency tradeoff). The next draft will be announced on the public_webgl Khronos mailing list, hopefully quite soon.

I have verified that WEBGL_get_buffer_sub_data_async fixes the performance issue here, and the latency doesn't look higher than synchronous call (didn't investigate much though) but it is noticeable that elapsed time for asynchronous roundtrip is not even, in one frame it is 2ms, in the very next one it could be 25ms,... then 2ms, 15, 3, 30ms. Would be nice to make this more even and I believe the proposed approach above would fix it.

I plan to start with transform feedback case and later add getBufferSubData PBO (for async readPixels using getBufferSubData use case).


Blockedon: 828135
Status: Fixed (was: Available)
Verified that this is fixed by  Issue 828135  fix, using attached example. Great work kainino@.
tf_demo.html
6.4 KB View Download
Status: Verified (was: Fixed)
Great to hear, thanks for verifying!

I'm still curious why the numbers were so bad. I wonder if Chrome's IPC is particularly slow on that machine.
Checked the numbers; they are an order of magnitude higher then with Firefox, on Windows and Linux, both with integrated and discrete GPUs.
On Firefox, one would observe measurements in range 0-2ms for several minutes, on Chrome it could be any value between e.g. 3 and 40ms. Often, for several seconds there wouldn't be a single measurement under 10ms.

This latency didn't isn't affected by the fix to  Issue 828135  - it takes about the same time until the fence sync gets signaled - but with the 828135, getBufferSubData doesn't block after it.

I'll need to check this further.

Sign in to add a comment