WebGL 2.0 getBufferSubData is slow |
|||||
Issue descriptionLogging this one as I plan to investigate it further. I'm especially interested in transform feedback as it seems like the best choice to get back computed data from GPU and feed to physics engine (ammo.js). The issue is known and related to Chrome graphics pipeline implementation (see Issue 616554 description). To mitigate, WEBGL_get_buffer_sub_data_async extension ( Issue 616554 ) is implemented. It is still a draft. TODO: plan to investigate how it behaves with complex scenes. Given that the extension is a draft, the bug here is about synchronous call - getBufferSubData. As the impact is severe, is makes sense to spend some more time to try to optimize this. Attached is a simple app used to measure performance - open the tf_demo.html and get the elapsed time printed in console. The issue is most severe on Windows - Windows 10, Intel® Core™ i7-7700HQ CPU @ 2.80GHz × 8, GeForce GTX 1060/PCIe/SSE2. Similar numbers with integrated card. On Windows, getBufferSubData with 640x480 transform feedback (the attached test) takes: Chrome: 2-40ms, occasionally >60. Mostly values > 10ms. Firefox: 1-3ms. On Ubuntu 16.04, same hardware the numbers are: Chrome: ~6ms, Firefox: <0.9ms. On Chrome, similar code with more complex scenes takes almost always around 60ms; it doesnt make a difference if transform feedback mode ended in the same or previous frames. In another experiment, gl.flush() and gl.finish() are inserted just before getBufferSubData (tf_demo.html), and time spend in getBufferSubData is measured: Chrome: the same, 2-40ms, mostly >10ms. Firefox: 0.3ms.
,
Nov 15 2017
As far as I know, there will be no practical way to make getBufferSubData particularly fast in Chrome, due to its fundamental architectural difference (Chrome executes GPU calls in another process). That said, the numbers you're seeing are worse than I would have expected. There may be a bad interaction going on with ANGLE. I think that we incur a shadow copy in some cases. +ANGLE. The WebGL working group has been making steady progress on the design of WEBGL_get_buffer_sub_data_async. I think we can expect it to provide better performance even in other browsers (with a latency tradeoff). The next draft will be announced on the public_webgl Khronos mailing list, hopefully quite soon.
,
Nov 17 2017
There are multiple GpuChannelHost::InternalFlush(uint32_t flush_id) calls issued every frame, from: unique_notifier.cc:32 task_queue_manager.cc:408 display_scheduler.cc:460 and especially from CommandBufferHelper::InsertToken(). Significant time is spent, within gpu service, after receiving GpuCommandBufferMsg_WaitForGetOffsetInRange until it gets executed. gpu process does multiple tasks for diffferent clients. As you pointed, it would be difficult to change that. I would like to propose another approach: Web developer could use WebGLSync knowing that when it is signalled, the subsequent call to getBufferSubData would be fast or, don't use WebGLSync and, as specified, try luck after some time to get getBufferSubData with no blocking. It could be implemented by bookkeeping the offset of the latest call, (command ring buffer state offset) changing the state of buffer, on client side - within GLES2Implementation and CommandBufferHelper. I don't think it would be necessary to use CommandBufferHelper::InsertToken(), like the one used for asynchronous PBO readback in GLES2Implementation::MapBufferCHROMIUM. >That said, the numbers you're seeing are worse than I would have expected. There may be a bad interaction going on with ANGLE. I think that we incur a shadow copy in some cases. +ANGLE. The numbers are volatile, changing within the same session and also depending on system load. Need to do further investigation to know if it is ANGLE related. >The WebGL working group has been making steady progress on the design of WEBGL_get_buffer_sub_data_async. I think we can expect it to provide better performance even in other browsers (with a latency tradeoff). The next draft will be announced on the public_webgl Khronos mailing list, hopefully quite soon. I have verified that WEBGL_get_buffer_sub_data_async fixes the performance issue here, and the latency doesn't look higher than synchronous call (didn't investigate much though) but it is noticeable that elapsed time for asynchronous roundtrip is not even, in one frame it is 2ms, in the very next one it could be 25ms,... then 2ms, 15, 3, 30ms. Would be nice to make this more even and I believe the proposed approach above would fix it. I plan to start with transform feedback case and later add getBufferSubData PBO (for async readPixels using getBufferSubData use case).
,
Apr 2 2018
,
Sep 17
Verified that this is fixed by Issue 828135 fix, using attached example. Great work kainino@.
,
Sep 17
,
Sep 17
Great to hear, thanks for verifying! I'm still curious why the numbers were so bad. I wonder if Chrome's IPC is particularly slow on that machine.
,
Oct 10
Checked the numbers; they are an order of magnitude higher then with Firefox, on Windows and Linux, both with integrated and discrete GPUs. On Firefox, one would observe measurements in range 0-2ms for several minutes, on Chrome it could be any value between e.g. 3 and 40ms. Often, for several seconds there wouldn't be a single measurement under 10ms. This latency didn't isn't affected by the fix to Issue 828135 - it takes about the same time until the fence sync gets signaled - but with the 828135, getBufferSubData doesn't block after it. I'll need to check this further. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by aleksand...@intel.com
, Nov 15 2017