Revisit forcing virtualized GL contexts for Qualcomm chipsets |
|||||||||||
Issue descriptionTL;DR: Qualcomm GPUs have been forced to use virtualized GL contexts since 2013. This prevents a planned optimization for WebVR. The underlying issue still exists, but is it really a driver bug or is it something fixable in Chrome? Chrome Version: 57.0.2978.0 + patches OS: Android What steps will reproduce the problem? (1) Run custom patched build including https://codereview.chromium.org/2586803003, this disables virtualized contexts for WebGL, overriding the GPU blacklist. I appplied this patch on top of base revision 02323a21ecbf43a0a55146eb2dd4dbc96b55f41a and tested on a Pixel XL phone. (2) Open https://rawgit.com/nselikoff/basic-camera-movement-three-js/master/index.html (3) Set movement mode to "Pan" with the top-right selector. What is the expected result? Reasonably smooth panning. What happens instead? Very obvious tearing, and jerky movement where frames seem to arrive out of order. See video recording here: https://www.youtube.com/watch?v=dJgkJgSBdxw Ignore the small tear lines that appear to be caused by the recording camera's frame timing. The relevant part is the large tears and jerky movement that are mainly visible in the bottom part. Currently Qualcomm GPUs enable virtualized contexts as a workaround for out-of-order frames, and this has been in place since 2013: https://codereview.chromium.org/2586803003/#msg44 "cr_bugs": [289461], "description": "Non-virtual contexts on Qualcomm sometimes cause out-of-order frames", "os": { "type": "android" }, "gl_vendor": "Qualcomm.*", "features": [ "use_virtualized_gl_contexts" ] For the underlying issue 289461 , was it ever definitely confirmed that this is a genuine driver bug? For example, in 2015, issue 554268 found mailbox manager sync problems where the underlying bug seemed to be misuse of fence object, and this was also worked around by forcing virtualized contexts on. This appeared Nvidia chipset specific, but Qualcomm GPUs wouldn't have shown the issue due to the pre-existing driver override that already forced virtualized contexts on for them. See issue 510243 for further information, especially https://bugs.chromium.org/p/chromium/issues/detail?id=510243#c23 which quotes Nvidia's Michael Gold. Is it possible that there's a more general synchronization issue for mailboxes or other internal Chrome features, i.e. related to per-context fences? For background, I've been working on a feature to accelerate WebVR rendering by directly rendering WebGL content to an Android Surface, this saves a pixel copy step. Since it's directly doing 3D rendering to the default framebuffer, it needs a compatible surface for the GL context, including support for depth buffer and other attributes. This is incompatible with virtualized contexts since those all share a surface used for creating the context, and this results BAD_MATCH errors when later trying to switch to surfaces with diffent attributes. My patch https://codereview.chromium.org/2586803003 implemented context-owned custom surfaces, and this required disabling virtualized contexts. That's tracked in issue 690106.
,
Feb 11 2017
@boliu, thank you for following up on this. You're right, I've experimented more and it turns out that the tearing is only triggered when low priority mode is activated for the GL context (via EGL_CONTEXT_PRIORITY_LEVEL_IMG). If I disable that part of the patch, the non-virtualized offscreen context appears to work as expected with no visible tearing, both for plain WebGL and WebVR. Sorry about the red herring. I'd still like to see if it's possible to narrow the blacklist to allow use of non-virtualized contexts, at least on Daydream-ready devices which presumably have recent drivers, considering that it doesn't seem to cause obvious issues after all.
,
Feb 11 2017
pixel device represent the best possible android device, but android industry is super fragmented, and has a long tail.. need to test on more popular devices (eg samsung s series) before narrowing the blacklist
,
Feb 11 2017
Could a first step be to very specifically narrow the blacklist, i.e. just exclude Pixel and/or Android N devices from it? FWIW, I think it's potentially interesting that the issue shows up when using a low priority context for WebGL rendering. It's possible that this is a driver bug and red herring, but I think it may also be that the changed priority and resulting preemption is highlighting a pre-existing synchronization issue related to non-virtualized contexts that would otherwise be hard to reproduce. I'll look into making a more restricted version of the own-surface patch that's off by default and doesn't change priority, so that there's a baseline for further experiments.
,
Feb 11 2017
,
Feb 13 2017
> just exclude Pixel and/or Android N devices from it? generally, we blacklist by OS (or possibly driver version) and chipset, but not any more granular than that. eg "adreno 5xx on M or above". blacklist by device is way too granular and serves no purpose for an unpopular device. nexus/pixel devices is probably a rounding error in the entire android market
,
Feb 16 2017
Just FYI, S6 Exynos/Mali produces the same issue, thus it is not a Qualcomm specific issue.
,
Mar 3 2017
,
Jun 6 2017
klausw@: Is this still revelvant?
,
Jun 14 2017
This bug is still relevant, and is blocking an originally intended WebVR optimization. Plan was to enable drawing from WebGL directly to surfaces, but this broke non-WebVR GL content and needed to be rolled back. I'm fairly confident that this is a Chrome bug and not an issue with the drivers. As far as I understand, the problem is that Chrome's mailbox synchronization via SyncToken is fundamentally just based on sequencing the emitted commands to the underlying actual GL contexts, but is not using any synchronization primitives that would actually enforce this across contexts. This works fine when using virtualized contexts since there's just one underlying shared GL context, so the emitted commands from the virtual GL contexts all end up in the single GL output command stream in the correct sequence. However, when not using virtualized contexts, the commands get sent to different underlying GL contexts. This happens in correct temporal sequence, but since the contexts are independent from each other it appears that the driver is technically free to reorder operations among them. This is especially apparent when I experimented with changing GL context priority which made tearing due to bad synchronization very obvious. As supporting evidence, MailboxManagerSync seems to have attempted to solve a similar issue by adding EGLFence objects, but this appears to not have had the desired effect, and virtualized contexts needed to be enabled to work around issues. See also comment http://crbug.com/510243#c22 by gold@nvidia.com, Oct 28 2015 : > I found that performing a glFinish on LoseCurrent() resolves the corruption. This should not be necessary if the app is performing proper synchronization (GL/EGL make no guarantees about serialization across context switches, even within the same thread). Tegra4 has a channel per context whereas many implementations (including T124 and newer) use a channel per thread which implicitly forces serialization. If the new code depends on per-thread serialization, that could explain why it fails only on Tegra4. Of course its also possible we have a bug, but conceptually a fence is pretty simple to implement and I can't find a problem in our implementation. > > We're still looking at this as I'd rather not submit this workaround - it has performance consequences and may hide application errors. I emailed boliu more detail offline as I'm hoping he can take a look from the webview side. bajones@ thought it is it plausible that this effect wasn't visible on other GPUs because the drivers chose to respect the command order even if they technically don't have to, for example because this helped avoid issues due to erroneous assumptions in other software. Does this sound plausible? If I'm misunderstanding things, I'd appreciate feedback. https://cs.chromium.org/chromium/src/gpu/GLES2/extensions/CHROMIUM/CHROMIUM_sync_point.txt
,
Jun 14 2017
hMore specifically, the issue may be that Chrome is emitting GL commands to multiple contexts from within a single thread which I assume is somewhat unusual application behavior, and as gold@nvidia.com mentioned many driver implementations use a per-thread channel which would keep commands synchronized implicitly. I don't know if Qualcomm/Adreno or the S6 Exynos/Mali chipsets Artem mentioned in comment 7 use per-context channels. I think it would also seem likely that activating a non-default context priority would force it out of sharing a per-thread channel.
,
Jun 14 2017
I am taking my comment about S6 Mali back, I just seen judder which periodically happen with or without virtualized context, assuming something's wrong with timings / vsyncs. In fact, I've seen the issue Klaus is describing only on a Pixel phone and very old Samsung Note 4 (SD805 / Adreno 420). We use this way to optimize WebVR for Oculus Browser on S6, S7 Mali, S7 Adreno (SD820/A530), S8 Mali, S8 Adreno, Note 5 (Mali) and I haven't seen this issue on any of them. (the issue on note 4 could be seen here: https://youtu.be/tiQhhj1QlL8 ) One more detail: Mali (all - S6, S7 and S8) work fine with this optimization even with virtualized contexts (while, with non-virtualized contexts it produces artifacts, like lost textures).
,
Jun 14 2017
And just FYI: Oculus is currently on M57 and we are planning to switch to M60 in nearest future. We are planning to use 'own offscreen surface' path as a default one, but keep the mailbox path as a fallback for phones with known issues (the good candidate for that is Note 4, for example). Is this a way what you may be interested in as well? I know that S8 is going (already?) to support Daydream and there is no reason to use mailbox way on it (IMO), since the 'own offscreen surface' works just fine.
,
Jun 19 2017
,
Jun 22 2017
@piman or @kbr, do you think my interpretation from comment #10 makes sense? TL;DR: Chrome makes assumptions about cross-context GL synchronization that are out of spec, and this was worked around by forcing virtualized GL contexts on Qualcomm. If I'm misunderstanding things, and it's really the driver's fault, is there an open issue with Qualcomm to get this resolved?
,
Jun 22 2017
,
Jun 22 2017
sunnyps@ and I talked about this issue a couple of days ago. I don't know the MailboxManager code well (and am working on other bugs; can't study it in depth right now). Sunny thinks we can make the synchronization cheaper. Glancing at MailboxManagerSync it uses EGL fences because there's no guarantee that the contexts are in the same share group, and these apparently are costly. If it were changed to use GL fences and server-side waits (perhaps only when we know we're sending mailboxes between contexts in the same share group), and assuming the VR context and the other context are in the same EGL share group, that might make the correct synchronization cheap enough to use.
,
Jun 22 2017
I don't think we'll get away without using EGLFences. Android doesn't provide flush ordering between contexts like Mac does so we'll still need proper synchronization. However, we won't need EGLImage based sharing if the (native) contexts are in the same (native) share group. Also, we may not need the snapshotting functionality in MailboxManagerSync if we can define the right semantics. In webview's case we aren't in control of android UI context so we have to snapshot each mailbox at every InsertFenceSync.
,
Dec 4 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/409774e95d22cffc51877e176045a82ef6f65907 commit 409774e95d22cffc51877e176045a82ef6f65907 Author: Klaus Weidner <klausw@chromium.org> Date: Mon Dec 04 23:57:15 2017 Disable SupportOwnOffscreenSurface for WebGL The WebVRExperimentalRendering flag is being repurposed for other rendering optimizations, and activating the own offscreen surface was causing glitches. BUG= 761432 ,691102, 791755 Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel Change-Id: I15ab8ee3c3370968ce5ea2ea6a170f6282858ad1 Reviewed-on: https://chromium-review.googlesource.com/802631 Commit-Queue: Klaus Weidner <klausw@chromium.org> Reviewed-by: Brandon Jones <bajones@chromium.org> Cr-Commit-Position: refs/heads/master@{#521535} [modify] https://crrev.com/409774e95d22cffc51877e176045a82ef6f65907/third_party/WebKit/Source/modules/webgl/WebGLRenderingContextBase.cpp
,
Jul 4
,
Aug 7
Removing Blink>WebVR component and assigning to Blink>WebXR
,
Aug 7
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by boliu@chromium.org
, Feb 10 2017