New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 601608 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 2
Type: Bug

Blocked on:
issue 611805

Blocking:
issue 533617
issue 581777
issue 607130



Sign in to add a comment

Simplify timing in ImageTransportSurfaceOverlayMac

Project Member Reported by ccameron@chromium.org, Apr 7 2016

Issue description

ImageTransportSurfaceOverlayMac attempts to coordinate when swaps occur on Mac, but does a bad job at it.

Current design is as follows.

The lifetime of a SwapBuffers call is as follows
0. GPU work is issued (either compositor, WebGL, nor potentially nothing if we're in CA mode)
1. SwapBuffer is called on ImageTransportSurfaceOverlayMac
2. Create and issue a GL fence object and glFlush
3. Package up the swap information into a PendingSwap
4. Post a task to draw the PendingSwap in 1 vsync
[ Wait until either (A) a subsequent SwapBuffers call comes in or (B) the posted task runs ]
5. Finish on the GL fence object
6. Update the CALayer tree and ack the swap to the browser

It's actually more complicated than that, but let's live in the simplified world.

These are pipelined, so if you have swaps A, B, C coming in at 60fps, you'll have the sequence of events:

A0,A1,A2,A3,A4,......A5,A6.
               B0,B1,......B2,B3,B4,......B5,B6,
                                    C0,C1,......C2,C3,C4.. (some time later) ..C5,C6

This complexity isn't buying us much. Issues we see are as follows:

* WindowServer starvation: Notice that the GPU work issued to B0 is sent before the GL fence and CALayer update at A5. This means that it is likely that this work will be already issued to the GPU before WindowServer can render the updated CALayer tree in A6. This manifests as the actual framerate being far below the framerate that Chrome thinks it has.

* Latency: This adds at least an extra vsync of latency before swaps are acknowledged.

* CPU usage: The GL fence finish in step 5 sometimes do busy-waits on crappy drivers.

It appears that the following simplified pipeline works well:
0. GPU work is issued (either compositor, WebGL, nor potentially nothing if we're in CA mode)
1. SwapBuffer is called on ImageTransportSurfaceOverlayMac
2. Do glFinish on the GL context
3. Update the CALayer tree and ack the swap to the browser

Note that we do the glFinish in step 2 because we have no other way of getting GPU back-pressure -- we get no signal from the WindowServer that our CALayer tree update has actually been displayed.

Also note that even though we are doing the glFinish in step 2, we do allow a pipeline of pending frames in the browser process, so the browser will queue up new GL commands for us to execute.

The drawbacks to this are:

GPU process pipelining: We can't start decoding GL commands for the next frame until the glFinish has completed. So, if we're limited by GL command decoding, this will make it harder to hit 60fps. That said, if we do allow GL command decoding, then we hit the starvation issue from before -- I think that this is the better risk to run.

Multiple Windows: With multiple windows, we don't have an easy way to batch up the GL commands into a single glFinish. This has not been a substantial issue in practice.

Adding jank: Updating a CALayer tree with vsync is tricky business -- we want to avoid the part of the vsync interval where the WindowServer picks up its new contents (because if we are updating the CALayer tree at the same time, we may end up having missed and duplicated frames because of timing artifacts). We already run this risk to some degree with the existing code (we have a glFinish in there) -- avoiding it completely would require have a very Mac-specific scheduler (maybe not a bad idea).
 

Comment 1 by kbr@chromium.org, Apr 7 2016

Cc: postfil...@gmail.com
Note: @thespite (on Twitter) pointed out that @alteredq 's GPU-bound demos claimed to be rendering (according to Chrome's FPS meter and other measurements) much faster than they visibly were. One sample that on an NVIDIA Retina MBP claimed to be rendering at ~50 FPS, but visibly was rendering around 20 FPS:

http://alteredqualia.com/xg/examples/liquid_face.html

This is a point where we should probably reiterate that "there is no limit to the complexity of the wrong solution".

Comment 3 by kbr@chromium.org, Apr 8 2016

Cc: thesp...@gmail.com
Thanks to @thespite for pointing out this problem.

A few questions:

1. Why does step 6 (update CALayer tree) happen after glFinish or FinishFence? Can't we do that right after the flush?
2. Would it be better if we could periodically TestFence after the flush instead of FinishFence after an entire vsync interval?
3. Even with the glFinish how do we know that all of our frames are being displayed by the WindowServer?
To clarify my first question, it looks like all we need to do is ensure that we don't change the contents of the CALayer too often (which leads to dropped frames) and glFinish ensures that most of the time whereas flush wouldn't. So it looks like all glFinish is doing for us is delaying the swap ack to the browser.

So we need 

A: Flush -> CALayer commit -> (periodically TestFence) -> SwapAck -> Next Vsync
B:                         Flush -> ... after the vsync following swapack ... -> CALayer commit

This still doesn't guarantee that A's CALayer commit was displayed (I believe this is impossible with CALayer + setContents) but it should achieve the same result as a glFinish right?
To the question in #4:

First, to limit our attention, I'm only concerned here with in GPU-bound or nearly-GPU-bound situations. For non-GPU-bound situations, an immediate glFlush+CACommit+Ack works fine.

The problem is that setting up any kind of pipeline in the GPU process causes starvation of the WindowServer and dropped frames.

Q1: Just doing a glFlush and delaying the ack will not stop the renderer from submitting the GL commands (e.g, heavy WebGL calls) for the next frame (recall that we can have ~2 in-flight frames). The GL work for the next frame will starve the WindowServer, and result in only the second frame being shown (sometimes).

Q2: This would add a lot of complexity. What would this gain us?

Q3: We don't, at least not with any APIs that I've found so far. I suspect there is a mechanism out there -- after all, CAOpenGLLayer seems to do this. We should spend more time investigating this issue. If we do find a reliable signal that content has appeared on the screen, then we should re-investigate how to ensure maximum smoothness.

Project Member

Comment 7 by bugdroid1@chromium.org, Apr 10 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5a3c87b37a46ce1b0813f0530d981165daaa7cfe

commit 5a3c87b37a46ce1b0813f0530d981165daaa7cfe
Author: ccameron <ccameron@chromium.org>
Date: Sun Apr 10 05:34:35 2016

Mac: Clean up ImageTransportSurfaceOverlayMac timing

Change the processing of a frame to be the following:
0. GPU commands are decoded (compositor, WebGL, nor nothing)
1. SwapBuffers is called on ImageTransportSurfaceOverlayMac
2. Do glFinish on the GL context
3. Update the CALayer tree and ack the swap to the browser

This is much simpler to reason about, and appears to result in improved
performance.

BUG= 601608 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel

Review URL: https://codereview.chromium.org/1867163002

Cr-Commit-Position: refs/heads/master@{#386316}

[modify] https://crrev.com/5a3c87b37a46ce1b0813f0530d981165daaa7cfe/gpu/ipc/service/image_transport_surface_overlay_mac.h
[modify] https://crrev.com/5a3c87b37a46ce1b0813f0530d981165daaa7cfe/gpu/ipc/service/image_transport_surface_overlay_mac.mm

Cc: erikc...@chromium.org
erikchen just added support to draw WebGL using the CoreAnimation renderer. It appears that, as a side-effect of this, we no longer glFinish the WebGL work, because we're only glFinish-ing the compositor context.

Ideas on how to reach over to the WebGL CGLContextObj?

Comment 9 by kbr@chromium.org, Apr 14 2016

We could instrument DrawingBuffer.cpp in Blink to do this when it produces its mailboxes, though doing it in exactly the right place and the right number of times per frame may be tricky. What exactly is the desired situation? glFinish right before WebGL produces its mailbox, for example? What about if there are 2 WebGL-rendered canvases on the page?

Comment 10 by kbr@chromium.org, May 2 2016

Blocking: 607130

Comment 11 by kbr@chromium.org, May 12 2016

Blocking: 533617

Comment 12 by kbr@chromium.org, May 12 2016

Blocking: 581777

Comment 13 by kbr@chromium.org, May 18 2016

Blockedon: 611805
Cc: ericrk@chromium.org
Status: Fixed (was: Started)
ericrk has further improved this
Cc: ccameron@chromium.org
 Issue 551483  has been merged into this issue.

Sign in to add a comment