Frames rendered in fast/slow pattern, fence related? |
||||||||||||||||
Issue descriptionTL;DR: very mysterious frame timing, where one frame takes twice the expected time followed by an almost-instant frame. Not sure if this is a misbehaving fence, or if rendering is really slow due to pipelining? Chrome Version: ToT M61 @ 86245656c46e084d9c80bcef68e6181053226d1e (Jul 20) OS: Android What steps will reproduce the problem? (1) Open https://webvr.info/samples/test-slow-render.html?renderScale=1&heavyGpu=1&cubeScale=0.3&workTime=4 (2) tap "Enter VR" (3) collect event trace What is the expected result? Reasonable framerate, all frames take about the same time What happens instead? Framerate is 30-35fps on a Pixel XL, and the suspicious thing is that timing shows an alternating sequence of slow and fast frames. See the attached event trace, I've added an image with some annotations drawn in to show the frame lifecycle. The blocking factor seems to be the DrawFrameSubmitWhenReady polling for a GLFenceEGL to signal that drawing has completed. This takes about 30ms for one frame, then essentially zero for the next frame. Other noteworthy details are how the CPU usage drops to near zero while waiting. This may be normal since I don't know if GPU usage would show here, the VrShellGl rendering context is a direct ui/gl/ context that doesn't use the CrGpuMain command buffer. The code active here is: https://cs.chromium.org/chromium/src/chrome/browser/android/vr_shell/vr_shell_gl.cc?l=876 [draw image, this is a simple fullscreen quad copying a SurfaceTexture] // Continue with submit once a GL fence signals that current drawing // operations have completed. std::unique_ptr<gl::GLFenceEGL> fence = base::MakeUnique<gl::GLFenceEGL>(); task_runner_->PostTask( FROM_HERE, base::Bind(&VrShellGl::DrawFrameSubmitWhenReady, weak_ptr_factory_.GetWeakPtr(), frame_index, frame.release(), render_info_primary_.head_pose, base::Passed(&fence))); ... while (not signaled yet) fence->ClientWaitWithTimeoutNanos(kWebVRFenceCheckTimeout.InMicroseconds() * 1000); frame.Submit(*buffer_viewport_list_, mat); Anyone have an idea what may be going on here? I vaguely remember that the Adreno GPU has different rendering modes with tradeoffs between latency and throughput, is it possible that it's toggling between modes in some way and trying to render frames in parallel? Or is there an issue where fences don't signal completion even though the drawing operations are actually done? I once had a similar effect in a prototype when misusing server wait (eglWaitSyncKHR), where I emitted two server waits in a row and the second server wait blocked GPU operations including checking earlier fences, that had almost identical symptoms. However, as far as I know neither my code nor the GVR code for the following steps uses server wait. Does Chrome use them internally somewhere in the GPU process? Anything else I could dig up to try to help debug this? FWIW, the code I measured with has a 1-line change to disable vsync-aligned JS rAF timing, this doesn't change the framerate significantly but shows the pattern more clearly: https://cs.chromium.org/chromium/src/chrome/browser/android/vr_shell/vr_shell_gl.cc?q=vr_shell_gl.cc+file:%5Esrc/+package:%5Echromium$&dr&l=1094 - if (surfaceless_rendering_ || !pending_vsync_) { + if (!pending_vsync_) {
,
Jul 20 2017
,
Jul 20 2017
Out of paranoia I added some debugging output to ui/gl/gl_fence_egl.cc to make sue I'm not doing something silly like destroying/recreating the fences instead of passing them. It seems to confirm the behavior. For reference: third_party/khronos/EGL/egl.h:#define EGL_TIMEOUT_EXPIRED 0x30F5 third_party/khronos/EGL/egl.h:#define EGL_CONDITION_SATISFIED 0x30F6 The GL calls being executed are basically this sync_ = eglCreateSyncKHR(eglGetCurrentDisplay(), EGL_SYNC_FENCE_KHR, NULL); glFlush(); ... result = eglClientWaitSyncKHR(display_, sync_, flags, timeout /*= 2000 ns */); Here's a few frames in sequence: 07-20 16:40:08.307 20047 21275 I chromium: [INFO:gl_fence_egl.cc(25)] GLFenceEGL;;; this=0xc38adc40 CONSTRUCTOR 07-20 16:40:08.308 20047 21275 I chromium: [INFO:gl_fence_egl.cc(29)] GLFenceEGL;;; this=0xc38adc40 flushed 07-20 16:40:08.308 20047 21275 I chromium: [INFO:vr_shell_gl.cc(879)] DrawFrame;;; fence=0xc38adc40 07-20 16:40:08.311 20047 21275 I chromium: [INFO:gl_fence_egl.cc(58)] ClientWaitWithTimeoutNanos;;; this=0xc38adc40 result=0x30f5 07-20 16:40:08.339 20047 21275 I chatty : uid=10133(u0_a133) Thread-21 identical 12 lines 07-20 16:40:08.341 20047 21275 I chromium: [INFO:gl_fence_egl.cc(58)] ClientWaitWithTimeoutNanos;;; this=0xc38adc40 result=0x30f5 07-20 16:40:08.343 20047 21275 I chromium: [INFO:gl_fence_egl.cc(58)] ClientWaitWithTimeoutNanos;;; this=0xc38adc40 result=0x30f6 Elapsed: ~36ms 07-20 16:40:08.349 20047 21275 I chromium: [INFO:gl_fence_egl.cc(25)] GLFenceEGL;;; this=0xc38ad9e0 CONSTRUCTOR 07-20 16:40:08.350 20047 21275 I chromium: [INFO:gl_fence_egl.cc(29)] GLFenceEGL;;; this=0xc38ad9e0 flushed 07-20 16:40:08.350 20047 21275 I chromium: [INFO:vr_shell_gl.cc(879)] DrawFrame;;; fence=0xc38ad9e0 07-20 16:40:08.350 20047 21275 I chromium: [INFO:gl_fence_egl.cc(58)] ClientWaitWithTimeoutNanos;;; this=0xc38ad9e0 result=0x30f6 Elapsed: ~1ms 07-20 16:40:08.371 20047 21275 I chromium: [INFO:gl_fence_egl.cc(25)] GLFenceEGL;;; this=0xc72dce80 CONSTRUCTOR 07-20 16:40:08.371 20047 21275 I chromium: [INFO:gl_fence_egl.cc(29)] GLFenceEGL;;; this=0xc72dce80 flushed 07-20 16:40:08.371 20047 21275 I chromium: [INFO:vr_shell_gl.cc(879)] DrawFrame;;; fence=0xc72dce80 07-20 16:40:08.374 20047 21275 I chromium: [INFO:gl_fence_egl.cc(58)] ClientWaitWithTimeoutNanos;;; this=0xc72dce80 result=0x30f5 07-20 16:40:08.403 20047 21275 I chatty : uid=10133(u0_a133) Thread-21 identical 12 lines 07-20 16:40:08.405 20047 21275 I chromium: [INFO:gl_fence_egl.cc(58)] ClientWaitWithTimeoutNanos;;; this=0xc72dce80 result=0x30f5 07-20 16:40:08.406 20047 21275 I chromium: [INFO:gl_fence_egl.cc(58)] ClientWaitWithTimeoutNanos;;; this=0xc72dce80 result=0x30f6 Elapsed: ~35ms 07-20 16:40:08.417 20047 21275 I chromium: [INFO:gl_fence_egl.cc(25)] GLFenceEGL;;; this=0xc38ad9a0 CONSTRUCTOR 07-20 16:40:08.417 20047 21275 I chromium: [INFO:gl_fence_egl.cc(29)] GLFenceEGL;;; this=0xc38ad9a0 flushed 07-20 16:40:08.418 20047 21275 I chromium: [INFO:vr_shell_gl.cc(879)] DrawFrame;;; fence=0xc38ad9a0 07-20 16:40:08.418 20047 21275 I chromium: [INFO:gl_fence_egl.cc(58)] ClientWaitWithTimeoutNanos;;; this=0xc38ad9a0 result=0x30f6 Elapsed: <1ms
,
Jul 21 2017
+kbr since you were involved in some of the WebVR discussions previously, do you have any ideas what could explain the very odd timing? Or do you think it would make sense to discuss this with someone from Qualcomm in case the unusual rendering loop is confusing the driver?
,
Jul 21 2017
Correction, billorr@ got me pointed in the right direction - there is actually significant overlap in rendering between frames, it starts setting up the mailbox for transfer before the waitForPreviousRenderToComplete step. See corrected interpretation of the original trace.
,
Jul 21 2017
It does seem that the effect is caused by parallel GPU work in different contexts, though it's odd that it's so drastic. I tried an experiment with moving the wait-for-previous-frame-to-render to before the step where it sets up the mailbox, this evens out the framerate nicely but only improves framerate very slightly, see new traces. I guess it may work better to start the next frame's rendering slightly earlier to improve parallelism, but not so early as to bog down the GPU with parallel work.
,
Jul 21 2017
+missing attachment for the trace itself
,
Jul 21 2017
I'm not really familiar with this code but a couple of things jump out at me from the trace. In trace_totnoalign-earlywait.html, VRDisplay::GetImage is making two long calls to SyncChannel::Send, taking between 1.3 ms and 1.6 ms each; VRDisplay::EnsureMailbox then calls SyncChannel::Send taking 1.5 ms; and finally VRDisplay::SubmitFrame takes about 0.75 ms. I mentioned on one of the earlier CLs that it seemed like a bad idea to be calling ClientWaitSync for synchronization purposes; not sure whether that CL has landed, but making round-trips to the GPU process from the renderer process each frame should be avoided at all costs. It's another question why waitForPreviousRenderToFinish is taking upwards of 10 ms per frame and sometimes 18 ms per frame. The GPU process doesn't appear to be very busy in that trace; it's doing about 10 ms of work periodically. So it seems to me it should be possible to optimize things and get 60 FPS in this case. You should dig deeper to understand exactly the behavior you want for a simple case, and make sure you're getting it per tracing. In my cursory understanding of the code, I suggest producing an EGLSync object from the producer as well as the texture it renders, and then the VRShell should pick it up, do an eglWaitSync (not eglClientWaitSync) and draw that texture. Once the traces for the simple case look good, then go forward and try the more complex GPU-bound cases. If absolutely necessary, you might do an operation like draining the graphics pipe once per frame via glFinish(), but I wouldn't introduce this unless you've found it is necessary due to overloading the GPU.
,
Jul 21 2017
,
Jul 24 2017
@kbr, thank you for the feedback. Would you be willing to do a VC to discuss this in more detail? > The GPU process doesn't appear to be very busy in that trace; it's doing about 10 ms of work periodically. So it seems to me it should be possible to optimize things and get 60 FPS in this case. Yes, the GPU process is not very busy, but I think the main bottleneck in this case is the actual GPU rendering, not the GPU process or renderer. This is a synthetic test that simulates GPU-render-bound applications, it uses an extremely expensive fragment shader to stress the GPU but only has a handful of simple draw calls. If I understand it right, this means that the GPU process would not be expected to be very busy since it only has to issue a few commands to the underlying GL driver, and any waiting time for rendering to complete would in general not show directly in the CrGpuMain trace unless there's a blocking GL call, right? > It's another question why waitForPreviousRenderToFinish is taking upwards of 10 ms per frame and sometimes 18 ms per frame. waitForPreviousRenderToFinish is the sequencing that's used to prevent overloading the GPU. The intent is that we have one frame that's wrapping up GPU rendering while the next frame is executing JS and issuing GL commands to the GPU process. In this case longish wait times are expected since there's minimal JS work. (There's even an intentional 4ms busy wait in the test application before submitFrame to simulate a more complex JS render setup.) So even if the submitFrame overhead were lower, it would just spend more time waiting in the waitForPreviousRenderToFinish step. The counterintuitive behavior I saw was that I get more consistent and slightly better performance by intentionally *delaying* sending commands to the GPU process to reduce parallelism, that's the change from comment #6 where I forced waiting for the previous frame before creating the mailbox with the associated steps. The original intent was to improve parallelism by letting the GPU start work on the next frame while the current one is finishing rendering, that usually worked but in this case it led to some very odd timing. The specific problem with the original trace was the huge variation in rendering times. I'm not sure if the GPU is indeed taking such a long time to finish frames or if the fences don't signal as complete even though rendering is done already, but it seems as if it's buffering one frame and finishing a pair of frames together at almost exactly the same time. Part of the issue is that it's doing mailbox setup which involves making a pixel copy at around the same time as the previous frame is rendering, and this seems to lead to bad scheduling. > I'm not really familiar with this code but a couple of things jump out at me from the trace. In trace_totnoalign-earlywait.html, VRDisplay::GetImage is making two long calls to SyncChannel::Send, taking between 1.3 ms and 1.6 ms each; VRDisplay::EnsureMailbox then calls SyncChannel::Send taking 1.5 ms; and finally VRDisplay::SubmitFrame takes about 0.75 ms. I mentioned on one of the earlier CLs that it seemed like a bad idea to be calling ClientWaitSync for synchronization purposes; not sure whether that CL has landed, but making round-trips to the GPU process from the renderer process each frame should be avoided at all costs. Yes, I'm aware that the current rendering pipeline is sadly inefficient. The first two SyncChannel::Send happen inside rendering_context_->GetImage() which isn't our code. In previous discussions I got the recommendation to use that interface for accessing the canvas content in preparation for turning it into a mailbox image. High level, what we want to happen at this point is just to create a mailbox and associated sync token, that's what gets passed to VRDisplay::SubmitFrame. In most cases this SubmitFrame is fairly fast (0.2ms or so), it's an async Mojo call. If there's a way to get the sync token and mailbox without GPU process round trips I'd be in favor of using that. However, I thought that a requirement for sync tokens is that the underlying fence must be flushed to the GPU process, doesn't that inherently involve a round trip? I'm not sure if an unverified sync token would be suitable in this case and if it would help with parallelism, but it sounds as if this would require changes to existing Blink interfaces. The ClientWaitSync change has landed, those are the DrawFrameSubmitWhenReady src_func blocks in VrShellGl where it's basically polling for a previous frame to finish rendering. Once that fence signals completion, it unblocks the waitForPreviousRenderToFinish step. > You should dig deeper to understand exactly the behavior you want for a simple case, and make sure you're getting it per tracing. In my cursory understanding of the code, I suggest producing an EGLSync object from the producer as well as the texture it renders, and then the VRShell should pick it up, do an eglWaitSync (not eglClientWaitSync) and draw that texture. The EGLSync is a bit tricky since we'd need to do a cross-context and cross-process sync. I do have a prototype that uses this as part of a replacement for the mailbox transfer, but that's a separate issue. We draw the texture before waiting for the existing fence to complete, the eglClientWaitSync is just a post-draw check to see when it finishes. It's only a bottleneck if an application needs almost all of a frame's JS time budget in addition to its GPU budget, but that's not the case for this test. > If absolutely necessary, you might do an operation like draining the graphics pipe once per frame via glFinish(), but I wouldn't introduce this unless you've found it is necessary due to overloading the GPU. I was doing a glFinish in earlier versions, the current behavior was generally more efficient than the glFinish except for issues like here where the increased parallelism seems to result in pathological behavior :-/ FYI, just for completeness, there's also additional GL contexts involved. VrShellGl in the browser has a command buffer connection to the GPU process to retrieve the mailbox, and uses a separate ui/gl context to draw this into GVR's renderbuffer. GVR has its own high priority context which does async distortion shading at 120fps (two eye views per frame), that contributes some background load which isn't directly visible in these traces.
,
Jul 24 2017
klausw@ happy to chat via GVC but it would be more productive for you to catch sunnyps@. He understands the synchronization primitives much better than I do, and can advise on how you can avoid round-trips to the GPU process. > This is a synthetic test that simulates GPU-render-bound applications, it uses an extremely expensive fragment shader to stress the GPU but only has a handful of simple draw calls. If I understand it right, this means that the GPU process would not be expected to be very busy since it only has to issue a few commands to the underlying GL driver, and any waiting time for rendering to complete would in general not show directly in the CrGpuMain trace unless there's a blocking GL call, right? I would still expect some call on the GPU process side to take a long time waiting for the GPU to complete its work; perhaps eglSwapBuffers, or something else. Still encourage you to try to get rid of the client-side waits. Due to Chrome's deep and multi-process rendering pipeline, swamping the GPU is a recurring problem, but different approaches to throttling the client have been taken and at this point have been successful without doing client-side waits per frame. Klaus, let me assign this bug to you as we'll need you to drive the investigation.
,
Jul 25 2017
I've experimented a bit and got an interesting extreme case. I've mostly disabled the waitForPreviousRenderToFinish throttling, and now it seems to be creating dozens of frames and mailboxes in the renderer while the consumer in VrShellGl is getting way behind. If I select the CrGpuMain commands from the timespan 11s to 13s, I see 68 DoProduceTextureCHROMIUM calls but only a single DoSwapBuffers call which would correspond to it being consumed by drawing onto a transfer Surface. Latency is horrific, the completion fence took >800 ms. The in-app framerate claims to be ~43fps, but the actual screen updates are more like 0.1fps. @sunnyps, is this maybe related to the GPU scheduling introduced recently as per https://docs.google.com/document/d/1hjVckIpb9WBE7A9HUxAmutRJEgSkZO_JAFfgwOE-8NE/edit / issue 514813 ? A user reported a significant slowdown going from M59 to M60 which I couldn't reproduce myself, but I'm unsure to what extent this may depend on being enrolled in experiments. I'll try disabling GPU scheduling and/or tweaking priorities, not sure if for some reason the VR composition in VrShellGl is accidentally ending up with low priority.
,
Jul 25 2017
For context re my previous comment, VrShellGl is supposed to run the following code once per received submitFrame call: https://cs.chromium.org/chromium/src/chrome/browser/android/vr_shell/mailbox_to_surface_bridge.cc?q=mailbox_to_sur&sq=package:chromium&l=251 gl->WaitSyncTokenCHROMIUM(mailbox.sync_token.GetConstData()); GLuint sourceTexture = gl->CreateAndConsumeTextureCHROMIUM(GL_TEXTURE_2D, mailbox.mailbox.name); DrawQuad(sourceTexture); ... gl_->SwapBuffers(); return true; Based on the trace, DoCreateAndConsumeTextureINTERNAL is being called 68 times which would be the expected count, but DoSwapBuffers only once in that time period. Strange. I vaguely remember there's a throttling mechanism for SwapBuffers that may apply here?
,
Jul 25 2017
Correction to comment #13, looks lkie CreateAndConsumeTextureCHROMIUM is called twice per frame in normal operation, so this would match one single call to the mailbox_to_surface_bridge function. But the mystery still remains why this is only called so rarely. I'm adding extra logging to see if it's related to SwapBuffers throttling.
,
Jul 25 2017
The gpu command buffer scheduler was enabled on M61. Can you try running with --disable-features=GpuScheduler? (I'm still reading through all the comments on this bug.)
,
Jul 26 2017
@sunnyps, I tried this: $ ./build/android/adb_chrome_public_command_line --v=0 --disable-features=GpuScheduler It's still showing the fast/slow pattern, see attached trace_fastslow-disablegpuscheduler.html , as far as I can tell that made no difference. The performance regression from M59 to M60 is likely a red herring. Turns out that there's at least two different "Cardboard (2015)" headset configurations out in the wild with different render resolutions, I've seen 1672x922 and 1818x1144. This difference seems sufficient to explain some of the performance variations I've seen. FYI, the actual resolution used is visible in traces as an arg to SwapBuffers calls in the GPU process. I can't see SwapBuffers throttling (see comment #13) after adding a TRACE_EVENT0 in GLES2Implementation::SwapBuffers for the WaitForToken call, so that also doesn't seem related. I was able to fix the slow/fast issue and increase FPS from ~44fps to ~54fps by adding attribs.context_priority = gl::ContextPriorityLow; at the end of gpu_command_buffer/service/service_utils.cc's GenerateGLContextAttribs, see attached trace_hack-lowprio.html. This changes the driver-level GL context priority as per the EGL_IMG_context_priority extension. Result is that the WebGL context is running at low priority, the VrShellGl copy is at medium priority, and the GVR async reprojection (not visible in the event traces) is at high priority.
,
Jul 27 2017
Some notes from a VC with sunnyps@, bajones@, and billorr@ yesterday, please let me know if I misunderstood things. Confirmed that te issue seems to be unrelated to the GPU scheduler, disabling it had no effect. Also, if I remember right, the issue also seems to exist in M60 where it wasn't active yet. We don't know what specifically is causing the odd timing, other than what appears to be unfortunate scheduling decisions involving a very busy virtualized context in the GPU process that's starving the VrShellGl native context. Changing the driver-level context priority seems like an invasive change and would need further discussion. In the meantime, my plan is to tweak the timing to avoid this situation, for example ensuring that there's a minimum elapsed time delay between submitting a previous frame and flushing GL commands for the following one. This doesn't necessarily need to be based on exact sync points or unblocking via IPC messages, it is likely sufficient to just measure average times and do an approximate delay to prevent overstuffing. Separately, sunnyps@ suggested switching from the current verified sync tokens that involve a synchronous IPC to unverified sync tokens that don't need that. This would involve some logic changes in VrShellGl to verify the tokens before using them, but should improve the available Javascript execution time.
,
Jul 28 2017
,
Jul 28 2017
,
Aug 7 2017
,
Aug 8 2017
Update: this issue does *not* reproduce on a Pixel (non-XL) running Android N. I suspect that this may be related specifically to the Android O Qualcomm driver as used on Pixel XL, in my case this was marlin-userdebug 8.0.0 OPR1.170623.011.
,
Aug 9 2017
Update to the update: I do still see a fast/slow pattern on a Pixel XL running Android N. Since the non-XL Pixel and Pixel XL have very similar hardware, I'm suspecting that the difference may just be timing related where triggering the effect varies due to details such as render resolution. For example, adding &renderScale=0.8 on the pixel XL results in fairly smooth 60fps without slow/fast pattern, and this roughly simulates the resolution difference for the devices (2560x1440 vs 1920x1080). Sorry about the confusion, it seems difficult to track down what exactly is going on without more insight into driver-level details.
,
Aug 9 2017
,
Aug 30 2017
,
Oct 2 2017
This seems to have gotten fixed unexpectedly as part of issue 769488 , specifically change r505155 from https://chromium-review.googlesource.com/687832 "Enforce flush ordering between contexts in GPU process on macOS". I'll update the external bug and close this one for now, we can revisit this separately if needed. (I'm still seeing fast/slow rendering in a GpuMemoryBuffer/AHardwareBuffer based prototype.)
,
Oct 5 2017
This has started up again, caused by https://chromium-review.googlesource.com/c/chromium/src/+/696666.
,
Oct 5 2017
,
Dec 7 2017
It looks like the various blocking issues have been fixed. Is this still an issue?
,
Dec 7 2017
Yes, closing. We've found workarounds for the issue. If I understand it right, the problematic sequence was something like this: - frame A rendering is still in progress. For example, a final texture copy to GVR's drawing buffer still hasn't been issued yet. - frame B (the following frame) has completed WebGL preparation and is ready to issue commands - frame B's GL commands get flushed to the GPU - since both frame A and frame B share the same (virtualized) GL context, the GPU decides to complete the work for frame B now. - frame A drawing commands get issued (i.e. texture copy to GVR), this is queued behind the frame B work on the GPU End result is that frame A and B complete at basically the same time. Workarounds were to avoid premature flushing for the next frame's drawing commands, and a followup aims to use the new GpuFence mechanism from http://crrev.com/680135 (not yet submitted) to explicitly separate the frames with a GL fence and server wait.
,
Feb 8 2018
,
Jul 4
|
||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||
Comment 1 by klausw@chromium.org
, Jul 20 2017884 KB
884 KB View Download
233 KB
233 KB View Download