Release TransferableResource properly when destroying a frame sink |
||||||||||
Issue descriptionWhen destroying a frame sink, we need find a way to release not returned TransfterableResource properly. * If we release not returned resources immediately with the frame sink, the resource may still being used by the parent frame. It will cause bad frame on screen. * If we keep the frame sink alive, and hope resources will be returned from the frame sink soon. But in my test, the resources for the last frame will not be returned until a new empty frame is submitted. But submitting a new empty frame may cause another problem. For example the surface could being used for animation, we don't want to animation an empty frame. Any idea about this problem?
,
Sep 15 2017
In my test, I set a surface to an aura::Window, and then I close and destroy the aura::Window, but the resources submitted to the surface are still not returned.
,
Sep 15 2017
It's okay if they aren't returned, the client should consider lost any resources that are sent to the parent when the connection is lost. But from the parent's side if it's embedding a child, it should remove that embedding before we destroy the child, or it creates races where it can try to use resources that didn't get reffed yet and are gone.
,
Sep 15 2017
Dana, thanks for the hint about considering exported resources as lost. That solves all our fast ink issues nicely. However, I'm not sure we can use that for exo and arc++. For arc++ we need to inform the client when a buffer can be safely reused. There's no mechanism in Android for wayland clients in general to say that a buffer is lost.
,
Sep 15 2017
I think we can solve the exo/arc++ problem by submitting an empty frame and then keeping the frame sink alive until the last submitted resource is reclaimed. It's just a bit of a mess from a frame sink life time management perspective as the ReclaimResources() call from the frame sink is going to be how we determine that the frame sink can be destroyed. The only sane way I can think of doing that would be to have some long lived instance (e.g. exo::Display) keep track of all exo frame sinks and then PostTask from ReclaimResources() to have frame sinks cleanup asynchronously. Any suggestions for how we can handle this in a better way?
,
Sep 18 2017
Turns out that considering the the texture lost at frame sink destruction time doesn't work at all. We're leaking textures if we do that. I've fix this leak for fast ink related code using https://chromium-review.googlesource.com/c/chromium/src/+/669586 for now but while that works OK I think it's pretty ugly and hope that we can find a better solution. The race condition I'm observing for fast ink and arc++ is: 1. Create a texture through the shared main-thread browser context. 2. Produce texture mailbox for that texture. 3. Submit a frame to an aura window's frame sink that reference that mailbox. 4. Destroy the frame sink. 5. Delete texture created above as otherwise we leak memory. 6. If the delete texture command is processed by the GPU process before the texture consume command by the display compositor, the latter will fail and we draw black. This is really easy to reproduce with fast ink as we produce a lot of frames. The exported texture deleter logic a landed that waits for compositor to produce and ack a frame avoids it but that's far from an ideal solution. Arc++ is still impacted by this issue and I'm failing to see how this is not also an issue for all renderers. Actually, I'm seeing errors in my log on Chrome OS devices that indicated that this is happening all the time: ERROR:gles2_cmd_decoder.cc(17983)] : [.DisplayCompositor-0x3260c5ddc400]GL ERROR :GL_INVALID_OPERATION : glCreateAndConsumeTextureCHROMIUM: invalid mailbox name ERROR:gles2_cmd_decoder.cc(9889)] : [.DisplayCompositor-0x3260c5ddc400]RENDER WARNING: texture bound to texture unit 0 is not renderable. It maybe non-power-of-2 and have incompatible texture filtering. That's what I would expect from this race condition and exactly what I was seeing for fast ink before landing that ugly workaround. In the renderer case it's likely a race between consuming the texture and cleaning up the renderers GPU process resources but that's really not that different. Maybe I'm missing something but I'm going to raise the priority of this until my understanding of it changes. My current understanding is that we're currently producing bad frames when closing tabs/windows/arc++ apps and it makes for a bad user experience in addition to filling the UI log with the errors mentioned above. danakj@, I'm assigning this to you for now as you seem to be the expert in the area.
,
Sep 18 2017
Pls apply appropriate OSs label.
,
Sep 18 2017
,
Sep 18 2017
FYI, this might not be RB unless it's a regression from what we already ship. Not clear to me when this problem was introduced.
,
Sep 18 2017
I'm seeing the same error messages in the log on my Samus device running M60 so it looks like we've already shipped with this problem. Removing M-61 and RB labels.
,
Sep 18 2017
Right, this is pretty much how things have been since... forever (when we switched to delegated rendering). Mailboxes are a weak reference to textures, and we rely on the producer (e.g. renderer) to keep a strong reference (actual texture id) to the resources. Those references go away if e.g. the renderer crashes/gets killed while frames for it are still in-flight. This is somewhat fundamental to how things are wired + security properties (we can't let the renderer leak textures, we have to clean up behind it if it dies). One possible way we thought of (in the past) to solve this in a generic way at a lower level (i.e. rather than tracking frames in the embedder) is to implement a reference-holding "pool", that would be owned by the embedder (consumer) and passed to the producer, where in-flight resources could be held (exact API left as an exercise to the reader, but basically you would create a reference from the pool to the texture on the client side before adding the sync token) - the embedder could then hold references to those resources longer than the lifetime of the producer (but would be responsible for managing the reference-holding pool). Right now part of the tricky bits is that the embedder and embeddee are somewhat decoupled (renderer doesn't know what embeds it, there could actually be more than one things), so we would have to be thoughtful in to how we would wire this up Interestingly, I think we can make all of these problems disappear with viz, because the same process receives frames and commands that create/destroy resources, so the trust issues essentially disappear. We can make sure the resources are kept alive as long as the CompositorFrame is. Given that, I wonder if there is value in trying to implement something generic today.
,
Sep 18 2017
Thanks for explaining. Makes sense. There seems to be an increase in how frequently this happens for renderers on ChromeOS devices lately. Could be a result of GPU process scheduling changes or changes to GPU process usage in general (GPU raster maybe?). The issue concerns me as we've had a few reports of black flashes lately and it's hard to diagnose when we have this known issue that can always cause it. I guess we can workaround this issue for arc++ using the same mechanism I added to fast ink but it would be nice if we could also avoid this for renderers during normal circumstances. Hard to handle the case when a renderer is crashing but during standard shutdown maybe we could make sure the display compositor have had a chance to consume all textures before we exit?
,
Sep 18 2017
+erosky who's looking at this too. I think one cause of increase is that the login flow is particularly susceptible to this (a bad frame shows longer), and the login renderer started doing animations which triggers this race more often.
,
Sep 19 2017
Glad we understand what's happening. This isn't something I'm working on so marking available. It's not clear what the next steps are to me, other than making sure the browser stops showing a renderer in its ui before destroying the process?
,
Sep 20 2017
For arc++ that is effectively a GPU process client inside the browser process I'm going to use the same exported texture delete logic that fast ink uses to avoid the issue for now. That logic is relatively simple but it feels like the mechanism that creates the frame sink for an aura::Window should also provide a mechanism to delete textures submitted through the sink without causing a race during frame sink destruction. For renderers that can die could we add a mechanism where GPU client state created by a renderer is put in a zombie state when the GPU process detects that the client connection is broken and require that the browser process ack the broken connection before the state is cleaned up?
,
Oct 12 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f273350c574976eee4b8458268d2786f88b3d16e commit f273350c574976eee4b8458268d2786f88b3d16e Author: David Reveman <reveman@chromium.org> Date: Thu Oct 12 03:15:52 2017 ui: Reclaim shared main thread context resources. Implement appropriate reclaim of resources created using the main thread context for exo and fast ink. This is done by submitting an empty frame after the top level window as been destroyed and wanting for all previously submitted resources to be reclaimed. Bug: 765763 Test: exo_unittests --gtest_filter=BufferTest.SurfaceTreeHostDestruction Change-Id: I31a123172d5a4a4b45fc3ccb2837e8455a4f61a8 Reviewed-on: https://chromium-review.googlesource.com/696648 Reviewed-by: Daniele Castagna <dcastagna@chromium.org> Commit-Queue: David Reveman <reveman@chromium.org> Cr-Commit-Position: refs/heads/master@{#508239} [modify] https://crrev.com/f273350c574976eee4b8458268d2786f88b3d16e/ash/fast_ink/fast_ink_view.cc [modify] https://crrev.com/f273350c574976eee4b8458268d2786f88b3d16e/ash/fast_ink/fast_ink_view.h [modify] https://crrev.com/f273350c574976eee4b8458268d2786f88b3d16e/components/exo/buffer_unittest.cc [modify] https://crrev.com/f273350c574976eee4b8458268d2786f88b3d16e/components/exo/layer_tree_frame_sink_holder.cc [modify] https://crrev.com/f273350c574976eee4b8458268d2786f88b3d16e/components/exo/layer_tree_frame_sink_holder.h [modify] https://crrev.com/f273350c574976eee4b8458268d2786f88b3d16e/components/exo/surface_tree_host.cc
,
Oct 20 2017
SurfaceTreeHostDestruction test is flakey on builder Linux ChromiumOS Tests (1); see for example https://build.chromium.org/p/chromium.chromiumos/builders/Linux%20ChromiumOS%20Tests%20%281%29/builds/45964 The test seems to fail about half the time, and turns the builder red about 1 in 14 runs. Output from exo_unittests: [ RUN ] BufferTest.SurfaceTreeHostDestruction ../../components/exo/buffer_unittest.cc:190: Failure Expected: release_call_count Which is: 0 To be equal to: 1 [ FAILED ] BufferTest.SurfaceTreeHostDestruction (8 ms) reveman@ - Could you have a look?
,
Oct 20 2017
,
Oct 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f8fde1905902266d83cd7677ba801f77773d4b2c commit f8fde1905902266d83cd7677ba801f77773d4b2c Author: David Reveman <reveman@chromium.org> Date: Fri Oct 20 23:01:04 2017 exo: Fix flaky buffer release tests. Set wait for release delay to 0 for tests. Bug: 765763 Test: exo_unittest --gtest_filter=BufferTest.* Change-Id: Ia7c152ad2c3d70afcceee5a6f5dd0265e842a3e8 Reviewed-on: https://chromium-review.googlesource.com/730496 Commit-Queue: Daniele Castagna <dcastagna@chromium.org> Reviewed-by: Daniele Castagna <dcastagna@chromium.org> Cr-Commit-Position: refs/heads/master@{#510598} [modify] https://crrev.com/f8fde1905902266d83cd7677ba801f77773d4b2c/components/exo/buffer.cc [modify] https://crrev.com/f8fde1905902266d83cd7677ba801f77773d4b2c/components/exo/buffer.h [modify] https://crrev.com/f8fde1905902266d83cd7677ba801f77773d4b2c/components/exo/buffer_unittest.cc
,
Oct 20 2017
Should be fixed now. Please re-open if still flaky.
,
Oct 23 2017
reveman@- Haven't seen any flakes since your change landed, looks like that got it. One question, though... as per Comment #15, this bug was open and unassigned when I co-opted it to report the test flake. You hadn't marked it as "Fixed" after your original CL (Comment #17). Is there still work that needs doing on the larger issue beyond the test flake?
,
Oct 23 2017
This works for exo and fast ink now that I've landed workarounds but it's still broken for regular web contents (renderer processes). However, that's been broken for years so less urgent to fix. I've been thinking about how to fix this properly for renderer processes and I'm currently leaning towards using GMB backed quads for explicit ownership passing instead of texture mailboxes. I'll use a different bug for this. |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by danakj@chromium.org
, Sep 15 2017