New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 699236 link

Starred by 4 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Feature

Blocking:
issue 672303



Sign in to add a comment

Surface Aggregator shouldn't need to copy compostior frame

Project Member Reported by weiliangc@chromium.org, Mar 7 2017

Issue description

Right now Surface Aggregator's aggregate returns a compositor frame by copying data from other compositor frames submitted, this was necessary when DirectRenderer was shared with LayerTreeHostImpl.

Now DirectRenderer is not shared, we don't need to do all the copying. This should improve performance of aggregate().

Mostly what DirectRenderer care about is a structure to help navigate list of RenderPasses in the order of tree of surfaces.
 
Also proposed name for this structure is RenderPass Group. :D
Blocking: 601863
Cc: rjkroege@chromium.org sadrul@chromium.org
Components: Internals>MUS
Labels: displaycompositor
Mus+Ash will have more surfaces and so this becomes an increasingly concerning problem there. I'm adding a mus label too.
Cc: piman@chromium.org
+piman@ because he's back! Yay!
We'll also need to see what we want to do about overlay processing. It currently expects to be given the complete frame and modify it to remove things that can be split into overlays.

Comment 5 by xing...@intel.com, Mar 9 2017

Cc: xing...@intel.com
Cc: vmi...@chromium.org
How will draw occlusion work in this design? Thanks!

Comment 8 by danakj@chromium.org, Apr 18 2017

Draw occlusion probably should be done while drawing rather than while aggregating. Otherwise you can build a data structure that stores the visible rect for each quad separately.
Cc: varkha@chromium.org
Blocking: 672303
Blocking: -601863
Components: -Internals>MUS
Project Member

Comment 13 by sheriffbot@chromium.org, Jun 14 2018

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Owner: kylec...@chromium.org
Status: Assigned (was: Untriaged)
This is an opportunity for a perf win. I think we should still look into it. Kyle to prioritize/assign appropriately.
Cc: ccameron@chromium.org
ccameron also talking about transactional stuff which fits here
Aggregate time currently: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=8b8822aff137896208c8c7fef84e1e36

android: ~237us
chromeos: ~520us
windows: ~224us
mac: ~55us
Sounds like more work is worth doing here.

Mac is light because it's delegating to CoreAnimation?
And ChromeOS is heavier because we have quads from the CrOS UI? Probably improving (Ash) UI drawing would help with that.

Anyway: it seems worth profiling SurfaceAggregator and seeing if there's room to make it more efficient.




Profiling and finding hot spots will help. But another approach would be to work toward making its output transactional and pointer based, with an output that is not a CompositorFrame
I agree this would be something good to do although those numbers aren't super accurate. There is a sync IPC that can happen at the end of aggregation, it's something to do with resource lifetime I think, which contributes pretty hugely to that. With OOP-D aggregation time is already reduced by 30%-60% on Windows.

https://uma.googleplex.com/p/chrome/variations/?sid=280bd336bcf4b643e48351f72b51bf80

I don't think mac is lower because of delegating to CA, that happens with the output from surface aggregation. I think it's because the UI isn't drawn by Chrome.
I think I know what you mean by "pointer based" but I'm not sure what you mean by transactional in this context?
For metrics:
The problem with copy is that time spend would be linear to size of final compositor frame.
The average time isn't that terrible, it's mostly the 95th percentile on Android that is worrying, which is ~2150us.

Technical part:
Essentially this idea is that if we stop copy (== pointer based) it would be a strictly perf win. The output we want is auxiliary data structure that knows how to walk between the different compositor frames in the correct order + the damage is generated correctly during aggregation.

In order to avoid copy, the compositor frame essentially need to be "read-only" post aggregation. The two cases we would need to consider are draw occlusion and overlay.

For draw occlusion: Fit info of when to skip a quad or draw only part of a quad into the auxiliary data structure.

For overlay:
Overlay replaces a quad w/ solid color quad to create a hole. Maybe just skipping that is enough?
Outside of scope of this task: Could overlay happen per-surface? Could it happen pre/during surface aggregation?

Cc: -varkha@chromium.org
re comment #19: indeed, the sync IPC, which happens only without OOP-D, is a large part of the time spent in Aggregate [1]. With OOP-D, that sync IPC goes away [2]. I believe that is the biggest reason for the improvement of Aggregate() with OOP-D (on Windows). It should be interesting to see how much of a saving we get on android from that.

[1] https://uma.googleplex.com/p/chrome/callstacks/?sid=5e0a2e879e9fec0e8439d2dd67490651
[2] https://uma.googleplex.com/p/chrome/callstacks/?sid=cbf9f53c77afb30a92dc59212bac7aa7
Looking at the flame-graphs comment #23, even with the sync-ipc gone with OOP-D, it looks like copying the quads (CopyPasses()) takes similar amount of time as returning the child resources (DeclareUsedResourcesFromChild()). So it may still be a good idea to delay this until draw happens? I am thinking of something like this: https://chromium-review.googlesource.com/c/chromium/src/+/1103527 Of course, this still means we have to do this work, but after swap, instead of during Aggregate(). I am thinking that would still be useful?
Some local numbers (Z840) with https://chromium-review.googlesource.com/c/chromium/src/+/1103527 for Aggregate() time:

ToT: 338us
With CL: 135us
With OOPD: 161us
With OOPD+CL: 132us

I am updating cluster-telemetry [1] so that I can get some numbers from there.

[1] https://chromium-review.googlesource.com/c/chromium/src/+/1103852
I'm curious what impact this has on total DrawAndSwap time and presentation time? Are we just moving work around in the pipeline or are we making things faster?

Sign in to add a comment