Surface Aggregator shouldn't need to copy compostior frame |
|||||||||||||
Issue descriptionRight now Surface Aggregator's aggregate returns a compositor frame by copying data from other compositor frames submitted, this was necessary when DirectRenderer was shared with LayerTreeHostImpl. Now DirectRenderer is not shared, we don't need to do all the copying. This should improve performance of aggregate(). Mostly what DirectRenderer care about is a structure to help navigate list of RenderPasses in the order of tree of surfaces.
,
Mar 7 2017
Mus+Ash will have more surfaces and so this becomes an increasingly concerning problem there. I'm adding a mus label too.
,
Mar 7 2017
+piman@ because he's back! Yay!
,
Mar 7 2017
We'll also need to see what we want to do about overlay processing. It currently expects to be given the complete frame and modify it to remove things that can be split into overlays.
,
Mar 9 2017
,
Mar 21 2017
,
Apr 18 2017
How will draw occlusion work in this design? Thanks!
,
Apr 18 2017
Draw occlusion probably should be done while drawing rather than while aggregating. Otherwise you can build a data structure that stores the visible rect for each quad separately.
,
May 23 2017
,
Jun 12 2017
,
Jun 13 2017
,
Jun 13 2017
,
Jun 14 2018
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Jun 14 2018
This is an opportunity for a perf win. I think we should still look into it. Kyle to prioritize/assign appropriately.
,
Jun 14 2018
ccameron also talking about transactional stuff which fits here
,
Jun 14 2018
Aggregate time currently: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=8b8822aff137896208c8c7fef84e1e36 android: ~237us chromeos: ~520us windows: ~224us mac: ~55us
,
Jun 15 2018
Sounds like more work is worth doing here. Mac is light because it's delegating to CoreAnimation? And ChromeOS is heavier because we have quads from the CrOS UI? Probably improving (Ash) UI drawing would help with that. Anyway: it seems worth profiling SurfaceAggregator and seeing if there's room to make it more efficient.
,
Jun 15 2018
Profiling and finding hot spots will help. But another approach would be to work toward making its output transactional and pointer based, with an output that is not a CompositorFrame
,
Jun 15 2018
I agree this would be something good to do although those numbers aren't super accurate. There is a sync IPC that can happen at the end of aggregation, it's something to do with resource lifetime I think, which contributes pretty hugely to that. With OOP-D aggregation time is already reduced by 30%-60% on Windows. https://uma.googleplex.com/p/chrome/variations/?sid=280bd336bcf4b643e48351f72b51bf80 I don't think mac is lower because of delegating to CA, that happens with the output from surface aggregation. I think it's because the UI isn't drawn by Chrome.
,
Jun 15 2018
I think I know what you mean by "pointer based" but I'm not sure what you mean by transactional in this context?
,
Jun 15 2018
For metrics: The problem with copy is that time spend would be linear to size of final compositor frame. The average time isn't that terrible, it's mostly the 95th percentile on Android that is worrying, which is ~2150us. Technical part: Essentially this idea is that if we stop copy (== pointer based) it would be a strictly perf win. The output we want is auxiliary data structure that knows how to walk between the different compositor frames in the correct order + the damage is generated correctly during aggregation. In order to avoid copy, the compositor frame essentially need to be "read-only" post aggregation. The two cases we would need to consider are draw occlusion and overlay. For draw occlusion: Fit info of when to skip a quad or draw only part of a quad into the auxiliary data structure. For overlay: Overlay replaces a quad w/ solid color quad to create a hole. Maybe just skipping that is enough? Outside of scope of this task: Could overlay happen per-surface? Could it happen pre/during surface aggregation?
,
Jun 15 2018
,
Jun 16 2018
re comment #19: indeed, the sync IPC, which happens only without OOP-D, is a large part of the time spent in Aggregate [1]. With OOP-D, that sync IPC goes away [2]. I believe that is the biggest reason for the improvement of Aggregate() with OOP-D (on Windows). It should be interesting to see how much of a saving we get on android from that. [1] https://uma.googleplex.com/p/chrome/callstacks/?sid=5e0a2e879e9fec0e8439d2dd67490651 [2] https://uma.googleplex.com/p/chrome/callstacks/?sid=cbf9f53c77afb30a92dc59212bac7aa7
,
Jun 16 2018
Looking at the flame-graphs comment #23, even with the sync-ipc gone with OOP-D, it looks like copying the quads (CopyPasses()) takes similar amount of time as returning the child resources (DeclareUsedResourcesFromChild()). So it may still be a good idea to delay this until draw happens? I am thinking of something like this: https://chromium-review.googlesource.com/c/chromium/src/+/1103527 Of course, this still means we have to do this work, but after swap, instead of during Aggregate(). I am thinking that would still be useful?
,
Jun 18 2018
Some local numbers (Z840) with https://chromium-review.googlesource.com/c/chromium/src/+/1103527 for Aggregate() time: ToT: 338us With CL: 135us With OOPD: 161us With OOPD+CL: 132us I am updating cluster-telemetry [1] so that I can get some numbers from there. [1] https://chromium-review.googlesource.com/c/chromium/src/+/1103852
,
Jun 18 2018
I'm curious what impact this has on total DrawAndSwap time and presentation time? Are we just moving work around in the pipeline or are we making things faster? |
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by weiliangc@chromium.org
, Mar 7 2017