New issue
Advanced search Search tips

Issue 636884 link

Starred by 5 users

Issue metadata

Status: Fixed
Owner:
Closed: Sep 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug-Regression

Blocking:
issue 581526



Sign in to add a comment

1%-303.9% regression in smoothness.tough_filters_cases at 411050:411073

Project Member Reported by pmeenan@chromium.org, Aug 11 2016

Issue description

See the link to graphs below.
 
Project Member

Comment 3 by 42576172...@developer.gserviceaccount.com, Aug 11 2016

Cc: erikc...@chromium.org
Owner: erikc...@chromium.org

=== Auto-CCing suspected CL author erikchen@chromium.org ===

Hi erikchen@chromium.org, the bisect results pointed to your CL below as possibly
causing a regression. Please have a look at this info and see whether
your CL be related.


===== BISECT JOB RESULTS =====
Status: completed


===== SUSPECTED CL(s) =====
Subject : Enable promotion of RenderPassDrawQuads to CALayers.
Author  : erikchen
Commit description:
  
BUG= 581526 
CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel

Review-Url: https://codereview.chromium.org/2234513002
Cr-Commit-Position: refs/heads/master@{#411055}
Commit  : c48bb2ddd6f847130de7956e9bf7e350d875b10c
Date    : Wed Aug 10 16:54:47 2016


===== TESTED REVISIONS =====
Revision         Mean     Std Dev    N  Good?
chromium@411052  19.1234  0.100575   5  good
chromium@411054  19.235   0.0787122  5  good
chromium@411055  77.4154  0.462796   5  bad    <--
chromium@411058  76.7695  0.42632    5  bad
chromium@411063  76.15    0.849974   5  bad

Bisect job ran on: mac_retina_perf_bisect
Bug ID: 636884

Test Command: src/tools/perf/run_benchmark -v --browser=release --output-format=chartjson --upload-results --also-run-disabled-tests smoothness.tough_filters_cases
Test Metric: frame_times/http___rawgit.com_WebKit_webkit_master_PerformanceTests_Animometer_developer.html?test-interval_20_display_minimal_controller_fixed_frame-rate_50_kalman-process-error_1_kalman-measurement-error_4_time-measurement_performance_suite-name_Animometer_test-name_Focus_complexity_100
Relative Change: 298.20%
Score: 99.9

Buildbot stdio: http://build.chromium.org/p/tryserver.chromium.perf/builders/mac_retina_perf_bisect/builds/1564
Job details: https://chromeperf.appspot.com/buildbucket_job_status/9004621240522277568


Not what you expected? We'll investigate and get back to you!
  https://chromeperf.appspot.com/bad_bisect?try_job_id=6376188033892352

| O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq
|  X  | for more information addressing perf regression bugs. For feedback,
| / \ | file a bug with component Tests>AutoBisect.  Thank you!
Blocking: 581526
The original CL has been reverted: https://codereview.chromium.org/2238113002/

I've been investigating locally, and discovered several problems:
1) ResourcePool resources were not correctly being reused.
2) Each frame needed to recreate every IOSurface, since each layer is changing slightly in size. The ResourcePool needs to allow fuzzy size matching, and the GLRenderer needs to handle IOSurfaces that are slightly too large.

Even with both of these fixes, there's still a non-trivial performance regression (~20%). Looking at the traces, everything seems GPU bound, and each frame is processed slightly faster in the non-CA compositor path. Will need to investigate further. 
Project Member

Comment 5 by bugdroid1@chromium.org, Aug 12 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2b5951abbd94b2beec72eaf126feade6950979ff

commit 2b5951abbd94b2beec72eaf126feade6950979ff
Author: erikchen <erikchen@chromium.org>
Date: Fri Aug 12 23:33:34 2016

GLRenderer should regularly call ResourcePool::CheckBusyResources.

The method moves resources from the busy pool to the unused pool.

BUG= 636884 ,  581526 
CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel

Review-Url: https://codereview.chromium.org/2245523003
Cr-Commit-Position: refs/heads/master@{#411817}

[modify] https://crrev.com/2b5951abbd94b2beec72eaf126feade6950979ff/cc/output/gl_renderer.cc
[modify] https://crrev.com/2b5951abbd94b2beec72eaf126feade6950979ff/cc/resources/resource_pool.h

Cc: ccameron@chromium.org
I spent a while looking into why the CA compositor still has lower FPS than GL compositor. The difference is smaller than I previous measured (maybe closer to 10%). The difference seems entirely attributable to the fact that [CATransaction commit] takes multiple milliseconds to process when there are ~100 layers. The slowdown is highly correlated with settings IOSurfaces as the contents of CALayer (rather than changing any of the other properties). That's weird, since this should theoretically be as cheap as grabbing a mach port from an IOSurface. ccameron, ideas?
Interesting about the correlation with setting contents. With CPU rasterization, that could incur an upload, but with GPU rasterization, it shouldn't. Sometimes paging on dual-GPU systems has issues.

Another thing that I noticed is that when setting an IOSurface as contents, CA will look up the color space (and other things) for the IOSurface. We don't set color spaces for most IOSurfaces (yet), but I'd be curious if they're looking up tons of stuff (and if there's a way around that).

Changing the tile size from 256x256 to 512x512 made a huge difference in performance (see  issue 598268 , though it's scant on details). We should consider also going to 1024x1024.
I did some profiling. Looks like settings an IOSurface as the contents of a CALayer forces a call to IOSurfaceClientCreateMachPort, which is /really/ slow. 

https://trac.webkit.org/changeset/180203/trunk/Source/WebKit2/Shared/mac/RemoteLayerBackingStore.mm

Looks like we can set a CAMachPort as the contents of a CALayer. Note that IOSurfaceCreateMachPort calls straight into IOSurfaceClientCreateMachPort, so it may be much less performant than we thought it was. Perhaps we should be caching it?
CATransactionCommit.trace.zip
1.8 MB Download
Screen Shot 2016-08-15 at 1.26.17 PM.png
227 KB View Download
Project Member

Comment 9 by bugdroid1@chromium.org, Aug 15 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5d68eb225969d35d6ad1748b3a390e093c02e5eb

commit 5d68eb225969d35d6ad1748b3a390e093c02e5eb
Author: erikchen <erikchen@chromium.org>
Date: Mon Aug 15 20:17:24 2016

Round IOSurface sizes to 64 pixels for filter effects.

This allows IOSurfaces to be more easily reused during resize animations. It
also helps prevent memory fragmentation.

BUG= 581526 ,  636884 
CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel

Review-Url: https://codereview.chromium.org/2245813002
Cr-Commit-Position: refs/heads/master@{#412031}

[modify] https://crrev.com/5d68eb225969d35d6ad1748b3a390e093c02e5eb/cc/output/gl_renderer.cc

 Issue 638250  has been merged into this issue.
The reland still has a performance regression that is surprisingly high, although it is already much lower than the first land attempt.

Looking at the graphs for rel-mac-11 and rel-mac-retina, the old regression was ~400%, the new regression is ~100%. I will continue to investigate, though I won't immediately revert, since this test triggers an exception edge case that is not reflective of real user pages. (And the benefits from the CL are significant).
The WebKit CL I referred to in c#8 was reverted shortly after landing:
https://trac.webkit.org/changeset/180799

The basic premise should still be sound though.
I was able to reproduce the 80% regression on a retina MBP 10.11.6. There are two causes: 

1. The current IOSurface reuse algorithm doesn't perform optimally with this particular web page. This is the bulk of the regression. I think it should be safe to change the numbers to encourage reuse, at the expense of a short-term increase in memory usage.

2. Passing IOSurfaces as the contents of a CALayer is slow, as indicated earlier in the bug. I wasn't able to get CAMachPortCreate to work. Not sure why. Given that WebKit isn't using that logic either, perhaps it's functionality was broken and never fixed.
Project Member

Comment 14 by bugdroid1@chromium.org, Aug 22 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b8818116176eebbb37637a3b30cec284aa01d937

commit b8818116176eebbb37637a3b30cec284aa01d937
Author: erikchen <erikchen@chromium.org>
Date: Mon Aug 22 18:51:27 2016

Encourage IOSurface reuse for CSS filter effects.

This comes at the expense of a temporary increase in memory usage. The resource
expiration delay was changed from 1s to 3s, so there's no long term increase in
memory usage, unless there is a corresponding performance gain.

BUG= 636884 
CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel

Review-Url: https://codereview.chromium.org/2254033002
Cr-Commit-Position: refs/heads/master@{#413489}

[modify] https://crrev.com/b8818116176eebbb37637a3b30cec284aa01d937/cc/output/gl_renderer.cc
[modify] https://crrev.com/b8818116176eebbb37637a3b30cec284aa01d937/cc/resources/resource_pool.cc
[modify] https://crrev.com/b8818116176eebbb37637a3b30cec284aa01d937/cc/resources/resource_pool.h
[modify] https://crrev.com/b8818116176eebbb37637a3b30cec284aa01d937/cc/resources/resource_pool_unittest.cc
[modify] https://crrev.com/b8818116176eebbb37637a3b30cec284aa01d937/cc/test/fake_picture_layer_tiling_client.cc
[modify] https://crrev.com/b8818116176eebbb37637a3b30cec284aa01d937/cc/test/layer_tree_pixel_resource_test.cc
[modify] https://crrev.com/b8818116176eebbb37637a3b30cec284aa01d937/cc/trees/layer_tree_host_impl.cc

 Issue 640137  has been merged into this issue.
Labels: ReleaseBlock-Stable
This doesn't seem to have recovered all of the perf loss. I'm still seeing ~80% loss on Mac 10.10, ~70% on mac_hdd, etc. I think a regression of that magnitude should be a stable blocker.
Labels: -ReleaseBlock-Stable
Please see comment #13. The regression is because this test case happens to tickle a worst-case scenario for the CA-compositor. This is still a huge win for power performance.
Cc: vmi...@chromium.org
This basically erases all of the gains that were made over several CLs to improve performance on the Animometer "Focus" test. I think we should have a wider discussion about the tradeoffs here.

Is there a way we can preserve the power gains without losing performance?
This test happens to sit right at the edge of the caching parameters used for IOSurface reuse. You can test it out locally by adjusting the parameters I adjusted in c#14. Those adjustments happened to "improve" the test on my local machine, I'm sure you can come up with a different set of numbers that make the perf bots happy. That being said, I can also arbitrarily adjust the test page to make the test performance worse again.

Keep in mind that power and performance are frequently closely correlated. The CA compositor has been shown to *significantly* improve power. It allows for optimizations like partial-content updates, and in general saves a blit of the web contents on every frame.

If all you care about is the performance of this particular test on the perf bots, you're welcome to [within reason] adjust the parameters of https://codereview.chromium.org/2254033002/diff/20001/cc/output/gl_renderer.cc to make your test happy.

Cc: senorblanco@chromium.org
senorblanco@ and I talked about this. To sum up:

me: This test isn't representative of real web sites.
senorblanco: Regardless, most browsers run this suite of tests, so we should do well on it.

Fair enough. I started to investigate. The caching problem is reasonably easy to deal with. Changing the iosurface_multiple to 256 mostly solves it. To be safe, I'd recommend setting the cache size to 6x number of rpdqs in a frame, and the timeout to 10 seconds. [It's not clear whether these settings will benefit all pages].

Even after this change, there is still a performance regression. Without tracing, CACompositor hits 30 fps (33 ms per frame), whereas GL compositing hits 40 fps (25ms per frame). That's an 8ms difference.

With tracing, the time per frame (CA Compositor) is 50ms, GL compositor is 40ms. 

So there's an 8-10ms difference. 2ms are due to [CATransaction commit]. Clearing the IOSurfaces before reuse is negligible. Digging deeper...I tried the following: instead of allocating 100 IOSurfaces (to create 100 CALayers to back 100 RPDQs), I instead create a single IOSurface, and draw into it 100 times, and then present a single IOSurface. This gives 33fps, for 30ms per frame! [Basically it saved the 2ms from [CATransaction commit]. It appears the cost of reusing IOSurfaces, making new FBOs, etc. is negligible. Tracing hasn't been useful, since both paths appear identical under tracing! Requires further investigation.


IOSurfaces are not-relevant to this last performance regression. The easiest way to check this is to modify CopyRenderPassDrawQuaadToOverlayResource (CA compositor) and DrawRenderPassQuadInternal (GL compositor) to immediately return after UpdateRPDQWithSkiaFilters. The performance regression still exists! This suggests something odd is happening since the actual skia/filter logic is pretty much identical. Before I dive further into this, let me describe my setup:

Corp MBP, dGPU always on, release build, DCHECK always off, non component, no symbols. Tests are run with animometer for at least 20 seconds, sometimes 40 if more time seems to reduce noise. I run two types of tests:
  1) Keep FPS constant (set to 30 fps), measure # of elements needed to maintain the fps.
  2) Keep number of elements constant (e.g. 200), measure fps.
In general, the two tests give consistent results, in that if build #1 is worse with (1), it is comparably worse with (2). That being said, (2) appears to give lower noise results, since it just involves an average, whereas (1) involves a linear regression, which I"ve seen give weird results in the past.

Using a reduced renderering path (the Skia filter image is never used, final output just a solid color quad). Note that in GL compositor, solid color quad drawn with GL, in CA compositor, solid color quad drawn with CA.
  1) CA compositor is about 20% worse than GL compositor.
  2) Adding logging statements to the CA compositor (~1 log per processed RPDQ) improvements performance by 10-20%. Doing all of the logging at once still keeps improvement. Replacing logging with a sleep makes improvement disappear.
  3) Getting rid of the solid color quads drawn by GL compositor reduces performance by 10-20%!

So then I tried replicating these results with the full rendering pipeline. Numbers are for animometer, 40 second tests, 200 fixed elements, all units are in fps. IOSurface caching has been modified to not be an issue past the first couple of frames.

CA compositor: 10.57, 10.58, 10.52, 200 fixed elements
CA compositor with additional logging to stderr: 11.68, 11.82, 11.25
GL compositor: 14, 14.07, 14.14
GL compositor, skip drawing quads that are not RPDQ, when RPDQ are present: 14.75, 14.53, 13.90, 14.62

Reduced pipeline (Skia filter image never used), 200 fixed elements
CA compositor: 19.96, 20.56, 19.94
CA compositor with additional logging to stderr: 24.16, 23.55, 24.31
GL compositor: 17.91, 17.84, 17.25
GL compositor, skip drawing quads that are not RPDQ, when RPDQ are present: 24.85, 18.40, 18.42, 25.24

Reduced pipeline (Skia filter image never used), 150 fixed elements
CA compositor: 27.92, 28.01
CA compositor with additional logging to stderr: 33.23, 32.74
GL compositor: 31.98, 31.36
GL compositor, skip drawing quads that are not RPDQ, when RPDQ are present: 24.24 32.74

Working on getting these same tests run on a non-corp machine. My best guess is that there's some type of really weird optimization that is getting sometimes hit in the reduced rendering pipeline case. Understanding this will probably also yield an explanation for the full pipeline case.
Just for kicks, I tried GL compositor (150 fixed elements), reduced pipeline, with logging to stderr: 29.51, 29.07

Other observations: There are some logic differences in the reduced pipeline between CA compositor and GL compositor around scissoring and blending. They shouldn't affect production of the Skia Image filter. I tried changing the settings to be consistent, and didn't see any behavior changes.

The actual parameters passed to the Skia image filter appera to be mostly irrelevant. e.g. Always pass in two filters, blur (amount=11), and opacity (amount=0.05) instead of whatever animometer is trying to feed in, and the results are the same.
Is our performance worse because of how we synchronize CA rendering?

If I recall, Safari reports a higher score because it doesn't synchronize it's rendering, and CA ends up skipping half of the frames.
Here are the optimizations I can think of: the blur filter should be applied in Skia, while the opacity filter should be applied by cc when drawing the resulting quad (this was https://codereview.chromium.org/1961553002). The other optimization is to use SkImage::makeWithFilter() instead of setting up an SkCanvas, calling drawImage() etc (https://codereview.chromium.org/1959033002). This shaves off a couple of FBO passes inside Skia.

One consequence of the latter is that the texture returned by Skia contains padding, where it used to be "snug" (texture bounds == image bounds). This is compensated for by cc when it draws by adjusting the texcoords. The texture may also be "flipped" (top left origin), now handled by cc. I have no idea if this would affect the CA path, but I thought I'd mention it.
Ran tests on a non-corp MBP.

Reduced pipeline (Skia filter image never used), 130 fixed elements
CA compositor: 31, 31.38
CA compositor with additional logging to stderr: 29.56, 29.90
GL compositor: 26.14, 26.13

Full rendering, 130 fixed elements: [modifications to CA compositor are made to CopyRenderPassDrawQuadToOverlayResource]
GL compositor: 21.34, 21.37
CA compositor: 15.73, 15.64, 15.73, 15.68
CA compositor with LOG(ERROR): 20.59, 20.17, 20.09, 20.56
CA compositor with fprintf + implicit fflush: 20.24, 
CA compositor with fprintf w/o implicit fflush: 20.49, 
CA compositor with fflush: 21.00
CA compositor with usleep(2): 15.72, 21.62, 15.70, 15.70
CA compositor with usleep(10): 15.73, 15.63
CA compositor with glFlush(): 14.80, 14.53
CA compositor [draw, but do not schedule RPDQ]:  22.45, 22.16
CA compositor [draw into GL texture, do not schedule RPDQ]:  25.79, 25.81

CA compositor [draw & schedule rpdq, GPU process ignores rpdqs]:  22.56, 22.35
CA compositor [again]: 22.22, 15.69, 22.19, 22.50, 22.21, 21.34
GL compositor: 21.35, 21.09
CA compositor: 15.59, 22.49, 22.53
GL compositor: 21.32
CA compositor: 15.70

Observation 1) At first, logging appears to improve performance. So does not scheduling CALayers.
Observation 2) Then, performance seems to stay elevated even w/o logging. 
Observation 3) Performance jumps discretely from ~16 to ~22, and never ends up in between. Maybe there's some mode in the Window Server that we're accidentally toggling?

I think I have an explanation for the regression. 

caveat: In general, I've had issues reproducing my measurements. In my non-corp MBP, not scheduling/displaying rpdqs always improves performance. However, sometimes performance is just as good even when displaying rpdqs. On my corp MBP, not scheduling/displaying rpdqs does not always improve performance.

I've attached two traces. In general, tracing significantly modifies the fps measurements, so those are no longer reliable. Without traces, the FPS is generally ~29 for RPDQ, and ~36 for no RPDQ, indicating ~7ms difference in frame times.

No RPDQ: Each frame takes 40ms.
RPDQ: Each frame takes 50ms.

Looking at each frame, the difference is that in the RPDQ frames, there's 2ms extra CATransaction commit, and ~8ms extra in glFlush time, spread across ~8 glFlushes in close proximity to each other. The additional flush time does not occur at a regular point in the frame, but it does regularly occur exactly once per frame. 

I was running these tests with 100 elements in Animometer. Those get converted into 100 IOSurfaces, and we know that the WindowServer is going to have to blit each of those into the final output surface. I think that the traces pretty clearly show there's additional GPU contention that shows up ~once per frame.

I wish we had more details about the exact functionality of the WindowServer. The solution I'm currently considering is to disable CA Compositor if there would be too many IOSurfaces [or perhaps we want to disable the CA compositor if there are too many IOSurfaces not being reused every frame, but that's a lot harder to detect]. 

I think we should disable CA compositor if there are too many DrawQuads in the root render pass.
trace_no_display_rpdq.gz
3.8 MB Download
trace_display_rpdq.gz
2.3 MB Download
Project Member

Comment 28 by bugdroid1@chromium.org, Sep 21 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e4e1cdc9897a759cf154dad8e81c6dd88ac4b22d

commit e4e1cdc9897a759cf154dad8e81c6dd88ac4b22d
Author: erikchen <erikchen@chromium.org>
Date: Wed Sep 21 22:59:18 2016

Disable CA compositor if there are too many RenderPassDrawQuads

If there are too many RenderPassDrawQuads, we shouldn't use Core Animation to
present them as individual layers, since that potentially doubles the amount of
work needed to present them. cc has to blit them into an IOSurface, and then
Core Animation has to blit them to the final surface.

BUG= 636884 
CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel

Review-Url: https://codereview.chromium.org/2351343002
Cr-Commit-Position: refs/heads/master@{#420188}

[modify] https://crrev.com/e4e1cdc9897a759cf154dad8e81c6dd88ac4b22d/cc/output/ca_layer_overlay.cc
[modify] https://crrev.com/e4e1cdc9897a759cf154dad8e81c6dd88ac4b22d/cc/output/overlay_unittest.cc
[modify] https://crrev.com/e4e1cdc9897a759cf154dad8e81c6dd88ac4b22d/tools/metrics/histograms/histograms.xml

Status: Fixed (was: Assigned)
It looks like this regression is now fixed. Thanks for going the extra mile!

https://chromeperf.appspot.com/report?sid=e692fbaf665f6cc929dd85d0a379010460b5b78e437091475d5093c382931c31&start_rev=411902&end_rev=421174

Sign in to add a comment