New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 599567 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 1
Type: Bug-Regression

Blocked on:
issue 606850

Blocking:
issue 576552



Sign in to add a comment

4% Mac page load regression in M49

Project Member Reported by rsch...@chromium.org, Mar 31 2016

Issue description

4% overall page load time regression at the median on Mac that's currently rolled out to stable:

https://uma.googleplex.com/timeline_v2?q=%7B%22day_count%22%3A%2290%22%2C%22end_date%22%3A%22latest%22%2C%22window_size%22%3A%221%22%2C%22filters%22%3A%5B%7B%22fieldId%22%3A%22channel%22%2C%22operator%22%3A%22COMPARE%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%224%22%5D%7D%2C%7B%22fieldId%22%3A%22platform%22%2C%22operator%22%3A%22COMPARE%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%22M%22%5D%7D%2C%7B%22fieldId%22%3A%22milestone%22%2C%22operator%22%3A%22COMPARE%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%2249%22%2C%2248%22%5D%7D%5D%2C%22histograms%22%3A%5B%22PageLoad.Timing2.NavigationToFirstContentfulPaint%22%5D%2C%22default_entry_values%22%3A%7B%22measureModel%22%3A%7B%22measure%22%3A%22%22%2C%22buckets%22%3A%5B%5D%2C%22percentiles%22%3A%5B%2250%22%5D%2C%22selectedFormulas%22%3A%5B%5D%2C%22allFormulas%22%3A%5B%5D%7D%2C%22zeroBased%22%3Atrue%2C%22logScale%22%3Afalse%2C%22showLowVolumeData%22%3Afalse%2C%22showVersionAnnotations%22%3Atrue%7D%2C%22entries%22%3A%5B%7B%22measureModel%22%3A%7B%22measure%22%3A%22percentile%22%7D%2C%22zeroBased%22%3Afalse%7D%5D%7D

Based on timing and characteristics, it seems likely to be related to fa7fc32c5940dfd3d734ed3231b1295da4c3303e. Full list of regressions attributed to that revision can be found at  https://chromeperf.appspot.com/group_report?bug_id=564223. According to eae "likely due to a large perf regressions for AAT fonts."

Related memory regression in  issue 564223 .
 

Comment 1 by e...@chromium.org, Mar 31 2016

Thanks

Note that all but two of the graphs has recovered to the same or better levels than before. One is likely due to AAT ( bug 547912 ), the other one is an OOPIF test that has had a number of large changes and swings in the past few months.

Also, note that the UMA data shows that performance on Windows and Linux improved between 48 and 49 for the same test.

Comment 2 by drott@chromium.org, Mar 31 2016

See https://bugs.chromium.org/p/chromium/issues/detail?id=576989#c16 for more details on the AAT performance regression that has been addressed.

Comment 4 by e...@chromium.org, Mar 31 2016

Status: Available (was: Unconfirmed)
Yeah that's bad and requires some further investigation. Interesting that it doesn't match the benchmarks. Thanks Ryan!

Comment 5 by e...@chromium.org, Apr 25 2016

FYI, the numbers for M50 seems to on par with M48. Yay. I'll keep monitoring to ensure this is the case as M50 rolls out.
Cc: pinkerton@chromium.org shrike@chromium.org

Comment 7 by e...@chromium.org, Apr 27 2016

Status: Fixed (was: Available)
M50 keeps matching M48 so looks like the fixes in M50 worked ans returned performance to M48 levels. Yay.

I'm confused, the numbers here for M50 look worse than M49, if anything. Certainly not back down to M48 levels. Unless I'm looking at the wrong thing?
Status: Available (was: Fixed)
Restricting it to just M49 Beta and Stable

https://uma.googleplex.com/timeline_v2?q=%7B%22day_count%22%3A%22180%22%2C%22end_date%22%3A%22latest%22%2C%22window_size%22%3A%221%22%2C%22filters%22%3A%5B%7B%22fieldId%22%3A%22channel%22%2C%22operator%22%3A%22COMPARE%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%224%22%2C%223%22%5D%7D%2C%7B%22fieldId%22%3A%22platform%22%2C%22operator%22%3A%22COMPARE%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%22M%22%5D%7D%2C%7B%22fieldId%22%3A%22milestone%22%2C%22operator%22%3A%22COMPARE%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%2249%22%5D%7D%5D%2C%22histograms%22%3A%5B%5B%22PageLoad.Timing2.NavigationToFirstContentfulPaint%22%5D%5D%2C%22default_entry_values%22%3A%7B%22measureModel%22%3A%7B%22measure%22%3A%22%22%2C%22buckets%22%3A%5B%5D%2C%22percentiles%22%3A%5B%2250%22%5D%2C%22selectedFormulas%22%3A%5B%5D%2C%22allFormulas%22%3A%5B%5D%7D%2C%22zeroBased%22%3Atrue%2C%22logScale%22%3Afalse%2C%22showLowVolumeData%22%3Afalse%2C%22showVersionAnnotations%22%3Atrue%7D%2C%22entries%22%3A%5B%7B%22measureModel%22%3A%7B%22measure%22%3A%22percentile%22%7D%2C%22zeroBased%22%3Afalse%7D%5D%7D

the regression takes off after 49.0.2623.75 Beta. The change list

https://chromium.googlesource.com/chromium/src/+log/49.0.2623.63..49.0.2623.75?pretty=fuller&n=10000

includes the cherry-picked IOSurface change that I've suspected in the other long regression thread. Issue 606850 tracks an experiment to disable that change for some users, so let's see if that makes a difference with this metric as well.



Blocking: 576552
Labels: -Type-Bug M-51 ReleaseBlock-Stable Type-Bug-Regression
Can we get an owner for this perf regression? This probably should have blocked M50. 
Blockedon: 606850
Cc: ccameron@chromium.org erikc...@chromium.org
+ccameron, erikchen as owners of issue 606850.
With the discussion elsewhere on this issue I'm less confident the IOSurface issue, if it is indeed the cause of the DrawInterval regression, is also the cause of this problem but with the experiment getting underway we should learn more.
M51 Stable is launching very soon! Your bug is labelled as Stable ReleaseBlock, pls make sure to land the fix and get it merged ASAP. All changes MUST be merged into the release branch by 5pm on May 20 to make into the desktop Stable final build cut. Thank you!
Any update on fix for this bug as we're very close to M51 stable candidate cut and this is M51 stable blocker?
I don't think that this should be RBS for M51.
rschoen@, pinkerton@, could you please remove the RB-stable if c#18 sounds right to you? Also, I am a little confused after reading through the comments. Can one of you summarize which releases see this regression and how bad the current numbers are?
I think that a lot of confusion here is coming from the fact that this bug conflates two issues (perhaps related, but with different symptoms).

Issue 1: M49 in BETA regressed this metric by 30msec. I could hazard a guess that this is because we started using the CoreAnimation renderer. In particular, we started using IOSurfaces for all textures, which are much more expensive to allocate. There exists a finch experiment that we could create to verify this hypothesis. If this is the case, it's "good to know, worth the tradeoff".

 Issue 2 : M49 in STABLE sees this metric DEGENERATE over time. This is the thing where, on M49, the we go from ~840ms on April 23 to ~920ms by May 18. This looks to me to be caused by Chrome running for a long time. Is there a way that we could break this data down by Chrome's uptime? A long time ago we found an OS X bug where they leak IOSurfaces (see crrev.com/168186). Perhaps we're triggering that again? Also concerning is  issue 580616 , where some people complained that performance degraded to an unacceptable level after several days of use. We haven't been able to reproduce that issue (and had the theory that it was specific to a particular GPU).

Two things to try.

1. Add a UMA stat that measures the systems's IOSurface load. I can put something together to merge into M52. We can then track that as M52 goes off to stable. It may be that we will see a leak there.

2. Determine the cause of metric's regression in Beta by a Finch experiment. In particular, chop the users into 3 groups:
 1. Disable the CoreAnimation renderer and disable using IOSurfaces for compositor resources
 2. Disable the CoreAnimation renderer but still use IOSurfaces for compositor resources
 3. Enable the CoreAnimation renderer (requires IOSurfaces)
That will answer the question of "where did issue 1 come from". Sending this experiment to stable would regress users' battery life, so we probably should just do it in beta.
graph.png
153 KB View Download
WRT the RBS label:
- The "Issue 1" above isn't worth RBS -- not a big enough regression.
- The " Issue 2 " above isn't going to be solved for M52. At most I could maybe get the IOSurface tracking UMA merged in, to see if that's correlated.

rschoen@: Is there a way to break UMAs down by how long Chrome has been running? That might be revealing.
Cc: rkaplow@chromium.org
+rkaplow to answer "Is there a way to break UMAs down by how long Chrome has been running?"
eae@ - was your AlwaysUseComplexText placed under an experiment before it went completely live? I would like to look at the experiment data.

---

Taking another look at this, I understand how I arrived at the conclusion that the IOSurface change may be responsible, and think that may still be a possibility.

I also think I understand how rschoen@ arrived at his conclusion. In my analysis, my theory for the jump was a cherry-picked change. However if you don't assume a cherry picked change and follow the regression trail backwards, you reach a different conclusion.

In this graph:

https://uma.googleplex.com/timeline_v2?q=%7B%22day_count%22%3A%22240%22%2C%22end_date%22%3A%22latest%22%2C%22window_size%22%3A%221%22%2C%22filters%22%3A%5B%7B%22fieldId%22%3A%22channel%22%2C%22operator%22%3A%22COMPARE%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%221%22%2C%222%22%2C%223%22%5D%7D%2C%7B%22fieldId%22%3A%22platform%22%2C%22operator%22%3A%22COMPARE%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%22M%22%5D%7D%5D%2C%22histograms%22%3A%5B%5B%22PageLoad.Timing2.NavigationToFirstContentfulPaint%22%5D%5D%2C%22default_entry_values%22%3A%7B%22measureModel%22%3A%7B%22measure%22%3A%22%22%2C%22buckets%22%3A%5B%5D%2C%22percentiles%22%3A%5B%2250%22%5D%2C%22selectedFormulas%22%3A%5B%5D%2C%22allFormulas%22%3A%5B%5D%7D%2C%22zeroBased%22%3Atrue%2C%22logScale%22%3Afalse%2C%22showLowVolumeData%22%3Afalse%2C%22showVersionAnnotations%22%3Atrue%7D%2C%22entries%22%3A%5B%7B%22measureModel%22%3A%7B%22measure%22%3A%22percentile%22%7D%2C%22zeroBased%22%3Afalse%7D%5D%7D

the data show a regression in Beta when going from 48.0.2564.82 -> 49.0.2623.28, and a regression in Dev when going from 48.0.2564.22 -> 49.0.2587.3. So basically a regression baked into the transition from M48 to M49 in Beta and Dev. If you look at Canary before the Dev regression there's a regression that starts to take off at 48.0.2571.0. The AlwaysUseComplexText change appears between 48.0.2571.0 and 48.0.2574.0. However I think the regression already began before this change landed.

Taking a stab at 48.0.2560.0 as a reasonable point before the regression, and noting that this is a Mac-only regression, the one change that stands out is:

    885da5130d948ba7f6d721888db7114a4f912789 
    mac: Some consumers of SharedMemory require a POSIX fd.
    https://chromium.googlesource.com/chromium/src/+/885da5130d948ba7f6d721888db7114a4f912789

Perhaps passing memory using POSIX is more expensive.

Labels: -ReleaseBlock-Stable -M-51 M-52
Project Member

Comment 25 by sheriffbot@chromium.org, Jun 1 2016

Labels: -M-52 M-53 MovedFrom-52
Moving this nonessential bug to the next milestone.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Owner: erikc...@chromium.org
erikchen, any idea if the change shrike identified in #23 could be to blame?
Status: Assigned (was: Available)
I can look into it, although it's pretty unlikely that POSIX/Mach shared memory changes are responsible, so those were gated behind an experiment, which showed improvements to PLT.
Owner: ccameron@chromium.org
ccameron's analysis appears accurate. The degradation of the metric over time after M49 is released is highly telling.

shrike's analysis relies on the assumption that the regression on Dev channel happened between D4 and D5 in the attached image. Given the amount of noise in the Dev channel, (e.g. D2 has same value as D5), it doesn't really seem like we can pinpoint a regression to that range.
Screen Shot 2016-07-06 at 10.09.14 AM.png
32.3 KB View Download
Re: my suspicions in #23, I agree with erikchen@ that the regression is not related to the shared memory change. But in #23 I also thought it might be related to the IOSurface clearing change. Again restricting the IOSurface experiment to users with spinning disks and 2Gb of RAM, there is a significant regression. On the theory that the cost of clearing an IOSurface is not negligible for these users, it makes sense that this could increase the time from navigation to first contentful paint. It appears that the slowdown is more sensitive to RAM than spinning disk (the regression pretty much disappears when you look just at disk).

https://uma.googleplex.com/p/chrome/variations/?sid=051c2d8734976b2e694437ebe2c0b239

Re: #20, ccameron@ - we should try to see if there is a problem with things getting slower the longer you run the browser.

Screen Shot 2016-07-06 at 3.55.04 PM.png
83.3 KB View Download
Project Member

Comment 30 by sheriffbot@chromium.org, Jul 14 2016

Labels: -M-53 -Pri-1 M-54 MovedFrom-53 Pri-2
This issue is Pri-1 but has already been moved once. Lowering the priority and moving to the next milestone.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Looking at IOSurface experiment data, IOSurface clearing was not the cause of the regression.

https://uma.googleplex.com/p/chrome/variations/?sid=48d6b2d63f010ff0720b0778c3de7d99
Labels: SHC
Labels: -SHC SystemHealth-Council
Owner: e...@chromium.org
Any updates on this? It seems like we're back to thinking it's likely fa7fc32c5940dfd3d734ed3231b1295da4c3303e? Assigning to eae in that case.
eae@, Could you please provide an update on this.
Cc: bmcquade@chromium.org
This has not recovered, and actually continues to regress. Anything we can do?

https://uma.googleplex.com/timeline_v2?sid=86bbb93e2c9a449539a4f8e5bd86c04c

Comment 37 by e...@chromium.org, Oct 18 2016

We knew that fa7fc32c5940dfd3d734ed3231b1295da4c3303e would regress performance on some sites for an improvement on others and an improvement in correctness. The regressions have since been addressed by subsequent performance improvements.

Our other performance tests are indicating that text rendering has gotten *faster* over the last few releases so if we're seeing a continuous regression then it likely has a different cause.


Cc: parisa@chromium.org
Labels: -Pri-2 Pri-1
+parisa as FYI

What are the next steps? Do we need to re-bisect and find a different culprit? Upgrading to P1 as this has been a regression for 5 milestones. 
A couple observations:

1. the regression in m49 is real, is specific to mac, and affects the distribution up through about the 90th percentile - the tail seems unchanged

2. PageLoad.Timing2.NavigationToCommit is unchanged: https://uma.googleplex.com/timeline_v2?sid=ddb9b13461cdf775e1a50fa9a10f0124
this indicates that the regression occurs in the period between commit and first contentful paint

3. the increase in M49 stable starting on May 25 at the end of the graph in comment #20 is not a real regression. M50 became the dominant browser version as of May 25, so any users on M49 after this point are users who are failing to upgrade to M50. We often see higher latency for users that remain on old versions, as these users likely have different network/device/etc characteristics from users who successfully upgrade. To control for this, you can include a "Version tag / contains / dominant" filter in your filtering criteria.

4. The Timing2 variants of these metrics have been replaced by PaintTiming equivalents. We stopped logging Timing2 metrics in M53. M54 became the dominant browser as of late October, so future analysis should look at PageLoad.PaintTiming.NavigationToFirstContentfulPaint.

5. The regression doesn't appear to have recovered: https://uma.googleplex.com/timeline_v2?sid=1e2e9c4170cdc926c9e093267275c8d1

6. The regression appears to have been introduced between the 2564 and 2587 branches, based on dev channel analysis:
https://uma.googleplex.com/p/chrome/timeline_v2/?sid=1ebaed7e293611c97ff0eb938b3eb869. Given that the regression also emerges at the 48/49 boundary, this may also be caused by a field trial targeting 49.

7. Canary channel data is unfortunately a bit too noisy for us to see precisely when the regression started on canary: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=36615395e55dc1cd65dfb623e9fb1dec
Cc: kouhei@chromium.org
kouhei and annie, is it possible for us to do a bisect using page cycler v2 to find regressions in the TTFCP metric on mac, in the 2564..2587 range?
kouhei, bmcquade: do you know which chart on chromeperf you'd like to bisect? If you paste a link I can help out.
Kouhei, is there a chromeperf chart that runs page cycler v2 on a representative set of URLs? any such chart/benchmark should suffice here.

Comment 43 by e...@chromium.org, Apr 10 2017

Status: WontFix (was: Assigned)
Given the lack of recent activity and the age of the regression I'm closing this bug as WontFix.

Sign in to add a comment