New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 594983 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
Closed: Jul 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 2
Type: Bug



Sign in to add a comment

Regression in Event.Latency.TouchToFirstScrollUpdateSwapBegin from M48 to M49

Project Member Reported by tdres...@chromium.org, Mar 15 2016

Issue description

In the median:
- GT-I9500, GT-I9505, Nexus 5X, Nexus 6P all regressed by ~1.3ms
- Nexus 5 and Nexus 6 improved by ~0.75ms

In the 99th percentile:
- GT-I9505 regressed by ~50ms
- GT-I9500, Nexus 5X, Nexus 6P, Nexus 5 and Nexus 6 improved by ~50ms

Telemetry shows a regression in first_gesture_scroll_update_latency which has now recovered, which overlaps with when M49 was branched. It's possible we should just wait until 50 hits beta to see if this is resolved.

Any thoughts on what this could be?
 
I suspect that's because 48.0.2564.95 has this code to disable expensive task blocking.

  // Expensive task blocking is currently disabled ( crbug.com/574343 ).
  block_expensive_loading_tasks = false;
  block_expensive_timer_tasks = false;

It's weird there was opposite changes in the 99th percentile of various tasks.
Hit send too soon.  I meant to say it's weird there was opposite changes in the 99th percentile for different devices.   Are we sure those numbers are statistically meaningful at the 99th percentile? (I sort of hope they are but I've been lead astray before by randomness).
Right, I suspect the first_gesture_scroll_update_latency was also caused by us turning off expensive task blocking.

Incidentally I've been trying to find out if some metrics show movement from the finch experiment that enables that blocking again. So far it's been very inconclusive, but maybe we can find the numbers corresponding to the devices listed here.

Here's GT-I9505 for example: go/piwfx -- Tim does that look similar to the regression you saw?
These numbers are statistically significant at the 99th percentile (the timeline view shows relatively stable numbers).

GT-I505 in UMA in general regressed by 80ms, compared to a ~15ms regression from the Finch Trial.

It doesn't look like this accounts for the whole regression.


Something that might explain the difference is that the new policy is less aggressive than the one we disabled in M49. For example it doesn't block anything during main thread driven gestures anymore. We might be able to bring it back with more heuristics, but I'm not sure if it's worth the complexity.
Is my understanding of the timeline here correct?
M46: No task blocking
M47: Aggressive task blocking (merged back from M48)
M48: Aggressive task blocking
M49: Less aggressive task blocking behind a finch trial
M50: Less aggressive task blocking behind a finch trial
Almost, M48 had the task blocking turned off.
In that case, from M48 to M49, we went from no task blocking to non-aggressive task blocking behind a finch trial. How would that result in a regression in M49?

The relevant UMA graph is here (internal only):
https://uma.googleplex.com/p/chrome/timeline_v2/?q=%7B%22day_count%22%3A%22All%22%2C%22end_date%22%3A%22latest%22%2C%22window_size%22%3A%221%22%2C%22filters%22%3A%5B%7B%22fieldId%22%3A%22channel%22%2C%22operator%22%3A%22EQ%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%224%22%5D%7D%2C%7B%22fieldId%22%3A%22version_tags%22%2C%22operator%22%3A%22CONTAINS%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%22D%22%5D%7D%2C%7B%22fieldId%22%3A%22platform%22%2C%22operator%22%3A%22EQ%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%22A%22%5D%7D%2C%7B%22fieldId%22%3A%22short_hw_class%22%2C%22operator%22%3A%22COMPARE%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%22Nexus%205%22%2C%22Nexus%205X%22%2C%22Nexus%206%22%2C%22Nexus%206P%22%2C%22GT-I9500%22%2C%22GT-I9505%22%2C%22SM_G900H%22%5D%7D%5D%2C%22histograms%22%3A%5B%22Event.Latency.TouchToFirstScrollUpdateSwapBegin%22%2C%22Event.Latency.TouchToScrollUpdateSwapBegin%22%5D%2C%22default_entry_values%22%3A%7B%22measureModel%22%3A%7B%22measure%22%3A%22%22%2C%22buckets%22%3A%5B%5D%2C%22percentiles%22%3A%5B%2250%22%5D%2C%22selectedFormulas%22%3A%5B%5D%2C%22allFormulas%22%3A%5B%5D%7D%2C%22zeroBased%22%3Atrue%2C%22logScale%22%3Afalse%2C%22showLowVolumeData%22%3Afalse%2C%22showVersionAnnotations%22%3Atrue%7D%2C%22entries%22%3A%5B%7B%22measureModel%22%3A%7B%22measure%22%3A%22percentile%22%2C%22percentiles%22%3A%5B%2299%22%5D%7D%2C%22zeroBased%22%3Afalse%7D%2C%7B%22measureModel%22%3A%7B%22measure%22%3A%22percentile%22%7D%7D%5D%7D
Sami, I'm not clear on how this regression could be related to the scheduler change. If M48 didn't have task blocking, how could the M48-M49 regression be related to task blocking?
Right, I think I need to double check the dates here. M48 had blocking, but it was reverted once we found the problems. If we're comparing M48 before the revert to M49 then there could be a difference.

Can we figure out revision ranges for the regressions?

(I'll soon be OoO the rest of the week but I'll return to this next week.)
On some devices there is some movement during M48, but performance appears to be improving over time, not regressing.

Performance in M49 is worse than the worst performance in M48 though.

We see M48 become dominant on Feb 4, and then its performance gradually improves until Feb 24, and then it's about flat until March 12, when M49 becomes dominant.
Any status updates here? Not sure if you guys are investigating off-thread.
Sorry, this has been on my backlog for a while. I'll try to get to it.
These are the ranges for the two regressions:

--- First (smaller) regression from 47.0.2526.83...48.0.2564.95
I'm guessing this is the difference of M47 having task blocking and it being disabled in M48.

--- Second regression from 48.0.2564.95...49.0.2623.91
No clear culprit for this one :( There are a number of changes in the display scheduler, cc scheduler and the Blink scheduler, but none of them seem obviously problematic.

I took some traces of scrolling at ToT and the only problem I saw had to do with jank caused by the long press touch highlight rect that sometimes activated when I saw trying to scroll (trace attached). However M48 has the same problem.

I also compared M48 and ToT with a high speed camera on a couple of sites but couldn't see any quantifiable difference.

Deep reports might be the next thing to look at here.
chrome-profile-results-2016-04-14-112110.html
3.6 MB View Download
This recovers in M50.

https://uma.googleplex.com/p/chrome/timeline_v2/?q=%7B%22day_count%22%3A%22All%22%2C%22end_date%22%3A%22latest%22%2C%22window_size%22%3A%221%22%2C%22filters%22%3A%5B%7B%22fieldId%22%3A%22channel%22%2C%22operator%22%3A%22EQ%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%223%22%5D%7D%2C%7B%22fieldId%22%3A%22version_tags%22%2C%22operator%22%3A%22CONTAINS%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%22D%22%5D%7D%2C%7B%22fieldId%22%3A%22platform%22%2C%22operator%22%3A%22EQ%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%22A%22%5D%7D%5D%2C%22histograms%22%3A%5B%5B%22Event.Latency.TouchToFirstScrollUpdateSwapBegin%22%5D%5D%2C%22default_entry_values%22%3A%7B%22measureModel%22%3A%7B%22measure%22%3A%22%22%2C%22buckets%22%3A%5B%5D%2C%22percentiles%22%3A%5B%2250%22%5D%2C%22selectedFormulas%22%3A%5B%5D%2C%22allFormulas%22%3A%5B%5D%7D%2C%22zeroBased%22%3Atrue%2C%22logScale%22%3Afalse%2C%22showLowVolumeData%22%3Afalse%2C%22showVersionAnnotations%22%3Atrue%7D%2C%22entries%22%3A%5B%7B%22measureModel%22%3A%7B%22measure%22%3A%22percentile%22%7D%2C%22zeroBased%22%3Afalse%7D%5D%7D

https://uma.googleplex.com/p/chrome/timeline_v2/?q=%7B%22day_count%22%3A%22All%22%2C%22end_date%22%3A%22latest%22%2C%22window_size%22%3A%221%22%2C%22filters%22%3A%5B%7B%22fieldId%22%3A%22channel%22%2C%22operator%22%3A%22EQ%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%224%22%5D%7D%2C%7B%22fieldId%22%3A%22version_tags%22%2C%22operator%22%3A%22CONTAINS%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%22D%22%5D%7D%2C%7B%22fieldId%22%3A%22platform%22%2C%22operator%22%3A%22EQ%22%2C%22study%22%3A%22%22%2C%22selected%22%3A%5B%22A%22%5D%7D%5D%2C%22histograms%22%3A%5B%5B%22Event.Latency.TouchToFirstScrollUpdateSwapBegin%22%5D%5D%2C%22default_entry_values%22%3A%7B%22measureModel%22%3A%7B%22measure%22%3A%22%22%2C%22buckets%22%3A%5B%5D%2C%22percentiles%22%3A%5B%2250%22%5D%2C%22selectedFormulas%22%3A%5B%5D%2C%22allFormulas%22%3A%5B%5D%7D%2C%22zeroBased%22%3Atrue%2C%22logScale%22%3Afalse%2C%22showLowVolumeData%22%3Afalse%2C%22showVersionAnnotations%22%3Atrue%7D%2C%22entries%22%3A%5B%7B%22measureModel%22%3A%7B%22measure%22%3A%22percentile%22%7D%2C%22zeroBased%22%3Afalse%7D%5D%7D

I did a bit more digging - this only occurs for Samsung device, so we don't have much telemetry coverage, but we do have some:
https://chromeperf.appspot.com/report?sid=6779fd0c536d8e89c211a6a93213a351d3267d39655b0a46061da3a4bf189f87&start_rev=311770&end_rev=392588


Labels: TouchLatencyRegression
Event.Latency.Browser.TouchAcked didn't regress, but Event.Latency.ScrollUpdate.TouchToHandled_Main did regress. This implies that it's the actual main thread scrolling machinery (or scheduling) that regressed.
Status: WontFix (was: Available)
Marking WontFix, as this is old, and has recovered.
Labels: SHC
Labels: -SHC SystemHealth-Council

Sign in to add a comment