New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 879207 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Nov 15
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 2
Type: Bug-Regression



Sign in to add a comment

1.6%-234.4% regression in rendering.mobile at 587172:587302

Project Member Reported by mustaq@chromium.org, Aug 30

Issue description

See the link to graphs below.
 
All graphs for this bug:
  https://chromeperf.appspot.com/group_report?bug_id=879207

(For debugging:) Original alerts at time of bug-filing:
  https://chromeperf.appspot.com/group_report?sid=8c10e4042530e0eb366d758f048754dcc2f9e7fbfb8ab23e3c15266781c51c4c


Bot(s) for this bug's original alert(s):

Android Nexus5 Perf
android-nexus5x-perf

rendering.mobile - Benchmark documentation link:
  https://bit.ly/rendering-benchmarks
Cc: ericrk@chromium.org
Owner: ericrk@chromium.org
Status: Assigned (was: Untriaged)
📍 Found a significant difference after 1 commit.
https://pinpoint-dot-chromeperf.appspot.com/job/101acd99640000

Android OOP-D: Enable OOP-D for Android Waterfall by ericrk@chromium.org
https://chromium.googlesource.com/chromium/src/+/86da5de5d854b46ef4e85bbef645a1c0866ed13b
62.97 → 199.6 (+136.6)

Understanding performance regressions:
  http://g.co/ChromePerformanceRegressions
Cc: sadrul@chromium.org kylec...@chromium.org samans@chromium.org fsam...@chromium.org
Hi All,

I'm looking into the regressions seen here: https://bugs.chromium.org/p/chromium/issues/detail?id=879207

The largest regression first gesture input latency regression on nyc_gov_scroll_2018. Here's a before/after trace:
before: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/nyc_gov_scroll_2018_2018-08-29_06-34-33_37905.html
after: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/nyc_gov_scroll_2018_2018-08-30_05-21-12_29671.html

Looking at these traces, we see some interesting behavior:
Before OOP-D, the first input event is fast, then we get backed up on the second. See before_oop_d.png
After OOP-D we immediately get backed up. See after_oop_d.png

So in both cases, we get really backed up at the beginning of the scroll. However, in the non-OOP-D case we get one frame in, *then* get backed up 9 gestures deep, while in the OOP-D case we immediately get backed up 12 gestures deep. Not sure how meaningful of a real-world regression this is, as we get pretty backed up in both cases.

However, this isn't really the most interesting part. The weird part is that, the entire time the scroll gesture is going on, we seem to be pumping frames - the display is actually drawing / swapping. Additionally, no thread seems to be backed up. See after_frame_production.png.

I'm not really sure how we can pump so many frames without processing these input events' associated latency info... unless we somehow queue up a lot of frames right before the input event starts... but I don't see evidence of this.

Am taking a pinpoint with additional trace categories to try to narrow down the root cause: https://pinpoint-dot-chromeperf.appspot.com/job/14c1371d640000
before_oop_d.png
31.6 KB View Download
after_oop_d.png
39.9 KB View Download
after_frame_production.png
90.1 KB View Download
I'm not an input event expert but what happens if events get coalesced?
Cc: nzolghadr@chromium.org
Components: Blink>Compositing Blink>Input
Cc: tdres...@chromium.org
When we coalesce two events we report one less data point. The new latency we report is also the result of coalescing the latency infos. This will be the difference from the time that the new coalesced event is handled (depends on the checkpoint this has different meaning) and the earliest event creation time. 
And when do we decide to coalesce? One possible explanation here is with OOP-D, the timing shifts such that we coalesce the first event into subsequent events whereas we don't without OOP-D and that reports as a regression. If that is the case, is this really a regression?
We always coalesce if the pipeline is busy. For example I believe in OOPIF if the previous event hasn't gone through (like finding the target frames and whatnot) there is coalescing/queuing there. Obviously only the coalescable events get coalesced and the rest of them get queued.
https://cs.chromium.org/chromium/src/ui/events/blink/blink_event_util.cc?type=cs&sq=package:chromium&g=0&l=510

Another example is in the renderer when main thread is busy (or for rAF aligning) and compositor coalesces events in the queue until main thread picks them up.

In general coalescing does make the metrics worse (if it is caused by the rest of the pipeline busy on the previous task) and it is kind of regression. If you look at the worst case scenarios (like high percentiles) we are essentially handling the first event in that coalesced pack later compared to when things weren't busy and we could avoid coalescing.
The InputLatency::GestureScrollUpdate traces have the full set of LatencyInfo components and associated times. The events have different DISPLAY_COMPOSITOR_RECEIVED_FRAME_COMPONENT, so that means they weren't coalesced, since they arrived in different CompositorFrames.

All the INPUT_EVENT_GPU_SWAP_BUFFER_COMPONENT and INPUT_EVENT_LATENCY_FRAME_SWAP_COMPONENT times I looked at were the same. This seems like either a surface sync issue or bug in display compositor code. For surface sync if the the newer renderer CompositorFrames aren't being used during aggregation I'd expect to see this. Alternatively there could be a bug with what LatencyInfo objects are being recorded in traces after a particular swap ack?
Hmm, I originally dismissed this being a surface sync issue because I turned on surface sync separately a few days prior to turning on OOP-D but with that said, it's possible a slight timing perturbation could put us in a not-so-great state timing wise.

With surface sync we will keep pumping frames from the renderer even if they're not embedded by the browser. We do not throttle frame production child-side. Perhaps that's the issue?
Components: -Blink>Compositing
Components: -Blink>Input Internals>Services>Viz
Cc: maxlg@chromium.org
 Issue 880297  has been merged into this issue.
Cc: npm@chromium.org
 Issue 883746  has been merged into this issue.
 Issue 883756  has been merged into this issue.
Ok, figured out the issue here... could use some Viz expertise to understand the best solution (or to help with a fix):

Basically, this is the combination of two things:

First, when Viz receives a frame, it checks if that frame causes damage. If not, it immediately acks it. This has odd interaction when the renderer frame arrives before the browser frame that embeds it. In the case we're seeing here, the renderer frame arrives before the browser frame, detects no damage, and immediately acks. This means that the renderer can get an arbitrary number of frames ahead of the browser. Eventually, a browser frame comes in for each renderer frame and all are drawn (none skipped), but in traces I've looked at we've had the renderer 3 frames ahead of the browser.

Second, if we get a new renderer frame for a given FrameSinkId, we coalesce the latency infos from the old frame into the new one. I assume this is under the assumption that the old frame will never be drawn. (this code I believe: https://cs.chromium.org/chromium/src/components/viz/service/surfaces/surface.cc?rcl=65181316897fd37d04cfcb07245159597556f39a&l=418)

When these two things combine, we end up in a situation where we keep coalescing latency info from frame A into frame B, before A is drawn. A is then drawn, but has no latency info.


It seems like this could be fixed in two ways:
1) Assume that the renderer *can* get ahead of the browser in frame submission, and delay coalescing latency infos *until* a frame is actually drawn, rather than at activation time.
2) Prevent the renderer from getting ahead of the browser by avoiding ACKing the renderer frame when a browser frame for it may be pending (this seems tricky, not sure how to handle this best).

WDYT?
Labels: OS-Android
Owner: fsam...@chromium.org
Over to fsamuel@ for a fix - will split out any non input-latency regressions into a new bug.

My feeling is that the user impact is low - this is mainly a reporting issue, so a fix in 71 seems fine.
I believe I have a fix in progress. I'm running a pinpoint job to be sure:

https://chromium-review.googlesource.com/c/chromium/src/+/1236931
Cc: piman@chromium.org
+piman@:

FYI: My solution is 2. The renderer can get ahead of the browser in frame submission: child-initiated synchronization events such as top bar controls scrolling.

I am throttling the child on the parent for child-initiated synchronizations.
I've attached some local telemetry results with an upcoming patch to throttle child frames. It makes a substantial improvement.
results.html
1.9 MB View Download
Labels: -M-70 M-71
This will be addressed in M71
😿 Pinpoint job stopped with an error.
https://pinpoint-dot-chromeperf.appspot.com/job/12ab5d2ce40000

The swarming task expired. The bots are likely overloaded, dead, or misconfigured.
😿 Pinpoint job stopped with an error.
https://pinpoint-dot-chromeperf.appspot.com/job/16923b72e40000

The swarming task expired. The bots are likely overloaded, dead, or misconfigured.
😿 Pinpoint job stopped with an error.
https://pinpoint-dot-chromeperf.appspot.com/job/165cb828e40000

The swarming task expired. The bots are likely overloaded, dead, or misconfigured.
Project Member

Comment 30 by bugdroid1@chromium.org, Oct 3

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f96f0ea3b55efe11726a463a08e34cee8de57686

commit f96f0ea3b55efe11726a463a08e34cee8de57686
Author: Fady Samuel <fsamuel@chromium.org>
Date: Wed Oct 03 18:49:09 2018

Surface Synchronization: Throttle child synchronization events

This CL enables throttling of child synchronization events. In a child-
initiated synchronization, the child's frame will likely activate
before the dependent parent's CompositorFrame arrives. Upon activation,
the child will see that the new surface does not cause any damage to the
display (because it's not embedded by any parent yet) and calls the
draw callback immediately, thereby sending back a DidReceiveCompositorFrameAck
to the client immediately.

This, in turn, results in the client sending another CompositorFrame to Viz,
and another and so on, letting the child get further and further ahead of the
parent. This results in a lot of wasted work because the child's surface will
not be shown on screen until the parent embeds it and so multiple child
initiated synchronization events are pointless.

This CL throttles DidReceiveCompositorFrameAck of the child on child-initiated
synchronization events until the parent references the child.

Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel
Change-Id: I62416d87f8d871b32f4caad565cff515a9c55445
Bug:  672962 ,  879207 
Reviewed-on: https://chromium-review.googlesource.com/1236931
Commit-Queue: Fady Samuel <fsamuel@chromium.org>
Reviewed-by: Saman Sami <samans@chromium.org>
Cr-Commit-Position: refs/heads/master@{#596300}
[modify] https://crrev.com/f96f0ea3b55efe11726a463a08e34cee8de57686/components/viz/service/frame_sinks/compositor_frame_sink_support.cc
[modify] https://crrev.com/f96f0ea3b55efe11726a463a08e34cee8de57686/components/viz/service/frame_sinks/compositor_frame_sink_support.h
[modify] https://crrev.com/f96f0ea3b55efe11726a463a08e34cee8de57686/components/viz/service/frame_sinks/compositor_frame_sink_support_unittest.cc
[modify] https://crrev.com/f96f0ea3b55efe11726a463a08e34cee8de57686/components/viz/service/frame_sinks/surface_synchronization_unittest.cc
[modify] https://crrev.com/f96f0ea3b55efe11726a463a08e34cee8de57686/components/viz/service/surfaces/surface.cc
[modify] https://crrev.com/f96f0ea3b55efe11726a463a08e34cee8de57686/components/viz/service/surfaces/surface.h
[modify] https://crrev.com/f96f0ea3b55efe11726a463a08e34cee8de57686/components/viz/service/surfaces/surface_dependency_tracker.cc
[modify] https://crrev.com/f96f0ea3b55efe11726a463a08e34cee8de57686/components/viz/service/surfaces/surface_dependency_tracker.h
[modify] https://crrev.com/f96f0ea3b55efe11726a463a08e34cee8de57686/components/viz/service/surfaces/surface_manager.cc
[modify] https://crrev.com/f96f0ea3b55efe11726a463a08e34cee8de57686/components/viz/service/surfaces/surface_manager.h

😿 Pinpoint job stopped with an error.
https://pinpoint-dot-chromeperf.appspot.com/job/1654c530e40000

The swarming task expired. The bots are likely overloaded, dead, or misconfigured.
Project Member

Comment 32 by bugdroid1@chromium.org, Oct 4

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6f72fd2c91cfbe156601a2e74acdb873187d0d36

commit 6f72fd2c91cfbe156601a2e74acdb873187d0d36
Author: Fady Samuel <fsamuel@chromium.org>
Date: Thu Oct 04 18:00:14 2018

Surface Synchronization: Allow out-of-order activation

With child-initiated synchronization throttling, it's possible for
frames from a given client to activate out-of-order.

For example, a child-initiated synchronization that is blocked on the
parent may be superceded by a parent-initiated synchronization.

That parent-inititaed surface will activate first, before the child-
initiated one, causing surfaces to activate out of order.

That's okay! This CL fixes some CHECKs and marks older surfaces
at activation time as candidates for garbage collection.

Bug:  672962 ,  879207 
Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel
Change-Id: I19c033c6400f467b29658c706556a99c5eaf0791
Reviewed-on: https://chromium-review.googlesource.com/c/1260019
Reviewed-by: Saman Sami <samans@chromium.org>
Commit-Queue: Fady Samuel <fsamuel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#596750}
[modify] https://crrev.com/6f72fd2c91cfbe156601a2e74acdb873187d0d36/components/viz/service/frame_sinks/compositor_frame_sink_support.cc
[modify] https://crrev.com/6f72fd2c91cfbe156601a2e74acdb873187d0d36/components/viz/service/frame_sinks/surface_synchronization_unittest.cc

Status: Fixed (was: Assigned)

Sign in to add a comment