New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 794302 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug-Regression



Sign in to add a comment

30.6%-60.1% regression in loading.desktop at 522477:522704

Project Member Reported by fmea...@chromium.org, Dec 12 2017

Issue description

See the link to graphs below.
 
Project Member

Comment 1 by 42576172...@developer.gserviceaccount.com, Dec 12 2017

All graphs for this bug:
  https://chromeperf.appspot.com/group_report?bug_id=794302

(For debugging:) Original alerts at time of bug-filing:
  https://chromeperf.appspot.com/group_report?sid=d8ec850d1170265d6a6a7f9a88560932b77feedfa6d2793c5c9978510595600a


Bot(s) for this bug's original alert(s):

chromium-rel-mac-retina
chromium-rel-mac11-air
chromium-rel-mac11-pro
chromium-rel-win10
chromium-rel-win7-gpu-ati
chromium-rel-win7-gpu-intel
chromium-rel-win7-gpu-nvidia
chromium-rel-win8-dual
linux-release
Project Member

Comment 2 by 42576172...@developer.gserviceaccount.com, Dec 12 2017

๐Ÿ“ Pinpoint job started.
https://pinpoint-dot-chromeperf.appspot.com/job/14e2b6ca040000
Project Member

Comment 3 by 42576172...@developer.gserviceaccount.com, Dec 13 2017

Cc: briander...@chromium.org khushals...@chromium.org
Owner: khushals...@chromium.org
Status: Assigned (was: Untriaged)
๐Ÿ“ Found a significant difference after 1 commit.
https://pinpoint-dot-chromeperf.appspot.com/job/14e2b6ca040000

cc: Alter duration used in prioritizing impl latency in Scheduler.
By khushalsagar@chromium.org ยท Thu Dec 07 22:46:18 2017
chromium @ ee432e360c202a47eaf132b33d7c88155f52b6d6

Understanding performance regressions:
  http://g.co/ChromePerformanceRegressions
Status: WontFix (was: Assigned)
This regression is actually indicative of an improvement that allows us to schedule more work on the render main thread. The change in #3 is intended to improve the data used for prioritizing impl thread latency. If the scheduler prioritizes impl latency over the main thread, we draw before the main thread can respond. If the main thread is actually fast, this can block the frame pipeline in the renderer, which means the main thread is waiting on the commit event being signaled from the compositor for a longer duration and can not schedule any other work.

The change is working as intended and as a result we have less durations where the main thread is waiting. Looking at the Wall and CPU duration for MessageLoop::RunTask on the renderer main thread, its 5965 ms wall duration before the change and 3417 after. There is an increase in the number of tasks running, 792 to 1365 while the average cpu duration is now lower (3 to 1.7 ms).
Cc: nednguyen@chromium.org tdres...@chromium.org
Hey Ned - I don't understand the tradeoff being described in comment #4. Can you verify that the way this metric works is consistent with what's described in comment #4?
Cc: kouhei@chromium.org
Labels: -Pri-2 Pri-1
Status: Assigned (was: WontFix)
Wait, I don't understand yet how do you justify the huge regression in  timeToFirstMeaningfulPaint_avg yet. Regressing that metrics mean user will see the page load slower. 

I am not convinced that this is a good thing for end users yet.

My suggestion: revert your CL, guard with a Finch flag & analyze its affect on UMA heartbeat metrics to see if we can justify the trade off here
Cc: vmi...@chromium.org sunn...@chromium.org
I dug into this a bit more and IIUC, here is the set of events that is leading to this regression.

1) The Scheduler has a latency recovery mode, where if we have an update pending to be flushed (pending tree or undrawn active tree) and the timing history tells us that the main thread is fast, we skip the next main frame to allow the pipeline to stabilize. This mode is decided based on whether we can activate before the deadline here: https://cs.chromium.org/chromium/src/cc/scheduler/scheduler.cc?l=467.

2) Before this change, since the duration for which the main thread is blocked was included in our prediction for whether the main thread is fast, we were always putting it in high latency and never attempted recovery. This in turn meant that for the load with a warm cache, the timing history is saturated with samples that consider the main thread to be slow. This is because a reload reuses the same RenderWigdet/compositor.

3) After this change, before the warm load starts, the main thread stabilizes and maintains a history of fast main frames in cc's timing history. Its easy to see this in the traces on the bot. This means when the second load starts and the very first main frame is taking too long, we decide to skip the next main frame to recover latency. A side-effect of this is that the main thread schedules some background parsing work, since no high priority work is immediately pending, which further delays the next main frame.

Another interesting thing is that the recovery mode above also happens during cold start, irrespective of this change, because in that case we assume the main thread to be fast since we have no history of main frames. Reverting this change is definitely not the correct answer, that would simply avoid latency recovery altogether which unnecessarily keeps the main thread blocked across all frames.

Here is an idea, we should first avoid latency recovery when cc has no timing history, heuristic for a min number of samples? That would avoid the case of skipping the main frame on cold start as well. And then we would want to clear the timing history in cc on navigations, so a second load does not rely on the history from a previous navigation. Brian, wdyt? +sunnyps/vmiura for thoughts too.
Project Member

Comment 9 by bugdroid1@chromium.org, Jan 9 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3

commit 2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3
Author: Khushal <khushalsagar@chromium.org>
Date: Tue Jan 09 19:25:03 2018

cc: Avoid main thread latency recovery during loading.

The scheduler maintains a timing history for past frames to predict
the duration for each stage in the frame pipeline. If the main
thread is assumed to be fast from previous history but misses the
deadline in a frame, we may skip a frame to reduce back pressure in the
pipeline and allow the main thread to recover.

The above strategy can create the following scenarios during page load:
1) During the first load when the scheduler has no history and main
frames most likely miss the deadline, we might delay sending the second
main frame.

2) During reload we still retain history from the previous navigation.
So it may take a few frames before the main thread is identified to be
slow and we don't attempt latency recovery.

Skipping main frames during loading can in particular be worse since a
lot of lower priority tasks are competing to run on the main thread and
the RendererScheduler on the main thread may run lower priority work if
no higher priority rendering tasks are pending.

This patch makes 2 changes to avoid the situation above:
1) Don't attempt latency recovery until at least one commit has occured,
which is enough to have correct history estimates.

2) Clear compositor's timing history on navigations, so we dont rely on
estimates from the previous navigation. This combined with the change
above also ensures consistent behaviour from the scheduler on reloads.

R=brianderson@chromium.org, sunnyps@chromium.org

Bug:  794302 
Cq-Include-Trybots: master.tryserver.blink:linux_trusty_blink_rel;master.tryserver.chromium.android:android_optional_gpu_tests_rel
Change-Id: I65bdc730baa916e988b94e42ca80f7ece8548cae
Reviewed-on: https://chromium-review.googlesource.com/842203
Reviewed-by: Brian Anderson <brianderson@chromium.org>
Commit-Queue: Khushal <khushalsagar@chromium.org>
Cr-Commit-Position: refs/heads/master@{#528065}
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/base/rolling_time_delta_history.h
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/scheduler/compositor_timing_history.cc
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/scheduler/compositor_timing_history.h
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/scheduler/scheduler.cc
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/scheduler/scheduler.h
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/scheduler/scheduler_unittest.cc
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/test/fake_proxy.h
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/layer_tree_host.cc
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/proxy.h
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/proxy_impl.cc
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/proxy_impl.h
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/proxy_main.cc
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/proxy_main.h
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/single_thread_proxy.cc
[modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/single_thread_proxy.h

The win and linux graphs have recovered after the change above but not mac, looking into it.
The reason for the regression is still the same but the patch above didn't fix the delayed main frame. Its surprising that this is only on mac. My guess is that may be the notification that clears the timing history in cc isn't being dispatched correctly on mac.

The code for that is here: https://cs.chromium.org/chromium/src/content/renderer/render_frame_impl.cc?sq=package:chromium&l=4229, and doesn't look platform specific. I'll dig more to verify if this is the issue.
Project Member

Comment 12 by 42576172...@developer.gserviceaccount.com, Jan 31 2018

๐Ÿ“ Pinpoint job started.
https://pinpoint-dot-chromeperf.appspot.com/job/16a9f456840000
Project Member

Comment 13 by 42576172...@developer.gserviceaccount.com, Jan 31 2018

๐Ÿ“ Couldn't reproduce a difference.
https://pinpoint-dot-chromeperf.appspot.com/job/16a9f456840000
Project Member

Comment 14 by 42576172...@developer.gserviceaccount.com, Jan 31 2018

๐Ÿ“ Pinpoint job started.
https://pinpoint-dot-chromeperf.appspot.com/job/15b2483a840000
๐Ÿ“ Couldn't reproduce a difference.
https://pinpoint-dot-chromeperf.appspot.com/job/15b2483a840000
๐Ÿ“ Pinpoint job started.
https://pinpoint-dot-chromeperf.appspot.com/job/16e0fafa840000
๐Ÿ“ Couldn't reproduce a difference.
https://pinpoint-dot-chromeperf.appspot.com/job/16e0fafa840000
๐Ÿ“ Pinpoint job started.
https://pinpoint-dot-chromeperf.appspot.com/job/12f61716840000
Project Member

Comment 20 by bugdroid1@chromium.org, Feb 2 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02

commit f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02
Author: Khushal <khushalsagar@chromium.org>
Date: Fri Feb 02 01:47:09 2018

cc: Ensure main frame not skipped on navigations.

Compositor's timing history is cleared on navigations to ensure that the
data from the previous navigation does not affect the next load, in
particular main thread latency recovery. However, if the commit which
resets this data occures during the commit, the decision has already
been made at the beginning of the impl frame and is not reset. Fix this.

R=brianderson@chromium.org

Cq-Include-Trybots: master.tryserver.blink:linux_trusty_blink_rel;master.tryserver.chromium.android:android_optional_gpu_tests_rel
Change-Id: I4a045570e8622da4a0ff4a3fd98b75e8b745ff98
Bug:  794302 
Reviewed-on: https://chromium-review.googlesource.com/896226
Commit-Queue: Khushal <khushalsagar@chromium.org>
Reviewed-by: Brian Anderson <brianderson@chromium.org>
Cr-Commit-Position: refs/heads/master@{#533908}
[modify] https://crrev.com/f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02/cc/scheduler/compositor_timing_history.cc
[modify] https://crrev.com/f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02/cc/scheduler/scheduler.cc
[modify] https://crrev.com/f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02/cc/scheduler/scheduler_state_machine.cc
[modify] https://crrev.com/f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02/cc/scheduler/scheduler_state_machine.h
[modify] https://crrev.com/f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02/cc/scheduler/scheduler_unittest.cc
[modify] https://crrev.com/f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02/cc/trees/layer_tree_host.cc

Status: Fixed (was: Assigned)
Looks like the remaining mac graphs recovered before the change landed, its possible since the reason for the regression is sensitive to the decision making on main thread scheduling as well. Nothing more to be done, the changes here address the root cause of the regression.

Sign in to add a comment