Issue metadata
Sign in to add a comment
|
30.6%-60.1% regression in loading.desktop at 522477:522704 |
||||||||||||||||||||
Issue descriptionSee the link to graphs below.
,
Dec 12 2017
๐ Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/14e2b6ca040000
,
Dec 13 2017
๐ Found a significant difference after 1 commit. https://pinpoint-dot-chromeperf.appspot.com/job/14e2b6ca040000 cc: Alter duration used in prioritizing impl latency in Scheduler. By khushalsagar@chromium.org ยท Thu Dec 07 22:46:18 2017 chromium @ ee432e360c202a47eaf132b33d7c88155f52b6d6 Understanding performance regressions: http://g.co/ChromePerformanceRegressions
,
Dec 13 2017
This regression is actually indicative of an improvement that allows us to schedule more work on the render main thread. The change in #3 is intended to improve the data used for prioritizing impl thread latency. If the scheduler prioritizes impl latency over the main thread, we draw before the main thread can respond. If the main thread is actually fast, this can block the frame pipeline in the renderer, which means the main thread is waiting on the commit event being signaled from the compositor for a longer duration and can not schedule any other work. The change is working as intended and as a result we have less durations where the main thread is waiting. Looking at the Wall and CPU duration for MessageLoop::RunTask on the renderer main thread, its 5965 ms wall duration before the change and 3417 after. There is an increase in the number of tasks running, 792 to 1365 while the average cpu duration is now lower (3 to 1.7 ms).
,
Dec 14 2017
Hey Ned - I don't understand the tradeoff being described in comment #4. Can you verify that the way this metric works is consistent with what's described in comment #4?
,
Dec 15 2017
Wait, I don't understand yet how do you justify the huge regression in timeToFirstMeaningfulPaint_avg yet. Regressing that metrics mean user will see the page load slower. I am not convinced that this is a good thing for end users yet. My suggestion: revert your CL, guard with a Finch flag & analyze its affect on UMA heartbeat metrics to see if we can justify the trade off here
,
Dec 20 2017
I dug into this a bit more and IIUC, here is the set of events that is leading to this regression. 1) The Scheduler has a latency recovery mode, where if we have an update pending to be flushed (pending tree or undrawn active tree) and the timing history tells us that the main thread is fast, we skip the next main frame to allow the pipeline to stabilize. This mode is decided based on whether we can activate before the deadline here: https://cs.chromium.org/chromium/src/cc/scheduler/scheduler.cc?l=467. 2) Before this change, since the duration for which the main thread is blocked was included in our prediction for whether the main thread is fast, we were always putting it in high latency and never attempted recovery. This in turn meant that for the load with a warm cache, the timing history is saturated with samples that consider the main thread to be slow. This is because a reload reuses the same RenderWigdet/compositor. 3) After this change, before the warm load starts, the main thread stabilizes and maintains a history of fast main frames in cc's timing history. Its easy to see this in the traces on the bot. This means when the second load starts and the very first main frame is taking too long, we decide to skip the next main frame to recover latency. A side-effect of this is that the main thread schedules some background parsing work, since no high priority work is immediately pending, which further delays the next main frame. Another interesting thing is that the recovery mode above also happens during cold start, irrespective of this change, because in that case we assume the main thread to be fast since we have no history of main frames. Reverting this change is definitely not the correct answer, that would simply avoid latency recovery altogether which unnecessarily keeps the main thread blocked across all frames. Here is an idea, we should first avoid latency recovery when cc has no timing history, heuristic for a min number of samples? That would avoid the case of skipping the main frame on cold start as well. And then we would want to clear the timing history in cc on navigations, so a second load does not rely on the history from a previous navigation. Brian, wdyt? +sunnyps/vmiura for thoughts too.
,
Dec 22 2017
https://chromium-review.googlesource.com/c/chromium/src/+/842203 should address this.
,
Jan 9 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3 commit 2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3 Author: Khushal <khushalsagar@chromium.org> Date: Tue Jan 09 19:25:03 2018 cc: Avoid main thread latency recovery during loading. The scheduler maintains a timing history for past frames to predict the duration for each stage in the frame pipeline. If the main thread is assumed to be fast from previous history but misses the deadline in a frame, we may skip a frame to reduce back pressure in the pipeline and allow the main thread to recover. The above strategy can create the following scenarios during page load: 1) During the first load when the scheduler has no history and main frames most likely miss the deadline, we might delay sending the second main frame. 2) During reload we still retain history from the previous navigation. So it may take a few frames before the main thread is identified to be slow and we don't attempt latency recovery. Skipping main frames during loading can in particular be worse since a lot of lower priority tasks are competing to run on the main thread and the RendererScheduler on the main thread may run lower priority work if no higher priority rendering tasks are pending. This patch makes 2 changes to avoid the situation above: 1) Don't attempt latency recovery until at least one commit has occured, which is enough to have correct history estimates. 2) Clear compositor's timing history on navigations, so we dont rely on estimates from the previous navigation. This combined with the change above also ensures consistent behaviour from the scheduler on reloads. R=brianderson@chromium.org, sunnyps@chromium.org Bug: 794302 Cq-Include-Trybots: master.tryserver.blink:linux_trusty_blink_rel;master.tryserver.chromium.android:android_optional_gpu_tests_rel Change-Id: I65bdc730baa916e988b94e42ca80f7ece8548cae Reviewed-on: https://chromium-review.googlesource.com/842203 Reviewed-by: Brian Anderson <brianderson@chromium.org> Commit-Queue: Khushal <khushalsagar@chromium.org> Cr-Commit-Position: refs/heads/master@{#528065} [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/base/rolling_time_delta_history.h [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/scheduler/compositor_timing_history.cc [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/scheduler/compositor_timing_history.h [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/scheduler/scheduler.cc [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/scheduler/scheduler.h [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/scheduler/scheduler_unittest.cc [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/test/fake_proxy.h [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/layer_tree_host.cc [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/proxy.h [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/proxy_impl.cc [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/proxy_impl.h [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/proxy_main.cc [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/proxy_main.h [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/single_thread_proxy.cc [modify] https://crrev.com/2f9cdf26e68f1fc58a978216b0e4d97eb660a2d3/cc/trees/single_thread_proxy.h
,
Jan 11 2018
The win and linux graphs have recovered after the change above but not mac, looking into it.
,
Jan 12 2018
The reason for the regression is still the same but the patch above didn't fix the delayed main frame. Its surprising that this is only on mac. My guess is that may be the notification that clears the timing history in cc isn't being dispatched correctly on mac. The code for that is here: https://cs.chromium.org/chromium/src/content/renderer/render_frame_impl.cc?sq=package:chromium&l=4229, and doesn't look platform specific. I'll dig more to verify if this is the issue.
,
Jan 31 2018
๐ Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/16a9f456840000
,
Jan 31 2018
๐ Couldn't reproduce a difference. https://pinpoint-dot-chromeperf.appspot.com/job/16a9f456840000
,
Jan 31 2018
๐ Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/15b2483a840000
,
Feb 1 2018
๐ Couldn't reproduce a difference. https://pinpoint-dot-chromeperf.appspot.com/job/15b2483a840000
,
Feb 1 2018
๐ Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/16e0fafa840000
,
Feb 1 2018
๐ Couldn't reproduce a difference. https://pinpoint-dot-chromeperf.appspot.com/job/16e0fafa840000
,
Feb 1 2018
๐ Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/12f61716840000
,
Feb 1 2018
https://chromium-review.googlesource.com/c/chromium/src/+/896226 should get the remaining ones.
,
Feb 2 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02 commit f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02 Author: Khushal <khushalsagar@chromium.org> Date: Fri Feb 02 01:47:09 2018 cc: Ensure main frame not skipped on navigations. Compositor's timing history is cleared on navigations to ensure that the data from the previous navigation does not affect the next load, in particular main thread latency recovery. However, if the commit which resets this data occures during the commit, the decision has already been made at the beginning of the impl frame and is not reset. Fix this. R=brianderson@chromium.org Cq-Include-Trybots: master.tryserver.blink:linux_trusty_blink_rel;master.tryserver.chromium.android:android_optional_gpu_tests_rel Change-Id: I4a045570e8622da4a0ff4a3fd98b75e8b745ff98 Bug: 794302 Reviewed-on: https://chromium-review.googlesource.com/896226 Commit-Queue: Khushal <khushalsagar@chromium.org> Reviewed-by: Brian Anderson <brianderson@chromium.org> Cr-Commit-Position: refs/heads/master@{#533908} [modify] https://crrev.com/f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02/cc/scheduler/compositor_timing_history.cc [modify] https://crrev.com/f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02/cc/scheduler/scheduler.cc [modify] https://crrev.com/f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02/cc/scheduler/scheduler_state_machine.cc [modify] https://crrev.com/f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02/cc/scheduler/scheduler_state_machine.h [modify] https://crrev.com/f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02/cc/scheduler/scheduler_unittest.cc [modify] https://crrev.com/f60bbc668e29986e3f2dcc1f8fdfdbaba19deb02/cc/trees/layer_tree_host.cc
,
Feb 3 2018
Looks like the remaining mac graphs recovered before the change landed, its possible since the reason for the regression is sensitive to the decision making on main thread scheduling as well. Nothing more to be done, the changes here address the root cause of the regression. |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by 42576172...@developer.gserviceaccount.com
, Dec 12 2017