New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Frame metadata latency info has 150k entries, causing Surface::TakeLatencyInfo to spin CPU at 100%

Project Member Reported by erikc...@chromium.org, Apr 18 2018

Issue description

Version 68.0.3397.0 (Official Build) canary (32-bit)

Siggi is noticing his Browser spinning at 100% CPU. This has been happening the last couple of days. The trace shows that browser main thread is spending all its time in Surface::ActivateFrame. We didn't have a good mechanism to sample the process, so Siggi attached a debugger, paused the process, checked the stack of the main thread, and then resumed the process. He repeated this several times and always saw the same code on the stack, so this suggests that we're spinning in the same function.


00 ntdll_770f0000!RtlFreeHeap
01 chrome!base::allocator::WinHeapFree
02 chrome!`anonymous namespace'::DefaultWinHeapFreeImpl
03 chrome!heap_profiling::`anonymous namespace'::HookFree
04 chrome!ShimFree
05 chrome!free
06 chrome!std::_Deallocate
07 chrome!std::allocator<std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent> >::deallocate
08 chrome!std::vector<std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent>,std::allocator<std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent> > >::_Tidy
09 chrome!std::vector<std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent>,std::allocator<std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent> > >::~vector
0a chrome!base::internal::flat_tree<std::pair<ui::LatencyComponentType,long long>,std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent>,base::internal::GetKeyFromValuePairFirst<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent>,std::less<void> >::Impl::~Impl
0b chrome!base::internal::flat_tree<std::pair<ui::LatencyComponentType,long long>,std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent>,base::internal::GetKeyFromValuePairFirst<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent>,std::less<void> >::~flat_tree
0c chrome!ui::LatencyInfo::~LatencyInfo
0d chrome!std::_Default_allocator_traits<std::allocator<ui::LatencyInfo> >::destroy
0e chrome!std::_Destroy_range1
0f chrome!std::_Destroy_range
10 chrome!std::vector<ui::LatencyInfo,std::allocator<ui::LatencyInfo> >::_Destroy
11 chrome!std::vector<ui::LatencyInfo,std::allocator<ui::LatencyInfo> >::_Change_array
12 chrome!std::vector<ui::LatencyInfo,std::allocator<ui::LatencyInfo> >::emplace_back<const ui::LatencyInfo &>
13 chrome!std::vector<ui::LatencyInfo,std::allocator<ui::LatencyInfo> >::push_back
14 chrome!std::back_insert_iterator<std::vector<ui::LatencyInfo,std::allocator<ui::LatencyInfo> > >::operator=
15 chrome!std::_Copy_unchecked1
16 chrome!std::_Copy_unchecked
17 chrome!std::_Copy_no_deprecate
18 chrome!std::copy
19 chrome!viz::Surface::TakeLatencyInfoFromFrame
1a chrome!viz::Surface::TakeLatencyInfo
1b chrome!viz::Surface::ActivateFrame
1c chrome!viz::Surface::QueueFrame
1d chrome!viz::CompositorFrameSinkSupport::MaybeSubmitCompositorFrame
1e chrome!viz::CompositorFrameSinkSupport::SubmitCompositorFrame
1f chrome!content::RenderWidgetHostViewChildFrame::SubmitCompositorFrame
20 chrome!content::RenderWidgetHostImpl::SubmitCompositorFrame
21 chrome!viz::mojom::CompositorFrameSinkStubDispatch::Accept
22 chrome!viz::mojom::CompositorFrameSinkStub<mojo::RawPtrImplRefTraits<viz::mojom::CompositorFrameSink> >::Accept
23 chrome!mojo::internal::MultiplexRouter::ProcessIncomingMessage
24 chrome!mojo::internal::MultiplexRouter::Accept
25 chrome!mojo::Connector::ReadSingleMessage
26 chrome!mojo::Connector::ReadAllAvailableMessages
27 chrome!mojo::Connector::OnHandleReadyInternal
28 chrome!base::internal::FunctorTraits<void (base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &>::*)(const SkBitmap &) __attribute__((thiscall)),void>::Invoke
29 chrome!base::internal::InvokeHelper<0,void>::MakeItSo
2a chrome!base::internal::Invoker<base::internal::BindState<void (base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &>::*)(const SkBitmap &) __attribute__((thiscall)),std::unique_ptr<base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &>,std::default_delete<base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &> > > >,void (const SkBitmap &)>::RunImpl
2b chrome!base::internal::Invoker<base::internal::BindState<void (base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &>::*)(const SkBitmap &) __attribute__((thiscall)),std::unique_ptr<base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &>,std::default_delete<base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &> > > >,void (const SkBitmap &)>::Run
2c chrome!base::RepeatingCallback<void (unsigned int)>::Run
2d chrome!mojo::SimpleWatcher::DiscardReadyState
2e chrome!base::internal::FunctorTraits<void (*)(const InlineLoginHandlerImpl::FinishCompleteLoginParams &, Profile *, Profile::CreateStatus),void>::Invoke
2f chrome!base::internal::InvokeHelper<0,void>::MakeItSo
30 chrome!base::internal::Invoker<base::internal::BindState<void (*)(const InlineLoginHandlerImpl::FinishCompleteLoginParams &, Profile *, Profile::CreateStatus),InlineLoginHandlerImpl::FinishCompleteLoginParams>,void (Profile *, Profile::CreateStatus)>::RunImpl
31 chrome!base::internal::Invoker<base::internal::BindState<void (*)(const InlineLoginHandlerImpl::FinishCompleteLoginParams &, Profile *, Profile::CreateStatus),InlineLoginHandlerImpl::FinishCompleteLoginParams>,void (Profile *, Profile::CreateStatus)>::Run
32 chrome!base::RepeatingCallback<void (unsigned int, const mojo::HandleSignalsState &)>::Run
33 chrome!mojo::SimpleWatcher::OnHandleReady
34 chrome!base::internal::FunctorTraits<void (mojo::SimpleWatcher::*)(int, unsigned int, const mojo::HandleSignalsState &) __attribute__((thiscall)),void>::Invoke
35 chrome!base::internal::InvokeHelper<1,void>::MakeItSo
36 chrome!base::internal::Invoker<base::internal::BindState<void (mojo::SimpleWatcher::*)(int, unsigned int, const mojo::HandleSignalsState &) __attribute__((thiscall)),base::WeakPtr<mojo::SimpleWatcher>,int,unsigned int,mojo::HandleSignalsState>,void ()>::RunImpl
37 chrome!base::internal::Invoker<base::internal::BindState<void (mojo::SimpleWatcher::*)(int, unsigned int, const mojo::HandleSignalsState &) __attribute__((thiscall)),base::WeakPtr<mojo::SimpleWatcher>,int,unsigned int,mojo::HandleSignalsState>,void ()>::Run
38 chrome!base::OnceCallback<void ()>::Run
39 chrome!base::debug::TaskAnnotator::RunTask
3a chrome!base::internal::IncomingTaskQueue::RunTask
3b chrome!base::MessageLoop::RunTask
3c chrome!base::MessageLoop::DeferOrRunPendingTask
3d chrome!base::MessageLoop::DoWork
3e chrome!base::MessagePumpForUI::DoRunLoop
3f chrome!base::MessagePumpWin::Run
40 chrome!base::MessageLoop::Run
41 chrome!base::RunLoop::Run
42 chrome!ChromeBrowserMainParts::MainMessageLoopRun
43 chrome!content::BrowserMainLoop::RunMainMessageLoopParts
44 chrome!content::BrowserMainRunnerImpl::Run
45 chrome!content::BrowserMain
46 chrome!content::RunNamedProcessTypeMain
47 chrome!content::ContentMainRunnerImpl::Run
48 chrome!service_manager::Main
49 chrome!content::ContentMain
4a chrome!ChromeMain
4b chrome_exe!MainDllLoader::Launch
4c chrome_exe!wWinMain
4d chrome_exe!invoke_main
4e chrome_exe!__scrt_common_main_seh
4f KERNEL32!BaseThreadInitThunk
50 ntdll_770f0000!__RtlUserThreadStart
51 ntdll_770f0000!_RtlUserThreadStart


Aside: We are profiling heap allocations on siggi's machine [as evidenced by the top of the stack], but this uses sampling [so only profiles a very small number of allocations] and has low overhead. I've confirmed via Task Manager that the Profiling Service is happily chugging along without any issues.

My guess is that somehow, the latency info vector is getting a very large number of entires, causing:
"""
  std::copy(frame->metadata.latency_info.begin(),
            frame->metadata.latency_info.end(),
            std::back_inserter(*latency_info));
"""

to take ~150ms a pop. Luckily, we had heap profiling enabled. So I was able to confirm that the vector has ~150k entries.
 
trace_windows_cpu_100.json.gz
1.1 MB Download
Screen Shot 2018-04-18 at 3.17.11 PM.png
48.8 KB View Download
Screen Shot 2018-04-18 at 3.16.54 PM.png
358 KB View Download

Comment 1 by danakj@chromium.org, Apr 18 2018

Cc: ericrk@chromium.org tdres...@chromium.org
For reference, https://bugs.chromium.org/p/chromium/issues/detail?id=732979 was another case of over-collecting LatencyInfos

Comment 2 by danakj@chromium.org, Apr 18 2018

In this case, it's the renderer compositor submitting 150k LatencyInfos it sounds like, as content::RenderWidgetHostImpl::SubmitCompositorFrame is in that stack.

Comment 3 by danakj@chromium.org, Apr 18 2018

Actually, erikchen:

  std::copy(frame->metadata.latency_info.begin(),
            frame->metadata.latency_info.end(),
            std::back_inserter(*latency_info));

Which vector was 150k here. The |*latency_info|, or the |frame->metadata.latency_info|

Comment 4 by piman@chromium.org, Apr 18 2018

Cc: kylec...@chromium.org fsam...@chromium.org
#3 is on point... In particular I'm not quite sure I follow how accumulated LatencyInfos are ever cleared - Surface::ActivateFrame seems to accumulate from the previous frame to the new frame.
I've attached the symbolized heap dump - perhaps you can discern more from it than me. I can't tell whether it's the old or new vector. 
trace_with_heap_dump.json.gz
273 KB Download
Cc: sadrul@chromium.org
#4 Once the surface gets aggregated for display the LatencyInfos get copied into the display CompositorFrame where they're processed next.

https://cs.chromium.org/chromium/src/components/viz/service/display/surface_aggregator.cc?l=1238&rcl=4b87d1beadcd90546ea8e9590bc64561f06c1534

Comment 7 by danakj@chromium.org, Apr 18 2018

A surface could be offscreen and never aggregated with OOPIF, right, but keep submitting frames?
I added a log for the |latency_info| size in Surface::TakeLatencyInfo() and ran with --site-per-process. One of them is steadily climbing, so it appears that's the issue.

Comment 9 by piman@chromium.org, Apr 18 2018

Components: -Internals>GPU Internals>Services>Viz
Owner: kylec...@chromium.org
Status: Assigned (was: Untriaged)
Owner: sadrul@chromium.org
Beyond knowing where LatencyInfo copying happens I don't know what the right fix is. I would say just drop the anything more than N LatencyInfos, but that will probably break telemetry. sadrul you have a better understanding, can you take a look?
So I think the root cause of the problem is uncovering why a surface could be offscreen (never contributing to aggregation), and yet submitting CompositorFrames? Any ideas Kyle? I don't know if I'll get to this today but maybe tomorrow if you're stumped. 

I do think we should add a limit on LatencyInfo too though.

Comment 12 by siggi@chromium.org, Apr 19 2018

This appears to be something that grows over time - my browser runs normally after a restart. After ~6 hours it gets to noticeable slowdown, and after a couple of days it's a total dog.
Cc: dcheng@chromium.org nasko@chromium.org creis@chromium.org
Components: Internals>Sandbox>SiteIsolation
Adding some site isolation folks just so they're also aware of the problem.
How does it work if there is an OOPIF just outside the viewport. We still rasterize and submit a CompositorFrame for the OOPIF, along with the embedder including a SurfaceDrawQuad, to avoid checkerboarding on scroll right? At some point, SurfaceAggregator would drop the OOPIF SurfaceDrawQuad from the display CompositorFrame because it's not necessary. Could that cause this?

Alternatively an OOPIF or extension could be submitting CompositorFrames when they're not supposed to.
I kind of wonder about this block of code:

https://cs.chromium.org/chromium/src/components/viz/service/surfaces/surface_manager.cc?type=cs&sq=package:chromium&l=551

We run the draw callback even if the damage is not visible.

Should we be doing that? Maybe we can defer the DidReceiveCompositorFrameAck callback until the surface becomes reachable from the root?

We'll stick tick BeginFrames but we won't accept any more CompositorFrames UNTIL it becomes reachable.

It's not clear to me why what we're doing now is or isn't the right thing to do here.
*still tick
If we block a surface just outside the viewport it will be stale when it scrolls into view, as kyle was pointing out
Well, we won't checkerboard...we'll have a frame but yes, it'll be stale. The cool thing about DidReceiveCompositorFrameAck, is the new frame is ready in the renderer, right? We've already done most of the work, so we'll likely have just one stale frame.

I wonder if that's good enough? 

Maybe surface sync can help here by blocking the parent frame for a BeginFrame?

I guess in the short term we should just limit the number of entries of latency info or change how latency info propagation works.

Longer term, it would be nice to see if we can avoid pumping out frames if we don't need them.

I think adding a limit to how many LatencyInfo we accumulate should be OK, and not break telemetry tests. But I agree with fsamuel@ that it is still very important to understand why it keeps accumulating the way it does currently. Presumably, if the oopif is offscreen, then it wouldn't be receiving any input events (nor screenshot requests). So I don't understand what is adding the LatencyInfo instances to the oopif's compositor-frame.
Cc: briander...@chromium.org
+brianderson@ who might have ideas/thoughts.

Comment 21 by grt@chromium.org, Apr 20 2018

I'm seeing this, too. I happen to have a browser window in the state described in  issue 832538 . Is it possible that fixing that (by reverting r548851) will magically resolve this?

Comment 22 by mrpi@google.com, Apr 23 2018

I have a similar experience with comment 12, although I don't know if it's the same issue.

https://bugs.chromium.org/p/chromium/issues/detail?id=823439

Just to make sure, does anyone get 100% CPU usage on the "Browser" task when looking at the task manager? Does ending/killing the "GPU Process" task somehow magically reset the situation (instead of having to restart the entire browser)?


re comment #19: it looks like we do always include a LatencyInfo to a frame even if there was no copy-request/input-event involved in that frame [1]. 

[1] https://cs.chromium.org/chromium/src/cc/trees/proxy_main.cc?sq=package:chromium&dr=Ss&l=306

Comment 24 by woxxom@gmail.com, Apr 25 2018

 Issue 827788  looks similar judging by the recently added trace.
Status: Started (was: Assigned)
CL to make sure we don't accumulate more than a limited (100) number of LatencyInfos: https://chromium-review.googlesource.com/c/chromium/src/+/1027432

I would still like to understand the behaviour better though. I was unable to repro with local test cases. kylechar@ what test page were you using when you saw LatencyInfo numbers increasing?
$ ./chrome --site-per-process www.slashdot.org

I can reproduce pretty consistently if I just do nothing and let the page load.
Slashdot: the ultimate site isolation test. It’s never let me down. 

Comment 28 by creis@chromium.org, Apr 26 2018

Labels: Target-67
Comments 25-27: Glad you've got a lead on a repro and potential fix (at least for the short term)!  Looks like it might be simple enough to merge to M67 as well?  This would be great to resolve if possible.  

(Also, agreed on understanding the root problem if we can.)
Project Member

Comment 29 by bugdroid1@chromium.org, Apr 27 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/292bb1c376b1361a30f16cc0ebe63685551ec005

commit 292bb1c376b1361a30f16cc0ebe63685551ec005
Author: Sadrul Habib Chowdhury <sadrul@chromium.org>
Date: Fri Apr 27 05:44:21 2018

viz: Limit the amount of LatencyInfo we accumulate.

It looks like some clients (OOPIFs, mostly like) are submitting compositor
frames to viz, but they are not drawn on screen. As a result, the display
compositor keeps accumulating the LatencyInfo associated with the frames.
This can grow unbounded. Impose a limit of at most 100 such LatencyInfos
being accumulated for a single client.

BUG= 834421 

Change-Id: I144db30ef2264ea4e9e1b952fe7b1cbcc53ebc46
Reviewed-on: https://chromium-review.googlesource.com/1027432
Reviewed-by: Brian Anderson <brianderson@chromium.org>
Commit-Queue: Sadrul Chowdhury <sadrul@chromium.org>
Cr-Commit-Position: refs/heads/master@{#554321}
[modify] https://crrev.com/292bb1c376b1361a30f16cc0ebe63685551ec005/components/viz/service/surfaces/surface.cc

Comment 30 by creis@chromium.org, Apr 27 2018

Cc: gov...@chromium.org
NextAction: 2018-04-30
Thanks!  If you're able to confirm that it avoids the 100% CPU usage in Monday's Canary (and confirm that it's a safe fix to merge), would you be able to request merge to M67?  The cutoff for the next beta build is Monday at 4 pm Pacific.

Much appreciated-- this could be a big performance improvement for users, depending on how often the problem was happening in practice.
The NextAction date has arrived: 2018-04-30
Cc: jmukthavaram@chromium.org
Labels: Needs-Feedback
Tested this issue on Windows 10 using chrome-68.0.3415.0 & reported version-68.0.3397.0  as per C#0 & C#26. 
Steps followed:
--------------
1. Launched chrome with --site-per-process www.slashdot.org through command line
2. Navigated to dev tools
3. network-> start recording & observed the info
4. Right click on time ->Waterfal & select latencey

But unable to find the exact issue . Please find the attached screencast for reference & let us know if we miss any steps to verify this issue from our end.

Thanks..!


834421-Win.mp4
13.3 MB View Download

Comment 33 by creis@chromium.org, Apr 30 2018

sadrul: Can you verify if the fix is working as intended, and request merge to M67 if so?  The deadline for this week's build is today at 4 pm Pacific.

I tried to verify myself, but I wasn't seeing the 100% CPU usage on Windows Dev Channel (68.0.3409.2) without the fix when visiting https://slashdot.org/ with --site-per-process.  Maybe you can post more specific repro steps?

Thanks!
For repro steps I did the following:
1. Add logging in Surface::TakeLatencyInfo() to print the size of the vector after "taking" the LatencyInfos.
2. Run Chrome with --site-per-process.
3. Load a website that has lots of ads in OOPIFs.
4. See if that std::vector<LatencyInfo> continues to grow after the page is fully loaded. Don't do anything if in that case. Otherwise restart the browser and try again. I had to experiment with different browser window sizes, different websites and different initial positions on the website to reproduce initially.

I saw the std::vector<LatencyInfo> size hit 3000 after a couple of minutes in one case on slashdot and was able to reproduce similar results a few more times. It would be a while (hours?) before that started to have a big performance impact though.

Trying again the next day I got something similar after 50 tries but accidentally scrolled. Maybe it's dependent on what ads are being loaded?

Comment 35 by creis@chromium.org, Apr 30 2018

Thanks.  I've left it running for most of the morning and don't see a difference between Dev and Canary so far.  I don't suppose there's something I could check in chrome://histograms or a trace which would be roughly equivalent to your log statements?

If not, we may just want to request merge after ensuring there aren't any regressions.  It seems the fix is understood and simple, and could have a huge benefit to long-running sessions in M67 when Site Isolation is enabled.
Labels: Merge-Request-67
Project Member

Comment 37 by sheriffbot@chromium.org, Apr 30 2018

Labels: -Merge-Request-67 Merge-Review-67 Hotlist-Merge-Review
This bug requires manual review: M67 has already been promoted to the beta branch, so this requires manual review
Please contact the milestone owner if you have questions.
Owners: cmasso@(Android), cmasso@(iOS), kbleicher@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Cc: wittman@chromium.org
At least according to the sampling-profiler output [1], it does look like there is a drop in time-spent in copying ui::LatencyInfos during ActivateFrame().

[1] https://uma.googleplex.com/p/chrome/callstacks/?sid=eb7e27dc9617472082c51fa86adaabea
Labels: -Merge-Review-67 Merge-Approved-67
Approving merge to M67 branch 3396 based on comment #35 and per offline chat with creis@. Pls merge ASAP. Thank you.
Project Member

Comment 40 by bugdroid1@chromium.org, Apr 30 2018

Labels: -merge-approved-67 merge-merged-3396
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4ceb626ece56609d84dab12a6a65deaa2a325031

commit 4ceb626ece56609d84dab12a6a65deaa2a325031
Author: Sadrul Habib Chowdhury <sadrul@chromium.org>
Date: Mon Apr 30 22:31:33 2018

viz: Limit the amount of LatencyInfo we accumulate.

It looks like some clients (OOPIFs, mostly like) are submitting compositor
frames to viz, but they are not drawn on screen. As a result, the display
compositor keeps accumulating the LatencyInfo associated with the frames.
This can grow unbounded. Impose a limit of at most 100 such LatencyInfos
being accumulated for a single client.

BUG= 834421 

Change-Id: I144db30ef2264ea4e9e1b952fe7b1cbcc53ebc46
Reviewed-on: https://chromium-review.googlesource.com/1027432
Reviewed-by: Brian Anderson <brianderson@chromium.org>
Commit-Queue: Sadrul Chowdhury <sadrul@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#554321}(cherry picked from commit 292bb1c376b1361a30f16cc0ebe63685551ec005)
Reviewed-on: https://chromium-review.googlesource.com/1036483
Reviewed-by: Charlie Reis <creis@chromium.org>
Cr-Commit-Position: refs/branch-heads/3396@{#398}
Cr-Branched-From: 9ef2aa869bc7bc0c089e255d698cca6e47d6b038-refs/heads/master@{#550428}
[modify] https://crrev.com/4ceb626ece56609d84dab12a6a65deaa2a325031/components/viz/service/surfaces/surface.cc

Comment 41 by creis@chromium.org, Apr 30 2018

Cc: sindhu.chelamcherla@chromium.org
 Issue 827788  has been merged into this issue.
Is this supposed to be fixed in 68.0.3415.0? After a day's uptime, my browser is still chewing up a full core.
Is this still caused by LatencyInfo? I’m happy to dig into the root cause if you have repro steps. 
Yeah, looks like this should have landed in time for 68.0.3412.0. See https://storage.googleapis.com/chromium-find-releases-static/292.html#292bb1c376b1361a30f16cc0ebe63685551ec005, but whatever was spinning earlier this morning has let up now that I have a debugger attached.
Once you have reliable repro steps, please share and we'll continue investigation! :-)
chrome:tracing may also include useful info (especially if it's still related to ui::LatencyInfo)
Labels: OS-Linux
siggi@: Can you post a trace?  We just got a similar report on Linux in issue 823439 which shows that Surface::ActivateFrame is still a problem in 67.0.3396.30.

Sadrul: Would you be able to look again to see if there's another reason these are taking up all of the CPU?  It seems potentially important to fix in M67 if possible.  Thanks!
I don't have a repro, except to note that this seemed to be accumulating over time. Restarting the browser would clear it, then a few hours later it'd start to get annoying.
Owner: fsam...@chromium.org
Picking this up for further investigation. 
Cc: w...@chromium.org
This is happening again for me on:

Version 68.0.3425.1 (Official Build) canary-dcheck (32-bit). 

This is after running that build for about 3 hours, reading mail, watching YouTube, etc.

Note that this is an official release build we ship with dchecks_always_on=true, so it's conceivable that this is some form of debug-only behavior. As this is a shipping build, I'd still argue that this should be fixed.

Here's a canonical stack trace:

chrome.dll!`anonymous namespace'::DefaultWinHeapFreeImpl+0xe
chrome.dll!heap_profiling::`anonymous namespace'::HookFree+0x2e
chrome.dll!free+0x11
chrome.dll!std::vector<std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent>,std::allocator<std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent> > >::_Tidy+0x49
chrome.dll!ui::LatencyInfo::~LatencyInfo+0xe
chrome.dll!std::vector<ui::LatencyInfo,std::allocator<ui::LatencyInfo> >::emplace_back<const ui::LatencyInfo &>+0xf7
chrome.dll!viz::Surface::TakeLatencyInfoFromFrame+0x48
chrome.dll!viz::Surface::ActivateFrame+0x17a
chrome.dll!viz::Surface::QueueFrame+0x5e4
chrome.dll!viz::CompositorFrameSinkSupport::MaybeSubmitCompositorFrame+0xa6a
chrome.dll!viz::CompositorFrameSinkSupport::SubmitCompositorFrame+0x59
chrome.dll!content::RenderWidgetHostViewChildFrame::SubmitCompositorFrame+0xf2
chrome.dll!content::RenderWidgetHostImpl::SubmitCompositorFrame+0x309
chrome.dll!viz::mojom::CompositorFrameSinkStubDispatch::Accept+0x578
chrome.dll!viz::mojom::CompositorFrameSinkStub<mojo::RawPtrImplRefTraits<viz::mojom::CompositorFrameSink> >::Accept+0x13
chrome.dll!mojo::InterfaceEndpointClient::HandleValidatedMessage+0x221
chrome.dll!mojo::FilterChain::Accept+0x87
chrome.dll!mojo::InterfaceEndpointClient::HandleIncomingMessage+0x6e
chrome.dll!mojo::internal::MultiplexRouter::ProcessIncomingMessage+0x2b9
chrome.dll!mojo::internal::MultiplexRouter::Accept+0x151
chrome.dll!mojo::FilterChain::Accept+0x87
chrome.dll!mojo::Connector::ReadSingleMessage+0x154
chrome.dll!mojo::Connector::ReadAllAvailableMessages+0x57
chrome.dll!mojo::Connector::OnHandleReadyInternal+0x82
chrome.dll!base::internal::Invoker<base::internal::BindState<void (base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &>::*)(const SkBitmap &) __attribute__((thiscall)),std::unique_ptr<base::internal::AdaptCallbackForRepeatingHelper<const SkBi
chrome.dll!mojo::SimpleWatcher::DiscardReadyState+0x18
chrome.dll!base::internal::Invoker<base::internal::BindState<void (*)(const InlineLoginHandlerImpl::FinishCompleteLoginParams &, Profile *, Profile::CreateStatus),InlineLoginHandlerImpl::FinishCompleteLoginParams>,void (Profile *, Profile::CreateStatus)>:
chrome.dll!mojo::SimpleWatcher::OnHandleReady+0xe4
chrome.dll!base::internal::Invoker<base::internal::BindState<void (mojo::SimpleWatcher::*)(int, unsigned int, const mojo::HandleSignalsState &) __attribute__((thiscall)),base::WeakPtr<mojo::SimpleWatcher>,int,unsigned int,mojo::HandleSignalsState>,void ()
chrome.dll!base::debug::TaskAnnotator::RunTask+0x138
chrome.dll!base::internal::IncomingTaskQueue::RunTask+0x6d
chrome.dll!base::MessageLoop::RunTask+0x20f
chrome.dll!base::MessageLoop::DeferOrRunPendingTask+0xa1
chrome.dll!base::MessageLoop::DoWork+0x20a
chrome.dll!base::MessagePumpForUI::DoRunLoop+0x78
chrome.dll!base::MessagePumpWin::Run+0x6e
chrome.dll!base::MessageLoop::Run+0x78
chrome.dll!base::RunLoop::Run+0xd4
chrome.dll!ChromeBrowserMainParts::MainMessageLoopRun+0xcf
chrome.dll!content::BrowserMainLoop::RunMainMessageLoopParts+0x3b
chrome.dll!content::BrowserMainRunnerImpl::Run+0x96
chrome.dll!content::BrowserMain+0x9d
chrome.dll!content::RunNamedProcessTypeMain+0xee
chrome.dll!content::ContentMainRunnerImpl::Run+0x11f
chrome.dll!service_manager::Main+0x2a9
chrome.dll!content::ContentMain+0x33
chrome.dll!ChromeMain+0x11e
chrome.exe!MainDllLoader::Launch+0x230
chrome.exe!wWinMain+0x54b
chrome.exe!__scrt_common_main_seh+0xf6
KERNEL32.DLL!@BaseThreadInitThunk@12+0x24
ntdll.dll!__RtlUserThreadStart+0x2f
ntdll.dll!__RtlUserThreadStart@8+0x1b

Comment 51 by w...@chromium.org, May 9 2018

Re #50: That build should be after sadrul@'s patch to limit the # of elements in |latency_info|, so are we ending up with 1000s of LatencyComponent entries in the |latency_components_| flat_map<> in a single LatencyInfo?
Looks like I have a winner with 146k+ LatencyInfos in a vector:

0:000> dt /r latency_info
Local var @ 0x4f3d214 Type std::vector<ui::LatencyInfo,std::allocator<ui::LatencyInfo> >*
0x04f3d4f8 
   +0x000 _Mypair          : std::_Compressed_pair<std::allocator<ui::LatencyInfo>,std::_Vector_val<std::_Simple_types<ui::LatencyInfo> >,1>
      +0x000 _Myval2          : std::_Vector_val<std::_Simple_types<ui::LatencyInfo> >
         +0x000 _Myfirst         : 0x3c234040 ui::LatencyInfo
         +0x004 _Mylast          : 0x3cb21d40 ui::LatencyInfo
         +0x008 _Myend           : 0x3cedc5c0 ui::LatencyInfo
0:000> ? (0x3cb21d40 -0x3c234040 )/40
Evaluate expression: 146292 = 00023b74


Here's the stack trace where I got this, and the info above is from frame 16, TakeLatencyInfoFromFrame:

00 chrome!base::PlatformThread::CurrentRef
01 chrome!base::Lock::CheckUnheldAndMark
02 chrome!base::Lock::Acquire
03 chrome!base::AutoLock::AutoLock
04 chrome!heap_profiling::`anonymous namespace'::SendBuffer::Send
05 chrome!heap_profiling::`anonymous namespace'::DoSend
06 chrome!heap_profiling::AllocatorShimLogFree
07 chrome!heap_profiling::`anonymous namespace'::HookFree
08 chrome!ShimFree
09 chrome!free
0a chrome!std::_Deallocate
0b chrome!std::allocator<std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent> >::deallocate
0c chrome!std::vector<std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent>,std::allocator<std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent> > >::_Tidy
0d chrome!std::vector<std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent>,std::allocator<std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent> > >::~vector
0e chrome!base::internal::flat_tree<std::pair<ui::LatencyComponentType,long long>,std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent>,base::internal::GetKeyFromValuePairFirst<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent>,std::less<void> >::Impl::~Impl
0f chrome!base::internal::flat_tree<std::pair<ui::LatencyComponentType,long long>,std::pair<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent>,base::internal::GetKeyFromValuePairFirst<std::pair<ui::LatencyComponentType,long long>,ui::LatencyInfo::LatencyComponent>,std::less<void> >::~flat_tree
10 chrome!ui::LatencyInfo::~LatencyInfo
11 chrome!std::_Default_allocator_traits<std::allocator<ui::LatencyInfo> >::destroy
12 chrome!std::_Destroy_range1
13 chrome!std::_Destroy_range
14 chrome!std::vector<ui::LatencyInfo,std::allocator<ui::LatencyInfo> >::_Destroy
15 chrome!std::vector<ui::LatencyInfo,std::allocator<ui::LatencyInfo> >::clear
16 chrome!viz::Surface::TakeLatencyInfoFromFrame
17 chrome!viz::Surface::TakeLatencyInfo
18 chrome!viz::Surface::ActivateFrame
19 chrome!viz::Surface::QueueFrame
1a chrome!viz::CompositorFrameSinkSupport::MaybeSubmitCompositorFrame
1b chrome!viz::CompositorFrameSinkSupport::SubmitCompositorFrame
1c chrome!content::RenderWidgetHostViewChildFrame::SubmitCompositorFrame
1d chrome!content::RenderWidgetHostImpl::SubmitCompositorFrame
1e chrome!viz::mojom::CompositorFrameSinkStubDispatch::Accept
1f chrome!viz::mojom::CompositorFrameSinkStub<mojo::RawPtrImplRefTraits<viz::mojom::CompositorFrameSink> >::Accept
20 chrome!mojo::InterfaceEndpointClient::HandleValidatedMessage
21 chrome!mojo::FilterChain::Accept
22 chrome!mojo::InterfaceEndpointClient::HandleIncomingMessage
23 chrome!mojo::internal::MultiplexRouter::ProcessIncomingMessage
24 chrome!mojo::internal::MultiplexRouter::Accept
25 chrome!mojo::FilterChain::Accept
26 chrome!mojo::Connector::ReadSingleMessage
27 chrome!mojo::Connector::ReadAllAvailableMessages
28 chrome!mojo::Connector::OnHandleReadyInternal
29 chrome!base::internal::FunctorTraits<void (base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &>::*)(const SkBitmap &) __attribute__((thiscall)),void>::Invoke
2a chrome!base::internal::InvokeHelper<0,void>::MakeItSo
2b chrome!base::internal::Invoker<base::internal::BindState<void (base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &>::*)(const SkBitmap &) __attribute__((thiscall)),std::unique_ptr<base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &>,std::default_delete<base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &> > > >,void (const SkBitmap &)>::RunImpl
2c chrome!base::internal::Invoker<base::internal::BindState<void (base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &>::*)(const SkBitmap &) __attribute__((thiscall)),std::unique_ptr<base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &>,std::default_delete<base::internal::AdaptCallbackForRepeatingHelper<const SkBitmap &> > > >,void (const SkBitmap &)>::Run
2d chrome!base::RepeatingCallback<void (unsigned int)>::Run
2e chrome!mojo::SimpleWatcher::DiscardReadyState
2f chrome!base::internal::FunctorTraits<void (*)(const InlineLoginHandlerImpl::FinishCompleteLoginParams &, Profile *, Profile::CreateStatus),void>::Invoke
30 chrome!base::internal::InvokeHelper<0,void>::MakeItSo
31 chrome!base::internal::Invoker<base::internal::BindState<void (*)(const InlineLoginHandlerImpl::FinishCompleteLoginParams &, Profile *, Profile::CreateStatus),InlineLoginHandlerImpl::FinishCompleteLoginParams>,void (Profile *, Profile::CreateStatus)>::RunImpl
32 chrome!base::internal::Invoker<base::internal::BindState<void (*)(const InlineLoginHandlerImpl::FinishCompleteLoginParams &, Profile *, Profile::CreateStatus),InlineLoginHandlerImpl::FinishCompleteLoginParams>,void (Profile *, Profile::CreateStatus)>::Run
33 chrome!base::RepeatingCallback<void (unsigned int, const mojo::HandleSignalsState &)>::Run
34 chrome!mojo::SimpleWatcher::OnHandleReady
35 chrome!base::internal::FunctorTraits<void (mojo::SimpleWatcher::*)(int, unsigned int, const mojo::HandleSignalsState &) __attribute__((thiscall)),void>::Invoke
36 chrome!base::internal::InvokeHelper<1,void>::MakeItSo
37 chrome!base::internal::Invoker<base::internal::BindState<void (mojo::SimpleWatcher::*)(int, unsigned int, const mojo::HandleSignalsState &) __attribute__((thiscall)),base::WeakPtr<mojo::SimpleWatcher>,int,unsigned int,mojo::HandleSignalsState>,void ()>::RunImpl
38 chrome!base::internal::Invoker<base::internal::BindState<void (mojo::SimpleWatcher::*)(int, unsigned int, const mojo::HandleSignalsState &) __attribute__((thiscall)),base::WeakPtr<mojo::SimpleWatcher>,int,unsigned int,mojo::HandleSignalsState>,void ()>::Run
39 chrome!base::OnceCallback<void ()>::Run
3a chrome!base::debug::TaskAnnotator::RunTask
3b chrome!base::internal::IncomingTaskQueue::RunTask
3c chrome!base::MessageLoop::RunTask
3d chrome!base::MessageLoop::DeferOrRunPendingTask
3e chrome!base::MessageLoop::DoWork
3f chrome!base::MessagePumpForUI::DoRunLoop
40 chrome!base::MessagePumpWin::Run
41 chrome!base::MessageLoop::Run
42 chrome!base::RunLoop::Run
43 chrome!ChromeBrowserMainParts::MainMessageLoopRun
44 chrome!content::BrowserMainLoop::RunMainMessageLoopParts
45 chrome!content::BrowserMainRunnerImpl::Run
46 chrome!content::BrowserMain
47 chrome!content::RunNamedProcessTypeMain
48 chrome!content::ContentMainRunnerImpl::Run
49 chrome!service_manager::Main
4a chrome!content::ContentMain
4b chrome!ChromeMain
4c chrome_exe!MainDllLoader::Launch
4d chrome_exe!wWinMain
4e chrome_exe!invoke_main
4f chrome_exe!__scrt_common_main_seh
50 KERNEL32!BaseThreadInitThunk
51 ntdll!__RtlUserThreadStart
52 ntdll!_RtlUserThreadStart
Owner: sadrul@chromium.org
Passing back to sadrul@ as his CL doesn't fix the issue.
Labels: Proj-SiteIsolation-LaunchBlocking
Tentatively marking as Proj-SiteIsolation-LaunchBlocking, because while the bug seems rare, when it does happen the effects are rather severe.
And now this just cleared up, as I was trying to debug it. I have a full process dump of the bad state, in case that helps.
Well, I seem to have captured whatever this is about in the dump. Witness:
0:000> .frame
18 04f3d460 67b85f54 chrome!viz::Surface::ActivateFrame+0x17a [C:\b\c\b\win_asan\src\components\viz\service\surfaces\surface.cc @ 316] 
0:000> dt frame_data frame.metadata.latency_info...
Local var @ 0x4f3d468 Type viz::Surface::FrameData
   +0x000 frame                          : 
      +0x000 metadata                       : 
         +0x090 latency_info                   : 
            +0x000 _Mypair                        : 
               +0x000 _Myval2                        : 
                  +0x000 _Myfirst                       : 0x3c234040 ui::LatencyInfo
                  +0x004 _Mylast                        : 0x3cb21d40 ui::LatencyInfo
                  +0x008 _Myend                         : 0x3cedc5c0 ui::LatencyInfo
   +0x118 frame_index                    : 0x11dbc

LMK if you want it - this is a full memory dump of Chrome running in my personal profile, so I won't share it widely.
Labels: UMA-Sampling-Profiler
Taking a look inside the sampling profiler averages for viz::Surface::TakeLatencyInfoFromFrame in 68.0.3420.0 win canary: of the profiles we captured executing within the function, the top 15% are spending >85% of their time executing it. So we're definitely seeing this behavior in the wild.

The effect is there on Mac too, but smaller: the top 15% are spending >37% of their time executing the function.
OK. It looks like the accumulation of the LatencyInfo instances are happening on the client side. So the deserialization is also taking a fair amount of time. I will change the code to validate things on the client-side so it doesn't accumulate unbounded (and drop incoming messages in the service-side if it includes too many instances).

[1] https://uma.googleplex.com/p/chrome/callstacks/?sid=4b6a2738a05fd22d29b0f213c0b0646e
Put up https://chromium-review.googlesource.com/c/chromium/src/+/1053793 with a speculative fix, and some debug dumps to help understand this better.
Project Member

Comment 62 by bugdroid1@chromium.org, May 11 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/92e0446435896425805d05af8a91072263223d68

commit 92e0446435896425805d05af8a91072263223d68
Author: Sadrul Habib Chowdhury <sadrul@chromium.org>
Date: Fri May 11 20:05:26 2018

latency-info: Add a speculative fix.

Add a speculative fix in SurfaceAggregator::Aggregate() to make sure it
doesn't accumulate too many LatencyInfo objects.

BUG= 834421 

Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;master.tryserver.blink:linux_trusty_blink_rel
Change-Id: Ifd41d901050a954279b642cab972ca6ec6d560e2
Reviewed-on: https://chromium-review.googlesource.com/1053793
Reviewed-by: Charlie Reis <creis@chromium.org>
Reviewed-by: Fady Samuel <fsamuel@chromium.org>
Commit-Queue: Fady Samuel <fsamuel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#557992}
[modify] https://crrev.com/92e0446435896425805d05af8a91072263223d68/components/viz/service/display/surface_aggregator.cc

Project Member

Comment 63 by bugdroid1@chromium.org, May 11 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/603b405c64ee17cd1db7d001a23e6dc88e39a77b

commit 603b405c64ee17cd1db7d001a23e6dc88e39a77b
Author: Sadrul Habib Chowdhury <sadrul@chromium.org>
Date: Fri May 11 20:09:19 2018

latency-info: Add a couple of investigative dumps.

Add a couple of investigative dumps to understand better where and how
a large number of LatencyInfo objects are accumulating.

BUG= 834421 

Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;master.tryserver.blink:linux_trusty_blink_rel
Change-Id: Ieea6e47c4a1acf5c89384048e19b742e3e21e5a8
Reviewed-on: https://chromium-review.googlesource.com/1054153
Reviewed-by: Fady Samuel <fsamuel@chromium.org>
Commit-Queue: Sadrul Chowdhury <sadrul@chromium.org>
Cr-Commit-Position: refs/heads/master@{#557994}
[modify] https://crrev.com/603b405c64ee17cd1db7d001a23e6dc88e39a77b/cc/trees/layer_tree_host_impl.cc
[modify] https://crrev.com/603b405c64ee17cd1db7d001a23e6dc88e39a77b/components/viz/service/display/surface_aggregator.cc

Comment 64 by creis@chromium.org, May 11 2018

Thanks Sadrul!  If the next Canary with r557992 and r557994 looks good (probably 68.0.3428.0), we can request merge for r557992 to M67.  (No need to merge r557994, which is just DumpWithoutCrashing to better understand why this is happening.)
NextAction: 2018-05-14
The NextAction date has arrived: 2018-05-14

Comment 67 by creis@chromium.org, May 14 2018

siggi@: Any chance you can try out 68.0.3428.0 or later to see if it helps today?  Any signals you can offer will help as we decide about merging r557992 to M67.  Thanks!

Comment 68 by siggi@chromium.org, May 14 2018

I run the days Canary every day, today is 3430. I've never been able to
find repro steps. This seems to happen fairly randomly, and seems to clear
up after a while, so I don't know how I could verify a fix one way or the
other.

Comment 69 by creis@chromium.org, May 14 2018

Sure.  Just let us know if you happen to see the problem today on 3430.  Thanks!
According to sampling profiler data the issue is still present in 68.0.3428.0 canary, but appears to be somewhat decreased in severity on Windows.

Here's percentage of execution time by vingtile in viz::Surface::TakeLatencyInfoFromFrame, for profiles that saw at least one stack sample in this function.

68.0.3420.0:
Win	0,0,0,0,0,0,0,0,0,0,0,0,0,1,9,12,39,86,94,97,99
Mac	0,0,0,0,0,0,0,1,1,2,2,2,2,4,5,5,38,39,39,44,44

68.0.3428.0:
Win	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3,25,59,87,100
Mac	0,0,0,0,0,0,0,0,0,1,2,4,4,7,11,11,23,33,36,40,69

On Windows we went from the top 15% of clients seeing >86% of execution time in this function to the top 5%.

On Mac things look very slightly better, although the worst case is worse.
Project Member

Comment 71 by bugdroid1@chromium.org, May 14 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f9a2abbeaf4d221540e5888dc4fbd8239675698a

commit f9a2abbeaf4d221540e5888dc4fbd8239675698a
Author: Sadrul Chowdhury <sadrul@chromium.org>
Date: Mon May 14 22:22:11 2018

Revert "latency-info: Add a couple of investigative dumps."

This reverts commit 603b405c64ee17cd1db7d001a23e6dc88e39a77b.

Reason for revert: We have collected the dumps. So not needed anymore.

Original change's description:
> latency-info: Add a couple of investigative dumps.
>
> Add a couple of investigative dumps to understand better where and how
> a large number of LatencyInfo objects are accumulating.
>
> BUG= 834421 
>
> Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;master.tryserver.blink:linux_trusty_blink_rel
> Change-Id: Ieea6e47c4a1acf5c89384048e19b742e3e21e5a8
> Reviewed-on: https://chromium-review.googlesource.com/1054153
> Reviewed-by: Fady Samuel <fsamuel@chromium.org>
> Commit-Queue: Sadrul Chowdhury <sadrul@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#557994}

TBR=sadrul@chromium.org,fsamuel@chromium.org

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug:  834421 , 842505
Change-Id: I1acbb8d522cb12606071af82d40b96b2b4626d27
Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;master.tryserver.blink:linux_trusty_blink_rel
Reviewed-on: https://chromium-review.googlesource.com/1057362
Commit-Queue: Sadrul Chowdhury <sadrul@chromium.org>
Reviewed-by: Sadrul Chowdhury <sadrul@chromium.org>
Cr-Commit-Position: refs/heads/master@{#558491}
[modify] https://crrev.com/f9a2abbeaf4d221540e5888dc4fbd8239675698a/cc/trees/layer_tree_host_impl.cc
[modify] https://crrev.com/f9a2abbeaf4d221540e5888dc4fbd8239675698a/components/viz/service/display/surface_aggregator.cc

Labels: Merge-Request-67
Requesting merge of crrev.com/557992 to M67 (the dumps that were reverted show that the fix should help with some cases).
Labels: -Merge-Request-67 Merge-Approved-67
Approving merge to M67 branch 3396 for crrev.com/557992 based on comment #72. 
Project Member

Comment 74 by bugdroid1@chromium.org, May 15 2018

Labels: -merge-approved-67
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f04f2e48b7db8730f04d75e750437ef0028b8df0

commit f04f2e48b7db8730f04d75e750437ef0028b8df0
Author: Sadrul Habib Chowdhury <sadrul@chromium.org>
Date: Tue May 15 01:02:12 2018

latency-info: Add a speculative fix.

Add a speculative fix in SurfaceAggregator::Aggregate() to make sure it
doesn't accumulate too many LatencyInfo objects.

BUG= 834421 

Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;master.tryserver.blink:linux_trusty_blink_rel
Change-Id: Ifd41d901050a954279b642cab972ca6ec6d560e2
Reviewed-on: https://chromium-review.googlesource.com/1053793
Reviewed-by: Charlie Reis <creis@chromium.org>
Reviewed-by: Fady Samuel <fsamuel@chromium.org>
Commit-Queue: Fady Samuel <fsamuel@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#557992}(cherry picked from commit 92e0446435896425805d05af8a91072263223d68)
Reviewed-on: https://chromium-review.googlesource.com/1058288
Reviewed-by: Sadrul Chowdhury <sadrul@chromium.org>
Cr-Commit-Position: refs/branch-heads/3396@{#601}
Cr-Branched-From: 9ef2aa869bc7bc0c089e255d698cca6e47d6b038-refs/heads/master@{#550428}
[modify] https://crrev.com/f04f2e48b7db8730f04d75e750437ef0028b8df0/components/viz/service/display/surface_aggregator.cc

Comment 75 by creis@chromium.org, May 15 2018

Blocking: 823439
Thanks for the merge, Sadrul!

Sorry to keep asking, but given comment 70, can you take another look?  :)  Sounds like there's still more ways for this to happen.  (Sounds like issue 823439 might not be solved either.)  Ideally we'd get another CL landed this week with the chance to merge to M67.

Comment 76 by siggi@chromium.org, May 16 2018

Looks like this is happening again on my Version 68.0.3432.0 (Official Build) canary (64-bit). Hard to say, though as it looks like the symbols are harfed for that version :(.

Comment 77 by creis@chromium.org, May 17 2018

Sadrul, can you post an update when you get a chance?  It sounds like you might have had a lead on the remaining issues, which is great.  If we can get it landed today or tomorrow, it can bake over the weekend and might get merged to M67.  Thanks!
Thanks to sadrul@'s patch I now have a reliable repro:

1. Visit slashdot.org
2. Keep resizing the browser.
3. LatencyInfo grows without bound.

I'll have a fix shortly.

Comment 79 by creis@chromium.org, May 17 2018

Fantastic!  Thanks Fady!
I have a fix. Works fine locally. Just writing a unit test. Will be ready for review in minutes.
Project Member

Comment 81 by bugdroid1@chromium.org, May 17 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d1564e4226387f17c349e4d8a20cbd3ff8a08dc5

commit d1564e4226387f17c349e4d8a20cbd3ff8a08dc5
Author: Fady Samuel <fsamuel@chromium.org>
Date: Thu May 17 20:57:05 2018

Surface synchronization: Prevent LatencyInfo from accumulating across surfaces

The LatencyInfo vector in CompositorFrameMetadata can accumulate without bound
when allocating new surfaces. In order to track latency, we carry LatencyInfo
objects across surfaces. The new surface grabs the LatencyInfo objects from
the previous surface.

We used to check if the new Surface has too many LatencyInfos and if not,
grab all LatencyInfos from the previous Surface and we don't check again.
The previous Surface might have a large number of LatencyInfo objects
accumulated, and will accumulate further with every new surface allocated
for a given FrameSink.

The fix is to verify the LatencyInfo vector AFTER taking the items from
the previous surface.

Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel
Change-Id: I6814fd3657b7b0a9e9d275013e661b176095e740
Bug:  834421 ,  672962 
Reviewed-on: https://chromium-review.googlesource.com/1064810
Commit-Queue: Fady Samuel <fsamuel@chromium.org>
Reviewed-by: Sadrul Chowdhury <sadrul@chromium.org>
Cr-Commit-Position: refs/heads/master@{#559668}
[modify] https://crrev.com/d1564e4226387f17c349e4d8a20cbd3ff8a08dc5/components/viz/service/frame_sinks/surface_synchronization_unittest.cc
[modify] https://crrev.com/d1564e4226387f17c349e4d8a20cbd3ff8a08dc5/components/viz/service/surfaces/surface.cc
[modify] https://crrev.com/d1564e4226387f17c349e4d8a20cbd3ff8a08dc5/components/viz/test/compositor_frame_helpers.cc
[modify] https://crrev.com/d1564e4226387f17c349e4d8a20cbd3ff8a08dc5/components/viz/test/compositor_frame_helpers.h

Labels: Merge-Request-67
NextAction: 2018-05-18
Thank you fsamuel@. Pls update the bug with canary result tomorrow for CL listed at #81. 
Project Member

Comment 84 by sheriffbot@chromium.org, May 17 2018

Labels: -Merge-Request-67 Merge-Review-67
This bug requires manual review: We are only 11 days from stable.
Please contact the milestone owner if you have questions.
Owners: cmasso@(Android), cmasso@(iOS), kbleicher@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
The NextAction date has arrived: 2018-05-18
The latest canary build incorporates this change: 68.0.3434.0.


Everything seems find and #81 is very low risk (very small change), so I'm requesting merge approval to M67. This is a site isolation launch blocker! Thanks!
Labels: -Merge-Review-67 Merge-Approved-67
NextAction: 2018-05-21
Approving merge to M67 branch 3396 based on comment #86. Pls merge on Monday morning after weekend canary coverage if change continue to look good in canary. Thank you.

Comment 88 by creis@chromium.org, May 19 2018

wittman@: If you get a chance, can you check the sampling profiler data as you did in comment 70?  Would be great to help confirm if the issues is resolved after r559668 (68.0.3434.0).
The NextAction date has arrived: 2018-05-21
This has stuck around for a few days so I'm going to merge it back to M67.
Project Member

Comment 91 by bugdroid1@chromium.org, May 21 2018

Labels: -merge-approved-67
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/014cb0fc75c9c26af2c1645ae49d13c8e52b7c44

commit 014cb0fc75c9c26af2c1645ae49d13c8e52b7c44
Author: Fady Samuel <fsamuel@chromium.org>
Date: Mon May 21 15:51:13 2018

Surface synchronization: Prevent LatencyInfo from accumulating across surfaces

The LatencyInfo vector in CompositorFrameMetadata can accumulate without bound
when allocating new surfaces. In order to track latency, we carry LatencyInfo
objects across surfaces. The new surface grabs the LatencyInfo objects from
the previous surface.

We used to check if the new Surface has too many LatencyInfos and if not,
grab all LatencyInfos from the previous Surface and we don't check again.
The previous Surface might have a large number of LatencyInfo objects
accumulated, and will accumulate further with every new surface allocated
for a given FrameSink.

The fix is to verify the LatencyInfo vector AFTER taking the items from
the previous surface.

Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel
Change-Id: I6814fd3657b7b0a9e9d275013e661b176095e740
Bug:  834421 ,  672962 
Reviewed-on: https://chromium-review.googlesource.com/1064810
Commit-Queue: Fady Samuel <fsamuel@chromium.org>
Reviewed-by: Sadrul Chowdhury <sadrul@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#559668}(cherry picked from commit d1564e4226387f17c349e4d8a20cbd3ff8a08dc5)
Reviewed-on: https://chromium-review.googlesource.com/1067056
Reviewed-by: Fady Samuel <fsamuel@chromium.org>
Cr-Commit-Position: refs/branch-heads/3396@{#659}
Cr-Branched-From: 9ef2aa869bc7bc0c089e255d698cca6e47d6b038-refs/heads/master@{#550428}
[modify] https://crrev.com/014cb0fc75c9c26af2c1645ae49d13c8e52b7c44/components/viz/service/frame_sinks/surface_synchronization_unittest.cc
[modify] https://crrev.com/014cb0fc75c9c26af2c1645ae49d13c8e52b7c44/components/viz/service/surfaces/surface.cc
[modify] https://crrev.com/014cb0fc75c9c26af2c1645ae49d13c8e52b7c44/components/viz/test/compositor_frame_helpers.cc
[modify] https://crrev.com/014cb0fc75c9c26af2c1645ae49d13c8e52b7c44/components/viz/test/compositor_frame_helpers.h

The sampling profiler data looks good in 68.0.3434.0 (per the same analysis as in #70):
Win	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2
Mac	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1

Per-client execution percentages under ~4% are not necessarily demonstrative of actual execution time, so we can reasonably conclude from this that we're now spending negligible execution time in this function for all users.

Comment 93 by creis@chromium.org, May 21 2018

Status: Fixed (was: Started)
Comment 92: Thanks!  That's awesome to hear.  And thanks for merging it, Fady!

Sign in to add a comment