New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 777563 link

Starred by 2 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 3
Type: Bug



Sign in to add a comment

Startup.FirstWebContents.NonEmptyPaint2 has 2% samples in the overflow bucket on Mac

Project Member Reported by asvitk...@chromium.org, Oct 23 2017

Issue description

Startup.FirstWebContents.NonEmptyPaint2 has 2% samples in the overflow bucket on Mac.

This metric is a core heartbeat metric tracked for experiments and over time by Chrome Speed Releasing.

However, on Mac, it has about 2% of its samples in the overflow bucket:

https://uma.googleplex.com/p/chrome/histograms?endDate=20171022&dayCount=7&histograms=Startup.FirstWebContents.NonEmptyPaint2&fixupData=true&showMax=true&filters=platform%2Ceq%2CM%2Cchannel%2Ceq%2C4%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial

This makes its 99th percentile measurement useless and suggests there's a problem with metric definition or Chrome startup speed. The overflow bucket has min of 3,600,000 ms which is 60 minutes.

So either the metric is incorrectly recorded or 2% of Mac startups are taking over 60 minutes. Both of which would be bad.

shrike@ - could you find an owner for this from Mac team to take a look?

Also cc'ing speed releasing folks since we're using this as one of the heartbeat metrics.
 

Comment 1 by shrike@chromium.org, Oct 23 2017

Cc: lgrey@chromium.org
lgrey@ - is there any way we can use the sampling profiler to learn more about the long delays?

Comment 2 by lgrey@chromium.org, Oct 24 2017

The profiler takes the first 30 seconds of startup, so not exactly. We can probably look at samples that didn't complete startup/didn't get to a paint in that 30 seconds but based on the linked histogram, this would cover ~16% of samples.

When we have the ability to trigger on-demand, we can probably check in after x minutes if there hasn't been a paint, but that's not ready yet.

wittman@ can we access "milestones" like https://cs.chromium.org/chromium/src/components/startup_metric_utils/browser/startup_metric_utils.cc?q=NonEmptyPaint2&sq=package:chromium&dr=C&l=694 from the dashboard?

FWIW I would guess either this is incorrectly recorded or there's some sort of weird scenario like "user opens browser > immediately backgrounds while being part of the renderer backgrounding experiment > goes away from the computer for a long time" (I don't think that specific conjecture would account for 2%, just giving an idea of the kind of thing I mean.)
We should try to figure out if those scenarios are possible - either by testing locally the edge cases we're thinking of or by adding more instrumentation to try to diagnose the issue.

Comment 4 by lgrey@chromium.org, Oct 24 2017

Cc: wittman@chromium.org
I'll try to trigger the scenario above later today.

(also really cc wittman@ this time)
Re comment 2: we don't have support for selecting data based on milestones in the dashboard yet, but it's possible to do analysis against that data via custom queries.

From the continuous collection I think we should be able to determine if the profiler data corroborates 2% of startups taking > 1 hour.

Comment 6 by shrike@chromium.org, Oct 24 2017

Cc: -lgrey@chromium.org shrike@chromium.org
Owner: lgrey@chromium.org

Comment 7 by lgrey@chromium.org, Oct 24 2017

Status: Assigned (was: Untriaged)
I gave the custom analysis using the sampling profiler data a shot. While it's not trivial to do an apples-to-apples comparison on percentages, I did find a decent number of cases in the continuous data where we saw first nonempty paint after 1 hour, which would seem to corroborate the 2% number.

Comment 9 by lgrey@chromium.org, Oct 25 2017

Can we look at stacks from those sessions (before the paint)?
I think so. It'll be a bit of work but should be doable.
Any update on this - I'm very interested to see if the stacks give us a clue about the problem.
Keep getting interrupted; getting there slowly but surely.
Making some progress here. Still figuring out the best way to sift through, but here's one that popped out at me. This had a large # of samples and took 4 hours to paint (almost positive this is from a single session)

mach_msg_trap
mach_msg
__CFRunLoopServiceMachPort
__CFRunLoopRun
CFRunLoopRunSpecific
RunCurrentEventLoopInMode
ReceiveNextEventCommon
_BlockUntilNextEventMatchingListInModeWithFilter
_DPSNextEvent
-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:]
__71-[BrowserCrApplication nextEventMatchingMask:untilDate:inMode:dequeue:]_block_invoke
base::mac::CallWithEHFrame(void () block_pointer)
-[BrowserCrApplication nextEventMatchingMask:untilDate:inMode:dequeue:]
_NSHLTBMenuEventProc
IsUserStillTracking(MenuSelectData*, unsigned char*)
TrackMenuCommon(MenuSelectData&, unsigned char*, SelectionData*, MenuResult*, MenuResult*)
PopUpMenuSelectCore(MenuData*, Point, double, Point, unsigned short, unsigned int, unsigned int, Rect const*, unsigned short, unsigned int, Rect const*, Rect const*, __CFDictionary const*, __CFString const*, OpaqueMenuRef**, unsigned short*)
_HandlePopUpMenuSelection8(OpaqueMenuRef*, OpaqueEventRef*, unsigned int, Point, unsigned short, unsigned int, unsigned int, Rect const*, unsigned short, Rect const*, Rect const*, __CFDictionary const*, __CFString const*, OpaqueMenuRef**, unsigned short*)
_HandlePopUpMenuSelectionWithDictionary
_NSSLMPopUpCarbonMenu3
-[NSCarbonMenuImpl _popUpContextMenu:withEvent:forView:withFont:]
-[NSMenu _popUpContextMenu:withEvent:forView:withFont:]
RenderViewContextMenuMac::Show()
ChromeWebContentsViewDelegateMac::ShowContextMenu(content::RenderFrameHost*, content::ContextMenuParams const&)
non-virtual thunk to content::WebContentsImpl::ShowContextMenu(content::RenderFrameHost*, content::ContextMenuParams const&)
content::RenderFrameHostImpl::OnContextMenu(content::ContextMenuParams const&)
bool IPC::MessageT<FrameHostMsg_ContextMenu_Meta, std::__1::tuple<content::ContextMenuParams>, void>::Dispatch<content::RenderFrameHostImpl, content::RenderFrameHostImpl, void, void (content::RenderFrameHostImpl::*)(content::ContextMenuParams const&)>(IPC::Message const*, content::RenderFrameHostImpl*, content::RenderFrameHostImpl*, void*, void (content::RenderFrameHostImpl::*)(content::ContextMenuParams const&))
content::RenderFrameHostImpl::OnMessageReceived(IPC::Message const&)
IPC::ChannelProxy::Context::OnDispatchMessage(IPC::Message const&)
base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*)
base::MessageLoop::RunTask(base::PendingTask*)
base::MessageLoop::DoWork()
base::MessagePumpCFRunLoopBase::RunWork()
base::mac::CallWithEHFrame(void () block_pointer)
base::MessagePumpCFRunLoopBase::RunWorkSource(void*)
__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__
__CFRunLoopDoSource0
__CFRunLoopDoSources0
__CFRunLoopRun
CFRunLoopRunSpecific
RunCurrentEventLoopInMode
ReceiveNextEventCommon
_BlockUntilNextEventMatchingListInModeWithFilter
_DPSNextEvent
-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:]
__71-[BrowserCrApplication nextEventMatchingMask:untilDate:inMode:dequeue:]_block_invoke
base::mac::CallWithEHFrame(void () block_pointer)
-[BrowserCrApplication nextEventMatchingMask:untilDate:inMode:dequeue:]
-[NSApplication run]
base::MessagePumpNSApplication::DoRun(base::MessagePump::Delegate*)
base::MessagePumpCFRunLoopBase::Run(base::MessagePump::Delegate*)
<name omitted>
ChromeBrowserMainParts::MainMessageLoopRun(int*)
content::BrowserMainLoop::RunMainMessageLoopParts()
content::BrowserMainRunnerImpl::Run()
content::BrowserMain(content::MainFunctionParams const&)
content::ContentMainRunnerImpl::Run()
service_manager::Main(service_manager::MainParams const&)
content::ContentMain(content::ContentMainParams const&)
ChromeMain
main

So that looks like a user right clicked something.

Can we check locally if right-clicking during browser startup before the tab is painted causes the metric to report a very high value? You should be able to see the metric locally on chrome://histograms
Nope, though maybe I'm not clicking quickly enough. I'm pretty sure it's not blocking the renderer.

Could just be a coincidence and the real issue is in the renderer (so invisible to us through this method).


"Could just be a coincidence and the real issue is in the renderer (so invisible to us through this method)."

That's possible. One thing we can do to diagnose this is look at breakdown histograms:

https://uma.googleplex.com/p/chrome/histograms?endDate=20171208&dayCount=7&histograms=Startup.BrowserMessageLoopStart.To.NonEmptyPaint2%2CStartup.BrowserMessageLoopStartTime%2CStartup.FirstWebContents.NonEmptyPaint2&fixupData=true&showMax=true&filters=platform%2Ceq%2CM%2Cchannel%2Ceq%2C4%2Cmilestone%2Cone_of%2C63%7C62%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial

So here, I added Startup.BrowserMessageLoopStart.To.NonEmptyPaint2.

This records the time from MessageLoopStart to NonEmptyPaint2. I restricted milestones to M62 and M63 stable to current most recent data.

Both histograms have the same number of samples logged, which is good.

However, there's much fewer samples in the overflow bucket of the other metric. Only 1.6k samples vs. 360k samples in the other one.

I also added Startup.BrowserMessageLoopStartTime and that actually does have a ton of samples in the overflow bucket. So whatever's causing this actually ends up delaying the message loop start time. So then to diagnose this, perhaps there are other metrics we can use to slice that timing?

The other thing I would suggest checking is whether the metric uses base::TimeTicks or base::Time. I think the latter could be affected by system clock changes and e.g. the machine sleeping between the two. So we should use base::TimeTicks if we aren't currently.
lgrey: ping! Any progress on this?

Comment 18 by lgrey@chromium.org, Mar 27 2018

I didn't really see anything particularly actionable, but I also didn't get beyond manually scanning stacks/flamegraphing, so there's probably a lot more room to improve re: sifting through stacks.

I have no cycles for this right now due to MacViews, but if someone wants to take this up, happy to share the specifics of my meager progress.
ping one more time. Any progress? 

+tdresser: Is there anyone else this would be a good bug for? 
Re #16 on base::Time vs base::TimeTicks.
We grab the process startup time as a base::Time, and then convert it to a base::TimeTicks.

https://cs.chromium.org/chromium/src/components/startup_metric_utils/browser/startup_metric_utils.cc?rcl=d4e9a72822a8313ebda910ad567cc938bf76da6d&l=541

A comment on the conversion:
// Converts a base::Time value to a base::TimeTicks value. The conversion isn't
// exact, but by capturing Time::Now() as early as possible, the likelihood of a
// clock change between it and process start is as low as possible. There is
// also the time taken to synchronously resolve base::Time::Now() and
// base::TimeTicks::Now() at play, but in practice it is pretty much instant
// compared to multi-seconds startup timings.

Maybe we're just hitting this known edge-case?
Ping! This bug has been flagged as stale by Chrome Speed Metrics bug triage, and it has an owner. Any update on this issue?
Cc: lgrey@chromium.org
Owner: ----
Status: Available (was: Assigned)
I just checked and this is still the case, despite the changeover to MacViews. I don't think I would be able to get to this for a while, so unassigning

Sign in to add a comment