Startup.FirstWebContents.NonEmptyPaint2 has 2% samples in the overflow bucket on Mac |
|||||
Issue descriptionStartup.FirstWebContents.NonEmptyPaint2 has 2% samples in the overflow bucket on Mac. This metric is a core heartbeat metric tracked for experiments and over time by Chrome Speed Releasing. However, on Mac, it has about 2% of its samples in the overflow bucket: https://uma.googleplex.com/p/chrome/histograms?endDate=20171022&dayCount=7&histograms=Startup.FirstWebContents.NonEmptyPaint2&fixupData=true&showMax=true&filters=platform%2Ceq%2CM%2Cchannel%2Ceq%2C4%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial This makes its 99th percentile measurement useless and suggests there's a problem with metric definition or Chrome startup speed. The overflow bucket has min of 3,600,000 ms which is 60 minutes. So either the metric is incorrectly recorded or 2% of Mac startups are taking over 60 minutes. Both of which would be bad. shrike@ - could you find an owner for this from Mac team to take a look? Also cc'ing speed releasing folks since we're using this as one of the heartbeat metrics.
,
Oct 24 2017
The profiler takes the first 30 seconds of startup, so not exactly. We can probably look at samples that didn't complete startup/didn't get to a paint in that 30 seconds but based on the linked histogram, this would cover ~16% of samples. When we have the ability to trigger on-demand, we can probably check in after x minutes if there hasn't been a paint, but that's not ready yet. wittman@ can we access "milestones" like https://cs.chromium.org/chromium/src/components/startup_metric_utils/browser/startup_metric_utils.cc?q=NonEmptyPaint2&sq=package:chromium&dr=C&l=694 from the dashboard? FWIW I would guess either this is incorrectly recorded or there's some sort of weird scenario like "user opens browser > immediately backgrounds while being part of the renderer backgrounding experiment > goes away from the computer for a long time" (I don't think that specific conjecture would account for 2%, just giving an idea of the kind of thing I mean.)
,
Oct 24 2017
We should try to figure out if those scenarios are possible - either by testing locally the edge cases we're thinking of or by adding more instrumentation to try to diagnose the issue.
,
Oct 24 2017
I'll try to trigger the scenario above later today. (also really cc wittman@ this time)
,
Oct 24 2017
Re comment 2: we don't have support for selecting data based on milestones in the dashboard yet, but it's possible to do analysis against that data via custom queries. From the continuous collection I think we should be able to determine if the profiler data corroborates 2% of startups taking > 1 hour.
,
Oct 24 2017
,
Oct 24 2017
,
Oct 25 2017
I gave the custom analysis using the sampling profiler data a shot. While it's not trivial to do an apples-to-apples comparison on percentages, I did find a decent number of cases in the continuous data where we saw first nonempty paint after 1 hour, which would seem to corroborate the 2% number.
,
Oct 25 2017
Can we look at stacks from those sessions (before the paint)?
,
Oct 25 2017
I think so. It'll be a bit of work but should be doable.
,
Dec 4 2017
Any update on this - I'm very interested to see if the stacks give us a clue about the problem.
,
Dec 4 2017
Keep getting interrupted; getting there slowly but surely.
,
Dec 7 2017
Making some progress here. Still figuring out the best way to sift through, but here's one that popped out at me. This had a large # of samples and took 4 hours to paint (almost positive this is from a single session) mach_msg_trap mach_msg __CFRunLoopServiceMachPort __CFRunLoopRun CFRunLoopRunSpecific RunCurrentEventLoopInMode ReceiveNextEventCommon _BlockUntilNextEventMatchingListInModeWithFilter _DPSNextEvent -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] __71-[BrowserCrApplication nextEventMatchingMask:untilDate:inMode:dequeue:]_block_invoke base::mac::CallWithEHFrame(void () block_pointer) -[BrowserCrApplication nextEventMatchingMask:untilDate:inMode:dequeue:] _NSHLTBMenuEventProc IsUserStillTracking(MenuSelectData*, unsigned char*) TrackMenuCommon(MenuSelectData&, unsigned char*, SelectionData*, MenuResult*, MenuResult*) PopUpMenuSelectCore(MenuData*, Point, double, Point, unsigned short, unsigned int, unsigned int, Rect const*, unsigned short, unsigned int, Rect const*, Rect const*, __CFDictionary const*, __CFString const*, OpaqueMenuRef**, unsigned short*) _HandlePopUpMenuSelection8(OpaqueMenuRef*, OpaqueEventRef*, unsigned int, Point, unsigned short, unsigned int, unsigned int, Rect const*, unsigned short, Rect const*, Rect const*, __CFDictionary const*, __CFString const*, OpaqueMenuRef**, unsigned short*) _HandlePopUpMenuSelectionWithDictionary _NSSLMPopUpCarbonMenu3 -[NSCarbonMenuImpl _popUpContextMenu:withEvent:forView:withFont:] -[NSMenu _popUpContextMenu:withEvent:forView:withFont:] RenderViewContextMenuMac::Show() ChromeWebContentsViewDelegateMac::ShowContextMenu(content::RenderFrameHost*, content::ContextMenuParams const&) non-virtual thunk to content::WebContentsImpl::ShowContextMenu(content::RenderFrameHost*, content::ContextMenuParams const&) content::RenderFrameHostImpl::OnContextMenu(content::ContextMenuParams const&) bool IPC::MessageT<FrameHostMsg_ContextMenu_Meta, std::__1::tuple<content::ContextMenuParams>, void>::Dispatch<content::RenderFrameHostImpl, content::RenderFrameHostImpl, void, void (content::RenderFrameHostImpl::*)(content::ContextMenuParams const&)>(IPC::Message const*, content::RenderFrameHostImpl*, content::RenderFrameHostImpl*, void*, void (content::RenderFrameHostImpl::*)(content::ContextMenuParams const&)) content::RenderFrameHostImpl::OnMessageReceived(IPC::Message const&) IPC::ChannelProxy::Context::OnDispatchMessage(IPC::Message const&) base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*) base::MessageLoop::RunTask(base::PendingTask*) base::MessageLoop::DoWork() base::MessagePumpCFRunLoopBase::RunWork() base::mac::CallWithEHFrame(void () block_pointer) base::MessagePumpCFRunLoopBase::RunWorkSource(void*) __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ __CFRunLoopDoSource0 __CFRunLoopDoSources0 __CFRunLoopRun CFRunLoopRunSpecific RunCurrentEventLoopInMode ReceiveNextEventCommon _BlockUntilNextEventMatchingListInModeWithFilter _DPSNextEvent -[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] __71-[BrowserCrApplication nextEventMatchingMask:untilDate:inMode:dequeue:]_block_invoke base::mac::CallWithEHFrame(void () block_pointer) -[BrowserCrApplication nextEventMatchingMask:untilDate:inMode:dequeue:] -[NSApplication run] base::MessagePumpNSApplication::DoRun(base::MessagePump::Delegate*) base::MessagePumpCFRunLoopBase::Run(base::MessagePump::Delegate*) <name omitted> ChromeBrowserMainParts::MainMessageLoopRun(int*) content::BrowserMainLoop::RunMainMessageLoopParts() content::BrowserMainRunnerImpl::Run() content::BrowserMain(content::MainFunctionParams const&) content::ContentMainRunnerImpl::Run() service_manager::Main(service_manager::MainParams const&) content::ContentMain(content::ContentMainParams const&) ChromeMain main
,
Dec 7 2017
So that looks like a user right clicked something. Can we check locally if right-clicking during browser startup before the tab is painted causes the metric to report a very high value? You should be able to see the metric locally on chrome://histograms
,
Dec 7 2017
Nope, though maybe I'm not clicking quickly enough. I'm pretty sure it's not blocking the renderer. Could just be a coincidence and the real issue is in the renderer (so invisible to us through this method).
,
Dec 11 2017
"Could just be a coincidence and the real issue is in the renderer (so invisible to us through this method)." That's possible. One thing we can do to diagnose this is look at breakdown histograms: https://uma.googleplex.com/p/chrome/histograms?endDate=20171208&dayCount=7&histograms=Startup.BrowserMessageLoopStart.To.NonEmptyPaint2%2CStartup.BrowserMessageLoopStartTime%2CStartup.FirstWebContents.NonEmptyPaint2&fixupData=true&showMax=true&filters=platform%2Ceq%2CM%2Cchannel%2Ceq%2C4%2Cmilestone%2Cone_of%2C63%7C62%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial So here, I added Startup.BrowserMessageLoopStart.To.NonEmptyPaint2. This records the time from MessageLoopStart to NonEmptyPaint2. I restricted milestones to M62 and M63 stable to current most recent data. Both histograms have the same number of samples logged, which is good. However, there's much fewer samples in the overflow bucket of the other metric. Only 1.6k samples vs. 360k samples in the other one. I also added Startup.BrowserMessageLoopStartTime and that actually does have a ton of samples in the overflow bucket. So whatever's causing this actually ends up delaying the message loop start time. So then to diagnose this, perhaps there are other metrics we can use to slice that timing? The other thing I would suggest checking is whether the metric uses base::TimeTicks or base::Time. I think the latter could be affected by system clock changes and e.g. the machine sleeping between the two. So we should use base::TimeTicks if we aren't currently.
,
Mar 27 2018
lgrey: ping! Any progress on this?
,
Mar 27 2018
I didn't really see anything particularly actionable, but I also didn't get beyond manually scanning stacks/flamegraphing, so there's probably a lot more room to improve re: sifting through stacks. I have no cycles for this right now due to MacViews, but if someone wants to take this up, happy to share the specifics of my meager progress.
,
Jul 16
ping one more time. Any progress? +tdresser: Is there anyone else this would be a good bug for?
,
Jul 16
Re #16 on base::Time vs base::TimeTicks. We grab the process startup time as a base::Time, and then convert it to a base::TimeTicks. https://cs.chromium.org/chromium/src/components/startup_metric_utils/browser/startup_metric_utils.cc?rcl=d4e9a72822a8313ebda910ad567cc938bf76da6d&l=541 A comment on the conversion: // Converts a base::Time value to a base::TimeTicks value. The conversion isn't // exact, but by capturing Time::Now() as early as possible, the likelihood of a // clock change between it and process start is as low as possible. There is // also the time taken to synchronously resolve base::Time::Now() and // base::TimeTicks::Now() at play, but in practice it is pretty much instant // compared to multi-seconds startup timings. Maybe we're just hitting this known edge-case?
,
Nov 26
Ping! This bug has been flagged as stale by Chrome Speed Metrics bug triage, and it has an owner. Any update on this issue?
,
Nov 27
I just checked and this is still the case, despite the changeover to MacViews. I don't think I would be able to get to this for a while, so unassigning |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by shrike@chromium.org
, Oct 23 2017