Potential jank caused by OpenNewLog |
||||||
Issue descriptionSlow-reports is shipped back to our server with a chrome trace when some jank is detected on the chrome main thread. See attachment. Report: 71d36440a98c5627 Title MessageLoop::RunTask Wall Duration 4,331.116 ms CPU Duration 33.938 ms src_file "../../components/metrics/metrics_service.cc" src_func "OpenNewLog" Title MessageLoop::RunTask Wall Duration 6,128.184 ms CPU Duration 37.718 ms Args src_file "../../components/metrics/metrics_service.cc" src_func "PrepareProviderMetricsTask" More examples: 71d36440a98c5627 f9d8fcdcf6be6c22 c1bf78b4aeab1556 1998b2704fed3cb7 fd566c8c1cd1caf4 c706cda97d3dd2c0 5dbf16666266442e This is related to these two posted tasks: https://cs.chromium.org/chromium/src/components/metrics/metrics_service.cc?type=cs&sq=package:chromium&g=0&l=608 1) Tasks should be posted by using |PostAfterStartupTask| https://cs.chromium.org/chromium/src/content/public/browser/browser_thread.h?l=162 2) Tasks should not be posted on the CrBrowserMain thread.
,
Aug 22
To clarify, the PostAfterStartup won't fix the case. It's just better than posting the task 30 sec after startup (I think). gab@/fdoray@ can tell you more about it.
,
Aug 22
Time Report 51.184 df46184a6aec2298 chrome 70.0.3514.0 37.715 2cf57d6b9b5f76cd chrome 70.0.3521.2 29.86 516188c4fdd6bece chrome 70.0.3521.2 23.461 7356c93b6118a4c1 chrome 70.0.3524.0 23.041 656a5be2c2f96cd8 chrome 70.0.3521.2 22.718 d4a69f1c75d23c1a chrome 70.0.3524.0 20.915 6cdda3403b6a4adc chrome 70.0.3521.2
,
Aug 23
Those tasks are delayed 30 and 60 seconds, respectively, on desktop and 5/10 seconds on mobile. Posting them after startup would likely make them run sooner but it would be fine to do so. We've talked in the past about making Metric collection occur on a background thread but it would be non-trivial to do so.
,
Sep 10
We've turn on StackSampling on browser main thread. This is helping us to figure out where the time is spent on these long tasks. Reports (symbolized version attached): 02a2fe715b118f4a bce247c9cd36c1bd 489b48e794e05be8 It seems that most of the time is spent on that following stackframe: base::PersistentMemoryAllocator::Iterator::GetNext(unsigned int *) base::PersistentMemoryAllocator::Iterator::GetNextOfType(unsigned int) base::PersistentSampleMap::GetNextPersistentRecord(base::PersistentMemoryAllocator::Iterator &,unsigned __int64 *) base::PersistentSparseHistogramDataManager::LoadRecords(base::PersistentSampleMapRecords *) base::PersistentSampleMapRecords::GetNext() base::PersistentSampleMap::ImportSamples(int,bool) base::PersistentSampleMap::Iterator() base::HistogramSamples::Add(base::HistogramSamples const &) base::SparseHistogram::SnapshotFinalDelta() base::HistogramSnapshotManager::PrepareFinalDelta(base::HistogramBase const *) metrics::FileMetricsProvider::RecordHistogramSnapshotsFromSource(base::HistogramSnapshotManager *,metrics::FileMetricsProvider::SourceInfo *) metrics::FileMetricsProvider::ProvideIndependentMetrics(metrics::SystemProfileProto *,base::HistogramSnapshotManager *) metrics::MetricsLog::LoadIndependentMetrics(metrics::MetricsProvider *) metrics::MetricsService::PrepareProviderMetricsLog() metrics::MetricsService::PrepareProviderMetricsTask() base::debug::TaskAnnotator::RunTask(char const *,base::PendingTask *) base::MessageLoop::RunTask(base::PendingTask *) base::MessageLoop::DoDelayedWork(base::TimeTicks *) base::MessagePumpForUI::DoRunLoop() base::MessagePumpWin::Run(base::MessagePump::Delegate *) base::RunLoop::Run() ChromeBrowserMainParts::MainMessageLoopRun(int *) content::BrowserMainLoop::RunMainMessageLoopParts() content::BrowserMainRunnerImpl::Run() content::BrowserMain(content::MainFunctionParams const &) content::RunBrowserProcessMain(content::MainFunctionParams const &,content::ContentMainDelegate *) content::ContentMainRunnerImpl::Run(bool) service_manager::Main(service_manager::MainParams const &) content::ContentMain(content::ContentMainParams const &) ChromeMain MainDllLoader::Launch(HINSTANCE__ *,base::TimeTicks) wWinMain __scrt_common_main_seh
,
Sep 10
,
Sep 10
I'm looking to the second task PrepareProviderMetricsTask Report: fefb7485675c4fd2 (symbolized attached) Time is spent on this stackframe: base::PersistentMemoryAllocator::Iterator::GetNext(unsigned int *) base::PersistentMemoryAllocator::Iterator::GetNextOfType(unsigned int) metrics::PersistentSystemProfile::RecordAllocator::NextSegment() metrics::PersistentSystemProfile::RecordAllocator::Read(metrics::PersistentSystemProfile::RecordType *,std::basic_string<char,std::char_traits<char>,std::allocator<char> > *) metrics::PersistentSystemProfile::MergeUpdateRecords(base::PersistentMemoryAllocator const &,metrics::SystemProfileProto *) metrics::PersistentSystemProfile::GetSystemProfile(base::PersistentMemoryAllocator const &,metrics::SystemProfileProto *) metrics::FileMetricsProvider::ProvideIndependentMetrics(metrics::SystemProfileProto *,base::HistogramSnapshotManager *) metrics::MetricsLog::LoadIndependentMetrics(metrics::MetricsProvider *) metrics::MetricsService::PrepareProviderMetricsLog() metrics::MetricsService::PrepareProviderMetricsTask() base::debug::TaskAnnotator::RunTask(char const *,base::PendingTask *) base::MessageLoop::RunTask(base::PendingTask *) base::MessageLoop::DoDelayedWork(base::TimeTicks *) base::MessagePumpForUI::DoRunLoop() base::MessagePumpWin::Run(base::MessagePump::Delegate *) base::RunLoop::Run() ChromeBrowserMainParts::MainMessageLoopRun(int *) content::BrowserMainLoop::RunMainMessageLoopParts() content::BrowserMainRunnerImpl::Run() content::BrowserMain(content::MainFunctionParams const &) content::RunBrowserProcessMain(content::MainFunctionParams const &,content::ContentMainDelegate *) content::ContentMainRunnerImpl::Run(bool) service_manager::Main(service_manager::MainParams const &) content::ContentMain(content::ContentMainParams const &) ChromeMain MainDllLoader::Launch(HINSTANCE__ *,base::TimeTicks) wWinMain
,
Sep 10
Those call stacks indicate that we're spending a lot of time in getting histograms or system profile from the memory-mapped persistent histograms file. Ideally, this should be in RAM in addition to disk and therefore should not be slow. Do we know if the systems have memory pressure such that maybe Windows may have purged the RAM parts of those files and has to swap them in when we're accessing them?
,
Sep 10
It does not look like the system itself was running slow during the trace. One other observation is that there were multiple other tasks on the worker threads that were trying to read files like sqlite_persistent_cookie_store, in_memory_url_index and ranker_model_loader_impl. Another thing to note is that this occurred about 80 seconds after startup. Is it possible to avoid reading from disk on main thread? like pin memory on RAM?
,
Sep 11
Actually, looking at things more, I don't see OpenNewLog on the stack here? The stacks above do seem to indicate a separate issue where it does look like we're doing IO on main thread, which we shouldn't be doing. The intention is for the work to happen on a background thread, which CheckAndMergeMetricSourcesOnTaskRunner() is meant to be doing. But looks like some IO work is ending up on the main thread. Since this seems different than just in OpenNewLog, I went ahead and filed a new bug for that: https://bugs.chromium.org/p/chromium/issues/detail?id=882892
,
Sep 19
https://chromium-review.googlesource.com/1220310 enabled caching of the file on a background thread so that it shouldn't have to do any I/O on the UI thread. There is a 50/50 experiment in Canary & Dev: https://cs.corp.google.com/piper///depot/google3/googledata/googleclient/chrome/finch/studies/CacheFileMetricsData.json?q=cachefilemetricsdata.json&g=0&l=1 Can you check versions 71.0.3556.0 or above to see if the Jank has improved?
,
Sep 20
From #c10:
> Actually, looking at things more, I don't see OpenNewLog on the stack here?
It make sense since OpenNewLog is the trace event title, which is the FROM_HERE. Thus, it's the source where the tasks causing trouble is posted.
There are two cases:
base::SequencedTaskRunnerHandle::Get()->PostDelayedTask(
FROM_HERE,
base::BindOnce(&MetricsService::StartInitTask,
self_ptr_factory_.GetWeakPtr()),
base::TimeDelta::FromSeconds(kInitializationDelaySeconds));
base::SequencedTaskRunnerHandle::Get()->PostDelayedTask(
FROM_HERE,
base::BindOnce(&MetricsService::PrepareProviderMetricsTask,
self_ptr_factory_.GetWeakPtr()),
base::TimeDelta::FromSeconds(2 * kInitializationDelaySeconds));
We can see the following stackframe part of the sampling profiler results:
* MetricsService::PrepareProviderMetricsTask
,
Sep 20
> Can you check versions 71.0.3556.0 or above to see if the Jank has improved? Unfortunately, with slow-reports that's not an easy tasks since traces are not aware of experiments.
,
Sep 20
If I go 100% can you see a difference by version?
,
Sep 20
How about we just add a histogram to measure the timing of the code in question? Then we can easily split by the experiment and also understand e.g. how widespread this is.
,
Sep 20
Will do.
,
Sep 20
I'm landing the code in the slow-reports pipeline to support experiments.
,
Sep 26
,
Nov 23
This should be fixed. thx to bcwhite@chromium.org |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by mpear...@chromium.org
, Aug 22Components: Internals>Metrics