New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 867603 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Nov 23
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocked on:
issue 882892



Sign in to add a comment

Potential jank caused by OpenNewLog

Project Member Reported by etienneb@chromium.org, Jul 25

Issue description

Slow-reports is shipped back to our server with a chrome trace when some jank is detected on the chrome main thread.

See attachment.

Report: 71d36440a98c5627

Title	        MessageLoop::RunTask
Wall Duration	4,331.116 ms
CPU Duration	33.938 ms
  src_file      "../../components/metrics/metrics_service.cc"
  src_func      "OpenNewLog"

Title	        MessageLoop::RunTask
Wall Duration	6,128.184 ms
CPU Duration	37.718 ms
Args
  src_file      "../../components/metrics/metrics_service.cc"
  src_func      "PrepareProviderMetricsTask"


More examples:
  71d36440a98c5627
  f9d8fcdcf6be6c22
  c1bf78b4aeab1556
  1998b2704fed3cb7
  fd566c8c1cd1caf4
  c706cda97d3dd2c0
  5dbf16666266442e

This is related to these two posted tasks:
  https://cs.chromium.org/chromium/src/components/metrics/metrics_service.cc?type=cs&sq=package:chromium&g=0&l=608


1) Tasks should be posted by using |PostAfterStartupTask|
https://cs.chromium.org/chromium/src/content/public/browser/browser_thread.h?l=162

2) Tasks should not be posted on the CrBrowserMain thread.
 
jank_opennewlog.png
16.7 KB View Download
Cc: bcwh...@chromium.org
Components: Internals>Metrics
bcwhite@, can you take a look at whether switching to PostAfterStartupTask is appropriate here, and also whether we can run either or both of these tasks on background threads?  I feel like this in an area of code that you're somewhat familiar with, and that likely aligns well how you think through races / priorities / handoffs and whatnot.  thanks!
To clarify, the PostAfterStartup won't fix the case. It's just better than posting the task 30 sec after startup (I think). gab@/fdoray@ can tell you more about it.
 Time    Report
51.184	df46184a6aec2298	chrome	70.0.3514.0	
37.715	2cf57d6b9b5f76cd	chrome	70.0.3521.2
29.86	516188c4fdd6bece	chrome	70.0.3521.2
23.461	7356c93b6118a4c1	chrome	70.0.3524.0
23.041	656a5be2c2f96cd8	chrome	70.0.3521.2
22.718	d4a69f1c75d23c1a	chrome	70.0.3524.0
20.915	6cdda3403b6a4adc	chrome	70.0.3521.2
Those tasks are delayed 30 and 60 seconds, respectively, on desktop and 5/10 seconds on mobile.  Posting them after startup would likely make them run sooner but it would be fine to do so.

We've talked in the past about making Metric collection occur on a background thread but it would be non-trivial to do so.
Cc: ssid@chromium.org
We've turn on StackSampling on browser main thread. This is helping us to figure out where the time is spent on these long tasks.

Reports (symbolized version attached):
  02a2fe715b118f4a
  bce247c9cd36c1bd
  489b48e794e05be8

It seems that most of the time is spent on that following stackframe:

 base::PersistentMemoryAllocator::Iterator::GetNext(unsigned int *)
 base::PersistentMemoryAllocator::Iterator::GetNextOfType(unsigned int)
 base::PersistentSampleMap::GetNextPersistentRecord(base::PersistentMemoryAllocator::Iterator &,unsigned __int64 *)
 base::PersistentSparseHistogramDataManager::LoadRecords(base::PersistentSampleMapRecords *)
 base::PersistentSampleMapRecords::GetNext()
 base::PersistentSampleMap::ImportSamples(int,bool)
 base::PersistentSampleMap::Iterator()
 base::HistogramSamples::Add(base::HistogramSamples const &)
 base::SparseHistogram::SnapshotFinalDelta()
 base::HistogramSnapshotManager::PrepareFinalDelta(base::HistogramBase const *)
 metrics::FileMetricsProvider::RecordHistogramSnapshotsFromSource(base::HistogramSnapshotManager *,metrics::FileMetricsProvider::SourceInfo *)
 metrics::FileMetricsProvider::ProvideIndependentMetrics(metrics::SystemProfileProto *,base::HistogramSnapshotManager *)
 metrics::MetricsLog::LoadIndependentMetrics(metrics::MetricsProvider *)
 metrics::MetricsService::PrepareProviderMetricsLog()
 metrics::MetricsService::PrepareProviderMetricsTask()
 base::debug::TaskAnnotator::RunTask(char const *,base::PendingTask *)
 base::MessageLoop::RunTask(base::PendingTask *)
 base::MessageLoop::DoDelayedWork(base::TimeTicks *)
 base::MessagePumpForUI::DoRunLoop()
 base::MessagePumpWin::Run(base::MessagePump::Delegate *)
 base::RunLoop::Run()
 ChromeBrowserMainParts::MainMessageLoopRun(int *)
 content::BrowserMainLoop::RunMainMessageLoopParts()
 content::BrowserMainRunnerImpl::Run()
 content::BrowserMain(content::MainFunctionParams const &)
 content::RunBrowserProcessMain(content::MainFunctionParams const &,content::ContentMainDelegate *)
 content::ContentMainRunnerImpl::Run(bool)
 service_manager::Main(service_manager::MainParams const &)
 content::ContentMain(content::ContentMainParams const &)
 ChromeMain
 MainDllLoader::Launch(HINSTANCE__ *,base::TimeTicks)
 wWinMain
 __scrt_common_main_seh
trace-02a2fe715b118f4a.json.gz
528 KB Download
trace-489b48e794e05be8.json.gz
1000 KB Download
trace-bce247c9cd36c1bd.json.gz
1.2 MB Download
Labels: Hotlist-SamplingProfilerInField
I'm looking to the second task PrepareProviderMetricsTask

Report: fefb7485675c4fd2 (symbolized attached)

Time is spent on this stackframe:

 base::PersistentMemoryAllocator::Iterator::GetNext(unsigned int *)
 base::PersistentMemoryAllocator::Iterator::GetNextOfType(unsigned int)
 metrics::PersistentSystemProfile::RecordAllocator::NextSegment()
 metrics::PersistentSystemProfile::RecordAllocator::Read(metrics::PersistentSystemProfile::RecordType *,std::basic_string<char,std::char_traits<char>,std::allocator<char> > *)
 metrics::PersistentSystemProfile::MergeUpdateRecords(base::PersistentMemoryAllocator const &,metrics::SystemProfileProto *)
 metrics::PersistentSystemProfile::GetSystemProfile(base::PersistentMemoryAllocator const &,metrics::SystemProfileProto *)
 metrics::FileMetricsProvider::ProvideIndependentMetrics(metrics::SystemProfileProto *,base::HistogramSnapshotManager *)
 metrics::MetricsLog::LoadIndependentMetrics(metrics::MetricsProvider *)
 metrics::MetricsService::PrepareProviderMetricsLog()
 metrics::MetricsService::PrepareProviderMetricsTask()
 base::debug::TaskAnnotator::RunTask(char const *,base::PendingTask *)
 base::MessageLoop::RunTask(base::PendingTask *)
 base::MessageLoop::DoDelayedWork(base::TimeTicks *)
 base::MessagePumpForUI::DoRunLoop()
 base::MessagePumpWin::Run(base::MessagePump::Delegate *)
 base::RunLoop::Run()
 ChromeBrowserMainParts::MainMessageLoopRun(int *)
 content::BrowserMainLoop::RunMainMessageLoopParts()
 content::BrowserMainRunnerImpl::Run()
 content::BrowserMain(content::MainFunctionParams const &)
 content::RunBrowserProcessMain(content::MainFunctionParams const &,content::ContentMainDelegate *)
 content::ContentMainRunnerImpl::Run(bool)
 service_manager::Main(service_manager::MainParams const &)
 content::ContentMain(content::ContentMainParams const &)
 ChromeMain
 MainDllLoader::Launch(HINSTANCE__ *,base::TimeTicks)
 wWinMain
trace-fefb7485675c4fd2.json.gz
601 KB Download
Those call stacks indicate that we're spending a lot of time in getting histograms or system profile from the memory-mapped persistent histograms file.

Ideally, this should be in RAM in addition to disk and therefore should not be slow. Do we know if the systems have memory pressure such that maybe Windows may have purged the RAM parts of those files and has to swap them in when we're accessing them?
It does not look like the system itself was running slow during the trace. One other observation is that there were multiple other tasks on the worker threads that were trying to read files like sqlite_persistent_cookie_store, in_memory_url_index and ranker_model_loader_impl.
Another thing to note is that this occurred about 80 seconds after startup.

Is it possible to avoid reading from disk on main thread? like pin memory on RAM?
Actually, looking at things more, I don't see OpenNewLog on the stack here?

The stacks above do seem to indicate a separate issue where it does look like we're doing IO on main thread, which we shouldn't be doing. The intention is for the work to happen on a background thread, which CheckAndMergeMetricSourcesOnTaskRunner() is meant to be doing. But looks like some IO work is ending up on the main thread.

Since this seems different than just in OpenNewLog, I went ahead and filed a new bug for that:
https://bugs.chromium.org/p/chromium/issues/detail?id=882892
Blockedon: 882892
https://chromium-review.googlesource.com/1220310 enabled caching of the file on a background thread so that it shouldn't have to do any I/O on the UI thread.

There is a 50/50 experiment in Canary & Dev:
https://cs.corp.google.com/piper///depot/google3/googledata/googleclient/chrome/finch/studies/CacheFileMetricsData.json?q=cachefilemetricsdata.json&g=0&l=1

Can you check versions 71.0.3556.0 or above to see if the Jank has improved?
From #c10:
 > Actually, looking at things more, I don't see OpenNewLog on the stack here?

It make sense since OpenNewLog is the trace event title, which is the FROM_HERE. Thus, it's the source where the tasks causing trouble is posted.

There are two cases:

    base::SequencedTaskRunnerHandle::Get()->PostDelayedTask(
        FROM_HERE,
        base::BindOnce(&MetricsService::StartInitTask,
                       self_ptr_factory_.GetWeakPtr()),
        base::TimeDelta::FromSeconds(kInitializationDelaySeconds));

    base::SequencedTaskRunnerHandle::Get()->PostDelayedTask(
        FROM_HERE,
        base::BindOnce(&MetricsService::PrepareProviderMetricsTask,
                       self_ptr_factory_.GetWeakPtr()),
        base::TimeDelta::FromSeconds(2 * kInitializationDelaySeconds));


We can see the following stackframe part of the sampling profiler results:

  * MetricsService::PrepareProviderMetricsTask
> Can you check versions 71.0.3556.0 or above to see if the Jank has improved?

Unfortunately, with slow-reports that's not an easy tasks since traces are not aware of experiments.
If I go 100% can you see a difference by version?

How about we just add a histogram to measure the timing of the code in question? Then we can easily split by the experiment and also understand e.g. how widespread this is.
Will do.
I'm landing the code in the slow-reports pipeline to support experiments.
Labels: Performance-Browser
Owner: bcwh...@chromium.org
Status: Fixed (was: Untriaged)
This should be fixed.
thx to bcwhite@chromium.org

Sign in to add a comment