Potential jank caused by histogram synchronizer |
||||
Issue descriptionWe are observing long running tasks that are causing jank on the browser process main thread through slow-reports. The current scenario is happening when there is an ImportantFileWriter::WriteNow at the same time than an histograms synchronisation. (see attachment) This can be observed in these slow-reports (on crash): be7f01923eeccd7b ~56 seconds. I spent some time looking to this issue and I think the problem is related to the lock in FindHistogram (called by WriteNow) which is also used by the histogram. WriteNow: https://cs.chromium.org/chromium/src/base/files/important_file_writer.cc?l=115 https://cs.chromium.org/chromium/src/base/metrics/statistics_recorder.cc?l=194 GetHistograms: https://cs.chromium.org/chromium/src/base/metrics/statistics_recorder.cc?l=331 But this lock is not enough to cause the jank. I'm still missing something. Strangely, there is a watchdog of minutes, which is aligned with the trace observed: https://cs.chromium.org/chromium/src/content/browser/histogram_synchronizer.cc?sq=package:chromium&dr&g=0&l=206 https://cs.chromium.org/chromium/src/content/browser/histogram_synchronizer.cc?sq=package:chromium&dr&g=0&l=250
,
Sep 7
So looking at this more, the task is the one that's posted by InternalPostTask - so actually it's not the histogram synchronizer code itself but whatever callback is given to it. So let's trace back where it's coming from: InternalPostTask SetTaskRunnerAndCallback HistogramSynchronizer::FetchHistogramsAsynchronously content::FetchHistogramsAsynchronously ChromeMetricsServiceClient::OnMemoryDetailCollectionDone ChromeMetricsServiceClient::OnHistogramSynchronizationDone ChromeMetricsServiceClient::CollectFinalMetricsForLog -> MetricsService::OnFinalLogInfoCollectionDone So that's the culprit I think. So interestingly, this *can* call OpenNewLog(). So I think this is likely the same issue that Etienne was seeing in another trace. (Not 100% sure because it's not clear that OpenNewLog() inside MetricsService::OnFinalLogInfoCollectionDone is to blame or something else.) .... I think stack sampled metrics would be a good use here rather than trying to reason about all of this manually. Etienne, is that something you're already collaborating on Mike with? Or when you mentioned stacks was that something else?
,
Sep 7
FYI This is what OnFinalLogInfoCollectionDone looks like in sampling profiler data: https://uma.googleplex.com/callstacks?sid=891c3560dbe29e34891ad81bf240442c Nothing stands out very obviously. (Also I appreciate if someone can double check my callback following logic in comment 2 to make sure I didn't miss something.)
,
Sep 11
,
Sep 20
I personally use the UMA sampler to understand the calltree. It ease debugging a lot. We are rolling out a version of slow-reports that is running the sampler profiler. It will ease the correlation of both. Some results should come in a few days. Keep in mind that the UMA sampler profile is the "average" user, but the slow-reports is the 99th percentile user. We won't see the same profile and often some rare codepaths may be executed.
,
Sep 26
|
||||
►
Sign in to add a comment |
||||
Comment 1 by asvitk...@chromium.org
, Sep 7