New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 882892 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Nov 20
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , Chrome , Mac
Pri: 2
Type: Bug

Blocking:
issue 867603



Sign in to add a comment

FileMetricsProvider::ProvideIndependentMetrics should not be doing IO on UI thread

Project Member Reported by asvitk...@chromium.org, Sep 11

Issue description

FileMetricsProvider::ProvideIndependentMetrics should not be doing IO on UI thread.

See the following two stacks:
https://bugs.chromium.org/p/chromium/issues/detail?id=867603#c5
https://bugs.chromium.org/p/chromium/issues/detail?id=867603#c7

The stacks above do seem to indicate an issue where it does look like we're doing IO on main thread, which we shouldn't be doing. The intention is for the work to happen on a background thread, which CheckAndMergeMetricSourcesOnTaskRunner() is meant to be doing. But looks like some IO work is ending up on the main thread.

Given that this can potentially cause considerable jank, we should fix.
 
Labels: M-71
Brian, can you take a look?

It would be good to fix this for M71, which branches on Oct 11.
These are the "independent" metrics, meaning files from previous runs.  It shouldn't happen much, probably only once per run.

Metric collection has always been intended to run on the main thread.  It would take a little work to ensure that another thread could access internal FMP structures.

The MetricsService would also need some changes to its state machine to recognize its in a state where another thread is doing work on the current log and not do anything that is thread-unsafe.

But perhaps the simplest solution is to undo the change that deleted the metrics file upon a clean exit.  We decided to keep it around to collect any metrics from the very end of the shutdown sequence but it means there is always one file of independent metrics to send on the next run.  By deleting the file on a clean exit, this would only happen on the first run after a browser crash.

Thoughts?
Or...  The FMP could simply copy the file data into memory on a background thread and then do the merge on the UI thread.  Thus, no disk access when it most matters but increased RAM usage when those files are awaiting upload.
Can we have the file metrics provider do all the necessary work to read those things into RAM on a background thread and then hang on to them until next time ProvideIndependentMetrics() is called so that when that function runs the things it adds would be in RAM already?

This way, we don't need huge changes to change the flow of MetricsService itself, just the logic for grabbing the data on the bg in FileMetricsProvider().
Oops, didn't see your second reply - yeah that's exactly what I was thinking. :D
Do think it would suffice to simply access the data in the file (it's a memory-mapped file) in order to bring it in to RAM and thus make for no-IO access when it's read for real?
If that's the simplest thing, then we can try that and measure the impact w/ a histogram & simple experiment.
Project Member

Comment 8 by bugdroid1@chromium.org, Sep 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ce24fb3c06486f202186dbf96cf82f77315345eb

commit ce24fb3c06486f202186dbf96cf82f77315345eb
Author: Brian White <bcwhite@chromium.org>
Date: Tue Sep 18 18:45:51 2018

Add Cache ability to FilePersistentMemoryAllocator.

This functionality is used by the FileMetricsProvider, currently based
on a Feature to be enabled.

Bug:  882892 
Change-Id: I55262453b58c5ed9473b33e8839bad423ca7ec84
Reviewed-on: https://chromium-review.googlesource.com/1220310
Commit-Queue: Brian White <bcwhite@chromium.org>
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Cr-Commit-Position: refs/heads/master@{#592120}
[modify] https://crrev.com/ce24fb3c06486f202186dbf96cf82f77315345eb/base/metrics/persistent_memory_allocator.cc
[modify] https://crrev.com/ce24fb3c06486f202186dbf96cf82f77315345eb/base/metrics/persistent_memory_allocator.h
[modify] https://crrev.com/ce24fb3c06486f202186dbf96cf82f77315345eb/components/metrics/file_metrics_provider.cc

Blocking: 867603
> Do think it would suffice to simply access the data in the file (it's a memory-mapped file) in order to bring it in to RAM and thus make for no-IO access when it's read for real?

I expect this to reduce the frequency of main-thread jank, but not eliminate it. 

In general, I'd be quite wary of any access to file-mapped memory on the main thread.
Project Member

Comment 11 by bugdroid1@chromium.org, Sep 21

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8d50b33a935653c70990e00d70a74c8ac52b9123

commit 8d50b33a935653c70990e00d70a74c8ac52b9123
Author: Brian White <bcwhite@chromium.org>
Date: Fri Sep 21 17:32:23 2018

Add histogram to time loading of independent metrics.

Bug:  882892 
Change-Id: I91d2cf5b46b8ef3b790b5e00692a053686ba17d1
Reviewed-on: https://chromium-review.googlesource.com/1236460
Commit-Queue: Brian White <bcwhite@chromium.org>
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Cr-Commit-Position: refs/heads/master@{#593245}
[modify] https://crrev.com/8d50b33a935653c70990e00d70a74c8ac52b9123/components/metrics/file_metrics_provider.cc
[modify] https://crrev.com/8d50b33a935653c70990e00d70a74c8ac52b9123/tools/metrics/histograms/histograms.xml

With a few days of data...
https://uma.googleplex.com/p/chrome/variations/?sid=5162ded46d2befbcd12f84e6c8152b8c

The caching is improving the top-end by roughly 27% which is good, but a reduction from 2.7s to 2.0s is still noticeable jank when it occurs.

If the paged-in memory is getting paged-out again, then these systems must be low in memory.  There is a significant time delay (several minutes, likely) between the two events.  The MetricProvider virtual class could be extended with some kind of "PrepareOnBackgroundThread()" method that is run immediately before the collection, thus allowing the caching to be done just milliseconds before the file data is accessed on the UI thread.

Moving all metric collection to its own thread would be the best solution but a pretty big change with a potentially large can of worms attached.

This action occurs once per Chrome run to clean up any metrics from the previous run.  If we undo the no-remove-of-file on clean exit, then it would only occur on a run immediately following a crash.  Should we consider that?

That's surprisingly high still.

I wonder what would it look like if we just read the data into a buffer in RAM from the background thread (instead of memory-mapping the file) and then used the persistent memory allocater over that buffer when providing the metrics to the log?
I don't think reading it into heap memory would be a good thing.  If it's already being paged out then there's a good possibility that the OS will page-out the heap memory, too.  That happening would actually be worse because the system would have to physically write those bytes before releasing the pages.
Cc: erikc...@chromium.org etienneb@chromium.org
Arguably a 2+ s delay in the UI is a worse experience than the extra memory use/IO.

I agree that it's possible the system pages those out too. But it's not obvious to me that would be the case. I'd like to understand the trade offs more - how hard would it be to do that change (under a feature so we can see the effect)?

The other thing we could try to do is get more info about the cases where this is happening. - e.g. are they really under memory pressure? Etienne/Erik - do we have this info from the original Jank reports we were looking at?
We added some signals for disk IO, and for memory swap on windows only.
They may not be in the original traces, but will be available in recent traces.

We can easily track any changes in slow-reports based on a experiment now.
Traces from a recent chrome version will contains the list of experiments and they are extracted in our tables.

To help you dig into this issue, we can add more trace events (with args) and collect more information on the system.

Also, as this scenario seems to be quite frequent. I'll recommend we add a UMA histograms for the timing, and hook a slow-reports on large value. This will give you an easy queryable scenario.
The histogram is present and a comparison based on the experiment is linked in #12.

Caching is an improvement but not a panacea.  One problem is that there is significant time between the caching and the access of that data.
Since the caching is a clear improvement, let's just enable it by default for now. But given the metric, there's still more to do.
Agreed.
Labels: -M-71 M-72
I think it would be good to get a fix for this in for M72.
Project Member

Comment 21 by bugdroid1@chromium.org, Oct 26

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2868e616e219500919fde8f95f0642267dd8c245

commit 2868e616e219500919fde8f95f0642267dd8c245
Author: Brian White <bcwhite@chromium.org>
Date: Fri Oct 26 01:30:24 2018

Fetch independent metrics on background thread.

Previously, these metrics were loaded on the UI thread which, because
they are in a memory-mapped file, can cause blocking I/O and cause
jank.

Bug:  882892 
Change-Id: I23b1808b5233b5bd05a1dbe5e96ace32c73ef46b
Reviewed-on: https://chromium-review.googlesource.com/c/1292271
Commit-Queue: Brian White <bcwhite@chromium.org>
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Reviewed-by: François Doray <fdoray@chromium.org>
Cr-Commit-Position: refs/heads/master@{#602968}
[modify] https://crrev.com/2868e616e219500919fde8f95f0642267dd8c245/base/metrics/histogram_flattener.h
[modify] https://crrev.com/2868e616e219500919fde8f95f0642267dd8c245/components/metrics/delegating_provider.cc
[modify] https://crrev.com/2868e616e219500919fde8f95f0642267dd8c245/components/metrics/delegating_provider.h
[modify] https://crrev.com/2868e616e219500919fde8f95f0642267dd8c245/components/metrics/file_metrics_provider.cc
[modify] https://crrev.com/2868e616e219500919fde8f95f0642267dd8c245/components/metrics/file_metrics_provider.h
[modify] https://crrev.com/2868e616e219500919fde8f95f0642267dd8c245/components/metrics/file_metrics_provider_unittest.cc
[modify] https://crrev.com/2868e616e219500919fde8f95f0642267dd8c245/components/metrics/metrics_log.cc
[modify] https://crrev.com/2868e616e219500919fde8f95f0642267dd8c245/components/metrics/metrics_log.h
[modify] https://crrev.com/2868e616e219500919fde8f95f0642267dd8c245/components/metrics/metrics_provider.cc
[modify] https://crrev.com/2868e616e219500919fde8f95f0642267dd8c245/components/metrics/metrics_provider.h
[modify] https://crrev.com/2868e616e219500919fde8f95f0642267dd8c245/components/metrics/metrics_service.cc
[modify] https://crrev.com/2868e616e219500919fde8f95f0642267dd8c245/components/metrics/metrics_service.h

It'll be awhile before there is sufficient data to know what effect there is from moving those file with embedded profiles to a background task.

There are other files that could be accessed on the main thread...

- CrashPad (active): This same file is open and mapped by the CrashPad process so it should be already resident in memory and not cause I/O.

- CrashPad (previous): If present, this will get sent during startup before multi-threading begins so won't be causing any jank.

- Setup (Windows only): If the setup process has run recently, there will be a file to process and such will happen on the I/O thread.  However, these are generally very small (a few KiB) and will be cached in memory on a background thread so shouldn't have a significant impact.

- Notification (Windows only): I'm not sure exactly when these get written but, like Setup, they also appear to be small.
The experiment is coming to and that controls the background merging of metrics from files on disk.  There's no significant difference in the "heartbeat" metrics, which isn't surprising.

https://uma.googleplex.com/p/chrome/variations/?sid=988e00b2d241ced47df5419c127d8e37

Looking at the four "jank" metrics, there is a notable (though statistically insignificant) reduction in Browser.Responsiveness.JankyIntervalsPerThirtySeconds.NonStartup, which is where file access would occur.  Since there is typically only one such access per run of Chrome, the effect will always be small.

https://uma.googleplex.com/p/chrome/variations/?sid=2b02b4fb1fabbd8d391079dff513f820

If you have any traces indicating Metrics I/O with the BackgoundIndependentMetrics experiment (note the missing R in "gound" :-) enabled then we should investigate.  In chrome://versions, it will appear as "788ec52f-3f4a17df".

Background execution of metric collection will be enabled by default in beta & stable in M72 forward.
From this slow-reports: df3a0514cccee32a
which seems to be part of the experiement,
we can see a 4 seconds OpeNewLog.

The problem is now in the thread creation. This is a different bug.

 NtCreateThreadEx
 CreateRemoteThreadEx
 CreateThreadStub
 base::`anonymous namespace\'::CreateThreadInternal
 base::internal::SchedulerWorker::Start(base::SchedulerWorkerObserver *)
 base::internal::SchedulerSingleThreadTaskRunnerManager::CreateCOMSTATaskRunnerWithTraits(base::TaskTraits const &,base::SingleThreadTaskRunnerThreadMode)
 base::internal::TaskSchedulerImpl::CreateCOMSTATaskRunnerWithTraits(base::TaskTraits const &,base::SingleThreadTaskRunnerThreadMode)
 base::CreateCOMSTATaskRunnerWithTraits(base::TaskTraits const &,base::SingleThreadTaskRunnerThreadMode)
 AntiVirusMetricsProvider::AsyncInit(base::RepeatingCallback<void > const &)
 metrics::DelegatingProvider::AsyncInit(base::RepeatingCallback<void > const &)
 metrics::MetricsService::StartInitTask()
 base::debug::TaskAnnotator::RunTask(char const *,base::PendingTask *)
 base::MessageLoop::RunTask(base::PendingTask *)
 base::MessageLoop::DoDelayedWork(base::TimeTicks *)
 base::MessagePumpForUI::DoRunLoop()
 base::MessagePumpWin::Run(base::MessagePump::Delegate *)
 base::RunLoop::Run()
 ChromeBrowserMainParts::MainMessageLoopRun(int *)
 content::BrowserMainLoop::RunMainMessageLoopParts()
 content::BrowserMainRunnerImpl::Run()
 content::BrowserMain(content::MainFunctionParams const &)
 content::RunBrowserProcessMain(content::MainFunctionParams const &,content::ContentMainDelegate *)
 content::ContentMainRunnerImpl::Run(bool)
 service_manager::Main(service_manager::MainParams const &)
 content::ContentMain(content::ContentMainParams const &)
 ChromeMain
 MainDllLoader::Launch(HINSTANCE__ *,base::TimeTicks)
 wWinMain
 __scrt_common_main_seh
 BaseThreadInitThunk
 RtlUserThreadStart
Status: Fixed (was: Assigned)
Okay.  Closing this bug as "fixed" then.

Sign in to add a comment