New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 896394 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 12
Cc:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug



Sign in to add a comment

Chrome creates an 8MB file on start-up called BrowserMetrics-spare.pma

Project Member Reported by agrieve@chromium.org, Oct 17

Issue description

Chrome Version: Tested M69 and M71
OS: Android

What steps will reproduce the problem?
(1) Start Chrome
(2) run: adb shell su 0 sh -c 'ls\ -l\ /data/data/com.android.chrome/app_chrome/'

-rw-------  1 u0_a55 u0_a55 8388608 2018-10-17 15:11 BrowserMetrics-spare.pma


Looks like this is a memory-mapped file related to histograms. See  bug 546019 .

I'd love to know if there are any options for making this file smaller to save disk space on low-end devices. E.g.:
 * Does it need to be so big for 32-bit devices?
 * What happens if it's too small? 
 * Any ideas if we could reduce functionality for low-end devices? E.g. live with some metrics not being recorded
 
The 99th percentile for "used percent" on Android is about 20%.
https://uma.googleplex.com/p/chrome/histograms/?endDate=20181016&dayCount=1&histograms=UMA.PersistentAllocator.BrowserMetrics.UsedPct&fixupData=true&showMax=true&filters=platform%2Ceq%2CA%2Cchannel%2Ceq%2C4%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial

We could thus cut the size of metrics files down to as little as 2GB and still cover most everything.

If it fills up completely, then new histograms won't be persistent; they'll be uploaded if the process lives long enough for an upload sequence (about every 5 minutes) but will be lost if the process restarts before then (a common occurrence on Android).

Note that there are actually 2 or more of these files.
- active one
- spare one
- previous one(s)

"Active" is the one in use.  "Spare" is created shortly after startup and used by the next run so startup isn't delayed creating the file.  In fact, persistent metrics is disabled on Android if there isn't a spare file because the startup time increase would be too large.

"Previous" is/are from previous runs that have not yet been uploaded.  This begins about 30 or 60 seconds after startup.  Each is deleted after upload.
Labels: -Pri-3 Target-71 Pri-1
We may need to pull this into m71
What size do you want it reduced to?  2GB would cover around 99.9% of Android runs.  3GB would cover around 99.999%.

Those not covered would still have the vast majority of their histograms be persistent with the few remaining ones lost if the process dies before upload (about every 5 minutes).

Cc: pasko@chromium.org
Assuming you mean MBs :P.

I think might make sense to use different values for different devices.
E.g. look at either SysInfo::AmountOfTotalDiskSpace() or SysInfo::IsLowEndDevice().

I'd guess looking at disk space makes more sense, but IsLowEndDevice() is likely to run faster since it caches its value.

On high-end phones, we do have a lot of disk, so would be fine to stick with 8MB.
Would 512kb be too little for low-end?


Ideally, I think it'd be something like:
<=8GB device: 512kb
<=16GB: 2MB
>16GB: 8MB
Great!

So if I'm reading it correctly, the second column is % used of 8mb, thus:
17% covers 99.6% (1.36mb)
12% covers 87% (0.96mb)

I don't know how this works, but are histograms already coalesced in this format? E.g If recording the same metric a second time, will it take up new space, or just adjust the currently pending record for that value? 

Any concept of how important the last .4mb of metrics are compared to the first .96mb? E.g. maybe they are redundant in some way? 

Could look at starting small, and then growing the file when it fills up. If growing fails, go to memory. I don't have a good idea of the difficulty of this idea though...
Yes, you're reading it correctly.

The histograms are "coalesced" but there is something called a "single value optimization" which keeps from allocating all the buckets when there is only a single value stored.  This is useful because some histograms, startup time for example, only ever have one value.

Once there are multiple (different) values, then all of the buckets are allocated and the histogram is fully formed.  It'll never need more memory.

Thus, longer run times of a process have a minimal increase in memory use, only adding those new ones that get hit more and more rarely.

The file is a memory-mapped file and so has to be completely formed at startup.  Initially, it was a "sparse" file and thus had no physical size until used.  Unfortunately, memory-mapping a file means that I/O errors (like "disk full" or a bad flash) become SIGBUS/crash.  Thus, the decision was made to fully realize the file at the beginning.  If we ever create a SIGBUS handler (I've started one but it's not a pressing need) then we could go back to sparse files.

Note that it already falls back to "non persistent memory" in the case of the persistent segment getting full.  If we assume that more memory use aligns with a longer process life, then non-persistent metrics will still have time to be uploaded.

I've never analyzed the "long tail" but it would have to be histograms that get hit only rarely or dynamically generated names used for some kind of experiment.

Note that we're moving histograms towards "auto expiry" which should actually reduce the number of them that are recorded going forward.

How important is 18MB of "disk" space on Android?  (That assumes 3 files all cut from 8MB to 2MB).
Cc: wnwen@chromium.org
@bcwhite - is it a lot of engineering work to disable persistent histograms for low-storage devices (i.e. most low-end devices).

Persistent histograms is definitely very useful, but not critical for us. Stability data (on Android as of M71) is already fully encompassed by the system profile proto, so we would not lose stability data if histograms are not persistent.

How much work would it be to simply disable it for storage-critical devices? i.e. the following:

SysInfo::IsLowEndDevice() == true => 0 MB allocated to persistent histograms, no active or spare or previous files.
SysInfo::IsLowEndDevice() == false => 8 MB allocated (same as right now).

Since storage is release-critical, i.e. if we use too much then we will not be able to update Chrome and deliver security fixes, the loss of non-persisted histograms on low-end devices is a trade-off for user benefit and security that I recommend we make.
As to why it is release-critical, see: http://g/chrome-binary-size/7FUHe6mL9B0
Disabling it is trivial and can be done via code or a Finch experiment.

Note that disabling persistent memory will increase heap memory as the histograms will now be stored there.  The total memory use is approximately the same, just in a different location.

Sounds good. I've chatted with agrieve@ offline and this is the best course of action as we know of right now.

Since low-end devices are likely the flakiest with Finch, code change would be better since this would be used in preinstalling apks or OS upgrading system apps.

We'd need this merged back to M71. Would you mind starting a CL, Brian, or would you like me to take a stab at it if you're bandwidth constrained?
Cc: benmason@chromium.org ushesh@chromium.org
+benmason@ for merge approval and +ushesh@ for release context.
Cc: benhenry@chromium.org
+benhenry@ for release context too.
Cc: lizeb@chromium.org
> How important is 18MB of "disk" space on Android?

Besides the size being occupied there are other concerns:

1. Disk thrashing.

On linux all dirty pages from mmaped files get flushed to disk periodically (every 5 minutes?). For histogram data we will most likely have all of the range dirty very quickly after Chrome comes to foreground. Some low end devices are particularly concerned with the amount of writes apps are making - it noticeably shortens device lifetimes, unfortunately. An increase of 2MiB of writes per startup was once noticed by Android folks (or a partner), and we had to go with a complex change to eliminate it. See for more details: http://b/62463424 (googlers only, sorry) and crbug.com/719977.

2. Increase in jank.

If a page is evicted from pagecache (which is likely for file-backed clean pages), then the next write to the page will wait synchronously for the faults to happen. This may introduce some more jank on critical path. Delays like that are hard to measure locally. We do not have a sampling profiler (yet) to confirm whether it is a real problem. It is also very device specific (devices with heavy usage of ZRAM are behind too many faults anyway). Would a Finch trial be appropriate?

Were these points considered a part of the tradeoff for Android? (I was not included in the discussions)

Relevant discussion we had in 2015 about this: https://groups.google.com/a/chromium.org/d/msg/chromium-dev/Gqx-bZaUICU/vL62U8Q-CgAJ
Persistence is more important for Android than any other platform because tasks get aborted frequently and without warning.  Ideally there would be an option to not flush dirty pages unless the system were shutting down.

> Ideally there would be an option to not flush dirty pages unless the system were shutting down.

This is a reasonable request for Android team for one of the next releases, likely not earlier than R. If everything goes well, we will see non-trivial traffic from those devices in 4 years.
> Persistence is more important for Android than any other platform because tasks get aborted frequently and without warning.

Browser process rarely gets killed on foreground (and if it does fail this way we have breakpad/crashpad for OOM, which could also dump histograms .. I think). We can flush histograms quickly after going to background (we do something like that in simplecache, and sometimes end up with lost disk writes, I can dig up some numbers if you want). We would not get 100% persistency this way, it is a tradeoff for your consideration.

Lockfree shared memory region for non-browser-processes would probably work as well? I believe it can be a single region which all processes share.
Cc: asvitk...@chromium.org
All sub-processes use a shared memory segment (no on-disk backing file) between it and the Browser.  When they die and the Browser is notified, histograms are collected from that segment and merged with the Browser's.

Unfortunately, sharing isn't allowed for security reasons as it could allow a compromised renderer to learn things about other processes.

There is code to flush a metrics log when going into background but we still saw a big jump in collected metrics when persistence was enabled.  It's especially useful to have metrics when there are crashes as the metrics near the crash site are often the most useful for debugging that crash.
Project Member

Comment 19 by bugdroid1@chromium.org, Oct 24

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/cc7f6c23fcc794a6bbc4acd35f7fee77e9de53ad

commit cc7f6c23fcc794a6bbc4acd35f7fee77e9de53ad
Author: Brian White <bcwhite@chromium.org>
Date: Wed Oct 24 15:47:20 2018

Disable mapped-file persistent metrics on low-end devices.

Also, reduce file size from 8MB to 4MB because that's what is reported
by the most recent metrics indicate.  And remove a block of obsolete
code doing an unnecessary disk access.

Bug:  896394 
Change-Id: I8830fce2f467044550d5f01597ae5d1a998689f4
Reviewed-on: https://chromium-review.googlesource.com/c/1294293
Reviewed-by: Peter Wen <wnwen@chromium.org>
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Commit-Queue: Peter Wen <wnwen@chromium.org>
Commit-Queue: Brian White <bcwhite@chromium.org>
Cr-Commit-Position: refs/heads/master@{#602352}
[modify] https://crrev.com/cc7f6c23fcc794a6bbc4acd35f7fee77e9de53ad/chrome/browser/chrome_browser_field_trials.cc

Labels: Merge-Request-71
Thanks Brian! I can start merging it to M71 if you haven't already.
It needs to run in Canary for a bit.  Let's wait until Monday.

Sounds good, will leave you to merge it then.
Project Member

Comment 23 by sheriffbot@chromium.org, Oct 25

Labels: -Merge-Request-71 Hotlist-Merge-Review Merge-Review-71
This bug requires manual review: M71 has already been promoted to the beta branch, so this requires manual review
Please contact the milestone owner if you have questions.
Owners: benmason@(Android), kariahda@(iOS), kbleicher@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Don't expect it to but want to look at the histograms about it once there are a few days of reported data.
Ah, makes sense. Ideally we would want histograms isolated for low end devices, but I think that canary won't yield enough of those to be statistically significant.

Hopefully once this is merged to M71 and it's promoted to Beta, we'd get better coverage for low end devices.
Looking at data up through Saturday...

The reduced file size (from 8MiB to 4MiB) shows a maximum use of 60%:
https://uma.googleplex.com/p/chrome/histograms/?endDate=20181027&dayCount=7&histograms=UMA.PersistentAllocator.BrowserMetrics.UsedPct&fixupData=true&showMax=true&filters=simple_version%2Ceq%2C72.0.3591.0%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial

There is a noticeable drop (about 3-5%) in the number of records for Android devices:
https://uma.googleplex.com/p/chrome/timeline_v2/?sid=c2d507325bd0dbf2b92118c9fef3c017

While the drop in records for Android is in itself not significant, it isn't evenly distributed and all metric results will be now be biased towards non-low-end devices, more so for those metrics generally collected near the end of the process lifetime.

There are currently only two days worth of data with the change (starting with "A21" on Oct 21) and so the difference should increase asymptotically as all devices update to the latest Canary versions.

So there is some "cost" to this change.  From the discussions, it sounds like you feel the improvement to be well worth that cost.  Correct?

Do you want to start the merge today?  I wouldn't mind seeing a few more days worth of data but don't have any objection to doing merging sooner.
Thank you for your analysis, Brian. The cost of this change is non-trivial, but it is worthwhile because it contributes to unblocking the entire Chrome app blacklisted from being able to be updated on low-end phones.

For low-end phones, we are willing to accept a loss of metrics so we can have better security and user-visible improvements through future updates.

Please start the merge today as we want this to be in the next M71 beta to see fuller impact. I will bring this to benmason@ for merge review. 
Looking at this again, seems like most of the dip is caused entirely by changing 8MB -> 4MB rather than turning off on the low-end (at least on canary). So metrics are not being biased towards non-low-end any more than today. 

Low-memory: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=f6036fbaa46fbc16da602539e041d3d9

Not low-memory: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=7cbf532f7fb8e3aae1e96f7280b0c69b
The 8->4 change won't affect the number of records.  It could if it was full but there were no reports greater than 60% used.

"Low Memory Device" is exactly the same as "IsLowEndDevice"?

I was wrong, turns out they are exactly equal.

Here's the code that generates MemoryAndroid.LowRamDevice (it uses IsLowEndDevice explicitly): https://cs.chromium.org/chromium/src/chrome/browser/metrics/android_metrics_provider.cc?l=22&rcl=a0353e806c227e86965afe6678d3a9caa6d6d6ab

This is then used in stability histograms: http://google3/analysis/uma/dashboards/uuma/histograms/columnio/chrome/histograms_columnio.proto?l=197&rcl=217564626

Which is then used for our uma filter: http://google3/analysis/uma/dashboards/website/product/tables/chrome_tables.py?l=513&rcl=218725655

It's Android's definition that is slightly different than ours, but within chrome and UMA this is consistent.
I don't see what would make the difference, then.  I can create a Finch experiment to force on-disk, local-memory, and default for 33% to compare them.
Windows and MacOS also fell more than usual around the same days, though it's harder to see on their graphs since they have bigger bumps.
Are you sure you're not just looking at weekend effects?
No, which is why I originally suggested looking at a few more days.

But it's really odd that it seems to be affecting the opposite set of devices.

Labels: -Hotlist-Merge-Review -Merge-Review-71 Merge-Approved-71
Merge approved to 71, branch 3578.
Today's metrics don't show any significant difference from yesterday.  Going ahead with the merge...
Project Member

Comment 41 by bugdroid1@chromium.org, Oct 30

Labels: -merge-approved-71 merge-merged-3578
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/07497d997da6250028aba311d4f5cb11e585dc07

commit 07497d997da6250028aba311d4f5cb11e585dc07
Author: Brian White <bcwhite@chromium.org>
Date: Tue Oct 30 17:38:58 2018

Disable mapped-file persistent metrics on low-end devices.

Also, reduce file size from 8MB to 4MB because that's what is reported
by the most recent metrics indicate.  And remove a block of obsolete
code doing an unnecessary disk access.

Bug:  896394 
Change-Id: I8830fce2f467044550d5f01597ae5d1a998689f4
Reviewed-on: https://chromium-review.googlesource.com/c/1294293
Reviewed-by: Peter Wen <wnwen@chromium.org>
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Commit-Queue: Peter Wen <wnwen@chromium.org>
Commit-Queue: Brian White <bcwhite@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#602352}(cherry picked from commit cc7f6c23fcc794a6bbc4acd35f7fee77e9de53ad)
Reviewed-on: https://chromium-review.googlesource.com/c/1308037
Reviewed-by: Brian White <bcwhite@chromium.org>
Cr-Commit-Position: refs/branch-heads/3578@{#403}
Cr-Branched-From: 4226ddf99103e493d7afb23a4c7902ee496108b6-refs/heads/master@{#599034}
[modify] https://crrev.com/07497d997da6250028aba311d4f5cb11e585dc07/chrome/browser/chrome_browser_field_trials.cc

Labels: Merge-Merged-71-3578
The following revision refers to this bug: 
https://chromium.googlesource.com/chromium/src.git/+/07497d997da6250028aba311d4f5cb11e585dc07

Commit: 07497d997da6250028aba311d4f5cb11e585dc07
Author: bcwhite@chromium.org
Commiter: bcwhite@chromium.org
Date: 2018-10-30 17:38:58 +0000 UTC

Disable mapped-file persistent metrics on low-end devices.

Also, reduce file size from 8MB to 4MB because that's what is reported
by the most recent metrics indicate.  And remove a block of obsolete
code doing an unnecessary disk access.

Bug:  896394 
Change-Id: I8830fce2f467044550d5f01597ae5d1a998689f4
Reviewed-on: https://chromium-review.googlesource.com/c/1294293
Reviewed-by: Peter Wen <wnwen@chromium.org>
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Commit-Queue: Peter Wen <wnwen@chromium.org>
Commit-Queue: Brian White <bcwhite@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#602352}(cherry picked from commit cc7f6c23fcc794a6bbc4acd35f7fee77e9de53ad)
Reviewed-on: https://chromium-review.googlesource.com/c/1308037
Reviewed-by: Brian White <bcwhite@chromium.org>
Cr-Commit-Position: refs/branch-heads/3578@{#403}
Cr-Branched-From: 4226ddf99103e493d7afb23a4c7902ee496108b6-refs/heads/master@{#599034}
Thanks for monitoring and merging this, Brian! I'll also keep an eye on the graphs when the next beta rolls out.
Status: Fixed (was: Assigned)
Everything working out okay?
bcwhite: is there a graph demonstrating the increase in UMA data received because of the switch to persistent histograms on Android? I could not find one easily :/
Removing the low-end conditional also doesn't change the graph significantly: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=82d44ec89678a530e87991c9d141a808
bcwhite: friendly ping, can you plz show some data on how persistent histograms are helping us on Android?

Sign in to add a comment