Chrome creates an 8MB file on start-up called BrowserMetrics-spare.pma |
||||||||||||||
Issue descriptionChrome Version: Tested M69 and M71 OS: Android What steps will reproduce the problem? (1) Start Chrome (2) run: adb shell su 0 sh -c 'ls\ -l\ /data/data/com.android.chrome/app_chrome/' -rw------- 1 u0_a55 u0_a55 8388608 2018-10-17 15:11 BrowserMetrics-spare.pma Looks like this is a memory-mapped file related to histograms. See bug 546019 . I'd love to know if there are any options for making this file smaller to save disk space on low-end devices. E.g.: * Does it need to be so big for 32-bit devices? * What happens if it's too small? * Any ideas if we could reduce functionality for low-end devices? E.g. live with some metrics not being recorded
,
Oct 18
We may need to pull this into m71
,
Oct 22
What size do you want it reduced to? 2GB would cover around 99.9% of Android runs. 3GB would cover around 99.999%. Those not covered would still have the vast majority of their histograms be persistent with the few remaining ones lost if the process dies before upload (about every 5 minutes).
,
Oct 22
Assuming you mean MBs :P. I think might make sense to use different values for different devices. E.g. look at either SysInfo::AmountOfTotalDiskSpace() or SysInfo::IsLowEndDevice(). I'd guess looking at disk space makes more sense, but IsLowEndDevice() is likely to run faster since it caches its value. On high-end phones, we do have a lot of disk, so would be fine to stick with 8MB. Would 512kb be too little for low-end? Ideally, I think it'd be something like: <=8GB device: 512kb <=16GB: 2MB >16GB: 8MB
,
Oct 22
We can query the histogram for low end: https://uma.googleplex.com/p/chrome/histograms/?endDate=20181016&dayCount=1&histograms=UMA.PersistentAllocator.BrowserMetrics.UsedPct&fixupData=true&showMax=true&filters=platform%2Ceq%2CA%2Cchannel%2Ceq%2C4%2Cis_low_mem%2Ceq%2CTrue%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial
,
Oct 22
Great! So if I'm reading it correctly, the second column is % used of 8mb, thus: 17% covers 99.6% (1.36mb) 12% covers 87% (0.96mb) I don't know how this works, but are histograms already coalesced in this format? E.g If recording the same metric a second time, will it take up new space, or just adjust the currently pending record for that value? Any concept of how important the last .4mb of metrics are compared to the first .96mb? E.g. maybe they are redundant in some way? Could look at starting small, and then growing the file when it fills up. If growing fails, go to memory. I don't have a good idea of the difficulty of this idea though...
,
Oct 22
Yes, you're reading it correctly. The histograms are "coalesced" but there is something called a "single value optimization" which keeps from allocating all the buckets when there is only a single value stored. This is useful because some histograms, startup time for example, only ever have one value. Once there are multiple (different) values, then all of the buckets are allocated and the histogram is fully formed. It'll never need more memory. Thus, longer run times of a process have a minimal increase in memory use, only adding those new ones that get hit more and more rarely. The file is a memory-mapped file and so has to be completely formed at startup. Initially, it was a "sparse" file and thus had no physical size until used. Unfortunately, memory-mapping a file means that I/O errors (like "disk full" or a bad flash) become SIGBUS/crash. Thus, the decision was made to fully realize the file at the beginning. If we ever create a SIGBUS handler (I've started one but it's not a pressing need) then we could go back to sparse files. Note that it already falls back to "non persistent memory" in the case of the persistent segment getting full. If we assume that more memory use aligns with a longer process life, then non-persistent metrics will still have time to be uploaded. I've never analyzed the "long tail" but it would have to be histograms that get hit only rarely or dynamically generated names used for some kind of experiment. Note that we're moving histograms towards "auto expiry" which should actually reduce the number of them that are recorded going forward. How important is 18MB of "disk" space on Android? (That assumes 3 files all cut from 8MB to 2MB).
,
Oct 22
@bcwhite - is it a lot of engineering work to disable persistent histograms for low-storage devices (i.e. most low-end devices). Persistent histograms is definitely very useful, but not critical for us. Stability data (on Android as of M71) is already fully encompassed by the system profile proto, so we would not lose stability data if histograms are not persistent. How much work would it be to simply disable it for storage-critical devices? i.e. the following: SysInfo::IsLowEndDevice() == true => 0 MB allocated to persistent histograms, no active or spare or previous files. SysInfo::IsLowEndDevice() == false => 8 MB allocated (same as right now). Since storage is release-critical, i.e. if we use too much then we will not be able to update Chrome and deliver security fixes, the loss of non-persisted histograms on low-end devices is a trade-off for user benefit and security that I recommend we make.
,
Oct 22
As to why it is release-critical, see: http://g/chrome-binary-size/7FUHe6mL9B0
,
Oct 22
Disabling it is trivial and can be done via code or a Finch experiment. Note that disabling persistent memory will increase heap memory as the histograms will now be stored there. The total memory use is approximately the same, just in a different location.
,
Oct 22
Sounds good. I've chatted with agrieve@ offline and this is the best course of action as we know of right now. Since low-end devices are likely the flakiest with Finch, code change would be better since this would be used in preinstalling apks or OS upgrading system apps. We'd need this merged back to M71. Would you mind starting a CL, Brian, or would you like me to take a stab at it if you're bandwidth constrained?
,
Oct 22
+benmason@ for merge approval and +ushesh@ for release context.
,
Oct 22
+benhenry@ for release context too.
,
Oct 22
> How important is 18MB of "disk" space on Android? Besides the size being occupied there are other concerns: 1. Disk thrashing. On linux all dirty pages from mmaped files get flushed to disk periodically (every 5 minutes?). For histogram data we will most likely have all of the range dirty very quickly after Chrome comes to foreground. Some low end devices are particularly concerned with the amount of writes apps are making - it noticeably shortens device lifetimes, unfortunately. An increase of 2MiB of writes per startup was once noticed by Android folks (or a partner), and we had to go with a complex change to eliminate it. See for more details: http://b/62463424 (googlers only, sorry) and crbug.com/719977. 2. Increase in jank. If a page is evicted from pagecache (which is likely for file-backed clean pages), then the next write to the page will wait synchronously for the faults to happen. This may introduce some more jank on critical path. Delays like that are hard to measure locally. We do not have a sampling profiler (yet) to confirm whether it is a real problem. It is also very device specific (devices with heavy usage of ZRAM are behind too many faults anyway). Would a Finch trial be appropriate? Were these points considered a part of the tradeoff for Android? (I was not included in the discussions) Relevant discussion we had in 2015 about this: https://groups.google.com/a/chromium.org/d/msg/chromium-dev/Gqx-bZaUICU/vL62U8Q-CgAJ
,
Oct 22
Persistence is more important for Android than any other platform because tasks get aborted frequently and without warning. Ideally there would be an option to not flush dirty pages unless the system were shutting down.
,
Oct 22
> Ideally there would be an option to not flush dirty pages unless the system were shutting down. This is a reasonable request for Android team for one of the next releases, likely not earlier than R. If everything goes well, we will see non-trivial traffic from those devices in 4 years.
,
Oct 23
> Persistence is more important for Android than any other platform because tasks get aborted frequently and without warning. Browser process rarely gets killed on foreground (and if it does fail this way we have breakpad/crashpad for OOM, which could also dump histograms .. I think). We can flush histograms quickly after going to background (we do something like that in simplecache, and sometimes end up with lost disk writes, I can dig up some numbers if you want). We would not get 100% persistency this way, it is a tradeoff for your consideration. Lockfree shared memory region for non-browser-processes would probably work as well? I believe it can be a single region which all processes share.
,
Oct 23
All sub-processes use a shared memory segment (no on-disk backing file) between it and the Browser. When they die and the Browser is notified, histograms are collected from that segment and merged with the Browser's. Unfortunately, sharing isn't allowed for security reasons as it could allow a compromised renderer to learn things about other processes. There is code to flush a metrics log when going into background but we still saw a big jump in collected metrics when persistence was enabled. It's especially useful to have metrics when there are crashes as the metrics near the crash site are often the most useful for debugging that crash.
,
Oct 24
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/cc7f6c23fcc794a6bbc4acd35f7fee77e9de53ad commit cc7f6c23fcc794a6bbc4acd35f7fee77e9de53ad Author: Brian White <bcwhite@chromium.org> Date: Wed Oct 24 15:47:20 2018 Disable mapped-file persistent metrics on low-end devices. Also, reduce file size from 8MB to 4MB because that's what is reported by the most recent metrics indicate. And remove a block of obsolete code doing an unnecessary disk access. Bug: 896394 Change-Id: I8830fce2f467044550d5f01597ae5d1a998689f4 Reviewed-on: https://chromium-review.googlesource.com/c/1294293 Reviewed-by: Peter Wen <wnwen@chromium.org> Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Commit-Queue: Peter Wen <wnwen@chromium.org> Commit-Queue: Brian White <bcwhite@chromium.org> Cr-Commit-Position: refs/heads/master@{#602352} [modify] https://crrev.com/cc7f6c23fcc794a6bbc4acd35f7fee77e9de53ad/chrome/browser/chrome_browser_field_trials.cc
,
Oct 24
Thanks Brian! I can start merging it to M71 if you haven't already.
,
Oct 24
It needs to run in Canary for a bit. Let's wait until Monday.
,
Oct 24
Sounds good, will leave you to merge it then.
,
Oct 25
This bug requires manual review: M71 has already been promoted to the beta branch, so this requires manual review Please contact the milestone owner if you have questions. Owners: benmason@(Android), kariahda@(iOS), kbleicher@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Oct 25
CL in #19 is introduced in 72.0.3591.0. Doesn't look like any new crashes from it: https://crash.corp.google.com/browse?q=product_name%3D%27Chrome_Android%27&compProp=product.Version&v1=72.0.3591.0&v2=72.0.3590.0#magicsignature:100,-magicsignature2:30,-stablesignature:30
,
Oct 25
Don't expect it to but want to look at the histograms about it once there are a few days of reported data.
,
Oct 25
Ah, makes sense. Ideally we would want histograms isolated for low end devices, but I think that canary won't yield enough of those to be statistically significant. Hopefully once this is merged to M71 and it's promoted to Beta, we'd get better coverage for low end devices.
,
Oct 29
Looking at data up through Saturday... The reduced file size (from 8MiB to 4MiB) shows a maximum use of 60%: https://uma.googleplex.com/p/chrome/histograms/?endDate=20181027&dayCount=7&histograms=UMA.PersistentAllocator.BrowserMetrics.UsedPct&fixupData=true&showMax=true&filters=simple_version%2Ceq%2C72.0.3591.0%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial There is a noticeable drop (about 3-5%) in the number of records for Android devices: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=c2d507325bd0dbf2b92118c9fef3c017 While the drop in records for Android is in itself not significant, it isn't evenly distributed and all metric results will be now be biased towards non-low-end devices, more so for those metrics generally collected near the end of the process lifetime. There are currently only two days worth of data with the change (starting with "A21" on Oct 21) and so the difference should increase asymptotically as all devices update to the latest Canary versions. So there is some "cost" to this change. From the discussions, it sounds like you feel the improvement to be well worth that cost. Correct? Do you want to start the merge today? I wouldn't mind seeing a few more days worth of data but don't have any objection to doing merging sooner.
,
Oct 29
Thank you for your analysis, Brian. The cost of this change is non-trivial, but it is worthwhile because it contributes to unblocking the entire Chrome app blacklisted from being able to be updated on low-end phones. For low-end phones, we are willing to accept a loss of metrics so we can have better security and user-visible improvements through future updates. Please start the merge today as we want this to be in the next M71 beta to see fuller impact. I will bring this to benmason@ for merge review.
,
Oct 29
Looking at this again, seems like most of the dip is caused entirely by changing 8MB -> 4MB rather than turning off on the low-end (at least on canary). So metrics are not being biased towards non-low-end any more than today. Low-memory: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=f6036fbaa46fbc16da602539e041d3d9 Not low-memory: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=7cbf532f7fb8e3aae1e96f7280b0c69b
,
Oct 29
The 8->4 change won't affect the number of records. It could if it was full but there were no reports greater than 60% used. "Low Memory Device" is exactly the same as "IsLowEndDevice"?
,
Oct 29
They are supposed to be equal, but sometimes can be a little off. Used to be <512MB but recently changed to <1024MB for O+. Android: https://developer.android.com/reference/android/app/ActivityManager.html#isLowRamDevice() Ours: https://cs.chromium.org/chromium/src/base/android/java/src/org/chromium/base/SysUtils.java?l=158&rcl=8df245c788f2a14c38efc82ebb8e88b8fdd7fc3e Histogram to see if they are equal: https://uma.googleplex.com/p/chrome/histograms/?endDate=20181027&dayCount=1&histograms=Android.SysUtilsLowEndMatches&fixupData=true&showMax=true&filters=platform%2Ceq%2CA%2Cchannel%2Ceq%2C1%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial
,
Oct 29
Here's the same histogram restricted to Low Memory Device: https://uma.googleplex.com/p/chrome/histograms/?endDate=20181027&dayCount=1&histograms=Android.SysUtilsLowEndMatches&fixupData=true&showMax=true&filters=platform%2Ceq%2CA%2Cchannel%2Ceq%2C1%2Cis_low_mem%2Ceq%2CTrue%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial Higher percentage of not equals (20%), but for the most part equal.
,
Oct 29
I was wrong, turns out they are exactly equal. Here's the code that generates MemoryAndroid.LowRamDevice (it uses IsLowEndDevice explicitly): https://cs.chromium.org/chromium/src/chrome/browser/metrics/android_metrics_provider.cc?l=22&rcl=a0353e806c227e86965afe6678d3a9caa6d6d6ab This is then used in stability histograms: http://google3/analysis/uma/dashboards/uuma/histograms/columnio/chrome/histograms_columnio.proto?l=197&rcl=217564626 Which is then used for our uma filter: http://google3/analysis/uma/dashboards/website/product/tables/chrome_tables.py?l=513&rcl=218725655 It's Android's definition that is slightly different than ours, but within chrome and UMA this is consistent.
,
Oct 29
See: https://uma.googleplex.com/p/chrome/histograms/?endDate=20181027&dayCount=1&histograms=MemoryAndroid.LowRamDevice&fixupData=true&showMax=true&filters=platform%2Ceq%2CA%2Cchannel%2Ceq%2C1%2Cis_low_mem%2Ceq%2CTrue%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial 9265 true, 0 false https://uma.googleplex.com/p/chrome/histograms/?endDate=20181027&dayCount=1&histograms=MemoryAndroid.LowRamDevice&fixupData=true&showMax=true&filters=platform%2Ceq%2CA%2Cchannel%2Ceq%2C1%2Cis_low_mem%2Ceq%2CFalse%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial 0 true, 667285 false
,
Oct 29
I don't see what would make the difference, then. I can create a Finch experiment to force on-disk, local-memory, and default for 33% to compare them.
,
Oct 29
Windows and MacOS also fell more than usual around the same days, though it's harder to see on their graphs since they have bigger bumps.
,
Oct 29
Are you sure you're not just looking at weekend effects?
,
Oct 29
No, which is why I originally suggested looking at a few more days. But it's really odd that it seems to be affecting the opposite set of devices.
,
Oct 30
Merge approved to 71, branch 3578.
,
Oct 30
Today's metrics don't show any significant difference from yesterday. Going ahead with the merge...
,
Oct 30
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/07497d997da6250028aba311d4f5cb11e585dc07 commit 07497d997da6250028aba311d4f5cb11e585dc07 Author: Brian White <bcwhite@chromium.org> Date: Tue Oct 30 17:38:58 2018 Disable mapped-file persistent metrics on low-end devices. Also, reduce file size from 8MB to 4MB because that's what is reported by the most recent metrics indicate. And remove a block of obsolete code doing an unnecessary disk access. Bug: 896394 Change-Id: I8830fce2f467044550d5f01597ae5d1a998689f4 Reviewed-on: https://chromium-review.googlesource.com/c/1294293 Reviewed-by: Peter Wen <wnwen@chromium.org> Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Commit-Queue: Peter Wen <wnwen@chromium.org> Commit-Queue: Brian White <bcwhite@chromium.org> Cr-Original-Commit-Position: refs/heads/master@{#602352}(cherry picked from commit cc7f6c23fcc794a6bbc4acd35f7fee77e9de53ad) Reviewed-on: https://chromium-review.googlesource.com/c/1308037 Reviewed-by: Brian White <bcwhite@chromium.org> Cr-Commit-Position: refs/branch-heads/3578@{#403} Cr-Branched-From: 4226ddf99103e493d7afb23a4c7902ee496108b6-refs/heads/master@{#599034} [modify] https://crrev.com/07497d997da6250028aba311d4f5cb11e585dc07/chrome/browser/chrome_browser_field_trials.cc
,
Oct 30
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/07497d997da6250028aba311d4f5cb11e585dc07 Commit: 07497d997da6250028aba311d4f5cb11e585dc07 Author: bcwhite@chromium.org Commiter: bcwhite@chromium.org Date: 2018-10-30 17:38:58 +0000 UTC Disable mapped-file persistent metrics on low-end devices. Also, reduce file size from 8MB to 4MB because that's what is reported by the most recent metrics indicate. And remove a block of obsolete code doing an unnecessary disk access. Bug: 896394 Change-Id: I8830fce2f467044550d5f01597ae5d1a998689f4 Reviewed-on: https://chromium-review.googlesource.com/c/1294293 Reviewed-by: Peter Wen <wnwen@chromium.org> Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Commit-Queue: Peter Wen <wnwen@chromium.org> Commit-Queue: Brian White <bcwhite@chromium.org> Cr-Original-Commit-Position: refs/heads/master@{#602352}(cherry picked from commit cc7f6c23fcc794a6bbc4acd35f7fee77e9de53ad) Reviewed-on: https://chromium-review.googlesource.com/c/1308037 Reviewed-by: Brian White <bcwhite@chromium.org> Cr-Commit-Position: refs/branch-heads/3578@{#403} Cr-Branched-From: 4226ddf99103e493d7afb23a4c7902ee496108b6-refs/heads/master@{#599034}
,
Oct 30
Thanks for monitoring and merging this, Brian! I'll also keep an eye on the graphs when the next beta rolls out.
,
Nov 12
Everything working out okay?
,
Nov 12
Yep! Low end looks great: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=6e0ed2e9ec48a0d44da91fbb17199457
,
Nov 15
bcwhite: is there a graph demonstrating the increase in UMA data received because of the switch to persistent histograms on Android? I could not find one easily :/
,
Nov 15
Removing the low-end conditional also doesn't change the graph significantly: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=82d44ec89678a530e87991c9d141a808
,
Nov 20
bcwhite: friendly ping, can you plz show some data on how persistent histograms are helping us on Android? |
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by bcwh...@chromium.org
, Oct 17