New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 695263 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug
Hotlist-MemoryInfra



Sign in to add a comment

malloc metadata/fragmentation major source of noise for reported_by_chrome:malloc:effective_size_avg

Project Member Reported by erikc...@chromium.org, Feb 23 2017

Issue description

https://chromeperf.appspot.com/report?sid=8413c6720da62c567c07ac5f06a09ac7e09b81b02e7ba61f15dbed873f4ef80c

There appears to be ~400KB of noise in the linked graph. Looking at specific traces, much of the noise appears to be in the metadata_fragmentation_cache.

https://00e9e64bac95dc901ad317127b0f2abc8c8bb6dd8ab8dad831-apidata.googleusercontent.com/download/storage/v1/b/chrome-telemetry-output/o/trace-file-id_0-2017-02-16_09-11-45-15287.html?qk=AD5uMEsV4TXjLepvhp9uEVbGCNhG_SivXcs-XcOh2oWGCuxmiA19x_6uUCX0l6YDSwRIWNRR4nfhUfQ9m1KQT4rsi1IsedymZhIkBw4HjiybRwA-OBWk1PBUBn5rrUgccy6DB9V1TB0fM52KD3g1i7_HB6pi52aULY_oiX-TGPzO0TcGS27fiF4nWzEcF_3CaPpDwe_kq9vzLa52nhVoqcXhIVUIWpsC7JtXQnyHPqxunUA0yac2Mm75GdVHyv9iwAY6DPhB7hlFCQZSsB-ooHn8IAqqEqr5hZSnboLkw_oIGRLNglNh9Kw-KgMQTD4UAUgEuhAe3tQY-xsIKQR9xbHF5Ansh-EBNbdqN_P38XoNhqNBbzO7-iGG4DI2Pl5oB1TPXF7Na_r_gQXt2AUn3zvQ6NNYtUMjGf9hYXOCiT_fPj3w8SSZZx9ZWsVYTk3KZa-DwoFrimfgB7VRGovfRxGwWuzjYIZnvbhWTeRGeFwKThVABGn8uAD1Md_M_O8rlJgpTaBsVUt-fO4dtpVDwOzhbba9a5rj2tm77rbsOMVgjkpJqHxQANCKr0Mi1ELexx8hvhpFqSau5Hd-jivB-zQqKlCeDQd8BdYHfOUkngtBDQa-97tn04d_nl2j8pDypu70l8oFGf-VJFDQoqRsYjmzup6VtVPpUf1Uu-iQCjzcdBP34gyDWbFjppgmrvqM6uBXZh5qrSuo-aQXmLq8cfpsE4O-r5b3JRUeJs5pX-Z_58_OC-DKUFW_tXboNrBE4YrYHjPTDA-7L0vjJQZGfGPItoSTWhK9mxrZ1VfqWHoGat7i0xka2lp38gh-rycRioYcuS7dUa0naEUqeDPuKFg0lSzDLkJQ8g

https://00e9e64bac9f6f5862131475e8baa00082457f3d27b99c4366-apidata.googleusercontent.com/download/storage/v1/b/chrome-telemetry-output/o/trace-file-id_0-2017-02-09_13-29-37-42492.html?qk=AD5uMEukVAI2l34imltvQ7T5ENTKG28acsO-2jD8HJah4LGB_WLqm_PwtIQx3G2AJEBMQCx0eRSpha6m6AZN0HLBHhn4PBxoi-7boxUhbpluFUa6_cMSVEjwMJV8Bvw6S6MPg5GwXN26rLfPb9PdI6V73or0JL7cE9wHQnisIkapnHZZEpqvoVd-AtvwVdpvVFWha6T3Fuhg7R6oHb91F18ICcqNTOEO_MLNlCW5c7W_I3VaN5rXCHelDOFHjF0mHnq1VPZe3-rjK_C2zzmrPn6Vply-Lkl0lpat_Xdu7VclaTYTUIyXJR5EQprNWlyKNNQMm3BnsCbImz4151Px3RIm7_Gyufges9URbY7dZX_CtJY_IXRzXmuYfrWDY-Xd9kD0kx1NyQ76B2gSJGzpDcocvUhG9f5JZuJFW6rqmVwyAJcuCujCo_9T9rWOeB-fnMG1Rv1GbGJowd4byNTUrXFS0-6m7gP-K6y-SryFDUsXEdFa-9sDfxHtoIbqGcTtx9dC15IDN7hl56PQEk6Hdo4HPQ879fOutcDLb63cew790A3eRDZr8y67bp38oCKB0d3DuI3E66mL0cHqi7YVRfAOIl_ENGv2enxwpEW14v2U4F6iofIqEQb6M-qxKT1xRHfZWCjay3ntCSRBHOCYQWUHM0YI1SGXX08xBsD6fMTPvX7TsPgMoLLKLUcUo46Ebw5IGbpQZavxqUW5Ov7HQtZZxl1nfUop9jQ1d58d4JC9_Hb0uFbMEjle6ShJ9XRmgtZe2k6VYYeqSRIwjIIBDXO2MIl6n1YXQmL8HZHaX0vB6oKw6J7RuwYzQQ4Ca41p1NQf6-7H7J1bYRHjzNgie0fU_X8GDUcxRw

In these two examples, the GPU process has ~200KB different in allocated objects, and ~3MB difference in metadata_fragmentation_caches.

the browser process has ~200KB difference in allocated objects, and ~200kb in metadata_fragmentation_caches.

I suggest that we emit a new metric malloc:allocated_object_size_avg, and track that with relatively high sensitivity, whereas we can treat effective_size_avg with much lower sensitivity.
 
Looking at libmalloc116/src/magazine_tiny.c...

Ranges are freed with madvise(...MADV_FREE_REUSABLE). If an entire page is madvised this way, the kernel marks it as resident + reusable. This means that the page still counts against the resident page size of the process, but any process can come along and reuse the physical page. Technically, if an entire region is freed, and certain conditions are met, then the full region will be "truly" deallocated. 

I wrote a test that demonstrates this. I start with a 800MB array. I malloc 10^8 4-byte segments, then free almost all of them [I leave multiples of 10^5] alone. size in use jumps from 800MB, to 2400MB, then back to 800MB. max size in use jumps to 2400MB, then only drops down to 1850MB. Activity monitor agrees that the actual "memory" usage of the app after playing this game is 800MB.

So max size in use is not useful, since it neither records peak, nor real memory usage. [Resident is meaningless when a page is also marked with MADV_FREE_REUSABLE.

Output: 
"""
size in use: 800016608
max size in use: 800066400
size in use: 2400016608
max size in use: 2425327968
size in use: 800032736
max size in use: 1851756896
"""
tiny_malloc_test.cc
910 bytes View Download

Comment 2 by w...@chromium.org, Mar 9 2017

Are MADV_FREE_REUSABLE pages still accessible to the owning process,
until/unless they get poached by some other process, then? i.e. are they
effectively "discardable" from the owning process' point-of-view?
Before the process reuses the range, it calls madvise(...MADV_FREE_REUSE). 
> Looking at specific traces, much of the noise appears to be in the metadata_fragmentation_cache.
Agree with the analysis on the noise.
Just pointing out something (might be already obvious): IIRC metadata_fragmentation_cache, is the "outer" total of malloc - "allocated object sizes" of malloc. So, the statement here (which I agree) is that the internals of the allocator, which we can observe but are outside of our control, are too noisy w.r.t. the malloc usage we have in chrome.  


> I suggest that we emit a new metric malloc:allocated_object_size_avg
I think we have it already no? See
https://chromeperf.appspot.com/report?sid=8ebdf52f465ffd86e4d997483ba31456d2595ce6b713f27bfea63db3bb555b61
which also confirms your analysis on the timeline

> and track that with relatively high sensitivity, whereas we can treat effective_size_avg with much lower sensitivity.
definitely +1 to this

The dilemma here is that the pain that we ultimately inflict to the user comes from the outer total malloc size (% resident and madvise... I did not come down there yet). But I think that what you are saying here is: "in theory yes, but in practice is nearly impossible to measure reliably"
Anyways, for regression tracking we have to be pragmatic: what our developers can act on is the allocated_objects_size. So +1 to track that.

> I wrote a test that demonstrates this.
you rock :)
Would be great to update our doc that describes the rationale of the "malloc" dump provider. I think what you are saying here is "some of the assumptions we made on malloc on OSX are wrong" and I am perfectly fine with that statement.
Doc: https://docs.google.com/document/d/1cHpqFoUD5ifYumokHcJe1StAx12EbcnH_ulZK99E0b4/edit


So let me see if I understand all this correctly:
Say that at a given time we have 100 MB of allocated_object size and 120 MB of total malloc size, hence we report 20 MB of metadata_fragmentation_cache.
- We all agree that all those 100 MB are "bad" -> and won't go away until we free() *
- We can't tell anything meaningful about those 20 MB. They *could* be resident and hence "bad" if they were really sub-page fragmentation of the allocator. But they could also be "tolerable" as they could be marked as reusable. And we cannot distinguish these cases.  

Also, I guess that this information is not surfaced by malloc_zone_statistics() or any other API?
So, what I am asking here is: Having established that what we report is incorrect, do we have any chance to fix the reporting (read: fix malloc_dump_provider?).
Or is all this data not surfaced?

> Activity monitor agrees that the actual "memory" usage of the app after playing this game is 800MB.
Out of curiosity, would our notion of "resident" memory (!= the malloc column) also agree on the 800 MB here?


* Well, there is another subtlety here: no idea on OSX, but on Linux large allocations can end up being directly mmap-ed. Which means that they don't actually waste memory (i.e. don't increase the resident size) until they are actually written. Which means that malloc(8MB) without touching that doesn't cause any memory pressure.
Having said this, if we think pragmatically, IMHO our developers shouldn't rely on the fact that allocating but not using memory doesn't cause memory pressure and we should just classify that as "bad".
Ah also similar curiosity related to Wez #2.
Is MADV_FREE_REUSABLE only reusable within the same process? Or in case of system pressure those physical pages can be actually used for other processes?
(On Linux/Android, MADV_FREE/WONTNEED do the latter)
>Would be great to update our doc that describes the rationale of the "malloc" dump provider. I think what you are saying here is "some of the assumptions we made on malloc on OSX are wrong" and I am perfectly fine with that statement.
Doc: https://docs.google.com/document/d/1cHpqFoUD5ifYumokHcJe1StAx12EbcnH_ulZK99E0b4/edit

The doc you linked doesn't mention macOS at all. It seems to be a comparison of bionic vs tcmalloc, with lots of details about mallinfo. e.g. malloc() 101 isn't accurate for macOS. It would probably be easier to start a new doc to just talk about macOS, but the utility of writing that seems low to me right now.

> So let me see if I understand all this correctly:
Say that at a given time we have 100 MB of allocated_object size and 120 MB of total malloc size, hence we report 20 MB of metadata_fragmentation_cache.
- We all agree that all those 100 MB are "bad" -> and won't go away until we free() *
- We can't tell anything meaningful about those 20 MB. They *could* be resident and hence "bad" if they were really sub-page fragmentation of the allocator. But they could also be "tolerable" as they could be marked as reusable. And we cannot distinguish these cases.  

Correct.

> Also, I guess that this information is not surfaced by malloc_zone_statistics() or any other API?
So, what I am asking here is: Having established that what we report is incorrect, do we have any chance to fix the reporting (read: fix malloc_dump_provider?).
Or is all this data not surfaced?

libmalloc doesn't keep track of whether pages of reusable, it just marks ranges as reusable after they're freed. This data isn't surfaced.

> Out of curiosity, would our notion of "resident" memory (!= the malloc column) also agree on the 800 MB here?
The "resident" memory column is broken on macOS [sometimes reports negative numbers]. It's on my list of things to fix, but there isn't a specific tracking bug for it.

> Is MADV_FREE_REUSABLE only reusable within the same process? Or in case of system pressure those physical pages can be actually used for other processes?
The page can be reused by any process.
Just some more proof for the lack of usefulness of max_size_in_use. I modified my test program in c#1 to free every other allocation. This means that no pages get freed [and marked as reusable].
"""
size in use: 800016608
max size in use: 800066400
size in use: 2400016608
max size in use: 2425344224
size in use: 1600016736
max size in use: 2425344224
"""

Note that the difference between size_in_use and max_size_in_use is ~1GB in both this example, and the example in c#1. In this example, the difference is entirely fragmentation. In c#1, the difference is entirely in reusable pages [and thus aren't meaningful].
Project Member

Comment 8 by bugdroid1@chromium.org, Mar 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/792525bd0feedcc111f8f9ff583cc6dd6d4e2d3d

commit 792525bd0feedcc111f8f9ff583cc6dd6d4e2d3d
Author: erikchen <erikchen@chromium.org>
Date: Fri Mar 10 18:06:15 2017

Stop reporting metadata_fragmentation_caches for macOS.

Resident size is approximated pretty well by stats.max_size_in_use. However, on
macOS, freed blocks are both resident and reusable, which is semantically
equivalent to deallocated. The implementation of libmalloc will also only hold
a fixed number of freed regions before actually starting to deallocate them, so
stats.max_size_in_use is also not representative of the peak size. As a result,
stats.max_size_in_use is typically somewhere between actually resident
[non-reusable] pages, and peak size. This is not very useful, so we just use
stats.size_in_use for resident_size, even though it's an underestimate and
fails to account for fragmentation.

BUG= 695263 

Review-Url: https://codereview.chromium.org/2743563004
Cr-Commit-Position: refs/heads/master@{#456106}

[modify] https://crrev.com/792525bd0feedcc111f8f9ff583cc6dd6d4e2d3d/base/trace_event/malloc_dump_provider.cc

Comment 9 by w...@chromium.org, Mar 10 2017

This all sounds identical to the Win8.1+ Offer/Reclaim mechanism; the
REUSABLE pages continue to contribute to the process' set of committed
pages, and the process can relatively cheaply REUSE those pages, causing
the OS to map them to physical memory as soon as they're touched.

So I think this value *is* important; the "committed" memory footprint
doesn't impact physical memory usage, since the OS can discard without
preserving the contents, but it does reduce the overall memory available,
IIUC.
The difference is that any process can reuse a REUSABLE page, so it doesn't reduce overall available memory.
Cc: ssid@chromium.org
Spoke at length with ssid@, who wants us to keep emitting max_size_in_use - size_in_use, but with a different label. 

Target audience for these metrics:
The original goal wasn't quite clear. I think we should target developers, primarily of Chrome.

Goal:
There is currently a "resident" column, and then columns for other memory dump providers. These other columns should sum to the "resident" column. This allows Chrome developers to figure out what is causing Chrome to use so much memory.

Several observations:
  * It's clear that "resident" memory is not a particularly useful stat. On macOS this also includes reusable pages, which makes it particularly not meaningful.
  * The memory dump providers [like MallocDumpProvider] emit stats that don't reflect resident pages. There is an implicit assumption that memory allocated by these subsystems is in fact resident. ssid@ said that a study of Chrome showed that >98% of malloc memory is resident.
  * The MemoryDumpProviders currently do not sum to the total in the "resident" column. This is *presumably* because of mmapped files, but it's not quite clear. 
  * The "resident" column fails to account for shared memory in a meaningful way.
 
I've put together a draft for a consistent definition for memory footprint across all platforms.
https://docs.google.com/document/d/1zcfWSsgwTY8WYYlxd0G8QTR6_b2j38uMfjjXBOuMZnM/edit#heading=h.72p7m75zec96

In the meanwhile, I will change "resident size" to not include reusable memory. The naive thing to do here is to switch to using mach_vm_region(...VM_REGION_TOP_INFO), which is both fast and automatically discounts reusable memory. On a release build of Chrome, it takes ~900us to iterate through the entire address space of the browser process. This will be slower if there are more allocated regions, but this speed seems acceptable. This happens to emit a number which is significantly larger than the currently emitted number. The current "resident size" comes from task_info(...TASK_BASIC_INFO_64), which measures the number of resident pages according to the pmap [physical memory map subsystem], which is unaware of the concept of shared memory, COW, or nested regions. Furthermore, the pmap number will be affected by paging/swap/compression, and could go down even though the system has a very high memory footprint.

On a clean profile, mach_vm_region(...VM_REGION_TOP_INFO) stabilizes around 336MB for the browser process, whereas the pmap metric stabilizes around ~140. Using vmmap shows the former number is accurate. The latter appears to be mostly missing out on mapped images [e.g. __TEXT and __LINKEDIT], and perhaps some mapped files as well. So, two proposals:

1) Change ProcessMetrics::GetWorkingSetSize to use mach_vm_region(...VM_REGION_TOP_INFO). 
2) Create a new memory dump provider for macOS that emits stats for mapped images [shared libraries, chromium image, etc.]



Staring at base/process/process_metrics.h a little bit more...we don't want to change the implementation of GetWorkingSetSize(), since that's currently a cross-platform, well defined concept. We will instead want to define a new function that emits the metric we care about, and have memory-infra call that method instead.


https://cs.chromium.org/chromium/src/components/tracing/common/process_metrics_memory_dump_provider.cc?type=cs&q=GetWorkingSetSize+package:%5Echromium$&l=630

We currently try to emit "private bytes" on macOS, but we use an implementation based on VM_REGION_TOP_INFO. There's a bug that prevents the field from getting emitted, but this will report numbers not particularly consistent with the pmap number we currently emit for total resident. This will be fixed when I switch the total to also use VM_REGION_TOP_INFO.
https://bugs.chromium.org/p/chromium/issues/detail?id=700532
Owner: erikc...@chromium.org
Status: Assigned (was: Untriaged)
I think we eventually will want to replace the public interface of base/process/process_metrics.h, but I want to acquire buy-in first for my proposed definition and calculation for "memory footprint". 

I considered adding a new function to base/process/process_metrics.h that process_metrics_memory_dump_provider.cc can call directly, but looking at all the use cases for GetWorkingSetSize(), there is no scenario where the current implementation [number of pmap resident pages] makes sense, or is the desired behavior. 

The only place where changing this behavior might have external affects is BlinkPlatformImpl::actualMemoryUsageMB, which gets called in reportFatalErrorInMainThread and reportOOMErrorInMainThread. The comment for this function explicitly says that it's a potentially inaccurate estimate whose implementation differs across platforms, so I'm not concerned.
Status: Fixed (was: Assigned)

Sign in to add a comment