New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 602992 link

Starred by 5 users

Issue metadata

Status: WontFix
Owner:
Xoogler
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 2
Type: Bug



Sign in to add a comment

MemoryInfra: suspiciously low malloc memory reported by the heap profiler on Android

Project Member Reported by primiano@chromium.org, Apr 13 2016

Issue description

Background context:go/memory-infra: memory profiling in chrome://tracing

From an offline conversation with Maria:
-----
I am playing with startup tracing for Android using the instructions here. I turned on heap profiling and tried to collect an early trace, but I am seeing a lot fewer objects in the heap profile than being reported total (e.g. 4MB of heap objects vs 13MB total malloc stuff). I was wondering why we have this discrepancy. Attaching a trace I was using for you to see.
-----


Hmm, this is weird. I have two suspects here:
1) the malloc_dump_provider is registered too late on android, so we miss most of the initial allocations.
2) something is going wrong with the shim.

kraynov can you take a look at this and, specifically test:
1. put some big malloc leaks in various pieces of the codebase for the browser process (e.g., net, content/browser) and see if they show up in the heap profiler.
2. Like above, but do some big C++ new()
3. Like above, but use a std::vector<char> and .reserve() a big amount.

And tell us what you find :)
 
Dmitry suggested today that this may be because we are using a prebuilt libstdc++ library on Android, where the shim doesn't exist, so all stl allocations end up calling uninstrumented malloc.

Since he's going to need to build libstdc++ for native stack profiling, we will see if that also fixes the issue here. 
Hmm I'd be interested in #1.
It *should* work because the -wrap,malloc trick is at the linker level. Even if libstc++ is prebuilt, the magic happens when we link it.
It should be easy to test. Just create a std::vector, add a TRACE_EVENT0, resize the vector to 42 MB and see if you see in the heap profiler.

Comment 3 by dskiba@chromium.org, Apr 27 2016

I'm doing some tests in this area, will report today.

Comment 4 by dskiba@chromium.org, Apr 27 2016

Yeah, so libc++ is not a problem, it's .cpp code is fairly small, and mostly uses operator new, which is weak.

But, it looks like that Android runtime also uses malloc (MILD SHOCK):

Breakpoint 4, 0xf749c25a in malloc () from /tmp/dskiba-adb-gdb-libs/system/lib/libc.so
(gdb) bt
#0  0xf749c25a in malloc () from /tmp/dskiba-adb-gdb-libs/system/lib/libc.so
#1  0xf76ea8b4 in android::SharedBuffer::alloc(unsigned int) () from /tmp/dskiba-adb-gdb-libs/system/lib/libutils.so
#2  0xf76ed068 in android::VectorImpl::_shrink(unsigned int, unsigned int) () from /tmp/dskiba-adb-gdb-libs/system/lib/libutils.so
#3  0xf76eec3c in android::Looper::pollInner(int) () from /tmp/dskiba-adb-gdb-libs/system/lib/libutils.so
#4  0xf76eeed6 in android::Looper::pollOnce(int, int*, int*, void**) () from /tmp/dskiba-adb-gdb-libs/system/lib/libutils.so
#5  0xf7636630 in android::NativeMessageQueue::pollOnce(_JNIEnv*, _jobject*, int) ()
   from /tmp/dskiba-adb-gdb-libs/system/lib/libandroid_runtime.so
#6  0x71a4e56e in ?? ()

So when we ask for "number of allocated bytes", we also get Android's stuff.

Comment 5 by dskiba@chromium.org, Apr 27 2016

So I added "total size of all allocations" to AllocationRegister, and reported it in malloc dump provider. Then I took ~30 seconds trace (Linux, CNN.com). The results:

Browser first:

allocated_objects 23.8 MiB
shim_registered_size 11.0 MiB
DIFF = 12.8

Browser last:

allocated_objects 70.2 MiB
shim_registered_size 57.0 MiB
DIFF = 13.2

Renderer CNN.com first:

allocated_objects 17.8 MiB
shim_registered_size 16.7 MiB
DIFF = 1.1

Renderer CNN.com last:

allocated_objects 77.1 MiB
shim_registered_size 75.7 MiB
DIFF = 1.4

So even on Linux, where we use TCMalloc, we have a difference.

Comment 6 by dskiba@chromium.org, Apr 27 2016

Trace: https://drive.google.com/file/d/0B_Hmi138MnbJcXl1Qkc0dTcyaUE/view?usp=sharing

Diff is attached, look for shim_registered_size value for malloc.
shim_registered_size.diff
2.9 KB Download

Comment 7 by dskiba@chromium.org, Apr 27 2016

Cc: primiano@chromium.org

Comment 8 by dskiba@chromium.org, Apr 28 2016

On Android situation is similar:

Browser first:

allocated_objects 19.3 MiB
shim_registered_size 3.3 MiB
DIFF: 16

Browser last:

allocated_objects 29.8 MiB
shim_registered_size 13.5 MiB
DIFF: 16.3

Renderer first:

allocated_objects 5.6 MiB
shim_registered_size 1.3 MiB
DIFF: 4.3

Renderer last:

allocated_objects 56.1 MiB
shim_registered_size 49.3 MiB
DIFF: 6.8

So we can see that Android (+ potentially something else) is responsible for ~16 MiB in the browser process. However, not all of those 16 MiB are committed. Here is a meminfo dump from "Hello World" Android app from Maria:

** MEMINFO in pid 30622 [com.example.mariakhomenko.testapp] **
                   Pss  Private  Private  Swapped     Heap     Heap     Heap
                 Total    Dirty    Clean    Dirty     Size    Alloc     Free
                ------   ------   ------   ------   ------   ------   ------
  Native Heap     3992     3600        0        0    19968    17886     2081
  Dalvik Heap     4286     4088        0        0    24192    20310     3882
 Dalvik Other      682      540        0        0                           
        Stack      180      180        0        0                           
       Ashmem        4        4        0        0                           
      Gfx dev      572      572        0        0                           
    Other dev        4        0        4        0                           
     .so mmap     1315      136       64        0                           
    .apk mmap      218        0       36        0                           
    .ttf mmap       85        0       72        0                           
    .dex mmap     2968       76     2892        0                           
    .oat mmap     1075        0      300        0                           
    .art mmap      893      656       24        0                           
   Other mmap        9        8        0        0                           
   EGL mtrack    41280    41280        0        0                           
      Unknown      188      188        0        0                           
        TOTAL    57751    51328     3392        0    44160    38196     5963

Native Heap / Heap Size, Heap Alloc and Heap Free are usmblks, uordblks and fordblks from mallinfo() (see android_os_Debug.cpp). So even simple Android app allocates ~16 MiB, but only commits 3600 of them - Native Heap / Private Dirty is calculated as PrivateDirty of regions from /proc/smaps.
trace_shim_android.json.gz
2.9 MB Download

Comment 9 by dskiba@chromium.org, Apr 28 2016

I did another experiment to see how much memory is allocated before MallocDumpProvider installs its shim. I calculated total allocated bytes at shim level (ShimXXX functions), and reported that as "shim_allocated_size". Difference between this value and "shim_registered_size", which comes from AllocationRegister, tells us how much we missed. See patch attached.

Numbers from CNN.com case (load, scroll bottom/top):

=== Linux ===
https://drive.google.com/file/d/0B_Hmi138MnbJc1NlR0NaNHVkUGM/view?usp=sharing

Browser / first dump:

shim_allocated_size: 24.5 MiB
shim_registered_size: 16.3 MiB
DIFF: 8.2

Browser / last dump:

shim_allocated_size: 54.5 MiB
shim_registered_size: 46.7 MiB
DIFF: 7.8

Renderer (CNN.com) / first dump:

shim_allocated_size: 5.8 MiB
shim_registered_size: 5.5 MiB
DIFF: 0.3

Renderer (CNN.com) / last dump:

shim_allocated_size: 45.1 MiB
shim_registered_size: 46.4 MiB
DIFF: -0.6

=== Android ===
https://drive.google.com/file/d/0B_Hmi138MnbJUXB5aE5IQkV6NTA/view?usp=sharing

Browser / first dump:

shim_allocated_size: 4.0 MiB
shim_registered_size: 3.8 MiB
DIFF: 0.2

Browser / last dump:

shim_allocated_size: 16.6 MiB
shim_registered_size: 16.6 MiB
DIFF: 0

Renderer / first dump:

shim_allocated_size: 1.5 MiB
shim_registered_size: 1.3 MiB
DIFF: 0.2

Renderer / last dump:

shim_allocated_size: 63.1 MiB
shim_registered_size: 63.9 MiB
DIFF: -0.8

So Android is fine, while on Linux we miss ~8 MiB.

Also, negative DIFFs are concerning. Either this is a bug in the way I calculate numbers, or something is not removed properly from AllocationRegister.
shim_a_r.diff
9.6 KB Download
Ok this bug is becoming hard to follow. There are different findings here
Let me try to make a point of the situation.

1. mallinfo() inherited by the Android Zygote pre-fork()
--------------------------------------------------------
Maria experiments suggest that on Android we start with a non-zero malloc() size. Thinking more, this is perfectly WAI. The Android Zygote does all sort of allocations before it fork()-s to specialize into the various processes.

Action item: there is nothing we can do here, also these allocations are not really using memory, as very likely they happened pre-fork().
I think the only think we should do is sample the value of mallinfo() at startup on Android, and expose/subtract that in memory-infra as "pre-fork() Zygote" or similar.

2. Android itself uses malloc
-----------------------------
Yes, agree, mild shock. I don't think there is anything we can do here. We'll have to live with that. Also that should be hopefully a minor conributions.

3. Difference on beween mallinfo() and heap profiler hooks
----------------------------------------------------------
Here I am very confused about the various experiments. Can you summarize them so we can analyze them case by case?
Some things to keep in mind:
 - Not sure if tcmalloc mallinfo() takes into account fragmentation. That might explain some of the difference.
 - Remember that on Linux Desktop the renderers are forked form the chrome zygote. we might have some similar effects like in the android zygote case.


So let me summarize the situation on Android:

=== Android framework allocations inflate 'malloc' values in traces. ===

Android uses malloc() to allocate, and that affects mallinfo() results. This might be expected, but nevertheless makes it hard to reason about 'malloc' values in traces. Simple 'Hello World' Android app allocates 16 MiB, and shim only registers ~3 MiB, which makes malloc values reported for the browser process almost meaningless, since most of that comes from Android, not Chrome.

This is what my first experiment measured - I added shim_registered_size value which shows amount of memory AllocationRegister knows about, i.e. that was allocated by Chrome. The difference between shim_registered_size and allocated_objects is memory that either allocated by Android, or by Chrome before AllocationRegister is plugged in. Results: 16 MiB for the browser process, 4-6 MiB for the renderer. Note the difference - it means that most of those 16 MiB were not inherited from Android's zygote.

OK, so we know that simple "Hello World" allocates 16 MiB, how to we estimate size of memory allocated by Chrome before AllocationRegister is turned on? By measuring total allocation size at the shim level. I.e. malloc calls ShimAlloc, which either goes directly to system malloc, or goes through AllocationRegister and then to system malloc. So allocated size measured at ShimXXX functions reflects total size allocated by Chrome from the start, while value reported by AllocationRegister reflects only amount allocated since AllocationRegister was turned on.

This is what my second experiment measured - I added shim_allocated_size that shows amount of memory allocated through ShimXXX interface. Difference between shim_allocated_size and shim_registered_size tells us how much Chrome allocates before AllocationRegister plugs in. Results: 0.2 MiB for the browser and 0 for the renderer. Good, this means that on Android AllocationRegister knows about almost all allocations (compare to Linux where diff is 8 MiB).

>>> ACTION ITEM: Think of ways to separate 'malloc' value into 'Android' and 'Chrome' parts. Or maybe report only 'Chrome' part of it in the trace.

(Note that there is an interesting opportunity to reorganize Java code so that it allocates less native memory from malloc.)


=== No correlation between 'malloc' value and private dirty. ===

In memory reduction efforts we mostly care about reducing 'Private Dirty'. This is what everyone is looking at when comparing memory usages.

However, we only have approximate idea where 'Private Dirty' comes from. We saw that "Hello World" Android app allocates 16 MiB, but reports only 3.6 MiB as dirty. So when in a Chrome trace we see that 'malloc' is 22.2 MiB and 'Private Dirty' is 36.5 MiB we can't expect that reducing 'malloc' will proportionally affect 'Private Dirty' (or affect at all). And of course it doesn't help that some of 'Private Dirty' comes from Android framework.

>>> ACTION ITEM: Think of ways to extract 'malloc' part from 'Private Dirty', so that we know how much malloc contributes.



Overall, the idea is to make things that are reported in a trace actionable. Or separate them into 'actionable' and 'FYI' parts.


(I think I'll post this to Project TRIM so that other guys can chime in.)


> 3. Difference on beween mallinfo() and heap profiler hooks
> ----------------------------------------------------------
> Here I am very confused about the various experiments. Can you summarize them so we can analyze them case by case?
> Some things to keep in mind:
>  - Not sure if tcmalloc mallinfo() takes into account fragmentation. That might explain some of the difference.
>  - Remember that on Linux Desktop the renderers are forked form the chrome zygote. we might have some similar effects like in the android zygote case.

So, experiment 1 (shim_registered_size) shows that on Linux the amount AllocationRegister knows about is ~12 MiB less than what TCMalloc reports.

Why is that? Two possibilities:
1. Something allocates memory directly (not using shim).
2. Memory is allocated using shim, but before AllocationRegister is plugged in.

I verified #1 manually by creating shim paths in TCMalloc and breaking on non-shim paths. I've found that we have UncheckedMalloc() thing, which avoids shime by calling system allocation functions directly (__libc_malloc, tc_malloc_skip_new_handler, etc). But it doesn't seem to be used much.

Experiment 2 confirms #2 - I added shim_allocated_size, which measures memory at ShimXXX functions, and there is a difference of 8 MiB between that and shim_registered_size from AllocationRegister. However, something is wrong because difference is changing over time and eventually becomes negative. I didn't investigate it much, but the code (see patch in #9) seems fine.

>>> ACTION ITEM: Continue investigation, verify differently - by snapshotting allocated_size from TCMalloc at the time we plug AllocationRegister in - that will tell us how much was allocated before AllocationRegister, but won't track deallocations of that memory.
Cc: klo...@chromium.org
Grace was curious about this.

Comment 14 by ccross@google.com, Apr 29 2016

Re #11:

Any memory malloc'd and written to after fork+specialize from zygote will show up as private dirty, and reducing mallocs will reduce private dirty.

The Native Heap private dirty number is already counting only memory from malloc - malloc in libc applies an "[anon:libc_malloc]" label to memory mapped for the native heap, and then the private dirty numbers are extracted from /proc/pid/smaps mappings with that label.

Comment 15 by ccross@google.com, Apr 29 2016

https://source.android.com/devices/tech/debug/native-memory.html has some slightly-out-of-date debugging information for native heap memory use.  The ddms native=true part still applies, but to enable malloc debug you now use:

adb root
adb shell setprop libc.debug.malloc.program app_process64
adb shell 'setprop libc.debug.malloc.options "backtrace=8"'
adb shell stop
adb shell start

ddms should then let you capture a native heap dump, and can subtract out zygote usage and show you stack traces of calls to malloc and how much was allocated, and break it down by library.

Comment 16 by ccross@google.com, Apr 29 2016

Cc: ccross@google.com
@primiano: Where is the allocator waste accounted for?

For instance, memory held into the various thread-local caches of the allocator, or the free space in the arenas.
This would show up as private dirty, unless it's de-allocated (using madvise(), which I *think* jemalloc() does on Android).
Cc: lizeb@chromium.org
Some of my experiments above relied on a idea that all allocations go through ShimXXX functions. Well, as I found base::UncheckedMalloc() doesn't go through shim, and at least on Android it can allocate quite a lot, see attached screenshot.

This is just FYI, I don't think we need to do anything about it.


unchecked_malloc.png
73.6 KB View Download
Uh what feels a bug, why is that?
From what I see in the code:


bool UncheckedMalloc(size_t size, void** result) {
#if BUILDFLAG(USE_EXPERIMENTAL_ALLOCATOR_SHIM)
  *result = allocator::UncheckedAlloc(size);

And, in allocator_shim.cc

void* UncheckedAlloc(size_t size) {
  const allocator::AllocatorDispatch* const chain_head = GetChainHead();
  return chain_head->alloc_function(chain_head, size);
}


UncheckedMalloc should go trough the shim :/
Cc: haraken@chromium.org
kraynov can you take a look at this and see whether UncheckedMalloc is intercepted or not (and if not, why)?
To understand which allocations Chrome's heap profiler misses, I used Android's heap profiler to capture all allocations. Chrome's heap profiler was enabled too. Here are all the allocations that originated from libchrome.so when starting to the NTP: https://docs.google.com/a/google.com/spreadsheets/d/1djmF53A5IKVmL0BRejbvu727838YG5NUNib6oMxgNvY/edit?usp=sharing

As you can see Chrome's heap profiler missed the following:

1.5 MiB	ShimCppNew
222 KiB	ShimMalloc
40 KiB	ShimRealloc

I.e. ~1.7 MiB of allocations. All other allocations originated from HookXXX functions, i.e. were intercepted and recorded in Chrome's heap profiler.

It looks like we're enabling heap profiler too late, and missing some startup stuff.
It turned out that startup tracing causes heap profiler to start early and catch most allocations:

3.7 MiB	HookAlloc
136 KiB	HookllocAligned
49 KiB	HookRealloc
26 KiB	ShimCppNew

I.e. previously uncaught 1.7 MiB were captured.

Comment 25 Deleted

Components: Internals>Instrumentation>Memory
Components: -Internals>Tracing
Status: Started (was: Assigned)
This bug is pretty old. What is the plan here?
Just to check that UncheckedMalloc is accounted correctly. Probably will be just closed.
Status: WontFix (was: Started)
Obsolete.
UncheckedMalloc is accounted correctly.

Sign in to add a comment