New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 698079 link

Starred by 5 users

Issue metadata

Status: Assigned
Owner:
Last visit > 30 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 698266



Sign in to add a comment

Heap profiler: revisit the decision of deliberately crashing when running out of space in the tracker data structure

Project Member Reported by primiano@chromium.org, Mar 2 2017

Issue description

The memory-infra heap profiler has a fixed size data structure to track allocations [1].
It is used quite efficiently in a high-watermark fashion, so memory is actually used only when the slots of the hashtables are needed, but it has a hard cap.
The rationale was: we don't want our tracker structure to grow unbounded, otherwise the hosting process will OOM because of the heap profiler itself.

So far, so good. The question now is: so, what we do when we hit the limit?
Back then the decision in codereview was: just CHECK(), we observed realistically never hit this limit in our prior experiments. It doesn't make sense to deal with it until we prove it is a problem.

The comment in the code is pretty self explanatory in this regard. From
https://cs.chromium.org/chromium/src/base/trace_event/heap_profiler_allocation_register.h?rcl=d005809c83c0ad0b9cbbda6b10b7ed2c3dcd5120&l=202

    // If the hash table has too little capacity (when too little address space
    // was reserved for |cells_|), |next_unused_cell_| can be an index outside
    // of the allocated storage. A guard page is allocated there to crash the
    // program in that case. There are alternative solutions:
    // - Deal with it, increase capacity by reallocating |cells_|.
    // - Refuse to insert and let the caller deal with it.
    // Because free cells are re-used before accessing fresh cells with a higher
    // index, and because reserving address space without touching it is cheap,
    // the simplest solution is to just allocate a humongous chunk of address
    // space.

    CHECK_LT(next_unused_cell_, num_cells_ + 1)
        << "Allocation Register hash table has too little capacity. Increase "
           "the capacity to run heap profiler in large sessions.";

Now, we have been hitting this problem a bunch of times. It's time to deal with it.
I think the right solution here is just: 
- gracefully refuse to store the allocation without crashing
- Add a flag to the dump saying "this data is incomplete because we reached saturation"

Still, the trace would have good data to figure out the problem, which is better than crashing

[1] https://cs.chromium.org/chromium/src/base/trace_event/heap_profiler_allocation_register.h?rcl=d005809c83c0ad0b9cbbda6b10b7ed2c3dcd5120&l=308
 
If our goal is to use heap profiling with users who are hitting many GBs of malloc in a single process, the current system doesn't make much sense. In the short term, for special debug builds we give people with z620s or z840s, we should use a much higher cap. 

An alternative solution would be to stream all allocations/destructions to disk, so that we have minimal in-memory footprint. 
Cc: dskiba@chromium.org
One simple solution that I wanted to implement at some point is to mmap more cells. We won't rehash, so the performance will degrade, but at least we won't hit that.

Also, the funny thing is that CHECK_LT() allocates when it logs, so sometimes it deadlocks.
> the current system doesn't make much sense
Why? The cap still allows to record several million objects. If we reach a state where the tracker is saturated with million objects IMHO there are very good chances that your leak is inside there.
Essentially what I am saying here is: In a scenario where a process is leaking 4 million objects, do we really need to record all 4 all them to get actionable data? My statement is that even if we record just 1 million, they will be enough to find culprits. Once reached saturation, accuracy of the heap profiler is less relevant. Unless you tell me that there are situations that having 1 million objects is WAI, in which case we can discuss about increasing the cap of the heap profiler tracker.

> One simple solution that I wanted to implement at some point is to mmap more cells.
Again, I'd like to understand why we think we need to be able to track more than O(M) objects. IMHO once we reach that point we already have a big portion of the problem in our hands. Why shooting to 100% accuracy when you have enough data for the problem in your hands?

IMHO the right action is removing that CHECK and adding the required if(!nullptr) to bypass the Insert() when reaching saturation. The rest of the code should keep working as is. Or am I missing something else?
> the current system doesn't make much sense
Why? The cap still allows to record several million objects. If we reach a state where the tracker is saturated with million objects IMHO there are very good chances that your leak is inside there.

Understood. Let's see what happens when we actually start running the native heap profiler on Desktop machines with issues. Either way, I think you'd agree that taking down the process is not the right behavior.
Yes, gracefully handling exhaustion is another way to go. The catch is that we have two tables there: one for allocations and another for backtraces, so we'll need to somehow make visible (in the UI) which table reached capacity.
> Either way, I think you'd agree that taking down the process is not the right behavior
Right. The only reason why that check is there is because back then I didn't want to deal with tracing ui to cope with that. I was not expecting chrome to get so borked to reach this insane state. I was so young, innocent and full of hopes :) . 

>so we'll need to somehow make visible (in the UI) which table reached capacity
I think that a red bar that says "this trace is incomplete, you cannot fully trust this data" is enough. Telling which table saturated doesn't really make any easier to interpret the results.
At that point you just hope that 90% of the memory shows in one place 
Blocking: 698266
> Why? The cap still allows to record several million objects. If we reach a state where the tracker is saturated with million objects IMHO there are very good chances that your leak is inside there.

I've been running native heap profiling with my Canary profile. Just loading 3 profiles creates over a million objects, and I hit this limit in under a few minutes. I once again think we should increase the limit.

Comment 9 by w...@chromium.org, Mar 3 2017

Bonus points if we increase the limit *and* get folks to start tackling the
fact that we seem to allocate way too many things. ;)
Owner: ajwong@chromium.org
Status: Assigned (was: Untriaged)
Snagging this.

Gonna just make it drop entries and (maybe) count the number of drops first. Let me know if anyone objects.
Could we also bump the limits on desktop? I hit the current limits on my own profile in a couple of minutes. Here's the ridiculously larger numbers I'm using:

https://paste.googleplex.com/6529243055915008
Sounds good, shouldn't be too hard. Please cc me on CL.

Here's a list of other changes I've made locally:

1) Change the dump time between detailed dumps [since it takes >>2s for a dump]. I used 10s, but maybe longer would be better. Note that if it's too long, the first dump never starts.
 const TraceConfig::MemoryDumpConfig::Trigger kDefaultHeavyMemoryDumpTrigger 

2) Increase max frame count:
-  enum { kMaxFrameCount = 48 };
+  enum { kMaxFrameCount = 200 };

3) And max frame size:
base/trace_event/heap_profiler_allocation_context_tracker.cc
-        const void* frames[128];
+        const void* frames[256];

Comment 14 by ssid@chromium.org, Mar 29 2017

How does increasing the dump interval help? The next dump is not triggered till the previous dump is over.
I found that if I took 2 or 3 dumps, the result was too large to load in tracing. It would take ~10-30s to take a single dump, and then Chrome would immediately start taking another dump, so I couldn't control the number of dumps that Chrome emitted.
> How does increasing the dump interval help? The next dump is not triggered till the previous dump is over.

Yeah I think this is more a matter of "too much stuff ends up in the trace".
+1 of implicitly raising the 2s interval when the heap profiler is enabled.

+dskiba for frame size. He did in the past lot of math about that, and there is a balance between depth of frames, usage of the backtrace hashtable and time that takes to symbolize the trace. 
Project Member

Comment 17 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c1d887e1b41c37348a412bcb956b641ef5b5caf1

commit c1d887e1b41c37348a412bcb956b641ef5b5caf1
Author: primiano <primiano@chromium.org>
Date: Mon Apr 03 18:40:14 2017

On heap tracking datastructure overflow, degrade instead of CHECK()

Based on ajwong's CL https://codereview.chromium.org/2784783003/
This allows the heap profiler to on systems with heavy memory usage
(sometimes caused by strange profiles).

Next step is to surface the overflow on the tracing UI.

BUG=698079

Review-Url: https://codereview.chromium.org/2787383002
Cr-Commit-Position: refs/heads/master@{#461485}

[modify] https://crrev.com/c1d887e1b41c37348a412bcb956b641ef5b5caf1/base/trace_event/heap_profiler_allocation_register.cc
[modify] https://crrev.com/c1d887e1b41c37348a412bcb956b641ef5b5caf1/base/trace_event/heap_profiler_allocation_register.h
[modify] https://crrev.com/c1d887e1b41c37348a412bcb956b641ef5b5caf1/base/trace_event/heap_profiler_allocation_register_unittest.cc

Issue 673009 has been merged into this issue.

Sign in to add a comment