Heap profiler: revisit the decision of deliberately crashing when running out of space in the tracker data structure |
||||
Issue descriptionThe memory-infra heap profiler has a fixed size data structure to track allocations [1]. It is used quite efficiently in a high-watermark fashion, so memory is actually used only when the slots of the hashtables are needed, but it has a hard cap. The rationale was: we don't want our tracker structure to grow unbounded, otherwise the hosting process will OOM because of the heap profiler itself. So far, so good. The question now is: so, what we do when we hit the limit? Back then the decision in codereview was: just CHECK(), we observed realistically never hit this limit in our prior experiments. It doesn't make sense to deal with it until we prove it is a problem. The comment in the code is pretty self explanatory in this regard. From https://cs.chromium.org/chromium/src/base/trace_event/heap_profiler_allocation_register.h?rcl=d005809c83c0ad0b9cbbda6b10b7ed2c3dcd5120&l=202 // If the hash table has too little capacity (when too little address space // was reserved for |cells_|), |next_unused_cell_| can be an index outside // of the allocated storage. A guard page is allocated there to crash the // program in that case. There are alternative solutions: // - Deal with it, increase capacity by reallocating |cells_|. // - Refuse to insert and let the caller deal with it. // Because free cells are re-used before accessing fresh cells with a higher // index, and because reserving address space without touching it is cheap, // the simplest solution is to just allocate a humongous chunk of address // space. CHECK_LT(next_unused_cell_, num_cells_ + 1) << "Allocation Register hash table has too little capacity. Increase " "the capacity to run heap profiler in large sessions."; Now, we have been hitting this problem a bunch of times. It's time to deal with it. I think the right solution here is just: - gracefully refuse to store the allocation without crashing - Add a flag to the dump saying "this data is incomplete because we reached saturation" Still, the trace would have good data to figure out the problem, which is better than crashing [1] https://cs.chromium.org/chromium/src/base/trace_event/heap_profiler_allocation_register.h?rcl=d005809c83c0ad0b9cbbda6b10b7ed2c3dcd5120&l=308
,
Mar 3 2017
One simple solution that I wanted to implement at some point is to mmap more cells. We won't rehash, so the performance will degrade, but at least we won't hit that. Also, the funny thing is that CHECK_LT() allocates when it logs, so sometimes it deadlocks.
,
Mar 3 2017
> the current system doesn't make much sense Why? The cap still allows to record several million objects. If we reach a state where the tracker is saturated with million objects IMHO there are very good chances that your leak is inside there. Essentially what I am saying here is: In a scenario where a process is leaking 4 million objects, do we really need to record all 4 all them to get actionable data? My statement is that even if we record just 1 million, they will be enough to find culprits. Once reached saturation, accuracy of the heap profiler is less relevant. Unless you tell me that there are situations that having 1 million objects is WAI, in which case we can discuss about increasing the cap of the heap profiler tracker. > One simple solution that I wanted to implement at some point is to mmap more cells. Again, I'd like to understand why we think we need to be able to track more than O(M) objects. IMHO once we reach that point we already have a big portion of the problem in our hands. Why shooting to 100% accuracy when you have enough data for the problem in your hands? IMHO the right action is removing that CHECK and adding the required if(!nullptr) to bypass the Insert() when reaching saturation. The rest of the code should keep working as is. Or am I missing something else?
,
Mar 3 2017
> the current system doesn't make much sense Why? The cap still allows to record several million objects. If we reach a state where the tracker is saturated with million objects IMHO there are very good chances that your leak is inside there. Understood. Let's see what happens when we actually start running the native heap profiler on Desktop machines with issues. Either way, I think you'd agree that taking down the process is not the right behavior.
,
Mar 3 2017
Yes, gracefully handling exhaustion is another way to go. The catch is that we have two tables there: one for allocations and another for backtraces, so we'll need to somehow make visible (in the UI) which table reached capacity.
,
Mar 3 2017
> Either way, I think you'd agree that taking down the process is not the right behavior Right. The only reason why that check is there is because back then I didn't want to deal with tracing ui to cope with that. I was not expecting chrome to get so borked to reach this insane state. I was so young, innocent and full of hopes :) . >so we'll need to somehow make visible (in the UI) which table reached capacity I think that a red bar that says "this trace is incomplete, you cannot fully trust this data" is enough. Telling which table saturated doesn't really make any easier to interpret the results. At that point you just hope that 90% of the memory shows in one place
,
Mar 3 2017
,
Mar 3 2017
> Why? The cap still allows to record several million objects. If we reach a state where the tracker is saturated with million objects IMHO there are very good chances that your leak is inside there. I've been running native heap profiling with my Canary profile. Just loading 3 profiles creates over a million objects, and I hit this limit in under a few minutes. I once again think we should increase the limit.
,
Mar 3 2017
Bonus points if we increase the limit *and* get folks to start tackling the fact that we seem to allocate way too many things. ;)
,
Mar 28 2017
Snagging this. Gonna just make it drop entries and (maybe) count the number of drops first. Let me know if anyone objects.
,
Mar 28 2017
Could we also bump the limits on desktop? I hit the current limits on my own profile in a couple of minutes. Here's the ridiculously larger numbers I'm using: https://paste.googleplex.com/6529243055915008
,
Mar 28 2017
Sounds good, shouldn't be too hard. Please cc me on CL.
,
Mar 28 2017
Here's a list of other changes I've made locally:
1) Change the dump time between detailed dumps [since it takes >>2s for a dump]. I used 10s, but maybe longer would be better. Note that if it's too long, the first dump never starts.
const TraceConfig::MemoryDumpConfig::Trigger kDefaultHeavyMemoryDumpTrigger
2) Increase max frame count:
- enum { kMaxFrameCount = 48 };
+ enum { kMaxFrameCount = 200 };
3) And max frame size:
base/trace_event/heap_profiler_allocation_context_tracker.cc
- const void* frames[128];
+ const void* frames[256];
,
Mar 29 2017
How does increasing the dump interval help? The next dump is not triggered till the previous dump is over.
,
Mar 29 2017
I found that if I took 2 or 3 dumps, the result was too large to load in tracing. It would take ~10-30s to take a single dump, and then Chrome would immediately start taking another dump, so I couldn't control the number of dumps that Chrome emitted.
,
Mar 30 2017
> How does increasing the dump interval help? The next dump is not triggered till the previous dump is over. Yeah I think this is more a matter of "too much stuff ends up in the trace". +1 of implicitly raising the 2s interval when the heap profiler is enabled. +dskiba for frame size. He did in the past lot of math about that, and there is a balance between depth of frames, usage of the backtrace hashtable and time that takes to symbolize the trace.
,
Apr 3 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c1d887e1b41c37348a412bcb956b641ef5b5caf1 commit c1d887e1b41c37348a412bcb956b641ef5b5caf1 Author: primiano <primiano@chromium.org> Date: Mon Apr 03 18:40:14 2017 On heap tracking datastructure overflow, degrade instead of CHECK() Based on ajwong's CL https://codereview.chromium.org/2784783003/ This allows the heap profiler to on systems with heavy memory usage (sometimes caused by strange profiles). Next step is to surface the overflow on the tracing UI. BUG=698079 Review-Url: https://codereview.chromium.org/2787383002 Cr-Commit-Position: refs/heads/master@{#461485} [modify] https://crrev.com/c1d887e1b41c37348a412bcb956b641ef5b5caf1/base/trace_event/heap_profiler_allocation_register.cc [modify] https://crrev.com/c1d887e1b41c37348a412bcb956b641ef5b5caf1/base/trace_event/heap_profiler_allocation_register.h [modify] https://crrev.com/c1d887e1b41c37348a412bcb956b641ef5b5caf1/base/trace_event/heap_profiler_allocation_register_unittest.cc
,
Apr 4 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7cf40cf83838145d2e6310f4f923f316b859e4f1 commit 7cf40cf83838145d2e6310f4f923f316b859e4f1 Author: ajwong <ajwong@chromium.org> Date: Tue Apr 04 20:31:21 2017 heap-profiler: Clean up some comments and C++. BUG=698079 Review-Url: https://codereview.chromium.org/2797603002 Cr-Commit-Position: refs/heads/master@{#461817} [modify] https://crrev.com/7cf40cf83838145d2e6310f4f923f316b859e4f1/base/trace_event/heap_profiler_allocation_register.cc [modify] https://crrev.com/7cf40cf83838145d2e6310f4f923f316b859e4f1/base/trace_event/heap_profiler_allocation_register.h
,
May 18 2018
Issue 673009 has been merged into this issue. |
||||
►
Sign in to add a comment |
||||
Comment 1 by erikc...@chromium.org
, Mar 2 2017