Current algorithm for native heap sampler has a minor edge case. |
|||
Issue descriptionFor detailed writeup with examples, see https://bugs.chromium.org/p/chromium/issues/detail?id=810748#c6. Current algorithm: """ def MaybeRecordAlloc(size): sample_interval -= size if sample_interval < 0: RecordAlloc() sample_interval += GetNextSampleInterval() """ Correct algorithm: """ def MaybeRecordAlloc(size): if size >= expected_sample_interval: RecordAlloc() return sample_interval -= size if sample_interval < 0: RecordAlloc() sample_interval = GetNextSampleInterval() """ Pathological behavior that is frequently observed in Chrome: expected_sample_interval = 100 MaybeRecordAlloc(10000) MaybeRecordAlloc(1) MaybeRecordAlloc(1) ... <-- 100 total calls to MaybeRecordAlloc MaybeRecordAlloc(1) Which results from large container resizes, followed by moving/adding elements to the resulting container.
,
Feb 14 2018
Hmmm... But SHP implementation does not use the algorithm you mention. Namely it neither distinguish between small and large allocations, nor does "+=" on sample_interval. Could that be a bug in your implementation? Also in the allocation pattern of: 1x10000 100x1 with sampling_interval of 100, the latter allocations (100x1) are on par with the interval, so the precision error is expected there. Looks like WAI to me. An issue I'm aware of is that allocations made by SHP itself get attributed to the current sample, but that's a different story. I plan to address it later.
,
Feb 15 2018
> Hmmm... But SHP implementation does not use the algorithm you mention. Namely it neither distinguish between small and large allocations, nor does "+=" on sample_interval. If SHP doesn't do +=, then it will undercount allocations of size ~ sample_interval. If it naively uses += for both large and small allocations, it will overcount small allocations proceeding large allocations. """ Also in the allocation pattern of: 1x10000 100x1 with sampling_interval of 100, the latter allocations (100x1) are on par with the interval, so the precision error is expected there. """ Repeat this set of allocations 1 million times. You will see that the 100x1 is pathologically undercounted using +=. I've created python scripts to test all of this. You may wish to do the same ;). Let's chat over VC.
,
Feb 15 2018
I'm very far from the implementation, but from the common sense standpoint, the algorithm should be something like:
"""
def MaybeRecordAlloc(size):
sample_interval -= size
while sample_interval < 0:
RecordAlloc()
sample_interval += GetNextSampleInterval()
"""
Each sample should attribute the default non-floating sampling interval. You can't unconditionally record the sample, you can't to guard expressions - they skew the profile.
,
Feb 15 2018
Attaching my test program, copy-pasting results: For the math behind what's happening, see: https://bugs.chromium.org/p/chromium/issues/detail?id=810748#c15 Summary: Poisson process sampling relies on a uniform distribution being sampled. Chrome has pathologically non-uniform distributions. By adding a conditional, we are moving from PoissonProcess sampling of everything to Poisson process sampling of small allocations, and RecordEverything of large allocations, which is what we want to do anyways. We artificially restrict the population we're sampling to a more-uniform population, thereby decreasing the effect of pathological edge cases. These still exist, but do not appear to occur frequently in Chrome. If we wanted an even more accurate algorithm [with higher hot-path cost], I would suggest: """ underflow_small = GetNextSampleInterval() underflow_large = GetNextSampleInterval() def MaybeRecordAlloc(size): if size > expected_sample_interval: RecordAlloc() return if size > expected_sample_interval / 10: underflow_large -= size if underflow_large < 0: RecordAlloc() underflow_large += GetNextSampleInterval() return underflow_small -= size if underflow_small < 0: RecordAlloc() underflow_small += GetNextSampleInterval() """ Effectively, we're using 2 poisson processes to sample medium sized allocations vs small sized allocations, thereby further reducing non-uniformity in the population. """ erikchen@erikchen-macpro3 ~/projects/test_python_files $ python sampling_framework.py ============================================== -------------------- technique: PerfectRandomSampling distribution: [50] repeat count: 100000 proportion of allocations sampled [smaller is better]: 0.50298 for size 50 correct ratio [closer to 1 is better]: 1.00596 -------------------- technique: PerfectRandomSampling distribution: [1000000, 1] repeat count: 100000 proportion of allocations sampled [smaller is better]: 0.504825 for size 1000000 correct ratio [closer to 1 is better]: 1.0 for size 1 correct ratio [closer to 1 is better]: 0.965 -------------------- technique: PerfectRandomSampling distribution: [99, 1] repeat count: 100000 proportion of allocations sampled [smaller is better]: 0.50006 for size 1 correct ratio [closer to 1 is better]: 0.998 for size 99 correct ratio [closer to 1 is better]: 0.99014 ============================================== -------------------- technique: RecordEverything distribution: [50] repeat count: 100000 proportion of allocations sampled [smaller is better]: 1.0 for size 50 correct ratio [closer to 1 is better]: 1.0 -------------------- technique: RecordEverything distribution: [1000000, 1] repeat count: 100000 proportion of allocations sampled [smaller is better]: 1.0 for size 1000000 correct ratio [closer to 1 is better]: 1.0 for size 1 correct ratio [closer to 1 is better]: 1.0 -------------------- technique: RecordEverything distribution: [99, 1] repeat count: 100000 proportion of allocations sampled [smaller is better]: 1.0 for size 1 correct ratio [closer to 1 is better]: 1.0 for size 99 correct ratio [closer to 1 is better]: 1.0 ============================================== -------------------- technique: PoissonAdditive distribution: [50] repeat count: 100000 proportion of allocations sampled [smaller is better]: 0.50163 for size 50 correct ratio [closer to 1 is better]: 1.00326 -------------------- technique: PoissonAdditive distribution: [1000000, 1] repeat count: 100000 proportion of allocations sampled [smaller is better]: 1.0 for size 1000000 correct ratio [closer to 1 is better]: 1.0 for size 1 correct ratio [closer to 1 is better]: 100.0 -------------------- technique: PoissonAdditive distribution: [99, 1] repeat count: 100000 proportion of allocations sampled [smaller is better]: 1.0 for size 1 correct ratio [closer to 1 is better]: 100.0 for size 99 correct ratio [closer to 1 is better]: 1.0 ============================================== -------------------- technique: PoissonNonAdditive distribution: [50] repeat count: 100000 proportion of allocations sampled [smaller is better]: 0.39067 for size 50 correct ratio [closer to 1 is better]: 0.78134 -------------------- technique: PoissonNonAdditive distribution: [1000000, 1] repeat count: 100000 proportion of allocations sampled [smaller is better]: 0.505035 for size 1000000 correct ratio [closer to 1 is better]: 1.0 for size 1 correct ratio [closer to 1 is better]: 1.007 -------------------- technique: PoissonNonAdditive distribution: [99, 1] repeat count: 100000 proportion of allocations sampled [smaller is better]: 0.31995 for size 1 correct ratio [closer to 1 is better]: 1.043 for size 99 correct ratio [closer to 1 is better]: 0.62947 ============================================== -------------------- technique: PoissonAdditiveWithThreshold distribution: [50] repeat count: 100000 proportion of allocations sampled [smaller is better]: 0.5023 for size 50 correct ratio [closer to 1 is better]: 1.0046 -------------------- technique: PoissonAdditiveWithThreshold distribution: [1000000, 1] repeat count: 100000 proportion of allocations sampled [smaller is better]: 0.50493 for size 1000000 correct ratio [closer to 1 is better]: 1.0 for size 1 correct ratio [closer to 1 is better]: 0.986 -------------------- technique: PoissonAdditiveWithThreshold distribution: [99, 1] repeat count: 100000 proportion of allocations sampled [smaller is better]: 0.49974 for size 1 correct ratio [closer to 1 is better]: 32.492 for size 99 correct ratio [closer to 1 is better]: 0.67456 erikchen@erikchen-macpro3 ~/projects/test_python_files $ """
,
Feb 15 2018
Thank you Pavel. That's exactly how the basics of the algorithm is supposed to work.
However, I made an optimization that eliminates the loop and records just a single sample assigning it a weight proportional to the accumulated value.
Effectively, it uses the current interval value for the next N intervals. It's a math question if we may do it. I don't have a proof, but now feel it could be unsafe. However I also feel that it's ok to replace any interval with the mean value.
Anyway the current implementation seems to work fine for large allocations >> sampling_interval, as well as
for small frequent allocations totaling to sizes that also >> sampling_interval.
Speaking of the pathological case you provided, I cannot reproduce it in my implementation. Note I don't distinguish between small and large allocations in the algorithm, they all treated the same way.
The code I sampled with 10KiB Poisson intervals is:
"""
NOINLINE void* AllocChunk1K() {
return new char[1024];
}
NOINLINE void* AllocChunk100K() {
return new char[100 * 1024];
}
for (int k = 0; k < 10000; ++k) {
AllocChunk100K();
for (int j = 0; j < 100; j++)
AllocChunk1K();
}
"""
The resulting profile looks ok to me (see attachment). Btw, what order of relative error did you observe with your initial implementation?
I also checked a case of a million of 1024 bytes allocations with SI=128KiB.
It produced a TOTAL=1,026,721,429 bytes. It is 0.266% more than the expected value of 1,024,000,000.
As I already mention the memory sampling profiler uses itself (for call stacks, etc) is currently uniformly attributed to the profiled program. It's about 300 bytes per sample. In this case we had TOTAL/SI = 7833 samples, that account for 7833*300 = 2,349,900 bytes. If we subtract it from TOTAL, we get 1,024,371,529 or 0.036% error.
,
Feb 15 2018
Hmmm, I think I found a pattern where my optimization breaks. Namely it's [100, 1, 100, 1, 100, 1, ...] with a SI of 10. When allocation of 100 comes in I extract all of it to a sample. And start accumulating towards next sample with 0 counter, that leads to allocations of 1s be unlikely captured with SI of 10. I'll fix that.
,
Feb 15 2018
Not sure if DT implementation is affected. Need to double check that.
,
Feb 15 2018
Mostly for my own future reference, I've attached an updated python script that has the correct implementation of pavel's suggested algorithm in c#4, my simple modification to improve performance w/o affecting correctness, as well as some comparison targets [record all allocations, perfect random sampling].
,
Feb 15 2018
Okay, posting here again, with the caveat that I'm not the expert, so take this with a grain of salt.
The current algorithm in v8 is:
"""
void AllocationObserver::AllocationStep(int bytes_allocated,
Address soon_object, size_t size) {
DCHECK_GE(bytes_allocated, 0);
bytes_to_next_step_ -= bytes_allocated;
if (bytes_to_next_step_ <= 0) {
Step(static_cast<int>(step_size_ - bytes_to_next_step_), soon_object, size);
step_size_ = GetNextStepSize();
bytes_to_next_step_ = step_size_;
}
DCHECK_GE(bytes_to_next_step_, 0);
}
"""
Note "step_size_ = GetNextStepSize();"
My claim is that this will incorrectly sample the sequence of allocations [100, 1, 100, 1, ...] with SI = 10.
This is tested in my first sampling_framework.py in C#5, as PoissonNonAdditive.
Just switching to "step_size_ += GetNextStepSize();" Does not fix the problem. See sampling_framework.py in C#5, as PoissonAdditive.
The current implementation I'm switching to for OOP HP is simple - it goes back to proper poisson sampling, but pre-filters allocations by size. This is tested in sampling_framework.py in c#10, and has the correct behavior on all tested sequenced [and matches behavior of a pure poisson sampling implementation].
"""
def MaybeRecordAlloc(size):
if size >= expected_sample_interval:
RecordAlloc()
return
sample_interval -= size
while sample_interval < 0:
RecordAlloc()
sample_interval += GetNextSampleInterval()
"""
,
Feb 15 2018
Hmm... checked it with a quick test and it seems to work ok.
"""
function alloc100() {
buf.push(new Array(100 * 1024 / 8).fill(42));
}
function alloc1() {
buf.push(new Array(1024 / 8).fill(42));
}
function foo() {
buf = [];
for (var i = 0; i < 1000; ++i) {
alloc100()
alloc1();
}
}
"""
,
Feb 16 2018
SI=16K
,
Feb 16 2018
With SI=16K, try with allocations 16K, 160b, 16K, 160b, ... Attached demo w/ screenshots.
,
Feb 16 2018
"""
def MaybeRecordAlloc(size):
if size >= expected_sample_interval:
RecordAlloc() <==== Not in loop, bad
return
sample_interval -= size
while sample_interval < 0: <==== In loop, good
RecordAlloc()
sample_interval += GetNextSampleInterval()
"""
You can't do that. My original algorithm stands!
,
Feb 16 2018
To expand on this, you either estimate the number of intervals of expected_sample_interval length and use the weight, or you use while sample_interval < 0. But you should never do both.
,
Feb 16 2018
To follow up, since this thread is a bit confusing. c#14 contains a minimized repro that demonstrates an edge case where the current implementation of devtools sampling heap profiler does not behave correctly.
,
Feb 16 2018
Regarding comment 14. Unfortunately, it's hard to see if results are off with this testcase. The optimizing compiler kicked in at certain point and inlined both alloc16kb and alloc160b into foo. Therefore foo is attributed with most allocations and it's impossible to break it down to the callees.
I have a better testcase:
"""
function alloc32k() {
try {
return new Array(32 * 1024 / 8).fill(42);
} catch (e) {}
}
function alloc32() {
try {
return []; // 32 bytes objects
} catch (e) {}
}
function foo() {
const count = 10000;
buf2 = new Array(count).fill(0);
buf1 = new Array(count).fill(0);
for (var i = 0; i < count; ++i) {
buf2[i] = alloc32k();
buf1[i] = alloc32();
}
}
"""
The values I observe are:
1,000 x 32kb allocations report 327,575,440
1,000 x 32b allocations report 443,040
The 32kb allocations seems to be ok, while 32b ones look overattributed (the expected value is ~ 320,000)
,
Feb 16 2018
Erik, this is a nice finding, but it has nothing to do with the discussion we have here, it relates to different product and different code. We should track it as a bug against v8 and focus on the native sampling head profiler here.
,
Feb 16 2018
Understood. Let me file a new bug.
,
Mar 12 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b956afc8367e2c2aa038b2afd37b204eca2c6dc8 commit b956afc8367e2c2aa038b2afd37b204eca2c6dc8 Author: Alexei Filippov <alph@chromium.org> Date: Mon Mar 12 22:41:08 2018 Sampling Heap Profiler: Use TLS for accumulated bytes. The TLS version has the same performance compared to lock-free version, and also has the following benefits: - simpler code - handles multithreaded allocations with higher accuracy It although has a potential corner case issue when there are lots or short living threads each allocating small amount of memory. BUG= 803276 ,812262 Change-Id: Ie868f07b99559d8cc95d134eed6592bffe1f63aa Reviewed-on: https://chromium-review.googlesource.com/944052 Commit-Queue: Alexei Filippov <alph@chromium.org> Reviewed-by: Pavel Feldman <pfeldman@chromium.org> Reviewed-by: Primiano Tucci <primiano@chromium.org> Reviewed-by: Erik Chen <erikchen@chromium.org> Cr-Commit-Position: refs/heads/master@{#542635} [modify] https://crrev.com/b956afc8367e2c2aa038b2afd37b204eca2c6dc8/base/sampling_heap_profiler/sampling_heap_profiler.cc [modify] https://crrev.com/b956afc8367e2c2aa038b2afd37b204eca2c6dc8/base/sampling_heap_profiler/sampling_heap_profiler.h [modify] https://crrev.com/b956afc8367e2c2aa038b2afd37b204eca2c6dc8/base/sampling_heap_profiler/sampling_heap_profiler_unittest.cc [modify] https://crrev.com/b956afc8367e2c2aa038b2afd37b204eca2c6dc8/content/browser/browser_main_runner.cc [modify] https://crrev.com/b956afc8367e2c2aa038b2afd37b204eca2c6dc8/content/renderer/renderer_main.cc |
|||
►
Sign in to add a comment |
|||
Comment 1 by erikc...@chromium.org
, Feb 14 2018Err blegh, my mistake, the new algorithm should be: """ def MaybeRecordAlloc(size): if size >= expected_sample_interval: RecordAlloc() return sample_interval -= size if sample_interval < 0: RecordAlloc() sample_interval += GetNextSampleInterval() """ We should still use += on sample_interval, as otherwise we will incorrectly record allocations close to |size|.