New issue
Advanced search Search tips

Issue 812262 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Current algorithm for native heap sampler has a minor edge case.

Project Member Reported by erikc...@chromium.org, Feb 14 2018

Issue description

For detailed writeup with examples, see https://bugs.chromium.org/p/chromium/issues/detail?id=810748#c6.

Current algorithm:
"""
def MaybeRecordAlloc(size):
  sample_interval -= size
  if sample_interval < 0:
    RecordAlloc()
    sample_interval += GetNextSampleInterval()
"""

Correct algorithm:
"""
def MaybeRecordAlloc(size):
  if size >= expected_sample_interval:
    RecordAlloc()
    return

  sample_interval -= size
  if sample_interval < 0:
    RecordAlloc()
    sample_interval = GetNextSampleInterval()
"""

Pathological behavior that is frequently observed in Chrome:

expected_sample_interval = 100

MaybeRecordAlloc(10000)
MaybeRecordAlloc(1)
MaybeRecordAlloc(1)
... <-- 100 total calls to MaybeRecordAlloc
MaybeRecordAlloc(1)

Which results from large container resizes, followed by moving/adding elements to the resulting container.
 
Err blegh, my mistake, the new algorithm should be:

"""
def MaybeRecordAlloc(size):
  if size >= expected_sample_interval:
    RecordAlloc()
    return

  sample_interval -= size
  if sample_interval < 0:
    RecordAlloc()
    sample_interval += GetNextSampleInterval()
"""

We should still use += on sample_interval, as otherwise we will incorrectly record allocations close to |size|.

Comment 2 by alph@chromium.org, Feb 14 2018

Hmmm... But SHP implementation does not use the algorithm you mention. Namely it neither distinguish between small and large allocations, nor does "+=" on sample_interval.

Could that be a bug in your implementation?

Also in the allocation pattern of:
1x10000
100x1

with sampling_interval of 100, the latter allocations (100x1) are on par with the interval, so the precision error is expected there.

Looks like WAI to me.

An issue I'm aware of is that allocations made by SHP itself get attributed to the current sample, but that's a different story. I plan to address it later.
> Hmmm... But SHP implementation does not use the algorithm you mention. Namely it neither distinguish between small and large allocations, nor does "+=" on sample_interval.

If SHP doesn't do +=, then it will undercount allocations of size ~ sample_interval. If it naively uses += for both large and small allocations, it will overcount small allocations proceeding large allocations.

"""

Also in the allocation pattern of:
1x10000
100x1

with sampling_interval of 100, the latter allocations (100x1) are on par with the interval, so the precision error is expected there.
"""
Repeat this set of allocations 1 million times. You will see that the 100x1 is pathologically undercounted using +=. 

I've created python scripts to test all of this. You may wish to do the same ;). Let's chat over VC.
I'm very far from the implementation, but from the common sense standpoint, the algorithm should be something like:

"""
def MaybeRecordAlloc(size):
  sample_interval -= size
  while sample_interval < 0:
    RecordAlloc()
    sample_interval += GetNextSampleInterval()
"""

Each sample should attribute the default non-floating sampling interval. You can't unconditionally record the sample, you can't to guard expressions - they skew the profile.
Attaching my test program, copy-pasting results:

For the math behind what's happening, see: https://bugs.chromium.org/p/chromium/issues/detail?id=810748#c15

Summary: Poisson process sampling relies on a uniform distribution being sampled. Chrome has pathologically non-uniform distributions. By adding a conditional, we are moving from PoissonProcess sampling of everything to Poisson process sampling of small allocations, and RecordEverything of large allocations, which is what we want to do anyways. We artificially restrict the population we're sampling to a more-uniform population, thereby decreasing the effect of pathological edge cases. These still exist, but do not appear to occur frequently in Chrome. If we wanted an even more accurate algorithm [with higher hot-path cost], I would suggest:

"""
underflow_small = GetNextSampleInterval()
underflow_large = GetNextSampleInterval()
def MaybeRecordAlloc(size):
  if size > expected_sample_interval:
    RecordAlloc()
    return
  if size > expected_sample_interval / 10:
    underflow_large -= size
    if underflow_large < 0:
      RecordAlloc()
      underflow_large += GetNextSampleInterval()
    return

  underflow_small -= size
  if underflow_small < 0:
    RecordAlloc()
    underflow_small += GetNextSampleInterval()
"""

Effectively, we're using 2 poisson processes to sample medium sized allocations vs small sized allocations, thereby further reducing non-uniformity in the population.



"""
erikchen@erikchen-macpro3 ~/projects/test_python_files $ python sampling_framework.py
==============================================
--------------------
technique: PerfectRandomSampling
distribution: [50] repeat count: 100000
proportion of allocations sampled [smaller is better]: 0.50298
for size 50 correct ratio [closer to 1 is better]: 1.00596
--------------------
technique: PerfectRandomSampling
distribution: [1000000, 1] repeat count: 100000
proportion of allocations sampled [smaller is better]: 0.504825
for size 1000000 correct ratio [closer to 1 is better]: 1.0
for size 1 correct ratio [closer to 1 is better]: 0.965
--------------------
technique: PerfectRandomSampling
distribution: [99, 1] repeat count: 100000
proportion of allocations sampled [smaller is better]: 0.50006
for size 1 correct ratio [closer to 1 is better]: 0.998
for size 99 correct ratio [closer to 1 is better]: 0.99014
==============================================
--------------------
technique: RecordEverything
distribution: [50] repeat count: 100000
proportion of allocations sampled [smaller is better]: 1.0
for size 50 correct ratio [closer to 1 is better]: 1.0
--------------------
technique: RecordEverything
distribution: [1000000, 1] repeat count: 100000
proportion of allocations sampled [smaller is better]: 1.0
for size 1000000 correct ratio [closer to 1 is better]: 1.0
for size 1 correct ratio [closer to 1 is better]: 1.0
--------------------
technique: RecordEverything
distribution: [99, 1] repeat count: 100000
proportion of allocations sampled [smaller is better]: 1.0
for size 1 correct ratio [closer to 1 is better]: 1.0
for size 99 correct ratio [closer to 1 is better]: 1.0
==============================================
--------------------
technique: PoissonAdditive
distribution: [50] repeat count: 100000
proportion of allocations sampled [smaller is better]: 0.50163
for size 50 correct ratio [closer to 1 is better]: 1.00326
--------------------
technique: PoissonAdditive
distribution: [1000000, 1] repeat count: 100000
proportion of allocations sampled [smaller is better]: 1.0
for size 1000000 correct ratio [closer to 1 is better]: 1.0
for size 1 correct ratio [closer to 1 is better]: 100.0
--------------------
technique: PoissonAdditive
distribution: [99, 1] repeat count: 100000
proportion of allocations sampled [smaller is better]: 1.0
for size 1 correct ratio [closer to 1 is better]: 100.0
for size 99 correct ratio [closer to 1 is better]: 1.0
==============================================
--------------------
technique: PoissonNonAdditive
distribution: [50] repeat count: 100000
proportion of allocations sampled [smaller is better]: 0.39067
for size 50 correct ratio [closer to 1 is better]: 0.78134
--------------------
technique: PoissonNonAdditive
distribution: [1000000, 1] repeat count: 100000
proportion of allocations sampled [smaller is better]: 0.505035
for size 1000000 correct ratio [closer to 1 is better]: 1.0
for size 1 correct ratio [closer to 1 is better]: 1.007
--------------------
technique: PoissonNonAdditive
distribution: [99, 1] repeat count: 100000
proportion of allocations sampled [smaller is better]: 0.31995
for size 1 correct ratio [closer to 1 is better]: 1.043
for size 99 correct ratio [closer to 1 is better]: 0.62947
==============================================
--------------------
technique: PoissonAdditiveWithThreshold
distribution: [50] repeat count: 100000
proportion of allocations sampled [smaller is better]: 0.5023
for size 50 correct ratio [closer to 1 is better]: 1.0046
--------------------
technique: PoissonAdditiveWithThreshold
distribution: [1000000, 1] repeat count: 100000
proportion of allocations sampled [smaller is better]: 0.50493
for size 1000000 correct ratio [closer to 1 is better]: 1.0
for size 1 correct ratio [closer to 1 is better]: 0.986
--------------------
technique: PoissonAdditiveWithThreshold
distribution: [99, 1] repeat count: 100000
proportion of allocations sampled [smaller is better]: 0.49974
for size 1 correct ratio [closer to 1 is better]: 32.492
for size 99 correct ratio [closer to 1 is better]: 0.67456
erikchen@erikchen-macpro3 ~/projects/test_python_files $ 
"""
sampling_framework.py
5.3 KB View Download

Comment 6 by alph@chromium.org, Feb 15 2018

Cc: primiano@chromium.org pfeldman@chromium.org
Thank you Pavel. That's exactly how the basics of the algorithm is supposed to work.

However, I made an optimization that eliminates the loop and records just a single sample assigning it a weight proportional to the accumulated value. 

Effectively, it uses the current interval value for the next N intervals. It's a math question if we may do it. I don't have a proof, but now feel it could be unsafe. However I also feel that it's ok to replace any interval with the mean value.

Anyway the current implementation seems to work fine for large allocations >> sampling_interval, as well as
for small frequent allocations totaling to sizes that also >> sampling_interval.

Speaking of the pathological case you provided, I cannot reproduce it in my implementation. Note I don't distinguish between small and large allocations in the algorithm, they all treated the same way.
The code I sampled with 10KiB Poisson intervals is:

"""
NOINLINE void* AllocChunk1K() {
  return new char[1024];
}
NOINLINE void* AllocChunk100K() {
  return new char[100 * 1024];
}

  for (int k = 0; k < 10000; ++k) {
    AllocChunk100K();
    for (int j = 0; j < 100; j++)
      AllocChunk1K();
  }

"""

The resulting profile looks ok to me (see attachment). Btw, what order of relative error did you observe with your initial implementation?

I also checked a case of a million of 1024 bytes allocations with SI=128KiB.
It produced a TOTAL=1,026,721,429 bytes. It is 0.266% more than the expected value of 1,024,000,000.
As I already mention the memory sampling profiler uses itself (for call stacks, etc) is currently uniformly attributed to the profiled program. It's about 300 bytes per sample. In this case we had TOTAL/SI = 7833 samples, that account for 7833*300 = 2,349,900 bytes. If we subtract it from TOTAL, we get 1,024,371,529 or 0.036% error.

Screenshot from 2018-02-15 10:06:53.png
24.4 KB View Download

Comment 7 Deleted

Comment 8 by alph@chromium.org, Feb 15 2018

Hmmm, I think I found a pattern where my optimization breaks.

Namely it's
[100, 1, 100, 1, 100, 1, ...] with a SI of 10.
When allocation of 100 comes in I extract all of it to a sample. And start accumulating towards next sample with 0 counter, that leads to allocations of 1s be unlikely captured with SI of 10.

I'll fix that.

Comment 9 by alph@chromium.org, Feb 15 2018

Not sure if DT implementation is affected. Need to double check that.
Mostly for my own future reference, I've attached an updated python script that has the correct implementation of pavel's suggested algorithm in c#4, my simple modification to improve performance w/o affecting correctness, as well as some comparison targets [record all allocations, perfect random sampling].
sampling_framework.py
5.0 KB View Download
Okay, posting here again, with the caveat that I'm not the expert, so take this with a grain of salt.

The current algorithm in v8 is:
"""
void AllocationObserver::AllocationStep(int bytes_allocated,
                                        Address soon_object, size_t size) {
  DCHECK_GE(bytes_allocated, 0);
  bytes_to_next_step_ -= bytes_allocated;
  if (bytes_to_next_step_ <= 0) {
    Step(static_cast<int>(step_size_ - bytes_to_next_step_), soon_object, size);
    step_size_ = GetNextStepSize();
    bytes_to_next_step_ = step_size_;
  }
  DCHECK_GE(bytes_to_next_step_, 0);
}
"""

Note "step_size_ = GetNextStepSize();"

My claim is that this will incorrectly sample the sequence of allocations [100, 1, 100, 1, ...] with SI = 10.

This is tested in my first sampling_framework.py in C#5, as PoissonNonAdditive. 

Just switching to "step_size_ += GetNextStepSize();" Does not fix the problem. See sampling_framework.py in C#5, as PoissonAdditive. 

The current implementation I'm switching to for OOP HP is simple - it goes back to proper poisson sampling, but pre-filters allocations by size. This is tested in sampling_framework.py in c#10, and has the correct behavior on all tested sequenced [and matches behavior of a pure poisson sampling implementation].

"""
def MaybeRecordAlloc(size):
  if size >= expected_sample_interval:
    RecordAlloc()
    return

  sample_interval -= size
  while sample_interval < 0:
    RecordAlloc()
    sample_interval += GetNextSampleInterval()
"""

Comment 12 by alph@chromium.org, Feb 15 2018

Hmm... checked it with a quick test and it seems to work ok.

"""
function alloc100() {
    buf.push(new Array(100 * 1024 / 8).fill(42));
}

function alloc1() {
    buf.push(new Array(1024 / 8).fill(42));
}

function foo() {
    buf = [];
    for (var i = 0; i < 1000; ++i) {
        alloc100()
        alloc1();
    }
}
"""
Screenshot from 2018-02-15 15:57:25.png
30.0 KB View Download

Comment 13 by alph@chromium.org, Feb 16 2018

SI=16K
With SI=16K, try with allocations 16K, 160b, 16K, 160b, ...

Attached demo w/ screenshots.
poisson.html
438 bytes View Download
Screen Shot 2018-02-15 at 7.15.18 PM.png
25.5 KB View Download
"""
def MaybeRecordAlloc(size):
  if size >= expected_sample_interval:
    RecordAlloc()                                <==== Not in loop, bad
    return

  sample_interval -= size
  while sample_interval < 0:                     <==== In loop, good
    RecordAlloc()                  
    sample_interval += GetNextSampleInterval()
"""

You can't do that. My original algorithm stands!
To expand on this, you either estimate the number of intervals of expected_sample_interval length and use the weight, or you use while sample_interval < 0. But you should never do both.
To follow up, since this thread is a bit confusing. c#14 contains a minimized repro that demonstrates an edge case where the current implementation of devtools sampling heap profiler does not behave correctly. 

Comment 18 by alph@chromium.org, Feb 16 2018

Regarding comment 14. Unfortunately, it's hard to see if results are off with this testcase. The optimizing compiler kicked in at certain point and inlined both alloc16kb and alloc160b into foo. Therefore foo is attributed with most allocations and it's impossible to break it down to the callees.

I have a better testcase:
"""
function alloc32k() {
    try {
      return new Array(32 * 1024 / 8).fill(42);
    } catch (e) {}
}

function alloc32() {
    try {
      return []; // 32 bytes objects
    } catch (e) {}
}

function foo() {
    const count = 10000;
    buf2 = new Array(count).fill(0);
    buf1 = new Array(count).fill(0);
    for (var i = 0; i < count; ++i) {
        buf2[i] = alloc32k(); 
        buf1[i] = alloc32(); 
    }
}
"""

The values I observe are:
1,000 x 32kb allocations report 327,575,440
1,000 x 32b  allocations report 443,040

The 32kb allocations seems to be ok, while 32b ones look overattributed (the expected value is ~ 320,000)


Screenshot from 2018-02-16 10:41:38.png
29.3 KB View Download
Erik, this is a nice finding, but it has nothing to do with the discussion we have here, it relates to different product and different code. We should track it as a bug against v8 and focus on the native sampling head profiler here.
Summary: Current algorithm for native heap sampler has a minor edge case. (was: Current algorithm to devtools/NHP heap sampler has a bug.)
Understood. Let me file a new bug.
Project Member

Comment 21 by bugdroid1@chromium.org, Mar 12 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b956afc8367e2c2aa038b2afd37b204eca2c6dc8

commit b956afc8367e2c2aa038b2afd37b204eca2c6dc8
Author: Alexei Filippov <alph@chromium.org>
Date: Mon Mar 12 22:41:08 2018

Sampling Heap Profiler: Use TLS for accumulated bytes.

The TLS version has the same performance compared to lock-free version,
and also has the following benefits:
  - simpler code
  - handles multithreaded allocations with higher accuracy

It although has a potential corner case issue when there are lots
or short living threads each allocating small amount of memory.

BUG= 803276 ,812262

Change-Id: Ie868f07b99559d8cc95d134eed6592bffe1f63aa
Reviewed-on: https://chromium-review.googlesource.com/944052
Commit-Queue: Alexei Filippov <alph@chromium.org>
Reviewed-by: Pavel Feldman <pfeldman@chromium.org>
Reviewed-by: Primiano Tucci <primiano@chromium.org>
Reviewed-by: Erik Chen <erikchen@chromium.org>
Cr-Commit-Position: refs/heads/master@{#542635}
[modify] https://crrev.com/b956afc8367e2c2aa038b2afd37b204eca2c6dc8/base/sampling_heap_profiler/sampling_heap_profiler.cc
[modify] https://crrev.com/b956afc8367e2c2aa038b2afd37b204eca2c6dc8/base/sampling_heap_profiler/sampling_heap_profiler.h
[modify] https://crrev.com/b956afc8367e2c2aa038b2afd37b204eca2c6dc8/base/sampling_heap_profiler/sampling_heap_profiler_unittest.cc
[modify] https://crrev.com/b956afc8367e2c2aa038b2afd37b204eca2c6dc8/content/browser/browser_main_runner.cc
[modify] https://crrev.com/b956afc8367e2c2aa038b2afd37b204eca2c6dc8/content/renderer/renderer_main.cc

Sign in to add a comment