New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 836238 link

Starred by 4 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on:
issue 836250



Sign in to add a comment

Fix histograms with mismatched construction arguments.

Project Member Reported by bcwh...@chromium.org, Apr 24 2018

Issue description

Sometimes histograms get instantiated with different parameters.  These used to crash but are now reported as a histogram:

https://uma.googleplex.com/p/chrome/histograms/?endDate=latest&dayCount=28&histograms=Histogram.MismatchedConstructionArguments&fixupData=true&showMax=true&filters=channel%2Ceq%2C1%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial

Instantiations different from the first cause a "dummy" histogram to be returned so that the caller will still execute but the reported data will be quietly dropped.

 
Blockedon: 836250
Project Member

Comment 2 by bugdroid1@chromium.org, Apr 24 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8e2f297bc419fb270427eec9b8a178446528a987

commit 8e2f297bc419fb270427eec9b8a178446528a987
Author: Brian White <bcwhite@chromium.org>
Date: Tue Apr 24 18:26:47 2018

Add name of histogram with mismatched params.

Bug: 836238
Change-Id: Ibec5500a6d9348a30009d3d6d97a51c38a8e64ef
Reviewed-on: https://chromium-review.googlesource.com/1026051
Reviewed-by: Brian White <bcwhite@chromium.org>
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Commit-Queue: Alexei Svitkine <asvitkine@chromium.org>
Cr-Commit-Position: refs/heads/master@{#553207}
[modify] https://crrev.com/8e2f297bc419fb270427eec9b8a178446528a987/tools/metrics/histograms/enums.xml

I've been looking through this a bit today and something doesn't seem right.  For example, Compositing.Browser.LayersUpdateTime (-1796008467) has some 5000 reports just for M68 (canary/dev) but the code is constant:

    std::string histogram_name =
        base::StringPrintf("Compositing.%s.LayersUpdateTime", client_name);
    base::UmaHistogramCounts10M(histogram_name, elapsed);

I'm wondering if these could be due to memory corruption that is causing the "existing" values to become corrupted and thus mismatch with the constants when it is again accessed.

Issue 884655 has been merged into this issue.
Project Member

Comment 6 by bugdroid1@chromium.org, Sep 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/673081e3d2e172e3875ffae881deab0d3f93e497

commit 673081e3d2e172e3875ffae881deab0d3f93e497
Author: Brian White <bcwhite@chromium.org>
Date: Tue Sep 18 19:02:34 2018

Field crash if histogram bucket creation isn't as expected.

There are reports from the field that the "maximum" of a histogram
does not match what was configured.  The working theory is that it is
due to rounding errors in the creation of exponential buckets.
Verify this assumption.  If it's behaving wrong, it should crash
during the creation.

Bug: 836238
Change-Id: I4cf5e4bc3986e288c728ae37e2ff8cf841b7598d
Reviewed-on: https://chromium-review.googlesource.com/1228840
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Commit-Queue: Brian White <bcwhite@chromium.org>
Cr-Commit-Position: refs/heads/master@{#592128}
[modify] https://crrev.com/673081e3d2e172e3875ffae881deab0d3f93e497/base/metrics/histogram.cc

Cc: asvitk...@chromium.org
Labels: -Pri-3 Pri-2
Status: Started (was: Assigned)
Reviewing the crashes from
https://chromium-review.googlesource.com/1026051

There's no real consistency in how they differ.
1000000 => 999989
1000000 => 999424
 500000 => 499994
  10000 =>   9961

Most odd is that 1M gets converted to two different numbers.  Rounding errors there may be but they should at least be consistent.  Number of buckets was 50 in all the above cases.  The same histogram (by name) always has the same difference, though.

Of note, the 5000000 => 499994 is the Chrome Installer:
https://cs.chromium.org/chromium/src/chrome/installer/setup/setup_main.cc?rcl=8b1391794f25f6e7363771a43c1988b3c5b8e5a5&l=1455

Of extra note, there is an identically defined histogram on the line above that does *not* mismatch!

Bucket ranges are reused so whatever was created on the previous line should be reused for the one that is failing.  It could be some kind of memory corruption that is changing the last bucket (or multiple buckets -- only the last one is being reported) between one call and the next.  It seems important that it is happening with some frequency between two identical calls that have nothing in between.
Project Member

Comment 8 by bugdroid1@chromium.org, Sep 19

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8f6326484a8c1ad74f61585eb0bfc7b1126ce2fa

commit 8f6326484a8c1ad74f61585eb0bfc7b1126ce2fa
Author: Brian White <bcwhite@chromium.org>
Date: Wed Sep 19 13:52:00 2018

Revert "Field crash if histogram bucket creation isn't as expected."

This reverts commit 673081e3d2e172e3875ffae881deab0d3f93e497.

Reason for revert: Test complete. One day in the field is sufficient.

Original change's description:
> Field crash if histogram bucket creation isn't as expected.
> 
> There are reports from the field that the "maximum" of a histogram
> does not match what was configured.  The working theory is that it is
> due to rounding errors in the creation of exponential buckets.
> Verify this assumption.  If it's behaving wrong, it should crash
> during the creation.
> 
> Bug: 836238
> Change-Id: I4cf5e4bc3986e288c728ae37e2ff8cf841b7598d
> Reviewed-on: https://chromium-review.googlesource.com/1228840
> Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
> Commit-Queue: Brian White <bcwhite@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#592128}

TBR=asvitkine@chromium.org,bcwhite@chromium.org

Change-Id: I0c7ecfc65d31aa2d161f1b689a178432a9364e27
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug: 836238
Reviewed-on: https://chromium-review.googlesource.com/1233893
Reviewed-by: Brian White <bcwhite@chromium.org>
Commit-Queue: Brian White <bcwhite@chromium.org>
Cr-Commit-Position: refs/heads/master@{#592385}
[modify] https://crrev.com/8f6326484a8c1ad74f61585eb0bfc7b1126ce2fa/base/metrics/histogram.cc

Thanks for the update.

Those lower numbers are coming from bucket ranges, right? So in theory something could be corrupting those post initial computation? Can we add a checksum (temporary to debug) to the ranges?

Or something simpler like if we know histogram X often gets affected, add some special case code that checks for that histogram name and checks what the max ranges value is when we're both storing it and accessing it later to figure out if it gets messed up initially, or later by some other event.
Correct.  The lower numbers are in the bucket ranges object.

I'm running a local build right now to see if I can find a difference between |created_ranges| and |registered_ranges| that might indicate a change in those values.  So far, no hits... but that's not really surprising since I didn't hit any of the other mismatches on a local build.

However, BucketRanges::Equals does a bucket-by-bucket comparison for equality so RegisterOrDeleteDuplicateRanges should not return a "matching" ranges object that has been corrupted.  So why one call would succeed and an immediate identical call fail, I'm not sure.

The byte differences in the range-max vs create-max aren't revealing.  Sometimes it's a one-byte difference, sometimes two bytes.  They're always a reduction and not usually a zeroing.  It doesn't look like any kind of _random_ corruption.
Project Member

Comment 11 by bugdroid1@chromium.org, Sep 19

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/37a4f70b13dda239f802da32c91afff892a7c09d

commit 37a4f70b13dda239f802da32c91afff892a7c09d
Author: Brian White <bcwhite@chromium.org>
Date: Wed Sep 19 18:23:02 2018

Field-check ranges creation.

Check if the original ranges creation is causing the bad bucket
values or if its some kind of corruption since the last time.

Bug: 836238
Change-Id: I5b2b72504b6ac6437a9cec3ee42ae7e688be451b
Reviewed-on: https://chromium-review.googlesource.com/1234215
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Commit-Queue: Brian White <bcwhite@chromium.org>
Cr-Commit-Position: refs/heads/master@{#592484}
[modify] https://crrev.com/37a4f70b13dda239f802da32c91afff892a7c09d/base/metrics/histogram.cc

Only one relevant crash over the weekend:
https://crash.corp.google.com/browse?q=product.Version%3E%3D%2771.0.3557.0%27+AND+product_name+LIKE+%27%25Chrome%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AHistogram%3A%3AFactory%3A%3ABuild%27&stbtiq=&reportid=03b5ebbbad42d57d&index=2#0

Looking at this in WinDbg, it's appears to be a legitimate rounding error from a maximum of 100,000,000 to 99,999,999.  That's a very small amount compared to some of the other mismatches that somehow have not occurred since I started checking the bucket creation directly.  Perhaps it's just the smaller user base since the latest check has not yet gone out in a Dev build.
Project Member

Comment 13 by bugdroid1@chromium.org, Sep 24

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f889b9f0b407daef2475bb7e0a60ad9ca16693f4

commit f889b9f0b407daef2475bb7e0a60ad9ca16693f4
Author: Brian White <bcwhite@chromium.org>
Date: Mon Sep 24 22:42:30 2018

Don't check-crash for linear histograms.

Most of the observed crashes are enumeration (aka "linear") ones for
"cast" that are apparently differences between extensions. Behave in
the expected fashion (return dummy histogram) for those so as to not
impact users unnecessarily.

Bug: 836238
Change-Id: I5f68146be7a5988c4b529d54861bb215869fac2b
Reviewed-on: https://chromium-review.googlesource.com/1241495
Commit-Queue: Brian White <bcwhite@chromium.org>
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Cr-Commit-Position: refs/heads/master@{#593722}
[modify] https://crrev.com/f889b9f0b407daef2475bb7e0a60ad9ca16693f4/base/metrics/histogram.cc

Two more crashes for 100,000,000 => 99,999,999 so that seems a a consistent rounding error.

Two crashes for bucket mismatch between what was created for the histogram and what was previously created for some other histogram leaving two possibilities that I can see:

1) Successive calls to CreateRanges() can produce different bucket values even though the values used in the algorithm are the same.  This seems very unlikely.

2) RegisterOrDeleteDuplicateranges() is returning a mismatched ranges set.  Since part of the comparison is a bucket-by-bucket check that all values are the same, this also seems unlikely.


Project Member

Comment 15 by bugdroid1@chromium.org, Sep 27

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e716f6e83718423d7732fc1a36af5300ee8e61b8

commit e716f6e83718423d7732fc1a36af5300ee8e61b8
Author: Brian White <bcwhite@chromium.org>
Date: Thu Sep 27 21:01:46 2018

Record more details about bucket value mismatches.

Bug: 836238
Change-Id: I4714f80e7a0796327bce606f89ce1f889752f39f
Reviewed-on: https://chromium-review.googlesource.com/1249396
Reviewed-by: Jesse Doherty <jwd@chromium.org>
Commit-Queue: Brian White <bcwhite@chromium.org>
Cr-Commit-Position: refs/heads/master@{#594854}
[modify] https://crrev.com/e716f6e83718423d7732fc1a36af5300ee8e61b8/base/metrics/histogram.cc

Two crashes since new debug info was added and BOTH are where a second creation of the ranges differs from the first.  There is nothing but check-code between the two calls.

https://crash.corp.google.com/browse?q=product.Version%3E%3D%2771.0.3564.0%27+AND+product_name+LIKE+%27%25Chrome%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AHistogram%3A%3AFactory%3A%3ABuild%27&stbtiq=&reportid=192f4c2c4c0fe58d&index=1#0

Bucket boundary #14 of the original |created_ranges| changed from 107 (0x6B) to 96 (0x60).  This isn't a calculation error as the previous boundary is 95 (0x5F) in both cases.  This is a 50-bucket exponential histogram from 1 to 1000.

https://crash.corp.google.com/browse?q=product.Version%3E%3D%2771.0.3564.0%27+AND+product_name+LIKE+%27%25Chrome%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AHistogram%3A%3AFactory%3A%3ABuild%27&stbtiq=&reportid=9edf01ae0922bcc8&index=0#0

Bucket boundary #8 of the original |created_ranges| changed from 8 to 4.  This is a linear histogram where each bucket boundary matches its index.

There are checksum calculation on the boundaries.  Perhaps we can use that to further narrow down where the change is happening.
Project Member

Comment 17 by bugdroid1@chromium.org, Oct 1

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b8ca53abff774931800f64fa522b7b0e2a8045d4

commit b8ca53abff774931800f64fa522b7b0e2a8045d4
Author: Brian White <bcwhite@chromium.org>
Date: Mon Oct 01 22:27:55 2018

Check ranges checksums for corruptions.

Try to narrow down where the ranges are getting corrupted by verifying
the internal checksum in places.

Bug: 836238
Change-Id: I05493e86d74c820c6dfe2dbe30759c04cdfb3f28
Reviewed-on: https://chromium-review.googlesource.com/1255322
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Commit-Queue: Brian White <bcwhite@chromium.org>
Cr-Commit-Position: refs/heads/master@{#595595}
[modify] https://crrev.com/b8ca53abff774931800f64fa522b7b0e2a8045d4/base/metrics/histogram.cc

 “Curiouser and curiouser!” Cried Alice.

Two crashes as of this morning, both impossible in the same way.

https://crash.corp.google.com/browse?q=product.Version%3E%3D%2771.0.3569.0%27+AND+product_name+LIKE+%27%25Chrome%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AHistogram%3A%3AFactory%3A%3ABuild%27&stbtiq=&reportid=e8f1fb4c45291832&index=0

                  found = false
                      i = 0x58e
          created_range = 0x58d
        recreated_range = 0x58e
                 h_name = char [100] "PartitionAlloc.CommittedSize"
                 h_type = LINEAR_HISTOGRAM (0n1)
                b_count = 0x1001
                c_count = 0x1001
                r_count = 0x1001
                c_valid = true
                r_valid = true

For a enum/linear histogram (which this is, despite its odd name), the bucket boundary value should match its index.  In this case, the "created range" is off by one.

Amazingly enough, though, the checksum matches, both in the CHECK that was done immediately after it was created and again during the debug output, say the internal checksums are correct.  For this to be true, the boundary value would have had to be wrong when the checksum is initially calculated which is immediately after the boundary values are generated... though not _while_ they are generated.

I guess the next question is...  Are they being incorrectly generated or incorrectly stored.


Here's other crash with a similarly impossible result:

https://crash.corp.google.com/browse?q=product.Version%3E%3D%2771.0.3569.0%27+AND+product_name+LIKE+%27%25Chrome%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AHistogram%3A%3AFactory%3A%3ABuild%27&stbtiq=&reportid=69c0ac9ba33e55ff&index=1

                      i = 0x869
          created_range = 0x869
        recreated_range = 0x68
                 h_name = char [100] "BlinkGC.CommittedSize"
                 h_type = LINEAR_HISTOGRAM (0n1)
                b_count = 0x1001
                c_count = 0x1001
                r_count = 0x1001
                c_valid = true
                r_valid = true

In this case, it's the "recreated" ranges that has the incorrect value though the checksum validation still thinks everything is okay.
Project Member

Comment 19 by bugdroid1@chromium.org, Oct 5

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/61b84b2866fe7e1917de0818e39ea7ae0a2096ed

commit 61b84b2866fe7e1917de0818e39ea7ae0a2096ed
Author: Brian White <bcwhite@chromium.org>
Date: Fri Oct 05 18:03:18 2018

Validate checksum calculations.

Crash reports indicate that checksums are valid even though the bucket
boundary values are tested to be incorrect. Try to determine if it is
memory corruption or bad calculations.

Bug: 836238
Change-Id: Ie89df5fef5093e9dd8c45fe9ff26a0430838769e
Reviewed-on: https://chromium-review.googlesource.com/c/1264036
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Commit-Queue: Brian White <bcwhite@chromium.org>
Cr-Commit-Position: refs/heads/master@{#597194}
[modify] https://crrev.com/61b84b2866fe7e1917de0818e39ea7ae0a2096ed/base/metrics/bucket_ranges.cc
[modify] https://crrev.com/61b84b2866fe7e1917de0818e39ea7ae0a2096ed/base/metrics/bucket_ranges.h
[modify] https://crrev.com/61b84b2866fe7e1917de0818e39ea7ae0a2096ed/base/metrics/histogram.cc

Project Member

Comment 21 by bugdroid1@chromium.org, Oct 10

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/fc3a02269e5d241d2b17503f0abe8c12d3375a33

commit fc3a02269e5d241d2b17503f0abe8c12d3375a33
Author: Brian White <bcwhite@chromium.org>
Date: Wed Oct 10 14:49:42 2018

Revert bucket-boundary checks for branch. (part 2)

Part 1 was here:
https://chromium-review.googlesource.com/c/chromium/src/+/1270345

I expect to revert this revert once the branch is done as there is
still more testing to be done here.

Bug: 836238
Change-Id: I1f85b6fa21405ceb3e569335aa995faf2050fd0f
Reviewed-on: https://chromium-review.googlesource.com/c/1271716
Commit-Queue: Alexei Svitkine <asvitkine@chromium.org>
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Cr-Commit-Position: refs/heads/master@{#598319}
[modify] https://crrev.com/fc3a02269e5d241d2b17503f0abe8c12d3375a33/base/metrics/bucket_ranges.cc
[modify] https://crrev.com/fc3a02269e5d241d2b17503f0abe8c12d3375a33/base/metrics/bucket_ranges.h
[modify] https://crrev.com/fc3a02269e5d241d2b17503f0abe8c12d3375a33/base/metrics/histogram.cc

All the checks have been reverted for the upcoming branch point but a Dev build was released with them making a bigger user base from which to receive crashes.  After 3 days of Canary with no crashes, we have one:

https://crash.corp.google.com/browse?q=product.version+like+%277_._.%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AHistogram%3A%3AFactory%3A%3ABuild%27+AND+product.version+%3E%3D+%2771.0.3572.0%27&stbtiq=&reportid=4ae84211b482941a&index=0#0

                  found = false
   using_created_ranges = true
           equal_ranges = true
          histogram_ref = 0x646e6552
                      i = 0x1c
          created_range = 0x364a
        recreated_range = 0x4cf4
                 h_name = char [100] "RendererScheduler.ExpectedQueueingTimeByTaskQueue2.FrameLoading"
                 h_type = HISTOGRAM (0n0)
                b_count = 0x32
                c_count = 0x32
                r_count = 0x32
                c_valid = true
                r_valid = true

Again, all the checksums are correct.  There was also no crash inside InitializeBucketRanges() where the checksum is done twice, once over the memeory contents and once over the individual values as they're calculated.  Thus, both think they're doing the same correct thing.

The difference, however, is not minor.  One calculation placed the boundary at 13902 and the other at 19700.  These are wildly different and so not a simple rounding error.  The previous boundary must have matched, though we don't know what that value was.

Because the checksums match, that leaves the calculation of the bucket boundary itself, specifically

    double log_current;
    log_current = log(static_cast<double>(current));
    // Calculate the count'th root of the range.
    log_ratio = (log_max - log_current) / (bucket_count - bucket_index);
    // See where the next bucket would start.
    log_next = log_current + log_ratio;
    Sample next;
    next = static_cast<int>(std::round(exp(log_next)));
    if (next > current)
      current = next;
    else
      ++current;  // Just do a narrow bucket, and keep trying.

The "Fields" tab doesn't show any 3rd-party modules loaded.  Could it be a floating-point math (hardware) error?
I think cpu info is in crash/ - if it's some float error, we can check for consistency of cpu models.

Otherwise, it does sound like it's sufficiently rare that we may not need to do much about it?
It is, but the crashes are spread fairly evenly between "x86" and "amd64", and different versions of each.  Any way to tell if the systems are overclocked?

It's too rare to be the cpu in general but frequent enough to show up in Canary and Dev.

I wonder if other parts of Chrome see floating-point errors.
Project Member

Comment 25 by bugdroid1@chromium.org, Oct 15

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2e7cacdbff0046391329f939eb361a0bba6ffdc8

commit 2e7cacdbff0046391329f939eb361a0bba6ffdc8
Author: Brian White <bcwhite@chromium.org>
Date: Mon Oct 15 14:09:59 2018

Revert "Revert bucket-boundary checks for branch. (part 2)"

This reverts commit fc3a02269e5d241d2b17503f0abe8c12d3375a33.

Reason for revert: M71 branch has occurred. Re-enable this for continued debugging on M72.

Original change's description:
> Revert bucket-boundary checks for branch. (part 2)
> 
> Part 1 was here:
> https://chromium-review.googlesource.com/c/chromium/src/+/1270345
> 
> I expect to revert this revert once the branch is done as there is
> still more testing to be done here.
> 
> Bug: 836238
> Change-Id: I1f85b6fa21405ceb3e569335aa995faf2050fd0f
> Reviewed-on: https://chromium-review.googlesource.com/c/1271716
> Commit-Queue: Alexei Svitkine <asvitkine@chromium.org>
> Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#598319}

TBR=asvitkine@chromium.org,bcwhite@chromium.org

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug: 836238
Change-Id: If2f1911f538c11170fff5e8de4bc3115262aa6c6
Reviewed-on: https://chromium-review.googlesource.com/c/1280782
Reviewed-by: Brian White <bcwhite@chromium.org>
Commit-Queue: Brian White <bcwhite@chromium.org>
Cr-Commit-Position: refs/heads/master@{#599629}
[modify] https://crrev.com/2e7cacdbff0046391329f939eb361a0bba6ffdc8/base/metrics/bucket_ranges.cc
[modify] https://crrev.com/2e7cacdbff0046391329f939eb361a0bba6ffdc8/base/metrics/bucket_ranges.h
[modify] https://crrev.com/2e7cacdbff0046391329f939eb361a0bba6ffdc8/base/metrics/histogram.cc

Project Member

Comment 26 by bugdroid1@chromium.org, Oct 26

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7f1fbf31984729c4122f2ac40069069060ac3548

commit 7f1fbf31984729c4122f2ac40069069060ac3548
Author: Brian White <bcwhite@chromium.org>
Date: Fri Oct 26 23:12:58 2018

Record all boundary values for histogram mismatch debug.

Talking with Bruce Dawson, it could be helpful to know all the
generated boundary values in order to discern if there is a
pattern.

Bug: 836238
Change-Id: I40ddda5e93d298508fb89fe828245fea39c98b6c
Reviewed-on: https://chromium-review.googlesource.com/c/1302315
Commit-Queue: Brian White <bcwhite@chromium.org>
Commit-Queue: Alexei Svitkine <asvitkine@chromium.org>
Reviewed-by: Bruce Dawson <brucedawson@chromium.org>
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Cr-Commit-Position: refs/heads/master@{#603242}
[modify] https://crrev.com/7f1fbf31984729c4122f2ac40069069060ac3548/base/metrics/histogram.cc

Project Member

Comment 27 by bugdroid1@chromium.org, Oct 31

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2dbe9d24b3b9ee05a2137847364bd47abc49be0a

commit 2dbe9d24b3b9ee05a2137847364bd47abc49be0a
Author: Brian White <bcwhite@chromium.org>
Date: Wed Oct 31 19:21:23 2018

Combine mismatch checks for more consistent crash reporting.

Bug: 836238
Change-Id: Iafd29fb293af9b284ae9ab98b2f6c12bb27fa9a0
Reviewed-on: https://chromium-review.googlesource.com/c/1308634
Commit-Queue: Alexei Svitkine <asvitkine@chromium.org>
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Cr-Commit-Position: refs/heads/master@{#604362}
[modify] https://crrev.com/2dbe9d24b3b9ee05a2137847364bd47abc49be0a/base/metrics/histogram.cc

Surprisingly few crashes but here's an interesting one:
https://crash.corp.google.com/browse?q=product.version+like+%277_._.%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AHistogram%3A%3AFactory%3A%3ABuild%27+AND+product.version+%3E%3D+%2772.0.3593.0%27&stbtiq=&reportid=cb3af43d469ecece&index=5#0

This has the full dump of 52 buckets and we can see that it goes correct most of the way and then starts to diverge at bucket 43.  The initial difference is 1 and then continues to get worse.

  double log_max = log(static_cast<double>(maximum));
  while (bucket_count > ++bucket_index) {
    double log_current;
    log_current = log(static_cast<double>(current));
    log_ratio = (log_max - log_current) / (bucket_count - bucket_index);
    log_next = log_current + log_ratio;
    next = static_cast<int>(std::round(exp(log_next)));
    if (next > current)
      current = next;
    else
      ++current;  // Just do a narrow bucket, and keep trying.
    ranges->set_range(bucket_index, current);
  }

The algorithm should always converge to log_max even if "current" were to experience error yet it does not.

When I do the calculation with my 10-digit calculator, I get 191548 for bucket 43 when calculating from the result of bucket 142.  A "double" has 15 decimal digits of precision, but even with 10 digits a difference in the last digit makes the exp() operation only off by 0.002.  So why did the C++ code get 191544 and 191543, different by roughly 4 and 5?

Yet if I do the calculation manually from values in bucket 48, in both cases it'll go to whatever I choose for "maximum".  I feel like I'm missing something.

Bucket  Created     Recreated
  0        0		0
  1        1		1
  2        2		2
  3        3		3
  4        4		4
  5        5		5
  6        7		7
  7        9		9
  8        12		12
  9        16		16
  10       21		21
  11       28		28
  12       37		37
  13       49		49
  14       65		65
  15       86		86
  16       113		113
  17       149		149
  18       196		196
  19       258		258
  20       340		340
  21       448		448
  22       590		590
  23       777		777
  24       1023		1023
  25       1347		1347
  26       1774		1774
  27       2336		2336
  28       3077		3077
  29       4053		4053
  30       5338		5338
  31       7030		7030
  32       9259		9259
  33       12194	12194
  34       16060	16060
  35       21152	21152
  36       27859	27859
  37       36693	36693
  38       48328	48328
  39       63651	63651
  40       83833	83833
  41       110417	110417
  42       145431	145431
  43       191544	191543
  44       252284	252283
  45       332282	332280
  46       437644	437641
  47       576431	576426
  48       759207	759199
  49       1000000	999989

The only thing that I can think of is that somebody has changed the rounding mode. The default rounding mode is round-to-nearest-even. If that is changed to round up or round down then errors can accumulate quite quickly, especially if there are lots of intermediate calculations.

The function used to modify the rounding mode (on Windows) is _controlfp (or _controlfp_s I guess). It's a bit subtle but you can use it to read the rounding modes and set them.

This will set the rounding mode to the default while not changing anything else:

_controlfp(_RC_NEAR, _MCW_RC);   // Round to nearest

Maybe somebody is calling deSetRoundingMode to change the rounding mode, and then not resetting it?

Most likely the problem is caused by some injected code (printer driver, perhaps) changing the FPU settings. I've seen this before and it is evil.

Note that the FPU settings are per-thread.

I hit an instance of FPU settings being modified two years ago. At the time I fixed the compiler bug which was triggering crashes but maybe I should also have added monitoring of FPU settings, and frequent resetting of them:

https://randomascii.wordpress.com/2016/09/16/everything-old-is-new-again-and-a-compiler-bug/

That is, maybe we could save ourselves a lot of grief by either trying to figure out who is changing our FPU settings (I would *love* to know that) or by forcibly resetting them in our message loop.

If you are recreating the calculations on your computer then you can try recreating them with the four different rounding modest (nearest, +infinity, -infinity, zero) and see if you can reproduce the results that way.
Huh, I didn't realize that was a thing. Seems dangerous to rely on std::round if rounding mode can be changed globally by some other code.

Very curious if that's the cause - if so, maybe we should move away from std::round() and maybe worth discussing on chromium-dev if that should be something to do more broadly in the project.
This rounding...  Are you talking float-to-int rounding or how the last fp digit is rounded after an operation?

If the former...  The rounding is off by more than 1 by the time we get to the end.  The algorithm recalculates from a constant |log_max| (even though it's not declared "const") and the previous integer boundary.  Any error in a previous calculation should receive correction so that the final boundary is the requested maximum.

If the latter...   A "double" has equivalent of 15 decimal digits of accuracy but my manual tests on my 10-digit calculator showed that errors in the last digit caused only a 0.002 change in the result of the exp() operation which was insignificant in a (proper) round-to-int operation.

I'll finish up that CL to validate that |log_max| isn't changing somehow.
You said that FPU settings are per-thread, but I'm seeing different results from two adjacent calls (no code in between) on the same thread.

Project Member

Comment 34 by bugdroid1@chromium.org, Nov 5

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/98151dc32fd4888fb1ef880d1e7be8e35525f5d3

commit 98151dc32fd4888fb1ef880d1e7be8e35525f5d3
Author: Brian White <bcwhite@chromium.org>
Date: Mon Nov 05 21:41:16 2018

Validate math when creating histogram buckets.

This is a further attempt to determine why two subsequent calls to this
method with the same parameters sometimes produce different outputs.

Bug: 836238
Change-Id: I3960fe0bd27976ae4678a4f96bad61870c22342c
Reviewed-on: https://chromium-review.googlesource.com/c/1314648
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Reviewed-by: Bruce Dawson <brucedawson@chromium.org>
Commit-Queue: Brian White <bcwhite@chromium.org>
Cr-Commit-Position: refs/heads/master@{#605473}
[modify] https://crrev.com/98151dc32fd4888fb1ef880d1e7be8e35525f5d3/base/metrics/histogram.cc

Cc: brucedaw...@chromium.org
+brucedawson
(sorry, didn't realize you weren't on this bug when posting questions and information)

https://crash.corp.google.com/browse?q=product.version+like+%277_._.%25%27+AND+product.version+%3E%3D+%2772.0.3603.0%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.file_path%3D%27src%2Fbase%2Fmetrics%2Fhistogram.cc%27+AND+product_name%3D%27Chrome%27&stbtiq=&reportid=ab81bffd5c2312d9&index=0#4

This crashed due to bucket differences and not because of checks added to the calculation of those buckets.  However, the boundaries in this case develop a growing difference and then reconverge to the same maximum value.  The reconvergence is indicative of the log_max value being the same, at least at the end.

The mismatched values after the first mismatch could be the error simply continuing, not having enough entries to converge before the final bucket.

...
40  1297558   1297558
41  1839423   1839423
42  2607574   2607574
43  3696508   3696507
44  5240185   5240183
45  7428507   7428505
46  10530681  10530679
47  14928336  14928334
48  21162469  21162467
49  30000000  30000000

Bag log() result today:

https://crash.corp.google.com/browse?q=product.version+like+%277_._.%25%27+AND+product.version+%3E%3D+%2772.0.3603.0%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.file_path%3D%27src%2Fbase%2Fmetrics%2Fhistogram.cc%27+AND+product_name%3D%27Chrome%27&stbtiq=&reportid=1979b264e44017eb&index=4#0

        maximum = 0n180000
   log_max_copy = 12.100715944569612859
        log_max = 12.100715944569612859

The actual value of ln(180000) is 12.100712129872347.  Though this amounts to an error of only 3.8e-6 it is in the 13th-last decimal place and is enough that the rounded maximum value of e^log_max is now 180001.

This could be a memory glitch, hardware error, or library bug.  Looking at the binary mantissa representation of these numbers...

12.100712129872347 => 0x402833908A503977  (good)
12.100715944569613 => 0x402833910A503977  (bad)

... there is a 2-bit difference with a "08" becoming a "10".  Looking at the full 64-bit "double" representation, it is

0100000000101000 0011001110010001 0000101001010000 0011100101110111 (good)
0100000000101000 0011001110010000 1000101001010000 0011100101110111 (bad)

Not only are there two different bit flips, one is the lsb of the first 32-bit word and the other is the msb of the second 32-bit word.  This is a 64-bit build, though, so that may not be relevant.

I don't know to which of those three things (memory, hardware, library) this revelation points.

IEEE 754 Calculator:
http://weitz.de/ieee/

Scientific Calulator:
https://www.mathsisfun.com/scientific-calculator.html

Got another error today:

https://crash.corp.google.com/browse?q=product.version+like+%277_._.%25%27+AND+product.version+%3E%3D+%2772.0.3603.0%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.file_path%3D%27src%2Fbase%2Fmetrics%2Fhistogram.cc%27+AND+product_name%3D%27Chrome%27&stbtiq=&reportid=70c5b3e67bc6dba0&index=0#0

This one generates different bucket lists with no crashes during the calculations of such meaning that log_max is fine both before & after.

I was thinking about my statement that any errors should reduce as it converges towards the maximum and yet we see the error growing larger.  I don't think that's actually the case.  Yes it will converge but that convergence is a growth that distributed over all the remaining buckets; the error in the "current" bucket, when multiplied by the growth, can actually cause an increased absolute error as it progresses.  Only at the last bucket does the correction reach 100% and converge to the intended maximum value.

So all bucket value differences after the first can be explained.

In #36, I documented an instance of log(maximum) not returning the correct value.  If log(current) incurred a similar error here, it would explain everything.  I'll add a check for that but right now it's looking like the log() method has a bug.
Project Member

Comment 38 by bugdroid1@chromium.org, Nov 13

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2aaa8b01affcd08c5eb0ef0d1edc60d6f15ee7eb

commit 2aaa8b01affcd08c5eb0ef0d1edc60d6f15ee7eb
Author: Brian White <bcwhite@chromium.org>
Date: Tue Nov 13 23:24:56 2018

Check log() operation on current value.

Bug: 836238
Change-Id: I10a60aea6e70c0a0f4a0ed40a7c9172ae383d170
Reviewed-on: https://chromium-review.googlesource.com/c/1334294
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Commit-Queue: Brian White <bcwhite@chromium.org>
Cr-Commit-Position: refs/heads/master@{#607802}
[modify] https://crrev.com/2aaa8b01affcd08c5eb0ef0d1edc60d6f15ee7eb/base/metrics/histogram.cc

If we see a consistent pattern of bit flips then a hardware bug is possible (although still unlikely), but until then I'm more inclined to suspect changes in the FPU rounding mode, since that is something that is known to change floating-point results. Has that been checked?
Crashes are very infrequent so to verify the results of changing the rounding mode would require an experiment and significant time to be statistically useful.

You said the rounding mode was per-thread, though.  We're seeing differences between two successive(*) calls to the same method on the same thread.

* Successive meaning that there is no code between one call and the next, no calls outside the directory (other than std library) or returns to outside code.  What could change the rounding mode in that interval?
Can't we do a test locally - i.e. use the API to change the rounding mode between two calls and see if the results match what we get from the wild?
I did that locally and didn't get any errors in Histograms with either 32 or 64 bit builds.  I didn't go and write specialized tests, though, because at there wasn't a clear indication of what operations might be affected.

There were errors triggered in other tests, though.  SafeConversion was one.

Would the rounding mode for float->int conversion affect log() and cause the kind of bitwise error I describe in #36?
Going over the crashes from the last week that checks the log() results...

11 - bad log() result
 6 - mismatched buckets

This would suggest that there are additional FP errors outside of the log() routine.  There is an exp() as well as addition, subtraction, and division.

Setting the rounding mode didn't change anything in my own tests.  I could try just always setting the rounding mode to "near" and see if the crashes go away but that doesn't seem consistent with Bruce's comment that the rounding mode is per-thread and we're seeing the discrepancy between adjacent calls on the same thread.

I'm going to remove this code shortly for the upcoming M72 branch and will resume investigation in M73.
Project Member

Comment 44 by bugdroid1@chromium.org, Nov 26

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0e1f2bcab45deb26ca5a1116c9f5ce9b3a668e7c

commit 0e1f2bcab45deb26ca5a1116c9f5ce9b3a668e7c
Author: Brian White <bcwhite@chromium.org>
Date: Mon Nov 26 23:35:03 2018

Remove mismatch test code for branch point.

The code checking for problemating floating point math should not be
part of the M72 beta/stable releases so removing it for now.  It will
likely be restored in some form after the branch occurs.

Bug: 836238
Change-Id: Id6f95382589de7e3d5a05cefa09087e2e598e4ad
Reviewed-on: https://chromium-review.googlesource.com/c/1351091
Reviewed-by: Alexei Svitkine <asvitkine@chromium.org>
Commit-Queue: Brian White <bcwhite@chromium.org>
Cr-Commit-Position: refs/heads/master@{#610977}
[modify] https://crrev.com/0e1f2bcab45deb26ca5a1116c9f5ce9b3a668e7c/base/metrics/histogram.cc

Sign in to add a comment