Fix histograms with mismatched construction arguments. |
|||
Issue descriptionSometimes histograms get instantiated with different parameters. These used to crash but are now reported as a histogram: https://uma.googleplex.com/p/chrome/histograms/?endDate=latest&dayCount=28&histograms=Histogram.MismatchedConstructionArguments&fixupData=true&showMax=true&filters=channel%2Ceq%2C1%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial Instantiations different from the first cause a "dummy" histogram to be returned so that the caller will still execute but the reported data will be quietly dropped.
,
Apr 24 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/8e2f297bc419fb270427eec9b8a178446528a987 commit 8e2f297bc419fb270427eec9b8a178446528a987 Author: Brian White <bcwhite@chromium.org> Date: Tue Apr 24 18:26:47 2018 Add name of histogram with mismatched params. Bug: 836238 Change-Id: Ibec5500a6d9348a30009d3d6d97a51c38a8e64ef Reviewed-on: https://chromium-review.googlesource.com/1026051 Reviewed-by: Brian White <bcwhite@chromium.org> Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Commit-Queue: Alexei Svitkine <asvitkine@chromium.org> Cr-Commit-Position: refs/heads/master@{#553207} [modify] https://crrev.com/8e2f297bc419fb270427eec9b8a178446528a987/tools/metrics/histograms/enums.xml
,
Jun 29 2018
I've been looking through this a bit today and something doesn't seem right. For example, Compositing.Browser.LayersUpdateTime (-1796008467) has some 5000 reports just for M68 (canary/dev) but the code is constant:
std::string histogram_name =
base::StringPrintf("Compositing.%s.LayersUpdateTime", client_name);
base::UmaHistogramCounts10M(histogram_name, elapsed);
I'm wondering if these could be due to memory corruption that is causing the "existing" values to become corrupted and thus mismatch with the constants when it is again accessed.
,
Sep 12
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6579b53fcb32d4a3b26affc0004c1a61b3796c26 commit 6579b53fcb32d4a3b26affc0004c1a61b3796c26 Author: Brian White <bcwhite@chromium.org> Date: Wed Sep 12 20:19:19 2018 Temporarily crash for histogram mismatch. There are reports of mismatches for metrics that don't seem to have any way of having a mismatch. This will at least indicate what the mismatch was. Bug: 836238 Change-Id: I17eb9249b326d6f62cf0b92382d4c0e34741d331 Reviewed-on: https://chromium-review.googlesource.com/1213544 Commit-Queue: Brian White <bcwhite@chromium.org> Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Cr-Commit-Position: refs/heads/master@{#590803} [modify] https://crrev.com/6579b53fcb32d4a3b26affc0004c1a61b3796c26/base/metrics/histogram.cc [modify] https://crrev.com/6579b53fcb32d4a3b26affc0004c1a61b3796c26/base/metrics/histogram_unittest.cc [modify] https://crrev.com/6579b53fcb32d4a3b26affc0004c1a61b3796c26/base/metrics/single_sample_metrics_unittest.cc [modify] https://crrev.com/6579b53fcb32d4a3b26affc0004c1a61b3796c26/base/test/metrics/histogram_tester_unittest.cc [modify] https://crrev.com/6579b53fcb32d4a3b26affc0004c1a61b3796c26/chrome/test/data/extensions/api_test/metrics/test.js [modify] https://crrev.com/6579b53fcb32d4a3b26affc0004c1a61b3796c26/components/metrics/single_sample_metrics_factory_impl_unittest.cc
,
Sep 17
Issue 884655 has been merged into this issue.
,
Sep 18
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/673081e3d2e172e3875ffae881deab0d3f93e497 commit 673081e3d2e172e3875ffae881deab0d3f93e497 Author: Brian White <bcwhite@chromium.org> Date: Tue Sep 18 19:02:34 2018 Field crash if histogram bucket creation isn't as expected. There are reports from the field that the "maximum" of a histogram does not match what was configured. The working theory is that it is due to rounding errors in the creation of exponential buckets. Verify this assumption. If it's behaving wrong, it should crash during the creation. Bug: 836238 Change-Id: I4cf5e4bc3986e288c728ae37e2ff8cf841b7598d Reviewed-on: https://chromium-review.googlesource.com/1228840 Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Commit-Queue: Brian White <bcwhite@chromium.org> Cr-Commit-Position: refs/heads/master@{#592128} [modify] https://crrev.com/673081e3d2e172e3875ffae881deab0d3f93e497/base/metrics/histogram.cc
,
Sep 19
Reviewing the crashes from https://chromium-review.googlesource.com/1026051 There's no real consistency in how they differ. 1000000 => 999989 1000000 => 999424 500000 => 499994 10000 => 9961 Most odd is that 1M gets converted to two different numbers. Rounding errors there may be but they should at least be consistent. Number of buckets was 50 in all the above cases. The same histogram (by name) always has the same difference, though. Of note, the 5000000 => 499994 is the Chrome Installer: https://cs.chromium.org/chromium/src/chrome/installer/setup/setup_main.cc?rcl=8b1391794f25f6e7363771a43c1988b3c5b8e5a5&l=1455 Of extra note, there is an identically defined histogram on the line above that does *not* mismatch! Bucket ranges are reused so whatever was created on the previous line should be reused for the one that is failing. It could be some kind of memory corruption that is changing the last bucket (or multiple buckets -- only the last one is being reported) between one call and the next. It seems important that it is happening with some frequency between two identical calls that have nothing in between.
,
Sep 19
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/8f6326484a8c1ad74f61585eb0bfc7b1126ce2fa commit 8f6326484a8c1ad74f61585eb0bfc7b1126ce2fa Author: Brian White <bcwhite@chromium.org> Date: Wed Sep 19 13:52:00 2018 Revert "Field crash if histogram bucket creation isn't as expected." This reverts commit 673081e3d2e172e3875ffae881deab0d3f93e497. Reason for revert: Test complete. One day in the field is sufficient. Original change's description: > Field crash if histogram bucket creation isn't as expected. > > There are reports from the field that the "maximum" of a histogram > does not match what was configured. The working theory is that it is > due to rounding errors in the creation of exponential buckets. > Verify this assumption. If it's behaving wrong, it should crash > during the creation. > > Bug: 836238 > Change-Id: I4cf5e4bc3986e288c728ae37e2ff8cf841b7598d > Reviewed-on: https://chromium-review.googlesource.com/1228840 > Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> > Commit-Queue: Brian White <bcwhite@chromium.org> > Cr-Commit-Position: refs/heads/master@{#592128} TBR=asvitkine@chromium.org,bcwhite@chromium.org Change-Id: I0c7ecfc65d31aa2d161f1b689a178432a9364e27 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 836238 Reviewed-on: https://chromium-review.googlesource.com/1233893 Reviewed-by: Brian White <bcwhite@chromium.org> Commit-Queue: Brian White <bcwhite@chromium.org> Cr-Commit-Position: refs/heads/master@{#592385} [modify] https://crrev.com/8f6326484a8c1ad74f61585eb0bfc7b1126ce2fa/base/metrics/histogram.cc
,
Sep 19
Thanks for the update. Those lower numbers are coming from bucket ranges, right? So in theory something could be corrupting those post initial computation? Can we add a checksum (temporary to debug) to the ranges? Or something simpler like if we know histogram X often gets affected, add some special case code that checks for that histogram name and checks what the max ranges value is when we're both storing it and accessing it later to figure out if it gets messed up initially, or later by some other event.
,
Sep 19
Correct. The lower numbers are in the bucket ranges object. I'm running a local build right now to see if I can find a difference between |created_ranges| and |registered_ranges| that might indicate a change in those values. So far, no hits... but that's not really surprising since I didn't hit any of the other mismatches on a local build. However, BucketRanges::Equals does a bucket-by-bucket comparison for equality so RegisterOrDeleteDuplicateRanges should not return a "matching" ranges object that has been corrupted. So why one call would succeed and an immediate identical call fail, I'm not sure. The byte differences in the range-max vs create-max aren't revealing. Sometimes it's a one-byte difference, sometimes two bytes. They're always a reduction and not usually a zeroing. It doesn't look like any kind of _random_ corruption.
,
Sep 19
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/37a4f70b13dda239f802da32c91afff892a7c09d commit 37a4f70b13dda239f802da32c91afff892a7c09d Author: Brian White <bcwhite@chromium.org> Date: Wed Sep 19 18:23:02 2018 Field-check ranges creation. Check if the original ranges creation is causing the bad bucket values or if its some kind of corruption since the last time. Bug: 836238 Change-Id: I5b2b72504b6ac6437a9cec3ee42ae7e688be451b Reviewed-on: https://chromium-review.googlesource.com/1234215 Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Commit-Queue: Brian White <bcwhite@chromium.org> Cr-Commit-Position: refs/heads/master@{#592484} [modify] https://crrev.com/37a4f70b13dda239f802da32c91afff892a7c09d/base/metrics/histogram.cc
,
Sep 24
Only one relevant crash over the weekend: https://crash.corp.google.com/browse?q=product.Version%3E%3D%2771.0.3557.0%27+AND+product_name+LIKE+%27%25Chrome%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AHistogram%3A%3AFactory%3A%3ABuild%27&stbtiq=&reportid=03b5ebbbad42d57d&index=2#0 Looking at this in WinDbg, it's appears to be a legitimate rounding error from a maximum of 100,000,000 to 99,999,999. That's a very small amount compared to some of the other mismatches that somehow have not occurred since I started checking the bucket creation directly. Perhaps it's just the smaller user base since the latest check has not yet gone out in a Dev build.
,
Sep 24
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f889b9f0b407daef2475bb7e0a60ad9ca16693f4 commit f889b9f0b407daef2475bb7e0a60ad9ca16693f4 Author: Brian White <bcwhite@chromium.org> Date: Mon Sep 24 22:42:30 2018 Don't check-crash for linear histograms. Most of the observed crashes are enumeration (aka "linear") ones for "cast" that are apparently differences between extensions. Behave in the expected fashion (return dummy histogram) for those so as to not impact users unnecessarily. Bug: 836238 Change-Id: I5f68146be7a5988c4b529d54861bb215869fac2b Reviewed-on: https://chromium-review.googlesource.com/1241495 Commit-Queue: Brian White <bcwhite@chromium.org> Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Cr-Commit-Position: refs/heads/master@{#593722} [modify] https://crrev.com/f889b9f0b407daef2475bb7e0a60ad9ca16693f4/base/metrics/histogram.cc
,
Sep 27
Two more crashes for 100,000,000 => 99,999,999 so that seems a a consistent rounding error. Two crashes for bucket mismatch between what was created for the histogram and what was previously created for some other histogram leaving two possibilities that I can see: 1) Successive calls to CreateRanges() can produce different bucket values even though the values used in the algorithm are the same. This seems very unlikely. 2) RegisterOrDeleteDuplicateranges() is returning a mismatched ranges set. Since part of the comparison is a bucket-by-bucket check that all values are the same, this also seems unlikely.
,
Sep 27
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e716f6e83718423d7732fc1a36af5300ee8e61b8 commit e716f6e83718423d7732fc1a36af5300ee8e61b8 Author: Brian White <bcwhite@chromium.org> Date: Thu Sep 27 21:01:46 2018 Record more details about bucket value mismatches. Bug: 836238 Change-Id: I4714f80e7a0796327bce606f89ce1f889752f39f Reviewed-on: https://chromium-review.googlesource.com/1249396 Reviewed-by: Jesse Doherty <jwd@chromium.org> Commit-Queue: Brian White <bcwhite@chromium.org> Cr-Commit-Position: refs/heads/master@{#594854} [modify] https://crrev.com/e716f6e83718423d7732fc1a36af5300ee8e61b8/base/metrics/histogram.cc
,
Oct 1
Two crashes since new debug info was added and BOTH are where a second creation of the ranges differs from the first. There is nothing but check-code between the two calls. https://crash.corp.google.com/browse?q=product.Version%3E%3D%2771.0.3564.0%27+AND+product_name+LIKE+%27%25Chrome%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AHistogram%3A%3AFactory%3A%3ABuild%27&stbtiq=&reportid=192f4c2c4c0fe58d&index=1#0 Bucket boundary #14 of the original |created_ranges| changed from 107 (0x6B) to 96 (0x60). This isn't a calculation error as the previous boundary is 95 (0x5F) in both cases. This is a 50-bucket exponential histogram from 1 to 1000. https://crash.corp.google.com/browse?q=product.Version%3E%3D%2771.0.3564.0%27+AND+product_name+LIKE+%27%25Chrome%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AHistogram%3A%3AFactory%3A%3ABuild%27&stbtiq=&reportid=9edf01ae0922bcc8&index=0#0 Bucket boundary #8 of the original |created_ranges| changed from 8 to 4. This is a linear histogram where each bucket boundary matches its index. There are checksum calculation on the boundaries. Perhaps we can use that to further narrow down where the change is happening.
,
Oct 1
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b8ca53abff774931800f64fa522b7b0e2a8045d4 commit b8ca53abff774931800f64fa522b7b0e2a8045d4 Author: Brian White <bcwhite@chromium.org> Date: Mon Oct 01 22:27:55 2018 Check ranges checksums for corruptions. Try to narrow down where the ranges are getting corrupted by verifying the internal checksum in places. Bug: 836238 Change-Id: I05493e86d74c820c6dfe2dbe30759c04cdfb3f28 Reviewed-on: https://chromium-review.googlesource.com/1255322 Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Commit-Queue: Brian White <bcwhite@chromium.org> Cr-Commit-Position: refs/heads/master@{#595595} [modify] https://crrev.com/b8ca53abff774931800f64fa522b7b0e2a8045d4/base/metrics/histogram.cc
,
Oct 5
“Curiouser and curiouser!” Cried Alice. Two crashes as of this morning, both impossible in the same way. https://crash.corp.google.com/browse?q=product.Version%3E%3D%2771.0.3569.0%27+AND+product_name+LIKE+%27%25Chrome%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AHistogram%3A%3AFactory%3A%3ABuild%27&stbtiq=&reportid=e8f1fb4c45291832&index=0 found = false i = 0x58e created_range = 0x58d recreated_range = 0x58e h_name = char [100] "PartitionAlloc.CommittedSize" h_type = LINEAR_HISTOGRAM (0n1) b_count = 0x1001 c_count = 0x1001 r_count = 0x1001 c_valid = true r_valid = true For a enum/linear histogram (which this is, despite its odd name), the bucket boundary value should match its index. In this case, the "created range" is off by one. Amazingly enough, though, the checksum matches, both in the CHECK that was done immediately after it was created and again during the debug output, say the internal checksums are correct. For this to be true, the boundary value would have had to be wrong when the checksum is initially calculated which is immediately after the boundary values are generated... though not _while_ they are generated. I guess the next question is... Are they being incorrectly generated or incorrectly stored. Here's other crash with a similarly impossible result: https://crash.corp.google.com/browse?q=product.Version%3E%3D%2771.0.3569.0%27+AND+product_name+LIKE+%27%25Chrome%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AHistogram%3A%3AFactory%3A%3ABuild%27&stbtiq=&reportid=69c0ac9ba33e55ff&index=1 i = 0x869 created_range = 0x869 recreated_range = 0x68 h_name = char [100] "BlinkGC.CommittedSize" h_type = LINEAR_HISTOGRAM (0n1) b_count = 0x1001 c_count = 0x1001 r_count = 0x1001 c_valid = true r_valid = true In this case, it's the "recreated" ranges that has the incorrect value though the checksum validation still thinks everything is okay.
,
Oct 5
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/61b84b2866fe7e1917de0818e39ea7ae0a2096ed commit 61b84b2866fe7e1917de0818e39ea7ae0a2096ed Author: Brian White <bcwhite@chromium.org> Date: Fri Oct 05 18:03:18 2018 Validate checksum calculations. Crash reports indicate that checksums are valid even though the bucket boundary values are tested to be incorrect. Try to determine if it is memory corruption or bad calculations. Bug: 836238 Change-Id: Ie89df5fef5093e9dd8c45fe9ff26a0430838769e Reviewed-on: https://chromium-review.googlesource.com/c/1264036 Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Commit-Queue: Brian White <bcwhite@chromium.org> Cr-Commit-Position: refs/heads/master@{#597194} [modify] https://crrev.com/61b84b2866fe7e1917de0818e39ea7ae0a2096ed/base/metrics/bucket_ranges.cc [modify] https://crrev.com/61b84b2866fe7e1917de0818e39ea7ae0a2096ed/base/metrics/bucket_ranges.h [modify] https://crrev.com/61b84b2866fe7e1917de0818e39ea7ae0a2096ed/base/metrics/histogram.cc
,
Oct 9
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e30f3739ff345100b38c3cb4fc7a57d8b15bf423 commit e30f3739ff345100b38c3cb4fc7a57d8b15bf423 Author: Brian White <bcwhite@chromium.org> Date: Tue Oct 09 17:17:08 2018 Revert histogram mismatch test/crash for branch (part 1). Bug: 836238 Change-Id: I3ee199bc87bc6094f6f68d4f44e1f2e5c52cc5e6 Reviewed-on: https://chromium-review.googlesource.com/c/1270345 Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Commit-Queue: Brian White <bcwhite@chromium.org> Cr-Commit-Position: refs/heads/master@{#597963} [modify] https://crrev.com/e30f3739ff345100b38c3cb4fc7a57d8b15bf423/base/metrics/histogram.cc [modify] https://crrev.com/e30f3739ff345100b38c3cb4fc7a57d8b15bf423/base/metrics/histogram_unittest.cc [modify] https://crrev.com/e30f3739ff345100b38c3cb4fc7a57d8b15bf423/base/metrics/single_sample_metrics_unittest.cc [modify] https://crrev.com/e30f3739ff345100b38c3cb4fc7a57d8b15bf423/chrome/test/data/extensions/api_test/metrics/test.js [modify] https://crrev.com/e30f3739ff345100b38c3cb4fc7a57d8b15bf423/components/metrics/single_sample_metrics_factory_impl_unittest.cc
,
Oct 10
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/fc3a02269e5d241d2b17503f0abe8c12d3375a33 commit fc3a02269e5d241d2b17503f0abe8c12d3375a33 Author: Brian White <bcwhite@chromium.org> Date: Wed Oct 10 14:49:42 2018 Revert bucket-boundary checks for branch. (part 2) Part 1 was here: https://chromium-review.googlesource.com/c/chromium/src/+/1270345 I expect to revert this revert once the branch is done as there is still more testing to be done here. Bug: 836238 Change-Id: I1f85b6fa21405ceb3e569335aa995faf2050fd0f Reviewed-on: https://chromium-review.googlesource.com/c/1271716 Commit-Queue: Alexei Svitkine <asvitkine@chromium.org> Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Cr-Commit-Position: refs/heads/master@{#598319} [modify] https://crrev.com/fc3a02269e5d241d2b17503f0abe8c12d3375a33/base/metrics/bucket_ranges.cc [modify] https://crrev.com/fc3a02269e5d241d2b17503f0abe8c12d3375a33/base/metrics/bucket_ranges.h [modify] https://crrev.com/fc3a02269e5d241d2b17503f0abe8c12d3375a33/base/metrics/histogram.cc
,
Oct 10
All the checks have been reverted for the upcoming branch point but a Dev build was released with them making a bigger user base from which to receive crashes. After 3 days of Canary with no crashes, we have one: https://crash.corp.google.com/browse?q=product.version+like+%277_._.%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AHistogram%3A%3AFactory%3A%3ABuild%27+AND+product.version+%3E%3D+%2771.0.3572.0%27&stbtiq=&reportid=4ae84211b482941a&index=0#0 found = false using_created_ranges = true equal_ranges = true histogram_ref = 0x646e6552 i = 0x1c created_range = 0x364a recreated_range = 0x4cf4 h_name = char [100] "RendererScheduler.ExpectedQueueingTimeByTaskQueue2.FrameLoading" h_type = HISTOGRAM (0n0) b_count = 0x32 c_count = 0x32 r_count = 0x32 c_valid = true r_valid = true Again, all the checksums are correct. There was also no crash inside InitializeBucketRanges() where the checksum is done twice, once over the memeory contents and once over the individual values as they're calculated. Thus, both think they're doing the same correct thing. The difference, however, is not minor. One calculation placed the boundary at 13902 and the other at 19700. These are wildly different and so not a simple rounding error. The previous boundary must have matched, though we don't know what that value was. Because the checksums match, that leaves the calculation of the bucket boundary itself, specifically double log_current; log_current = log(static_cast<double>(current)); // Calculate the count'th root of the range. log_ratio = (log_max - log_current) / (bucket_count - bucket_index); // See where the next bucket would start. log_next = log_current + log_ratio; Sample next; next = static_cast<int>(std::round(exp(log_next))); if (next > current) current = next; else ++current; // Just do a narrow bucket, and keep trying. The "Fields" tab doesn't show any 3rd-party modules loaded. Could it be a floating-point math (hardware) error?
,
Oct 10
I think cpu info is in crash/ - if it's some float error, we can check for consistency of cpu models. Otherwise, it does sound like it's sufficiently rare that we may not need to do much about it?
,
Oct 10
It is, but the crashes are spread fairly evenly between "x86" and "amd64", and different versions of each. Any way to tell if the systems are overclocked? It's too rare to be the cpu in general but frequent enough to show up in Canary and Dev. I wonder if other parts of Chrome see floating-point errors.
,
Oct 15
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2e7cacdbff0046391329f939eb361a0bba6ffdc8 commit 2e7cacdbff0046391329f939eb361a0bba6ffdc8 Author: Brian White <bcwhite@chromium.org> Date: Mon Oct 15 14:09:59 2018 Revert "Revert bucket-boundary checks for branch. (part 2)" This reverts commit fc3a02269e5d241d2b17503f0abe8c12d3375a33. Reason for revert: M71 branch has occurred. Re-enable this for continued debugging on M72. Original change's description: > Revert bucket-boundary checks for branch. (part 2) > > Part 1 was here: > https://chromium-review.googlesource.com/c/chromium/src/+/1270345 > > I expect to revert this revert once the branch is done as there is > still more testing to be done here. > > Bug: 836238 > Change-Id: I1f85b6fa21405ceb3e569335aa995faf2050fd0f > Reviewed-on: https://chromium-review.googlesource.com/c/1271716 > Commit-Queue: Alexei Svitkine <asvitkine@chromium.org> > Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> > Cr-Commit-Position: refs/heads/master@{#598319} TBR=asvitkine@chromium.org,bcwhite@chromium.org # Not skipping CQ checks because original CL landed > 1 day ago. Bug: 836238 Change-Id: If2f1911f538c11170fff5e8de4bc3115262aa6c6 Reviewed-on: https://chromium-review.googlesource.com/c/1280782 Reviewed-by: Brian White <bcwhite@chromium.org> Commit-Queue: Brian White <bcwhite@chromium.org> Cr-Commit-Position: refs/heads/master@{#599629} [modify] https://crrev.com/2e7cacdbff0046391329f939eb361a0bba6ffdc8/base/metrics/bucket_ranges.cc [modify] https://crrev.com/2e7cacdbff0046391329f939eb361a0bba6ffdc8/base/metrics/bucket_ranges.h [modify] https://crrev.com/2e7cacdbff0046391329f939eb361a0bba6ffdc8/base/metrics/histogram.cc
,
Oct 26
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7f1fbf31984729c4122f2ac40069069060ac3548 commit 7f1fbf31984729c4122f2ac40069069060ac3548 Author: Brian White <bcwhite@chromium.org> Date: Fri Oct 26 23:12:58 2018 Record all boundary values for histogram mismatch debug. Talking with Bruce Dawson, it could be helpful to know all the generated boundary values in order to discern if there is a pattern. Bug: 836238 Change-Id: I40ddda5e93d298508fb89fe828245fea39c98b6c Reviewed-on: https://chromium-review.googlesource.com/c/1302315 Commit-Queue: Brian White <bcwhite@chromium.org> Commit-Queue: Alexei Svitkine <asvitkine@chromium.org> Reviewed-by: Bruce Dawson <brucedawson@chromium.org> Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Cr-Commit-Position: refs/heads/master@{#603242} [modify] https://crrev.com/7f1fbf31984729c4122f2ac40069069060ac3548/base/metrics/histogram.cc
,
Oct 31
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2dbe9d24b3b9ee05a2137847364bd47abc49be0a commit 2dbe9d24b3b9ee05a2137847364bd47abc49be0a Author: Brian White <bcwhite@chromium.org> Date: Wed Oct 31 19:21:23 2018 Combine mismatch checks for more consistent crash reporting. Bug: 836238 Change-Id: Iafd29fb293af9b284ae9ab98b2f6c12bb27fa9a0 Reviewed-on: https://chromium-review.googlesource.com/c/1308634 Commit-Queue: Alexei Svitkine <asvitkine@chromium.org> Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Cr-Commit-Position: refs/heads/master@{#604362} [modify] https://crrev.com/2dbe9d24b3b9ee05a2137847364bd47abc49be0a/base/metrics/histogram.cc
,
Nov 2
Surprisingly few crashes but here's an interesting one: https://crash.corp.google.com/browse?q=product.version+like+%277_._.%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AHistogram%3A%3AFactory%3A%3ABuild%27+AND+product.version+%3E%3D+%2772.0.3593.0%27&stbtiq=&reportid=cb3af43d469ecece&index=5#0 This has the full dump of 52 buckets and we can see that it goes correct most of the way and then starts to diverge at bucket 43. The initial difference is 1 and then continues to get worse. double log_max = log(static_cast<double>(maximum)); while (bucket_count > ++bucket_index) { double log_current; log_current = log(static_cast<double>(current)); log_ratio = (log_max - log_current) / (bucket_count - bucket_index); log_next = log_current + log_ratio; next = static_cast<int>(std::round(exp(log_next))); if (next > current) current = next; else ++current; // Just do a narrow bucket, and keep trying. ranges->set_range(bucket_index, current); } The algorithm should always converge to log_max even if "current" were to experience error yet it does not. When I do the calculation with my 10-digit calculator, I get 191548 for bucket 43 when calculating from the result of bucket 142. A "double" has 15 decimal digits of precision, but even with 10 digits a difference in the last digit makes the exp() operation only off by 0.002. So why did the C++ code get 191544 and 191543, different by roughly 4 and 5? Yet if I do the calculation manually from values in bucket 48, in both cases it'll go to whatever I choose for "maximum". I feel like I'm missing something. Bucket Created Recreated 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 7 7 7 9 9 8 12 12 9 16 16 10 21 21 11 28 28 12 37 37 13 49 49 14 65 65 15 86 86 16 113 113 17 149 149 18 196 196 19 258 258 20 340 340 21 448 448 22 590 590 23 777 777 24 1023 1023 25 1347 1347 26 1774 1774 27 2336 2336 28 3077 3077 29 4053 4053 30 5338 5338 31 7030 7030 32 9259 9259 33 12194 12194 34 16060 16060 35 21152 21152 36 27859 27859 37 36693 36693 38 48328 48328 39 63651 63651 40 83833 83833 41 110417 110417 42 145431 145431 43 191544 191543 44 252284 252283 45 332282 332280 46 437644 437641 47 576431 576426 48 759207 759199 49 1000000 999989
,
Nov 2
The only thing that I can think of is that somebody has changed the rounding mode. The default rounding mode is round-to-nearest-even. If that is changed to round up or round down then errors can accumulate quite quickly, especially if there are lots of intermediate calculations. The function used to modify the rounding mode (on Windows) is _controlfp (or _controlfp_s I guess). It's a bit subtle but you can use it to read the rounding modes and set them. This will set the rounding mode to the default while not changing anything else: _controlfp(_RC_NEAR, _MCW_RC); // Round to nearest Maybe somebody is calling deSetRoundingMode to change the rounding mode, and then not resetting it? Most likely the problem is caused by some injected code (printer driver, perhaps) changing the FPU settings. I've seen this before and it is evil. Note that the FPU settings are per-thread. I hit an instance of FPU settings being modified two years ago. At the time I fixed the compiler bug which was triggering crashes but maybe I should also have added monitoring of FPU settings, and frequent resetting of them: https://randomascii.wordpress.com/2016/09/16/everything-old-is-new-again-and-a-compiler-bug/ That is, maybe we could save ourselves a lot of grief by either trying to figure out who is changing our FPU settings (I would *love* to know that) or by forcibly resetting them in our message loop.
,
Nov 2
If you are recreating the calculations on your computer then you can try recreating them with the four different rounding modest (nearest, +infinity, -infinity, zero) and see if you can reproduce the results that way.
,
Nov 2
Huh, I didn't realize that was a thing. Seems dangerous to rely on std::round if rounding mode can be changed globally by some other code. Very curious if that's the cause - if so, maybe we should move away from std::round() and maybe worth discussing on chromium-dev if that should be something to do more broadly in the project.
,
Nov 5
This rounding... Are you talking float-to-int rounding or how the last fp digit is rounded after an operation? If the former... The rounding is off by more than 1 by the time we get to the end. The algorithm recalculates from a constant |log_max| (even though it's not declared "const") and the previous integer boundary. Any error in a previous calculation should receive correction so that the final boundary is the requested maximum. If the latter... A "double" has equivalent of 15 decimal digits of accuracy but my manual tests on my 10-digit calculator showed that errors in the last digit caused only a 0.002 change in the result of the exp() operation which was insignificant in a (proper) round-to-int operation. I'll finish up that CL to validate that |log_max| isn't changing somehow.
,
Nov 5
You said that FPU settings are per-thread, but I'm seeing different results from two adjacent calls (no code in between) on the same thread.
,
Nov 5
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/98151dc32fd4888fb1ef880d1e7be8e35525f5d3 commit 98151dc32fd4888fb1ef880d1e7be8e35525f5d3 Author: Brian White <bcwhite@chromium.org> Date: Mon Nov 05 21:41:16 2018 Validate math when creating histogram buckets. This is a further attempt to determine why two subsequent calls to this method with the same parameters sometimes produce different outputs. Bug: 836238 Change-Id: I3960fe0bd27976ae4678a4f96bad61870c22342c Reviewed-on: https://chromium-review.googlesource.com/c/1314648 Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Reviewed-by: Bruce Dawson <brucedawson@chromium.org> Commit-Queue: Brian White <bcwhite@chromium.org> Cr-Commit-Position: refs/heads/master@{#605473} [modify] https://crrev.com/98151dc32fd4888fb1ef880d1e7be8e35525f5d3/base/metrics/histogram.cc
,
Nov 8
+brucedawson (sorry, didn't realize you weren't on this bug when posting questions and information) https://crash.corp.google.com/browse?q=product.version+like+%277_._.%25%27+AND+product.version+%3E%3D+%2772.0.3603.0%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.file_path%3D%27src%2Fbase%2Fmetrics%2Fhistogram.cc%27+AND+product_name%3D%27Chrome%27&stbtiq=&reportid=ab81bffd5c2312d9&index=0#4 This crashed due to bucket differences and not because of checks added to the calculation of those buckets. However, the boundaries in this case develop a growing difference and then reconverge to the same maximum value. The reconvergence is indicative of the log_max value being the same, at least at the end. The mismatched values after the first mismatch could be the error simply continuing, not having enough entries to converge before the final bucket. ... 40 1297558 1297558 41 1839423 1839423 42 2607574 2607574 43 3696508 3696507 44 5240185 5240183 45 7428507 7428505 46 10530681 10530679 47 14928336 14928334 48 21162469 21162467 49 30000000 30000000
,
Nov 12
Bag log() result today: https://crash.corp.google.com/browse?q=product.version+like+%277_._.%25%27+AND+product.version+%3E%3D+%2772.0.3603.0%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.file_path%3D%27src%2Fbase%2Fmetrics%2Fhistogram.cc%27+AND+product_name%3D%27Chrome%27&stbtiq=&reportid=1979b264e44017eb&index=4#0 maximum = 0n180000 log_max_copy = 12.100715944569612859 log_max = 12.100715944569612859 The actual value of ln(180000) is 12.100712129872347. Though this amounts to an error of only 3.8e-6 it is in the 13th-last decimal place and is enough that the rounded maximum value of e^log_max is now 180001. This could be a memory glitch, hardware error, or library bug. Looking at the binary mantissa representation of these numbers... 12.100712129872347 => 0x402833908A503977 (good) 12.100715944569613 => 0x402833910A503977 (bad) ... there is a 2-bit difference with a "08" becoming a "10". Looking at the full 64-bit "double" representation, it is 0100000000101000 0011001110010001 0000101001010000 0011100101110111 (good) 0100000000101000 0011001110010000 1000101001010000 0011100101110111 (bad) Not only are there two different bit flips, one is the lsb of the first 32-bit word and the other is the msb of the second 32-bit word. This is a 64-bit build, though, so that may not be relevant. I don't know to which of those three things (memory, hardware, library) this revelation points. IEEE 754 Calculator: http://weitz.de/ieee/ Scientific Calulator: https://www.mathsisfun.com/scientific-calculator.html
,
Nov 13
Got another error today: https://crash.corp.google.com/browse?q=product.version+like+%277_._.%25%27+AND+product.version+%3E%3D+%2772.0.3603.0%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.file_path%3D%27src%2Fbase%2Fmetrics%2Fhistogram.cc%27+AND+product_name%3D%27Chrome%27&stbtiq=&reportid=70c5b3e67bc6dba0&index=0#0 This one generates different bucket lists with no crashes during the calculations of such meaning that log_max is fine both before & after. I was thinking about my statement that any errors should reduce as it converges towards the maximum and yet we see the error growing larger. I don't think that's actually the case. Yes it will converge but that convergence is a growth that distributed over all the remaining buckets; the error in the "current" bucket, when multiplied by the growth, can actually cause an increased absolute error as it progresses. Only at the last bucket does the correction reach 100% and converge to the intended maximum value. So all bucket value differences after the first can be explained. In #36, I documented an instance of log(maximum) not returning the correct value. If log(current) incurred a similar error here, it would explain everything. I'll add a check for that but right now it's looking like the log() method has a bug.
,
Nov 13
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2aaa8b01affcd08c5eb0ef0d1edc60d6f15ee7eb commit 2aaa8b01affcd08c5eb0ef0d1edc60d6f15ee7eb Author: Brian White <bcwhite@chromium.org> Date: Tue Nov 13 23:24:56 2018 Check log() operation on current value. Bug: 836238 Change-Id: I10a60aea6e70c0a0f4a0ed40a7c9172ae383d170 Reviewed-on: https://chromium-review.googlesource.com/c/1334294 Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Commit-Queue: Brian White <bcwhite@chromium.org> Cr-Commit-Position: refs/heads/master@{#607802} [modify] https://crrev.com/2aaa8b01affcd08c5eb0ef0d1edc60d6f15ee7eb/base/metrics/histogram.cc
,
Nov 15
If we see a consistent pattern of bit flips then a hardware bug is possible (although still unlikely), but until then I'm more inclined to suspect changes in the FPU rounding mode, since that is something that is known to change floating-point results. Has that been checked?
,
Nov 15
Crashes are very infrequent so to verify the results of changing the rounding mode would require an experiment and significant time to be statistically useful. You said the rounding mode was per-thread, though. We're seeing differences between two successive(*) calls to the same method on the same thread. * Successive meaning that there is no code between one call and the next, no calls outside the directory (other than std library) or returns to outside code. What could change the rounding mode in that interval?
,
Nov 15
Can't we do a test locally - i.e. use the API to change the rounding mode between two calls and see if the results match what we get from the wild?
,
Nov 15
I did that locally and didn't get any errors in Histograms with either 32 or 64 bit builds. I didn't go and write specialized tests, though, because at there wasn't a clear indication of what operations might be affected. There were errors triggered in other tests, though. SafeConversion was one. Would the rounding mode for float->int conversion affect log() and cause the kind of bitwise error I describe in #36?
,
Nov 22
Going over the crashes from the last week that checks the log() results... 11 - bad log() result 6 - mismatched buckets This would suggest that there are additional FP errors outside of the log() routine. There is an exp() as well as addition, subtraction, and division. Setting the rounding mode didn't change anything in my own tests. I could try just always setting the rounding mode to "near" and see if the crashes go away but that doesn't seem consistent with Bruce's comment that the rounding mode is per-thread and we're seeing the discrepancy between adjacent calls on the same thread. I'm going to remove this code shortly for the upcoming M72 branch and will resume investigation in M73.
,
Nov 26
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0e1f2bcab45deb26ca5a1116c9f5ce9b3a668e7c commit 0e1f2bcab45deb26ca5a1116c9f5ce9b3a668e7c Author: Brian White <bcwhite@chromium.org> Date: Mon Nov 26 23:35:03 2018 Remove mismatch test code for branch point. The code checking for problemating floating point math should not be part of the M72 beta/stable releases so removing it for now. It will likely be restored in some form after the branch occurs. Bug: 836238 Change-Id: Id6f95382589de7e3d5a05cefa09087e2e598e4ad Reviewed-on: https://chromium-review.googlesource.com/c/1351091 Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Commit-Queue: Brian White <bcwhite@chromium.org> Cr-Commit-Position: refs/heads/master@{#610977} [modify] https://crrev.com/0e1f2bcab45deb26ca5a1116c9f5ce9b3a668e7c/base/metrics/histogram.cc |
|||
►
Sign in to add a comment |
|||
Comment 1 by bcwh...@chromium.org
, Apr 24 2018