SubprocessMetricsProviderTest.SnapshotMetrics is flaky on win10_chromium_x64_rel_ng |
|||||||
Issue descriptionSee flakiness results here: https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=unit_tests&tests=SubprocessMetricsProviderTest.SnapshotMetrics It suddenly started failing a lot around 3:30p today but I don't see any obvious culprit CLs. Here's an example failing build: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win10_chromium_x64_rel_ng/48944 [ RUN ] SubprocessMetricsProviderTest.SnapshotMetrics ../../chrome/browser/metrics/subprocess_metrics_provider_unittest.cc(132): error: Expected equality of these values: 2U Which is: 2 GetSnapshotHistogramCount() Which is: 3 Stack trace: Backtrace: testing::internal::UnitTestImpl::CurrentOsStackTraceExceptTop [0x00007FF7B4A7E8B7+87] testing::internal::AssertHelper::operator= [0x00007FF7B4A7E43E+78] SubprocessMetricsProviderTest_SnapshotMetrics_Test::TestBody [0x00007FF7B3ADCA85+421] ../../chrome/browser/metrics/subprocess_metrics_provider_unittest.cc(135): error: Expected equality of these values: 0U Which is: 0 GetSnapshotHistogramCount() Which is: 1 Stack trace: Backtrace: testing::internal::UnitTestImpl::CurrentOsStackTraceExceptTop [0x00007FF7B4A7E8B7+87] testing::internal::AssertHelper::operator= [0x00007FF7B4A7E43E+78] SubprocessMetricsProviderTest_SnapshotMetrics_Test::TestBody [0x00007FF7B3ADCB3C+604] [ FAILED ] SubprocessMetricsProviderTest.SnapshotMetrics (280 ms)
,
Jul 13
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/5af8add5481c4feb36cd42d61161ce3dd3fd99b6 commit 5af8add5481c4feb36cd42d61161ce3dd3fd99b6 Author: James Darpinian <jdarpinian@chromium.org> Date: Fri Jul 13 16:36:47 2018 Disable flaky test SnapshotMetrics SubprocessMetricsProviderTest.SnapshotMetrics has been flaky for a while but it is suddenly much worse now and I don't see an obvious culprit. Disabling until someone can investigate. Tbr: asvitkine@chromium.org Bug: 863262 Change-Id: I33d62b866c00f7fd9386cb274773547a3815bb20 Reviewed-on: https://chromium-review.googlesource.com/1136051 Commit-Queue: Kim Paulhamus <kpaulhamus@chromium.org> Reviewed-by: Kim Paulhamus <kpaulhamus@chromium.org> Cr-Commit-Position: refs/heads/master@{#574944} [modify] https://crrev.com/5af8add5481c4feb36cd42d61161ce3dd3fd99b6/chrome/browser/metrics/subprocess_metrics_provider_unittest.cc
,
Jul 13
Do we know why it started flaking more? What CL range does it correlate with?
,
Jul 13
This appears to have failed 3/3 times on my CL https://chromium-review.googlesource.com/c/chromium/src/+/1135898 - suggests something has regressed relating to those metrics. The revisions listed for when the flake was first reported seem to not exist, which is bizarre.
,
Jul 13
+jbudorick: The flakiness dashboard seems very confused; if you click on any of the failure cases under Linux CFI, for example, then you can't open the failure log (404) and if you open the build log then it didn't fail. If you actually open the unit_tests shard and search for SnapshotMetrics, for example, then you'll find that it is logged as being run 20 times. O_o
,
Jul 13
Issue 863553 has been merged into this issue.
,
Jul 16
Brian is away, so I will look at it this week.
,
Jul 18
The test appears to pass locally for me on Mac. Let me try on the bots.
,
Jul 18
So the re-enable CL passed the cq: https://chromium-review.googlesource.com/c/chromium/src/+/1142438 I can try another run to make sure. Otherwise, does anyone know if there a tool to try to run it N times to try to weed out flakyness?
,
Jul 18
Thinking about this more - the failure was that in this code path: // Recording should find the two histograms created in persistent memory. EXPECT_EQ(2U, GetSnapshotHistogramCount()); // A second run should have nothing to produce. EXPECT_EQ(0U, GetSnapshotHistogramCount()); The first check reported 3 and the second check 1. Given the latter result, it means a histogram was logged in the meantime between those two calls. So either that happened on another thread - e.g. maybe by something spawned by a previous thread that's still running - or by GetSnapshotHistogramCount() itself.
,
Jul 18
From the swarming log, on that failed instance, 1206 other tests ran before this one. So if we're thinking about the "some other thread is still running and logging histograms" world, then we have 1206 potential tests that could be doing this. https://chromium-swarm.appspot.com/task?id=3ea9c6328da6f410&refresh=10&show_raw=1 That's a big number. I wonder if we could somehow narrow this down? Don't we have infrastructure somewhere that tries to run some tests pairwise and see if they break?
,
Jul 19
Re #10: Could you temporarily add an atomic flag that causes an attempt to log a histogram to actually crash if it's set, and set it in-between those two expectations, so that if we can repro it then we'll get a stack showing us where the culprit is? Re #11: Yes, we have some tooling for that, though I'm not familiar with it myself. Perhaps that is a thing we should have set-up to run continuously as an FYI, if we don't already? (Adding a milestone since this is tagged P1)
,
Jul 23
I'll make the test log any unexpected histogram names and we can go from there.
,
Jul 24
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2d44877987c4d48ab0f00234e262ce7df67f9b88 commit 2d44877987c4d48ab0f00234e262ce7df67f9b88 Author: Alexei Svitkine <asvitkine@chromium.org> Date: Tue Jul 24 14:24:50 2018 Re-enable SubprocessMetricsProviderTest test and switch to gmock. With gmock, if any of the expectations fail, we will see a diff of the contents - in particular, we'll see the extra histogram name if there is one. Bug: 863262 Change-Id: Iff375adf1ebbc0a14d64005f48dd574efb45b8dc Reviewed-on: https://chromium-review.googlesource.com/1146907 Reviewed-by: Brian White <bcwhite@chromium.org> Commit-Queue: Alexei Svitkine <asvitkine@chromium.org> Cr-Commit-Position: refs/heads/master@{#577526} [modify] https://crrev.com/2d44877987c4d48ab0f00234e262ce7df67f9b88/chrome/browser/metrics/subprocess_metrics_provider_unittest.cc
,
Jul 24
Marking as Fixed, as my CL re-enabled the test. If the problem re-appears, the updated test code should have enough output to point at the culprit histogram that's unexpected. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by jdarpinian@chromium.org
, Jul 13Labels: -Pri-3 OS-Linux OS-Windows Pri-1
Owner: bcwh...@chromium.org