New issue
Advanced search Search tips

Issue 863262 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jul 24
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Windows
Pri: 1
Type: Bug



Sign in to add a comment

SubprocessMetricsProviderTest.SnapshotMetrics is flaky on win10_chromium_x64_rel_ng

Project Member Reported by jdarpinian@chromium.org, Jul 13

Issue description

See flakiness results here: https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=unit_tests&tests=SubprocessMetricsProviderTest.SnapshotMetrics

It suddenly started failing a lot around 3:30p today but I don't see any obvious culprit CLs.

Here's an example failing build:
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win10_chromium_x64_rel_ng/48944

[ RUN      ] SubprocessMetricsProviderTest.SnapshotMetrics
../../chrome/browser/metrics/subprocess_metrics_provider_unittest.cc(132): error: Expected equality of these values:
  2U
    Which is: 2
  GetSnapshotHistogramCount()
    Which is: 3
Stack trace:
Backtrace:
	testing::internal::UnitTestImpl::CurrentOsStackTraceExceptTop [0x00007FF7B4A7E8B7+87]
	testing::internal::AssertHelper::operator= [0x00007FF7B4A7E43E+78]
	SubprocessMetricsProviderTest_SnapshotMetrics_Test::TestBody [0x00007FF7B3ADCA85+421]
../../chrome/browser/metrics/subprocess_metrics_provider_unittest.cc(135): error: Expected equality of these values:
  0U
    Which is: 0
  GetSnapshotHistogramCount()
    Which is: 1

Stack trace:
Backtrace:
	testing::internal::UnitTestImpl::CurrentOsStackTraceExceptTop [0x00007FF7B4A7E8B7+87]
	testing::internal::AssertHelper::operator= [0x00007FF7B4A7E43E+78]
	SubprocessMetricsProviderTest_SnapshotMetrics_Test::TestBody [0x00007FF7B3ADCB3C+604]
[  FAILED  ] SubprocessMetricsProviderTest.SnapshotMetrics (280 ms)


 
Cc: asvitk...@chromium.org
Labels: -Pri-3 OS-Linux OS-Windows Pri-1
Owner: bcwh...@chromium.org
Looks like it's been flaky for a while but it suddenly got much worse today. I'm going to try to disable the test because it's failing too many tryjobs.
Project Member

Comment 2 by bugdroid1@chromium.org, Jul 13

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5af8add5481c4feb36cd42d61161ce3dd3fd99b6

commit 5af8add5481c4feb36cd42d61161ce3dd3fd99b6
Author: James Darpinian <jdarpinian@chromium.org>
Date: Fri Jul 13 16:36:47 2018

Disable flaky test SnapshotMetrics

SubprocessMetricsProviderTest.SnapshotMetrics has been flaky for a while
but it is suddenly much worse now and I don't see an obvious culprit.
Disabling until someone can investigate.

Tbr: asvitkine@chromium.org
Bug:  863262 
Change-Id: I33d62b866c00f7fd9386cb274773547a3815bb20
Reviewed-on: https://chromium-review.googlesource.com/1136051
Commit-Queue: Kim Paulhamus <kpaulhamus@chromium.org>
Reviewed-by: Kim Paulhamus <kpaulhamus@chromium.org>
Cr-Commit-Position: refs/heads/master@{#574944}
[modify] https://crrev.com/5af8add5481c4feb36cd42d61161ce3dd3fd99b6/chrome/browser/metrics/subprocess_metrics_provider_unittest.cc

Components: Tests>Flaky
Do we know why it started flaking more? What CL range does it correlate with?
This appears to have failed 3/3 times on my CL https://chromium-review.googlesource.com/c/chromium/src/+/1135898 - suggests something has regressed relating to those metrics.

The revisions listed for when the flake was first reported seem to not exist, which is bizarre.
Cc: jbudorick@chromium.org
+jbudorick: The flakiness dashboard seems very confused; if you click on any of the failure cases under Linux CFI, for example, then you can't open the failure log (404) and if you open the build log then it didn't fail. If you actually open the unit_tests shard and search for SnapshotMetrics, for example, then you'll find that it is logged as being run 20 times. O_o
 Issue 863553  has been merged into this issue.
Cc: -asvitk...@chromium.org bcwh...@chromium.org
Owner: asvitk...@chromium.org
Status: Assigned (was: Untriaged)
Brian is away, so I will look at it this week.
The test appears to pass locally for me on Mac. Let me try on the bots.
Cc: w...@chromium.org
So the re-enable CL passed the cq:

https://chromium-review.googlesource.com/c/chromium/src/+/1142438

I can try another run to make sure. Otherwise, does anyone know if there a tool to try to run it N times to try to weed out flakyness?
Thinking about this more - the failure was that in this code path:

  // Recording should find the two histograms created in persistent memory.
  EXPECT_EQ(2U, GetSnapshotHistogramCount());

  // A second run should have nothing to produce.
  EXPECT_EQ(0U, GetSnapshotHistogramCount());

The first check reported 3 and the second check 1. Given the latter result, it means a histogram was logged in the meantime between those two calls.

So either that happened on another thread - e.g. maybe by something spawned by a previous thread that's still running - or by GetSnapshotHistogramCount() itself.
From the swarming log, on that failed instance, 1206 other tests ran before this one. So if we're thinking about the "some other thread is still running and logging histograms" world, then we have 1206 potential tests that could be doing this.

https://chromium-swarm.appspot.com/task?id=3ea9c6328da6f410&refresh=10&show_raw=1

That's a big number. I wonder if we could somehow narrow this down? Don't we have infrastructure somewhere that tries to run some tests pairwise and see if they break?
Labels: M-69
Re #10: Could you temporarily add an atomic flag that causes an attempt to log a histogram to actually crash if it's set, and set it in-between those two expectations, so that if we can repro it then we'll get a stack showing us where the culprit is?

Re #11: Yes, we have some tooling for that, though I'm not familiar with it myself.  Perhaps that is a thing we should have set-up to run continuously as an FYI, if we don't already?

(Adding a milestone since this is tagged P1)
I'll make the test log any unexpected histogram names and we can go from there.
Project Member

Comment 14 by bugdroid1@chromium.org, Jul 24

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2d44877987c4d48ab0f00234e262ce7df67f9b88

commit 2d44877987c4d48ab0f00234e262ce7df67f9b88
Author: Alexei Svitkine <asvitkine@chromium.org>
Date: Tue Jul 24 14:24:50 2018

Re-enable SubprocessMetricsProviderTest test and switch to gmock.

With gmock, if any of the expectations fail, we will see a diff of
the contents - in particular, we'll see the extra histogram name
if there is one.

Bug:  863262 
Change-Id: Iff375adf1ebbc0a14d64005f48dd574efb45b8dc
Reviewed-on: https://chromium-review.googlesource.com/1146907
Reviewed-by: Brian White <bcwhite@chromium.org>
Commit-Queue: Alexei Svitkine <asvitkine@chromium.org>
Cr-Commit-Position: refs/heads/master@{#577526}
[modify] https://crrev.com/2d44877987c4d48ab0f00234e262ce7df67f9b88/chrome/browser/metrics/subprocess_metrics_provider_unittest.cc

Status: Fixed (was: Assigned)
Marking as Fixed, as my CL re-enabled the test. If the problem re-appears, the updated test code should have enough output to point at the culprit histogram that's unexpected.

Sign in to add a comment