New issue
Advanced search Search tips

Issue 648494 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Sep 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

[0919/191218:FATAL:persistent_histogram_allocator.cc(188)] Check failed: !user_.

Project Member Reported by scottmg@chromium.org, Sep 20 2016

Issue description

In trying to roll UMA-using-Crashpad into Chrome (https://codereview.chromium.org/2350943003/) I got this on the Windows bot:

https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/295922/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20%28with%20patch%29%20on%20Windows-2008ServerR2-SP1/logs/stdio

A crash was triggered in the GPU process, and it looks like crashpad_handler DCHECK'd due to PersistentHistogramAllocator.

[0919/191218:FATAL:persistent_histogram_allocator.cc(188)] Check failed: !user_. 
Backtrace:
	base::debug::StackTrace::StackTrace [0x00ECEAF7+23]
	logging::LogMessage::~LogMessage [0x00EBFDE1+49]
	base::PersistentSparseHistogramDataManager::UseSampleMapRecords [0x00EDCAD6+118]
	base::PersistentHistogramAllocator::UseSampleMapRecords [0x00EDCA54+20]
	base::PersistentSampleMap::ImportSamples [0x00EDDB5C+44]
	base::PersistentSampleMap::GetOrCreateSampleCountStorage [0x00EDDA38+88]
	base::PersistentSampleMap::Accumulate [0x00EDD5BF+15]
	base::SparseHistogram::AddCount [0x00ED1F28+120]
	base::SparseHistogram::Add [0x00ED4EBD+13]
	crashpad::Metrics::ExceptionCode [0x01177EA3+83]
	crashpad::CrashReportExceptionHandler::ExceptionHandlerServerException [0x00EEAD3D+218]
	crashpad::ExceptionHandlerServer::OnCrashDumpEvent [0x011745FB+52]
	RtlRegisterWait [0x77A01C6A+788]

From looking at that DCHECK, I'm not quite clear on what it's expressing. Brian, do you have any insight?

From looking at the log there, I would guess that it'd be maybe the first or second time that UMA_ macro would have been used for that particular key, but I didn't dig into the details of that GPU test.

That UMA is using UMA_HISTOGRAM_SPARSE_SLOWLY whereas there was a UMA that rolled in last week in https://codereview.chromium.org/2349083002/ that's a plain (?) UMA_HISTOGRAM_CUSTOM_COUNTS, in case that might be relevant.
 
Status: Assigned (was: Untriaged)
Oh, I also used a const char[] for the key name, not a string literal. https://codereview.chromium.org/2350943003/diff/80001/third_party/crashpad/crashpad/util/misc/metrics.cc on the off-chance that's a bad idea.
I'm looking at the trace.  The DCHECK indicates that there are two instances of this histogram using the same PersistentHistogramAllocator object.

That can happen if you iterate over histograms in that allocator within the same process that created them in there.

I'm trying to understand the stack.  Where is crashpad::Metrics::ExceptionCode?
crashpad::Metrics::ExceptionCode is implemented in third_party/crashpad/crashpad/util/misc/metrics.cc in the CL linked above https://codereview.chromium.org/2350943003/diff/120001/third_party/crashpad/crashpad/util/misc/metrics.cc .

I'm not sure what 'two instances of the histogram' means here (i.e. I'm not sure how a histogram instance gets created I guess)...

The UMA macro It could be getting called from different threads; the exception is handled via a thread pool https://cs.chromium.org/chromium/src/third_party/crashpad/crashpad/util/win/exception_handler_server.cc?q=ExceptionHandlerServer::OnCrashDumpEvent&sq=package:chromium&l=565&dr=CSs . Is that a problem?
Cc: bcwh...@chromium.org
Owner: scottmg@chromium.org
After some offline discussion, one possible explanation for this could be that the ExceptionHandlerServer gets created before the GlobalHistogramAllocator in main().

https://cs.chromium.org/chromium/src/third_party/crashpad/crashpad/handler/handler_main.cc?q=ExceptionHandlerServer+exception_handler_server&sq=package:chromium&l=350&dr=C

If there's a crash right away, the UMA macros get used which cause a default GlobalHistogramAllocator to get created, and then main comes along and creates another one. (maybe? sound plausible?)

We didn't really want to move the GlobalHistogramAllocator creation earlier because on Windows that would mean delaying the handshake response (which is on browser startup path) until after the file operations the create the persistent histogram file(s).
Or not! I'm going to see if I can get context_lost to run locally now and try reproduce, maybe that'll make it more obvious what's wrong.
One other random thought (haven't looked at this very deeply), is it possible that StatisticsRecorder::Initialize() doesn't get called in crashpad before the first UMA metric is logged? I think that could cause the histogram object to not be cached and so two of them could exist...
There's no usage of StatisticsRecorder in the crashpad_handler process. Is it required to work correctly, even if there's only one place where GlobalHistogramAllocator gets created (in main())? But that definitely seems like the sort of thing it could be.
If both the histogram AND the GlobalHistogramAllocator were created before Initialize() then that could cause the problem.

Perhaps having the GlobalHistogramAllocator call StatisticsRecorder::Initialize() as part of the former's ctor would be a safe thing to do.  You could try it and see if it makes the test stop failing.
Fortunately, I can repro locally. StatisticsRecorder::Initialize() is never used in the crashpad_handler process.

It seems to be on the second call to UMA_HISTOGRAM_SPARSE_SLOWLY, but at least while in the debugger (where it still reproduces) they're both called much after GlobalHistogramAllocator has been initialized as expected in main. Debugging more...
Great!

Try adding a call to StatisticsRecorder::Initialize() in the GlobalHistogramAllocator constructor.

That'll make it safe and multiple calls to Initialize() are okay.
OK, that works, I'll send a patch to do that separately from the roll. Thanks!

I debugged a bit more, because given that "simple" explanation, I didn't understand why the other UMAs that are called earlier and more frequently didn't crash.

However, the expansion of the other macros e.g. UMA_HISTOGRAM_COUNTS have a static atomic word that early outs before relooking up the histogram, so never get to their factory again.

UMA_HISTOGRAM_SPARSE_SLOWLY however doesn't do this and so always loads from FactoryGet(), so causing the DCHECK. (Should/could it do the same? The sparseness of UMA_HISTOGRAM_SPARSE_SLOWLY values seems unrelated to the frequency with which values are tracked.)
Only sparse histograms have this DCHECK.  Regular histograms will just end up creating duplicates with only the last one (the one in the StatisticsRecorder against which any further creations are de-dup'd) getting uploaded.

Oh, yes...  Sparse histograms could keep a local pointer for speed but since sparse histograms are already "slow", this wasn't deemed necessary.  Typically, they're rarely called.

Note that even with such, the problem would continue to exist if the same histogram (by name) were updated in another place since that other place would have it's own cached reference to the histogram that starts null.
Project Member

Comment 14 by bugdroid1@chromium.org, Sep 22 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0ad9ed389930c020268e251c454e004f5a9538be

commit 0ad9ed389930c020268e251c454e004f5a9538be
Author: bcwhite <bcwhite@chromium.org>
Date: Thu Sep 22 18:26:17 2016

The GHA won't function properly if the SR hasn't been initialized so do that.

Also: Fix problem with initialization of the global SR when a local
SR is already active for testing purposes.

BUG= 648494 , crashpad:100

Review-Url: https://codereview.chromium.org/2358023003
Cr-Commit-Position: refs/heads/master@{#420402}

[modify] https://crrev.com/0ad9ed389930c020268e251c454e004f5a9538be/base/metrics/persistent_histogram_allocator.cc
[modify] https://crrev.com/0ad9ed389930c020268e251c454e004f5a9538be/base/metrics/statistics_recorder.cc

Project Member

Comment 15 by bugdroid1@chromium.org, Sep 22 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0ad9ed389930c020268e251c454e004f5a9538be

commit 0ad9ed389930c020268e251c454e004f5a9538be
Author: bcwhite <bcwhite@chromium.org>
Date: Thu Sep 22 18:26:17 2016

The GHA won't function properly if the SR hasn't been initialized so do that.

Also: Fix problem with initialization of the global SR when a local
SR is already active for testing purposes.

BUG= 648494 , crashpad:100

Review-Url: https://codereview.chromium.org/2358023003
Cr-Commit-Position: refs/heads/master@{#420402}

[modify] https://crrev.com/0ad9ed389930c020268e251c454e004f5a9538be/base/metrics/persistent_histogram_allocator.cc
[modify] https://crrev.com/0ad9ed389930c020268e251c454e004f5a9538be/base/metrics/statistics_recorder.cc

Owner: bcwh...@chromium.org
Status: Fixed (was: Assigned)

Sign in to add a comment