New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 766047 link

Starred by 1 user

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , Chrome , Mac , Fuchsia
Pri: 2
Type: Bug



Sign in to add a comment

memory-infra: proposals to improve race detection of memory dump providers

Project Member Reported by primiano@chromium.org, Sep 18 2017

Issue description

Forking this off Issue 643438, so we can keep the discussion there about the crash.

Copy pasting from comments 58-59 there:

erikchen@
-----------------------
I have a proposal that will simplify the existing interface, and avoid this race condition at minimal cost.

The race condition we're trying to avoid is: MDP is destroyed on one thread while its in the midst of an OnMemoryDump() on a separate thread. 

Proposal: 
1) Registering a MDP creates a MDP::Identifier, which consists of a Lock, and a pointer to the MDP.
2) When the MDM is told to generate a memory dump, it generates a vector of MDP::Identifiers. 
2) The MDM begins to dequeue MDP::Identifiers. For each Identifier, the MDM grabs the lock, then checks if the pointer is not null. If so, MDM invokes MDP::OnMemoryDump [still within confines of the lock].
3) When a MDP deregisters, the MDM finds the MDP::Identifier, grabs the lock, clears the MDP pointer, and then continues with deregistration.

This turns MDP deregistration and MDP dumps into mutually exclusive critical sections. The performance cost only exists when a MDP attempts to deregister while it's being dumped - this can cause blocking of UI/IO threads. That being said, the alternatives [which the current code tries to implement] are:
  1) Pass ownership of MDP to MDM. This makes the unenforceable assumption that MDP does not have references to other, now-destroyed objects.
  2) Requires that MDP destruction and MDP OnMemoryDump occur on the same thread [only enforced as a DCHECK...and violated in the field]

Note that alternative (2) and the current proposal have the same effect: mutual exclusion of the critical sections, and there's no performance difference between them.
-----------------------

Some thoughts:

>  The race condition we're trying to avoid is: MDP is destroyed on one thread while its in the midst of an OnMemoryDump() on a separate thread. 

Correct.

First of all, a note: there are two types of MDPs, thread-bound (when they pass a non-null |task_runner|) and thread-whatever (i.e. they can be invoked on any thread). AFAIK, the majority of MDPs are thread-bound.
thread-bound are invoked and must be destroyed on the same thread. If this condition is guaranteed, the rest is conceptually very easy.
The problem instead is for thread-whatever is that if they want to be destroyed, now they have no idea on which thread we are invoking them. Hence why the "pass the ownership to MDM" model. Having said this, note that most of thread-whatever MDPs are leaky instances. The only two thread-whatever MDP that needs destruction are sql/connection.cc and mojo/edk/system/core.cc .

> 1) Registering a MDP creates a MDP::Identifier, which consists of a Lock, and a pointer to the MDP.
> 2) When the MDM is told to generate a memory dump, it generates a vector of MDP::Identifiers. 
This is the case already (% Lock), see  MemoryDumpProviderInfo and ProcessMemoryDumpAsyncState::dump_provider


> 2) The MDM begins to dequeue MDP::Identifiers. For each Identifier, the MDM grabs the lock, then checks if the pointer is not null. If so, MDM invokes MDP::OnMemoryDump [still within confines of the lock].
> This turns MDP deregistration and MDP dumps into mutually exclusive critical sections.

This still doesn't solve the problem. The issue is that at the point when MDM::Unregister is called, the MDP is very likely in its dtor (this is the case for both the thread-whatever MDPs that invoke UnregisterAndDeleteDumpProviderSoon). Which means that the lock will prevent that they get destroyed completely, but they are already half dead (even worse if they have derived classes, as all the vtable pointers will be gone by then).

The other risk is that this lock approach will turn some of these crashes into deadlocks. I am thinking to the case where invoking OnMemoryDump on a MDP that is in its dtor, blocked on the mutex, causes code paths in the MDP to re-register. deadlocks are way more hard to debug and triage, as they show as generic renderer hangs.

My statement here is that by using this lock proposal we wouldn't solve these problems but just make them more subtle to debug. At least today we get those crashes with a base::trace_event::MemoryDumpManager signature, which means they are easy to query and get triaged to us. If they start getting assigned to other teams, there is more risk that they get lost in triaging and we never get to know we have problems.

> This makes the unenforceable assumption that MDP does not have references to other, now-destroyed objects.
Right, your comment is spot on. Unfortunately a lock won't solve this.


IMHO a better way to address this problem is: violations will happen, because they are out of our control. how to make violations explicit and as easy to triage and fix as possible?

One thing to remember, is that in case of a crash in MDM, we keep a copy of the MDP name on the stack (see  |provider_name_for_debugging|) so that will show up in the minidump and we can blacklist it until it gets fixed.

> [only enforced as a DCHECK...and violated in the field]
This comment is also spot-on. That was a bad decision (of mine) back then, primiano@-from-the past had too much trust in DCHECK usefulness. Those should just be CHECKs.
Now, pragmatically I think (but not 100% sure, have to thing more on this) we can't just make them CHECKs because if there is any violation, that will very likely destabilize chrome too much or eventually cause crashes at startup that are too annoying even for canary. 
We should probably use the same, tedious but safer, model similar to what we used in https://codereview.chromium.org/2592233002 and followups (although that was only about making DCHECKS more pedantic).




 
> The issue is that at the point when MDM::Unregister is called, the MDP is very likely in its dtor

Ah, good point. I think we should only allow 1 type of interface: thread-bound, and force all MDPs to conform [and get rid of this ownership transfer stuff]. If there's only 2 MDPs that need to be destroyed and are thread-bound, this should be straight forward.

Comment 2 by ssid@chromium.org, Sep 18 2017

I agree we could make this registration scheme better. But, I do not understand why it is priority 2. The crashes have been fixed and we do not see any crashes in the latest version (or <10 crashes on stable). Is this just an effort to clean up code?
I'm seeing these crashes on Canary. [macOS, renderer process]

https://bugs.chromium.org/p/chromium/issues/detail?id=643438#c62

Comment 4 by ssid@chromium.org, Sep 18 2017

Ah sorry I forgot to post about this crash. I found that it crashed on FILE thread and the dump provider was ThreadLocalEventBuffer. I couldn't understand how a thread local dump provider could not be thread safe. I did not investigate further since it is not enabled on background mode or for UMA. I can dig into this further soon.

I agree with primiano, we shouldn't support unregistering a provider on any thread.  We should let the provider handle the thread safety.
Having a blacklist seems like a good idea. I am worried we will have piled up the list at some point with unfixed bugs. But, it's better than crashing while collecting metrics.
Also removing the model of ownership transfer sounds good! Now that we have background thread and UMA enabled, we can register these providers directly on background thread and handle the ownership handled by the provider itself.
> Also removing the model of ownership transfer sounds good! Now that we have background thread and UMA enabled, we can register these providers directly on background thread and handle the ownership handled by the provider itself.

Could do but this requires some small work on sql/connection.cc
The problem there is the following:
the SQL::Connection MDP is NOT thread-bound, because a SQL connection is used by different task runners.
Today the SQL::Connection MDP has internally a lock. The lock is held:
 - OnMemoryDump() 
 - ResetDatabase()

This prevents that the database goes away while we are dumping that.
We could definitely move the SQLConnection to live on our background thread, but in that case that still needs its own lock and needs to expose the ResetDatabase() method, which should be called before unregistering and destroying the MDP.
But doing that overall should be better than exposing this awkward UnregisterAndDeleteDumpProviderSoon.

> But, I do not understand why it is priority 2.
So IMHO killing UnregisterAndDeleteDumpProviderSoon is a P3 as it impacts only 2 MDPs anyways.
Figuring out a strategy to spot thread violations on unregistration is a P2 (because the crash numbers are not that high, see my recent mail on the internal list. 
making the CHECK stricter for new MDPs is a also P2, but is very easy to do so we should just go ahead and fix this easy part now-ish.

Sign in to add a comment