New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 695731 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Chrome
Pri: 1
Type: Bug

Blocking:
issue 698266
issue 643438



Sign in to add a comment

Tracing crashed at InvokeOnMemoryDump

Project Member Reported by ssid@chromium.org, Feb 24 2017

Issue description

Background context: go/memory-infra

MemoryDumpManager::InvokeOnMemoryDump crashes while taking a trace because a memory dump provider is deleted before un-registering at MemoryDumpManager.

Crash https://crash.corp.google.com/browse?stbtiq=61560c1ec0000000#1 shows that 
ProcessmetricsMemoryDumpProvider, sql::connection and HostSharedBitmapManager are the last called dump providers. The ProcessMetrics is the first name in minidump and the HostSharedBitmapManager is the last name. I am not sure how is the stack dumped in minidump.
Primiano, can you please check the minidump and assign the bug?

I am guessing HostSharedBitmapManager because the LazyInstance is not Leaky and it is never un-registerd. But, when taking a trace I was not trying to shutdown my browser. So, the instance cannot be deleted?
 

Comment 1 by ssid@chromium.org, Feb 24 2017

Cc: ericrk@chromium.org

Comment 2 by w...@chromium.org, Feb 24 2017

Given how trivially this issue repros, presumably the relevant
MemoryDumpProvider is being torn-down mid-capture, which will rarely be the
case with any global object.

It looks like ProcessMetricsMemoryDumpProvider is per-process, so for that
to disappear mid-capture seems much more plausible, but that is deleted by
the MemoryDumpManager itself, via UnregisterDumpProviderInternal
<https://cs.chromium.org/chromium/src/base/trace_event/memory_dump_manager.cc?l=349&gs=cpp%253Abase%253A%253Atrace_event%253A%253Aclass-MemoryDumpManager%253A%253AUnregisterDumpProviderInternal(base%253A%253Atrace_event%253A%253AMemoryDumpProvider%2B*%252C%2Bbool)%2540chromium%252F..%252F..%252Fbase%252Ftrace_event%252Fmemory_dump_manager.cc%257Cdef&gsn=UnregisterDumpProviderInternal&ct=xref_usages>
).

That just leaves sql::Connection, which does have some suspicious-looking
logic: It only unregisters the MemoryDumpProvider if |db_| is non-null;
that should probably be moved outside the |db_| check.  But that appeared
in the middle of the stack, so that seems unlikely to be the cause?

Comment 3 by ssid@chromium.org, Feb 24 2017

The bottom of the stack should be the cause of the crash since the crash happened when trying to invoke the dump. Both sql and ProcessMetrics use UnregisterAndDeleteDumpProvider which means MDM handles the deletion. The only unsafe case is HostSharedBitmapManager. But, I do not understand why it is deleted when not shutting down. Or somehow tracing causes the shutdown?

Comment 4 by ssid@chromium.org, Feb 24 2017

Just noticed that the report has in shutdown: false :/

Comment 5 by ssid@chromium.org, Feb 24 2017

Cc: dskiba@chromium.org

Comment 6 by dskiba@chromium.org, Feb 24 2017

Interesting that crash is a SIGILL - probably caused by __builtin_trap?
Ok checked the minidump. The PMD causing the crash is the ProcessMemoryMetrics one. So sounds like there is some race because some process goes away and the ProcessMemoryMetrics (the one that we register on behalf of other processes) is unregistered uncleanly.
Hector and I stared at the disassembly for quite a bit. the SIGILL is because of hitting a ud2 (explicit illegal  instruction). In turn we hit the ud2 because the code does some check on the validity of the vtable before making the OnMemoryDump virtual call and that check fails. 
So essentially the memory_dump_provider that we have stored points to some memory which is not a valid base address anymore (I think this check is part of the  recently introduced CFI - Control Flow Integrity).
Now somebody plz figure out where is the race in the ProcessMemoryMetrics dump provider  ;-)

I think I know what's going on. There is a bug in 
void ProcessMetricsMemoryDumpProvider::RegisterForProcess if we get somehow a duplicated PID: we Register the MDP to the MemoryDumpManager but then destroy immediately the object.
This is the code simplified:
ProcessMetricsMemoryDumpProvider::RegisterForProcess(base::ProcessId process) {
  std::unique_ptr<ProcessMetricsMemoryDumpProvider> metrics_provider(new ...);
  MemoryDumpManager::GetInstance()->RegisterDumpProvider(metrics_provider.get());
  bool did_insert = g_dump_providers_map.Get().insert(std::make_pair(process, std::move(metrics_provider))).second;
  if (!did_insert) {
    DLOG(ERROR) << "ProcessMetricsMemoryDumpProvider already registered"
  }
}

At this point we did NOT transfer the ownership of |metrics_provider| to the |g_dump_providers_map| because we had already another for the same pid.
so the std::move(metrics_provider) into a failing .insert() attempt will resolve in |metrics_provider| being destroyed.
However, the MDM at this point is convinced that the MDP is alive, and will crash trying to access it.
So there are two problems here:
1) The memory ownership of |metrics_provider| in the failure path of ProcessMetricsMemoryDumpProvider::RegisterForProcess is wrong. We should register the MDP only after the insert is sucessful, not before. This is easy to fix.
2) For some reason we see child processes with the same PID on CrOS. I have no idea why that happens.
Can somebody help here?
hjd@ can you plz fix the 1st problem and add a test?
I'd start adding a test and checking that the current code crashes. 
I would structure the test as follows:

- Rework the ProcessMetricsMemoryDumpProvider so that instead of directly creating a PMMDP instance in the RegisterFOrProcess method it invokes a factory method that you can override in the unittest *
- Add a method to MDM of the form MemoryDumpManager::IsDumpProviderRegisteredForTesting(MDP*)
- In the unittest keep track of the mocked CTOR/DTOR calls
- Do a double RegisterForProcess() with the same pid.
- CHeck that the one that is registered in the MDM is the one that didn't get destroyed

Now, fix the logic as I suggested above in #8
The test should now pass


* Something like this
---

class ProcessMetricsMemoryDumpProvider {
  using FactoryFunction = std::unique_ptr<ProcessMetricsMemoryDumpProvider> (*)(Process*);

..
...
 static FactoryFunction factory_for_testing;

};

std::unique_ptr<ProcessMetricsMemoryDumpProvider> CreateMetricsProvider(process) {
  if (PMMDP::factory_for_testing)
    return MakeUnique<ProcessMetricsMemoryDumpProvider>(factory_for_testing(process))
  return MakeUnique<ProcessMetricsMemoryDumpProvider>(new PMMDP(process))
}

void ProcessMetricsMemoryDumpProvider::RegisterForProcess(
    base::ProcessId process) {
  unique_ptr<PMMDP> metrics_provider = CreateMetricsProvider();

---

Comment 10 by w...@chromium.org, Mar 1 2017

Re #7: This diagnosis suggests some more fundamental bug in Chrome process management and/or the memory-infra impls, given that:
- This crasher is 100% reproducible if I run memory-infra tracing under ChromeOS.
- We register-but-teardown the provider only if the PID was already known.

That suggests that we either have some processes reported with bogus PIDs (e.g. the process fails to be created properly, or is torn-down before we reach here, so we get one or more processes with a zero process Id?), or that the map comparison isn't behaving as we expect.

FWIW I used to run with Task Manager set to show process priorities, but found that that frequently crashed Chrome because of invalid assertions about the validity of process handles - I wonder if we really are creating and tearing down e.g. utility processes, for some reason, at a crazy rate?

Comment 11 by w...@chromium.org, Mar 1 2017

Labels: M-59
> Re #7: This diagnosis suggests some more fundamental bug in Chrome process management and/or the memory-infra impls,
As I said in #8 there is a clear bug in the ProcessMetricsMemoryDumpProvider::RegisterForProcess(), which is silly and super easy to fix (hjd@ is on it). However my guts say that once we fix the bug we'll stop crashing but we will lose the OS statistics of the child processes.
So, on top of the bug I suspect there is a more general bug in chrome as regards getting the right peer_pid() from the IPC endpoints.

> That suggests that we either have some processes reported with bogus PIDs (e.g. the process fails to be created properly, or is torn-down before we reach here, so we get one or more processes with a zero process Id?)
Yes I think this is the case. The process id logic in chrome is very complex because of all the sandboxing mechanisms, that vary form os to os. So my biggest suspect right now is that we get the wrong PID in the browser process (maybe the one that the renderer sees itself, which is always 1 due to the pid namespace isolation)

> or that the map comparison isn't behaving as we expect.
Nah that map is absolutely trivial. Is really just a map<pid, pointer> and is simply populated when we get the IPC callbacks.

Possibly some refactoring of the IPC broke the PID seen by the IPC mechanism (message_filter->peer_pid()). 
This would be super easy to check... if somebody had a development CrOs device where we can add a printf(). We should recruit some CrOs folk here.

> FWIW I used to run with Task Manager set to show process priorities, but found that that frequently crashed Chrome because of invalid assertions about the validity of process handles 
Ok this smells to me like the same root problem.

>  I wonder if we really are creating and tearing down e.g. utility processes, for some reason, at a crazy rate?
Even if we did, that shouldn't cause the issue. AFAIK the linux kernel tries extremely hard to not recycle process ids, mainly for security reason. You should create an insane amount before seeing the same pid again (unless some CrOs kernel change regressed this).
I think it's more likely that those peer_id() we see are all 1s or 0s.

Comment 13 by w...@chromium.org, Mar 1 2017

FWIW I have a Chrome for ChromeOS build reasonably up-to-date on my Linux
box, so I can easily check to see whether we can trivially repro any PID
issue.

Comment 14 by ssid@chromium.org, Mar 1 2017

I tried to trace with the chromeOS that I built using tot and ran with out/chromeos/chrome. It didn't crash (after opening quite a bit og processes and closing). For a long time I was not able to repro the issue.

The crash link given in the bug was from my Linux workstation. So, the problem might not be just in CrOS IPC mechanism, but also in Linux.

It could also be possible that we did not get closed signal for some process and we did not unregister the provider for the process at all. After a much longer time kernel recycled the PID and we just cleared the provider that time without unregistering. Maybe a better fix here is to remove the existing element and unregister and insert the new one.

Comment 15 by ssid@chromium.org, Mar 1 2017

Cc: bcwh...@chromium.org primiano@chromium.org
 Issue 682081  has been merged into this issue.

Comment 16 by ssid@chromium.org, Mar 1 2017

Labels: -Pri-2 OS-Android OS-Chrome Pri-1
Increasing priority since we have slow reports and I noticed 200 Android crashes because of InvokeOnMemoryDump. Maybe it is same issue. But, these crashes are from older versions and we do not have the provider names.

Comment 17 by ssid@chromium.org, Mar 1 2017

My bad, it can happen only on Linux and CrOS. Anyways 12 crashes on Linux as well because of Slow reports mostly.

Comment 18 by ssid@chromium.org, Mar 1 2017

Labels: -OS-Android

Comment 19 by ssid@chromium.org, Mar 1 2017

Owner: hjd@chromium.org
Assigning to hjd@ as per comment #9.
Please re-assign to me if needed.

Comment 20 by hjd@chromium.org, Mar 2 2017

Status: Started (was: Untriaged)
Blocking: 698266

Comment 22 by ssid@chromium.org, Mar 13 2017

ping hjd. Any progress here? We might be crashing on users in the wild currently.

Comment 23 by hjd@chromium.org, Mar 13 2017

I've been OOO since Mar 2nd till today when I was sheriffing.
If we are happy to leave the unit test till later I can submit the fix mentioned in #9 tomorrow. If we want to add the unit test at the same time it may take a little longer. Either way that should stop us crashing. No timeframe yet for the problem mentioned #10 and #12 since it needs more investigation.

Comment 24 by ssid@chromium.org, Mar 15 2017

I think it's good to have a short term fix before the tests are written. I do not want to disable slow reports for this reason.
Fix +test landing in  https://codereview.chromium.org/2753723003/
Project Member

Comment 26 by bugdroid1@chromium.org, Mar 16 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/96f0a2b238852a56c0a30aa67f4a6943c5271688

commit 96f0a2b238852a56c0a30aa67f4a6943c5271688
Author: hjd <hjd@chromium.org>
Date: Thu Mar 16 10:12:47 2017

Fix crash in InvokeOnMemoryDump when tracing

Fixes a bug in ProcessMetricsMemoryDumpProvider::RegisterForProcess
where if we get a duplicate pid we register an MDP but then immediately
destroy it. This later causes a crash. Instead we should not register
the MDP if we get a duplicate pid. This won't help the root problem
of why we are getting duplicate pids which will now just cause
information loss instead of crashing.

BUG= 695731 

Review-Url: https://codereview.chromium.org/2753723003
Cr-Commit-Position: refs/heads/master@{#457392}

[modify] https://crrev.com/96f0a2b238852a56c0a30aa67f4a6943c5271688/base/trace_event/memory_dump_manager.cc
[modify] https://crrev.com/96f0a2b238852a56c0a30aa67f4a6943c5271688/base/trace_event/memory_dump_manager.h
[modify] https://crrev.com/96f0a2b238852a56c0a30aa67f4a6943c5271688/components/tracing/common/process_metrics_memory_dump_provider.cc
[modify] https://crrev.com/96f0a2b238852a56c0a30aa67f4a6943c5271688/components/tracing/common/process_metrics_memory_dump_provider.h
[modify] https://crrev.com/96f0a2b238852a56c0a30aa67f4a6943c5271688/components/tracing/common/process_metrics_memory_dump_provider_unittest.cc

Labels: Merge-Request-58 Merge-Request-57
Adding merge request as this should be a very small and safe fix, and would be nice to have it back down to beta
Project Member

Comment 28 by sheriffbot@chromium.org, Mar 16 2017

Labels: -Merge-Request-57 Hotlist-Merge-Review Merge-Review-57
This bug requires manual review: Request affecting a post-stable build
Please contact the milestone owner if you have questions.
Owners: amineer@(Android), cmasso@(iOS), ketakid@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Review-57
-Merge-Request-57 , I did the milestone math wrong. Don't intend to cherry-pick this to stable. 
Labels: -Merge-Request-58
Labels: Merge-Request-58
Project Member

Comment 32 by sheriffbot@chromium.org, Mar 17 2017

Labels: -Merge-Request-58 Hotlist-Merge-Approved Merge-Approved-58
Your change meets the bar and is auto-approved for M58. Please go ahead and merge the CL to branch 3029 manually. Please contact milestone owner if you have questions.
Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Project Member

Comment 33 by bugdroid1@chromium.org, Mar 17 2017

Labels: -merge-approved-58 merge-merged-3029
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7bbe40435f169a7d163e703f42c0086f4133e005

commit 7bbe40435f169a7d163e703f42c0086f4133e005
Author: Hector Dearman <hjd@google.com>
Date: Fri Mar 17 12:15:53 2017

Fix crash in InvokeOnMemoryDump when tracing

Fixes a bug in ProcessMetricsMemoryDumpProvider::RegisterForProcess
where if we get a duplicate pid we register an MDP but then immediately
destroy it. This later causes a crash. Instead we should not register
the MDP if we get a duplicate pid. This won't help the root problem
of why we are getting duplicate pids which will now just cause
information loss instead of crashing.

BUG= 695731 

Review-Url: https://codereview.chromium.org/2753723003
Cr-Commit-Position: refs/heads/master@{#457392}
(cherry picked from commit 96f0a2b238852a56c0a30aa67f4a6943c5271688)

Review-Url: https://codereview.chromium.org/2761443003 .
Cr-Commit-Position: refs/branch-heads/3029@{#259}
Cr-Branched-From: 939b32ee5ba05c396eef3fd992822fcca9a2e262-refs/heads/master@{#454471}

[modify] https://crrev.com/7bbe40435f169a7d163e703f42c0086f4133e005/base/trace_event/memory_dump_manager.cc
[modify] https://crrev.com/7bbe40435f169a7d163e703f42c0086f4133e005/base/trace_event/memory_dump_manager.h
[modify] https://crrev.com/7bbe40435f169a7d163e703f42c0086f4133e005/components/tracing/common/process_metrics_memory_dump_provider.cc
[modify] https://crrev.com/7bbe40435f169a7d163e703f42c0086f4133e005/components/tracing/common/process_metrics_memory_dump_provider.h
[modify] https://crrev.com/7bbe40435f169a7d163e703f42c0086f4133e005/components/tracing/common/process_metrics_memory_dump_provider_unittest.cc

Status: Fixed (was: Started)
Marking as fixed, as no other crashes showed up [1] after the merge points (58.0.3029.26 and 59.0.3044.0)

[1] http://go/mppbg

Please reopen if this keeps happening
Labels: -Hotlist-Merge-Review

Sign in to add a comment