Tracing crashed at InvokeOnMemoryDump |
|||||||||||||||||
Issue descriptionBackground context: go/memory-infra MemoryDumpManager::InvokeOnMemoryDump crashes while taking a trace because a memory dump provider is deleted before un-registering at MemoryDumpManager. Crash https://crash.corp.google.com/browse?stbtiq=61560c1ec0000000#1 shows that ProcessmetricsMemoryDumpProvider, sql::connection and HostSharedBitmapManager are the last called dump providers. The ProcessMetrics is the first name in minidump and the HostSharedBitmapManager is the last name. I am not sure how is the stack dumped in minidump. Primiano, can you please check the minidump and assign the bug? I am guessing HostSharedBitmapManager because the LazyInstance is not Leaky and it is never un-registerd. But, when taking a trace I was not trying to shutdown my browser. So, the instance cannot be deleted?
,
Feb 24 2017
Given how trivially this issue repros, presumably the relevant MemoryDumpProvider is being torn-down mid-capture, which will rarely be the case with any global object. It looks like ProcessMetricsMemoryDumpProvider is per-process, so for that to disappear mid-capture seems much more plausible, but that is deleted by the MemoryDumpManager itself, via UnregisterDumpProviderInternal <https://cs.chromium.org/chromium/src/base/trace_event/memory_dump_manager.cc?l=349&gs=cpp%253Abase%253A%253Atrace_event%253A%253Aclass-MemoryDumpManager%253A%253AUnregisterDumpProviderInternal(base%253A%253Atrace_event%253A%253AMemoryDumpProvider%2B*%252C%2Bbool)%2540chromium%252F..%252F..%252Fbase%252Ftrace_event%252Fmemory_dump_manager.cc%257Cdef&gsn=UnregisterDumpProviderInternal&ct=xref_usages> ). That just leaves sql::Connection, which does have some suspicious-looking logic: It only unregisters the MemoryDumpProvider if |db_| is non-null; that should probably be moved outside the |db_| check. But that appeared in the middle of the stack, so that seems unlikely to be the cause?
,
Feb 24 2017
The bottom of the stack should be the cause of the crash since the crash happened when trying to invoke the dump. Both sql and ProcessMetrics use UnregisterAndDeleteDumpProvider which means MDM handles the deletion. The only unsafe case is HostSharedBitmapManager. But, I do not understand why it is deleted when not shutting down. Or somehow tracing causes the shutdown?
,
Feb 24 2017
Just noticed that the report has in shutdown: false :/
,
Feb 24 2017
,
Feb 24 2017
Interesting that crash is a SIGILL - probably caused by __builtin_trap?
,
Mar 1 2017
Ok checked the minidump. The PMD causing the crash is the ProcessMemoryMetrics one. So sounds like there is some race because some process goes away and the ProcessMemoryMetrics (the one that we register on behalf of other processes) is unregistered uncleanly. Hector and I stared at the disassembly for quite a bit. the SIGILL is because of hitting a ud2 (explicit illegal instruction). In turn we hit the ud2 because the code does some check on the validity of the vtable before making the OnMemoryDump virtual call and that check fails. So essentially the memory_dump_provider that we have stored points to some memory which is not a valid base address anymore (I think this check is part of the recently introduced CFI - Control Flow Integrity). Now somebody plz figure out where is the race in the ProcessMemoryMetrics dump provider ;-)
,
Mar 1 2017
I think I know what's going on. There is a bug in
void ProcessMetricsMemoryDumpProvider::RegisterForProcess if we get somehow a duplicated PID: we Register the MDP to the MemoryDumpManager but then destroy immediately the object.
This is the code simplified:
ProcessMetricsMemoryDumpProvider::RegisterForProcess(base::ProcessId process) {
std::unique_ptr<ProcessMetricsMemoryDumpProvider> metrics_provider(new ...);
MemoryDumpManager::GetInstance()->RegisterDumpProvider(metrics_provider.get());
bool did_insert = g_dump_providers_map.Get().insert(std::make_pair(process, std::move(metrics_provider))).second;
if (!did_insert) {
DLOG(ERROR) << "ProcessMetricsMemoryDumpProvider already registered"
}
}
At this point we did NOT transfer the ownership of |metrics_provider| to the |g_dump_providers_map| because we had already another for the same pid.
so the std::move(metrics_provider) into a failing .insert() attempt will resolve in |metrics_provider| being destroyed.
However, the MDM at this point is convinced that the MDP is alive, and will crash trying to access it.
So there are two problems here:
1) The memory ownership of |metrics_provider| in the failure path of ProcessMetricsMemoryDumpProvider::RegisterForProcess is wrong. We should register the MDP only after the insert is sucessful, not before. This is easy to fix.
2) For some reason we see child processes with the same PID on CrOS. I have no idea why that happens.
Can somebody help here?
,
Mar 1 2017
hjd@ can you plz fix the 1st problem and add a test?
I'd start adding a test and checking that the current code crashes.
I would structure the test as follows:
- Rework the ProcessMetricsMemoryDumpProvider so that instead of directly creating a PMMDP instance in the RegisterFOrProcess method it invokes a factory method that you can override in the unittest *
- Add a method to MDM of the form MemoryDumpManager::IsDumpProviderRegisteredForTesting(MDP*)
- In the unittest keep track of the mocked CTOR/DTOR calls
- Do a double RegisterForProcess() with the same pid.
- CHeck that the one that is registered in the MDM is the one that didn't get destroyed
Now, fix the logic as I suggested above in #8
The test should now pass
* Something like this
---
class ProcessMetricsMemoryDumpProvider {
using FactoryFunction = std::unique_ptr<ProcessMetricsMemoryDumpProvider> (*)(Process*);
..
...
static FactoryFunction factory_for_testing;
};
std::unique_ptr<ProcessMetricsMemoryDumpProvider> CreateMetricsProvider(process) {
if (PMMDP::factory_for_testing)
return MakeUnique<ProcessMetricsMemoryDumpProvider>(factory_for_testing(process))
return MakeUnique<ProcessMetricsMemoryDumpProvider>(new PMMDP(process))
}
void ProcessMetricsMemoryDumpProvider::RegisterForProcess(
base::ProcessId process) {
unique_ptr<PMMDP> metrics_provider = CreateMetricsProvider();
---
,
Mar 1 2017
Re #7: This diagnosis suggests some more fundamental bug in Chrome process management and/or the memory-infra impls, given that: - This crasher is 100% reproducible if I run memory-infra tracing under ChromeOS. - We register-but-teardown the provider only if the PID was already known. That suggests that we either have some processes reported with bogus PIDs (e.g. the process fails to be created properly, or is torn-down before we reach here, so we get one or more processes with a zero process Id?), or that the map comparison isn't behaving as we expect. FWIW I used to run with Task Manager set to show process priorities, but found that that frequently crashed Chrome because of invalid assertions about the validity of process handles - I wonder if we really are creating and tearing down e.g. utility processes, for some reason, at a crazy rate?
,
Mar 1 2017
,
Mar 1 2017
> Re #7: This diagnosis suggests some more fundamental bug in Chrome process management and/or the memory-infra impls, As I said in #8 there is a clear bug in the ProcessMetricsMemoryDumpProvider::RegisterForProcess(), which is silly and super easy to fix (hjd@ is on it). However my guts say that once we fix the bug we'll stop crashing but we will lose the OS statistics of the child processes. So, on top of the bug I suspect there is a more general bug in chrome as regards getting the right peer_pid() from the IPC endpoints. > That suggests that we either have some processes reported with bogus PIDs (e.g. the process fails to be created properly, or is torn-down before we reach here, so we get one or more processes with a zero process Id?) Yes I think this is the case. The process id logic in chrome is very complex because of all the sandboxing mechanisms, that vary form os to os. So my biggest suspect right now is that we get the wrong PID in the browser process (maybe the one that the renderer sees itself, which is always 1 due to the pid namespace isolation) > or that the map comparison isn't behaving as we expect. Nah that map is absolutely trivial. Is really just a map<pid, pointer> and is simply populated when we get the IPC callbacks. Possibly some refactoring of the IPC broke the PID seen by the IPC mechanism (message_filter->peer_pid()). This would be super easy to check... if somebody had a development CrOs device where we can add a printf(). We should recruit some CrOs folk here. > FWIW I used to run with Task Manager set to show process priorities, but found that that frequently crashed Chrome because of invalid assertions about the validity of process handles Ok this smells to me like the same root problem. > I wonder if we really are creating and tearing down e.g. utility processes, for some reason, at a crazy rate? Even if we did, that shouldn't cause the issue. AFAIK the linux kernel tries extremely hard to not recycle process ids, mainly for security reason. You should create an insane amount before seeing the same pid again (unless some CrOs kernel change regressed this). I think it's more likely that those peer_id() we see are all 1s or 0s.
,
Mar 1 2017
FWIW I have a Chrome for ChromeOS build reasonably up-to-date on my Linux box, so I can easily check to see whether we can trivially repro any PID issue.
,
Mar 1 2017
I tried to trace with the chromeOS that I built using tot and ran with out/chromeos/chrome. It didn't crash (after opening quite a bit og processes and closing). For a long time I was not able to repro the issue. The crash link given in the bug was from my Linux workstation. So, the problem might not be just in CrOS IPC mechanism, but also in Linux. It could also be possible that we did not get closed signal for some process and we did not unregister the provider for the process at all. After a much longer time kernel recycled the PID and we just cleared the provider that time without unregistering. Maybe a better fix here is to remove the existing element and unregister and insert the new one.
,
Mar 1 2017
,
Mar 1 2017
Increasing priority since we have slow reports and I noticed 200 Android crashes because of InvokeOnMemoryDump. Maybe it is same issue. But, these crashes are from older versions and we do not have the provider names.
,
Mar 1 2017
My bad, it can happen only on Linux and CrOS. Anyways 12 crashes on Linux as well because of Slow reports mostly.
,
Mar 1 2017
,
Mar 1 2017
Assigning to hjd@ as per comment #9. Please re-assign to me if needed.
,
Mar 2 2017
,
Mar 3 2017
,
Mar 13 2017
ping hjd. Any progress here? We might be crashing on users in the wild currently.
,
Mar 13 2017
I've been OOO since Mar 2nd till today when I was sheriffing. If we are happy to leave the unit test till later I can submit the fix mentioned in #9 tomorrow. If we want to add the unit test at the same time it may take a little longer. Either way that should stop us crashing. No timeframe yet for the problem mentioned #10 and #12 since it needs more investigation.
,
Mar 15 2017
I think it's good to have a short term fix before the tests are written. I do not want to disable slow reports for this reason.
,
Mar 15 2017
Fix +test landing in https://codereview.chromium.org/2753723003/
,
Mar 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/96f0a2b238852a56c0a30aa67f4a6943c5271688 commit 96f0a2b238852a56c0a30aa67f4a6943c5271688 Author: hjd <hjd@chromium.org> Date: Thu Mar 16 10:12:47 2017 Fix crash in InvokeOnMemoryDump when tracing Fixes a bug in ProcessMetricsMemoryDumpProvider::RegisterForProcess where if we get a duplicate pid we register an MDP but then immediately destroy it. This later causes a crash. Instead we should not register the MDP if we get a duplicate pid. This won't help the root problem of why we are getting duplicate pids which will now just cause information loss instead of crashing. BUG= 695731 Review-Url: https://codereview.chromium.org/2753723003 Cr-Commit-Position: refs/heads/master@{#457392} [modify] https://crrev.com/96f0a2b238852a56c0a30aa67f4a6943c5271688/base/trace_event/memory_dump_manager.cc [modify] https://crrev.com/96f0a2b238852a56c0a30aa67f4a6943c5271688/base/trace_event/memory_dump_manager.h [modify] https://crrev.com/96f0a2b238852a56c0a30aa67f4a6943c5271688/components/tracing/common/process_metrics_memory_dump_provider.cc [modify] https://crrev.com/96f0a2b238852a56c0a30aa67f4a6943c5271688/components/tracing/common/process_metrics_memory_dump_provider.h [modify] https://crrev.com/96f0a2b238852a56c0a30aa67f4a6943c5271688/components/tracing/common/process_metrics_memory_dump_provider_unittest.cc
,
Mar 16 2017
Adding merge request as this should be a very small and safe fix, and would be nice to have it back down to beta
,
Mar 16 2017
This bug requires manual review: Request affecting a post-stable build Please contact the milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), ketakid@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Mar 16 2017
-Merge-Request-57 , I did the milestone math wrong. Don't intend to cherry-pick this to stable.
,
Mar 16 2017
,
Mar 16 2017
,
Mar 17 2017
Your change meets the bar and is auto-approved for M58. Please go ahead and merge the CL to branch 3029 manually. Please contact milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Mar 17 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7bbe40435f169a7d163e703f42c0086f4133e005 commit 7bbe40435f169a7d163e703f42c0086f4133e005 Author: Hector Dearman <hjd@google.com> Date: Fri Mar 17 12:15:53 2017 Fix crash in InvokeOnMemoryDump when tracing Fixes a bug in ProcessMetricsMemoryDumpProvider::RegisterForProcess where if we get a duplicate pid we register an MDP but then immediately destroy it. This later causes a crash. Instead we should not register the MDP if we get a duplicate pid. This won't help the root problem of why we are getting duplicate pids which will now just cause information loss instead of crashing. BUG= 695731 Review-Url: https://codereview.chromium.org/2753723003 Cr-Commit-Position: refs/heads/master@{#457392} (cherry picked from commit 96f0a2b238852a56c0a30aa67f4a6943c5271688) Review-Url: https://codereview.chromium.org/2761443003 . Cr-Commit-Position: refs/branch-heads/3029@{#259} Cr-Branched-From: 939b32ee5ba05c396eef3fd992822fcca9a2e262-refs/heads/master@{#454471} [modify] https://crrev.com/7bbe40435f169a7d163e703f42c0086f4133e005/base/trace_event/memory_dump_manager.cc [modify] https://crrev.com/7bbe40435f169a7d163e703f42c0086f4133e005/base/trace_event/memory_dump_manager.h [modify] https://crrev.com/7bbe40435f169a7d163e703f42c0086f4133e005/components/tracing/common/process_metrics_memory_dump_provider.cc [modify] https://crrev.com/7bbe40435f169a7d163e703f42c0086f4133e005/components/tracing/common/process_metrics_memory_dump_provider.h [modify] https://crrev.com/7bbe40435f169a7d163e703f42c0086f4133e005/components/tracing/common/process_metrics_memory_dump_provider_unittest.cc
,
Mar 27 2017
Marking as fixed, as no other crashes showed up [1] after the merge points (58.0.3029.26 and 59.0.3044.0) [1] http://go/mppbg Please reopen if this keeps happening
,
Jun 16 2017
|
|||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||
Comment 1 by ssid@chromium.org
, Feb 24 2017