New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 682081 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 695731
Owner:
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug-Regression
Hotlist-MemoryInfra



Sign in to add a comment

MemoryDumpManager::OnTraceLogDisabled crashe on ChromeOS

Project Member Reported by w...@chromium.org, Jan 18 2017

Issue description

Chrome Version: 57.0.2970.0 (Official Build) dev (64-bit)
OS: ChromeOS

What steps will reproduce the problem?
I was using ChromeOS and it started kswapd0'ing frantically. The system hung shortly thereafter and then crashed out.

The Chrome User Log contained:

[1:2:0117/161643.863573:ERROR:memory_dump_manager.cc(561)] Disabling MemoryDumpProvider "V8Isolate". Failed to post task on the task runner provided.

and

[1:1:0117/161643.913547:ERROR:persistent_histogram_allocator.cc(411)] NOTREACHED() hit.

at the end of it.
 

Comment 1 by w...@chromium.org, Jan 18 2017

Looks like the crash Id for this was 11550cf080000000.

Comment 2 by ssid@chromium.org, Jan 18 2017

The stack trace is:
0x00007f2a317f196b	(chrome -memory_dump_manager.cc:807 )	base::trace_event::MemoryDumpManager::OnTraceLogDisabled()
0x00007f2a31800d7c	(chrome -trace_log.cc:751 )	base::trace_event::TraceLog::SetDisabledWhileLocked(unsigned char)
0x00007f2a318013c2	(chrome -trace_log.cc:697 )	base::trace_event::TraceLog::SetDisabled()
0x00007f2a30d9acda	(chrome -tracing_controller_impl.cc:248 )	content::TracingControllerImpl::SetDisabledOnFileThread(base::Callback<void (), (base::internal::CopyMode)1, (base::internal::RepeatMode)1> const&)
0x00007f2a2fe614ec	(chrome -callback.h:68 )	base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*)
0x00007f2a2fe4b50d	(chrome -message_loop.cc:421 )	base::MessageLoop::DoWork()
0x00007f2a2fe4bad2	(chrome -message_pump_libevent.cc:218 )	base::MessagePumpLibevent::Run(base::MessagePump::Delegate*)
0x00007f2a317d3a17	(chrome -run_loop.cc:37 )	base::RunLoop::Run()
0x00007f2a30b07a44	(chrome -browser_thread_impl.cc:250 )	content::BrowserThreadImpl::FileThreadRun(base::RunLoop*)
0x00007f2a30b07b82	(chrome -browser_thread_impl.cc:303 )	content::BrowserThreadImpl::Run(base::RunLoop*)
0x00007f2a317ea728	(chrome -thread.cc:328 )	base::Thread::ThreadMain()
0x00007f2a317e7bd5	(chrome -platform_thread_posix.cc:71 )	ThreadFunc
0x00007f2a2ea80306	(libpthread-2.19.so -pthread_create.c:309 )	start_thread
0x00007f2a2d7d15cc	(libc-2.19.so + 0x000f95cc )	clone
I don't suppose you have the "error=X" value that should have been printed after the NOTREACHED?

Most likely, it was 6 indicating that the persistent histogram memory was full.  While that's not good, the system will continue to work by placing histograms in non-persistent memory... which is why a very recent change (i.e. yesterday) has disabled the NOTREACHED for the "full" condition.

There was a recent CL to both stable and beta (Jan 5 and 10, respectively) that increases the size of the persistent histogram memory but I think 2970 is too old for that.

See: https://bugs.chromium.org/p/chromium/issues/detail?id=675600

Comment 4 by w...@chromium.org, Jan 18 2017

There is no "error=X" value after the NOTREACHED - just "NOTREACHED() hit."
is loggged.  Note that this is a retail build, so presumably we don't have
the verbose logging built-in?
Why does a retail build have DCHECK enabled at all?

Anyway, it's probably the "full" condition so will not occur with a build from HEAD or beta-HEAD.

Comment 6 by w...@chromium.org, Jan 18 2017

Re #5: ChromeOS has a special-case to treat NOTREACHED() as LOG(ERROR) in non-DCheck builds, for some reason.

Comment 7 by w...@chromium.org, Jan 18 2017

Re #5: ChromeOS has a special-case to treat NOTREACHED() as LOG(ERROR) in non-DCheck builds, for some reason.

Comment 8 by ssid@chromium.org, Jan 18 2017

Cc: primiano@chromium.org
Summary: MemoryDumpManager::OnTraceLogDisabled crashe on ChromeOS (was: "Disabling MemoryDumpProvider", immediately before crash)
+primiano:
I am surprised by this stack trace. The crash happened at this line:
mdpinfo->dump_provider->SuspendFastMemoryPolling();

The crash was seg fault at address 0x0000002d. Which means either mdpinfo or dump_provider is NULL or corrupt?
|dump_providers_for_polling_| is always accessed with lock and mdpinfo is scoped_refptr. I can't see why that would be null.

The only way I can think of, why this crash would happen is:
The RegisterDumpProvider was called with NULL dump provider.

The only dump provider that sets |is_fast_polling_supported| to true is ProcessMetricsMemoryDumpProvider. This dump provider should never be unregistered on ChromeOS.
The only code that unregisters process metrics provider should be happening only on OS_LINUX:
https://chromium.googlesource.com/chromium/src/+/master/content/browser/tracing/tracing_controller_impl.cc#507

There is no way dump_provider can be null since MemoryDumpProviderInfo::dump_provider is constant, unless the Register was called with nullptr.
I can't say why ProcessMetricsMemoryDumpProvider::RegisterForProcess would register a null mdp.
Owner: ssid@chromium.org
Status: Assigned (was: Untriaged)
wez: thanks for reporting this.

RE #8:

> The crash was seg fault at address 0x0000002d. Which means either mdpinfo or dump_provider is NULL or corrupt?
Right. By look at the assembly in the minidump, it seems that crash happens at the virtual call site. 


  7f2a317f1960 00000035: 49 8b 77 20            mov    rsi,QWORD PTR [r15+0x20]  // this should be the beginning of for (const auto& mdpinfo) loop
  7f2a317f1964 00000039: 48 8b 7e 08            mov    rdi,QWORD PTR [rsi+0x8]
  7f2a317f1968 0000003d: 4c 8b 07               mov    r8,QWORD PTR [rdi]
=>7f2a317f196b 00000040: 41 ff 50 28            call   QWORD PTR [r8+0x28]  // this looks like the vtable call 
  7f2a317f196f 00000044: 4c 89 ff               mov    rdi,r15
  7f2a317f1972 00000047: e8 f9 36 40 fe         call   0xfe403745
  7f2a317f1977 0000004c: 49 89 c7               mov    r15,rax
  7f2a317f197a 0000004f: 49 39 c4               cmp    r12,rax
  7f2a317f197d 00000052: 75 e1                  jne    0x35      // This should be the block-end of the for (const auto& mdpinfo) loop

SuspendFastMemoryPolling is the 5th virtual method of MemoryDumpProvider, 5 * sizeof(PTR) = 40 -> 0x28, yes that crashed because the vptr on which we called SuspendFastMemoryPolling was invalid.
So mdpinfo itself was either valid (likely), or was corrupt but still had a valid pointer (to garbage) at the position expected by |dump_provider| (which is very hard, though still possible)

if mdpinfo was null or evidently invalid we would have crashed on the previous mov r8, [rdi].

Now the interesting thing is that it looks like (but here I might be overlooking some compiler trick) that mdpinfo->dump_provider wasn't really nullptr, but 5 (that also explains why the crash address is 0x2d -> 0x28 + 0x5). So, take this with a grain of salt, but there is a chance here that dump_provider = 0x05, which is even weirder.

> The only code that unregisters process metrics provider should be happening only on OS_LINUX:
> https://chromium.googlesource.com/chromium/src/+/master/content/browser/tracing/tracing_controller_impl.cc#507
OS_LINUX applies also to CrOS :)
so, yup, here the fact that we unregister the MDP is not surprising. Very likely we are unregistering the MDP for a child process that is gone.

Having said all this I don't have still a valid theory that explains why either mdpinfo or dump_provider would have been null or invalid. :-/

assigning this to ssid as the NOTREACHED thing seems settled, so this crash on memory-infra is the only thing left here.

wez, just to clarify: where you using tracing (or trying to)  when this crash happened?

Comment 11 by w...@chromium.org, Jan 19 2017

IIRC then yes, when the jank started I deliberately kicked off a
Tracing-on-Tap run.

Comment 12 by ssid@chromium.org, Mar 1 2017

Mergedinto: 695731
Status: Duplicate (was: Assigned)
Okay this is same as  issue 695731 . The dump provider is invalid.

Sign in to add a comment