New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 636489 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 1
Type: Bug

Blocked on:
issue 616031



Sign in to add a comment

GPU tests flaky (crash) on Linux Nvidia Debug

Project Member Reported by sunn...@chromium.org, Aug 10 2016

Issue description

Comment 1 by kbr@chromium.org, Aug 10 2016

Components: Blink>JavaScript
Here's the extracted stack trace.

	Crash reason:  SIGSEGV
	Crash address: 0x7fff94400708
	Process uptime: not available
	
	Thread 0 (crashed)
	 0  libbase.so!GetStackTrace(void**, int, int) + 0x52
	 1  libbase.so!MallocBlockQueueEntry::MallocBlockQueueEntry(MallocBlock*, unsigned long) + 0x5d
	 2  libbase.so!MallocBlock::ProcessFreeQueue(MallocBlock*, unsigned long, int) + 0x8d
	 3  libbase.so!MallocBlock::Deallocate(int) + 0x124
	 4  libbase.so!DebugDeallocate(void*, int) + 0xe5
	 5  libbase.so!tc_free + 0x23
	 6  libbase.so!(anonymous namespace)::TCFree(base::allocator::AllocatorDispatch const*, void*) + 0x19
	 7  libbase.so!ShimCppDelete + 0x27
	 8  libv8.so!Cmp [macro-assembler-x64.cc : 3035 + 0x5]
	 9  libv8.so!CallApiFunctionAndReturn [code-stubs-x64.cc : 4944 + 0xb]
	10  libv8.so!Generate [code-stubs-x64.cc : 5122 + 0x38]
	11  libv8.so!GenerateCode [code-stubs.cc : 129 + 0x14]
	12  libv8.so!GetCode [code-stubs.cc : 155 + 0x9]
	13  libv8.so!TailCallStub [macro-assembler-x64.cc : 666 + 0x8]
	14  libv8.so!GenerateApiAccessorCall [handler-compiler-x64.cc : 215 + 0x8]
	15  libv8.so!CompileLoadCallback [handler-compiler.cc : 243 + 0x1a]
	16  libv8.so!CompileHandler [ic.cc : 1235 + 0xd]
	17  libv8.so!ComputeHandler [ic.cc : 1033 + 0xc]
	18  libv8.so!UpdateCaches [ic.cc : 937 + 0xd]
	19  libv8.so!Load [ic.cc : 624 + 0x8]
	20  libv8.so!__RT_impl_Runtime_LoadIC_Miss [ic.cc : 2272 + 0xe]
	21  libv8.so!Runtime_LoadIC_Miss [ic.cc : 2253 + 0xb]
	22  0x36a4a6f063a7
	23  0x36a4a7011af0
	24  0x36a4a701093a
	25  0x36a4a700ecd7
	26  0x36a4a6f48423
	27  0x36a4a6f26a81
	28  libv8.so!Invoke [execution.cc : 137 + 0x15]
	29  libv8.so!<name omitted> [execution.cc : 174 + 0x1c]
	30  libv8.so!Run [api.cc : 1838 + 0x11]
	31  libblink_core.so!runCompiledScript [V8ScriptRunner.cpp : 415 + 0xc]
	32  libblink_core.so!executeScriptAndReturnValue [ScriptController.cpp : 150 + 0x16]
	33  libblink_core.so!evaluateScriptInMainWorld [ScriptController.cpp : 396 + 0xc]
	34  libblink_core.so!blink::ScriptController::executeScriptInMainWorld(blink::ScriptSourceCode const&, blink::AccessControlStatus) + 0x44
	35  libblink_core.so!blink::ScriptLoader::executeScript(blink::ScriptSourceCode const&) + 0xa1d
	36  libblink_core.so!blink::(anonymous namespace)::doExecuteScript(blink::Element*, blink::ScriptSourceCode const&, WTF::TextPosition const&) + 0x1d9
	37  libblink_core.so!blink::HTMLScriptRunner::executePendingScriptAndDispatchEvent(blink::PendingScript*, blink::ScriptStreamer::Type) + 0x428
	38  libblink_core.so!blink::HTMLScriptRunner::executeParsingBlockingScript() + 0x1f9
	39  libblink_core.so!blink::HTMLScriptRunner::executeParsingBlockingScripts() + 0x69
	40  libblink_core.so!blink::HTMLScriptRunner::execute(blink::Element*, WTF::TextPosition const&) + 0x29c
	41  libblink_core.so!blink::HTMLDocumentParser::runScriptsForPausedTreeBuilder() + 0xdc
	42  libblink_core.so!blink::HTMLDocumentParser::processTokenizedChunkFromBackgroundParser(std::unique_ptr<blink::HTMLDocumentParser::TokenizedChunk, std::default_delete<blink::HTMLDocumentParser::TokenizedChunk> >) + 0x92f
	43  libblink_core.so!blink::HTMLDocumentParser::pumpPendingSpeculations() + 0x4d0
	44  libblink_core.so!blink::HTMLDocumentParser::resumeParsingAfterYield() + 0xbf
	45  libblink_core.so!blink::HTMLParserScheduler::continueParsing() + 0x1d
	46  libblink_core.so!void base::internal::FunctorTraits<void (blink::ScriptStreamer::*)(), void>::Invoke<blink::CrossThreadPersistent<blink::ScriptStreamer> const&>(void (blink::ScriptStreamer::*)(), blink::CrossThreadPersistent<blink::ScriptStreamer> const&) + 0x82
	47  libblink_core.so!void base::internal::InvokeHelper<true, void>::MakeItSo<void (blink::HTMLParserScheduler::* const&)(), blink::WeakPersistent<blink::HTMLParserScheduler> const&>(void (blink::HTMLParserScheduler::* const&)(), blink::WeakPersistent<blink::HTMLParserScheduler> const&) + 0x6d
	48  libblink_core.so!void base::internal::Invoker<base::internal::BindState<void (blink::HTMLParserScheduler::*)(), blink::WeakPersistent<blink::HTMLParserScheduler> >, void ()>::RunImpl<void (blink::HTMLParserScheduler::* const&)(), std::tuple<blink::WeakPersistent<blink::HTMLParserScheduler> > const&, 0ul>(void (blink::HTMLParserScheduler::* const&)(), std::tuple<blink::WeakPersistent<blink::HTMLParserScheduler> > const&, base::IndexSequence<0ul>) + 0x42
	49  libblink_core.so!base::internal::Invoker<base::internal::BindState<void (blink::HTMLParserScheduler::*)(), blink::WeakPersistent<blink::HTMLParserScheduler> >, void ()>::Run(base::internal::BindStateBase*) + 0x2c
	50  libblink_platform.so!base::Callback<void (), (base::internal::CopyMode)1>::Run() const + 0x2e
	51  libblink_platform.so!WTF::Function<void (), (WTF::FunctionThreadAffinity)1>::operator()() + 0x101
	52  libblink_platform.so!blink::CancellableTaskFactory::CancellableTask::run() + 0x4e
	53  libscheduler.so!scheduler::WebTaskRunnerImpl::runTask(std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> >) + 0x1e
	54  libscheduler.so!void base::internal::FunctorTraits<void (*)(std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> >), void>::Invoke<std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> > >(void (*)(std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> >), std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> >&&) + 0x37
	55  libscheduler.so!void base::internal::InvokeHelper<false, void>::MakeItSo<void (* const&)(std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> >), std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> > >(void (* const&)(std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> >), std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> >&&) + 0x38
	56  libscheduler.so!void base::internal::Invoker<base::internal::BindState<void (*)(std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> >), base::internal::PassedWrapper<std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> > > >, void ()>::RunImpl<void (* const&)(std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> >), std::tuple<base::internal::PassedWrapper<std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> > > > const&, 0ul>(void (* const&)(std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> >), std::tuple<base::internal::PassedWrapper<std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> > > > const&, base::IndexSequence<0ul>) + 0x47
	57  libscheduler.so!base::internal::Invoker<base::internal::BindState<void (*)(std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> >), base::internal::PassedWrapper<std::unique_ptr<blink::WebTaskRunner::Task, std::default_delete<blink::WebTaskRunner::Task> > > >, void ()>::Run(base::internal::BindStateBase*) + 0x2c
	58  libbase.so!base::Callback<void (), (base::internal::CopyMode)1>::Run() const + 0x2e
	59  libbase.so!base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask const&) + 0x25b
	60  libscheduler.so!scheduler::TaskQueueManager::ProcessTaskFromWorkQueue(scheduler::internal::WorkQueue*, scheduler::internal::TaskQueueImpl::Task*) + 0x634


I'm not sure whether this should be a P1. An understanding of what's going wrong is needed. Is this basically an out-of-memory error?

Is this suddenly happening?
Cc: ahaas@chromium.org
Status: Available (was: Untriaged)
Is this maybe related to the latest GC crashers?

Comment 4 by kbr@chromium.org, Aug 11 2016

Components: -Internals>GPU Internals>GPU>Testing
Labels: -Pri-1 Pri-2
I don't think this is a P1. These flaky crashes have been on the waterfall at least since August 2. Attached is the full build log from https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Debug%20%28NVIDIA%29/builds/32703 in which GpuProcess.identify_active_gpu1 failed. The crash happened during compilation. It's a little disconcerting since it looks like there might be C heap corruption. It seems to be happening only in Debug mode though -- there's no evidence of this on the Release bot: https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Release%20%28NVIDIA%29?numbuilds=200 .

stdout.txt
703 KB View Download

Comment 5 by vmi...@chromium.org, Aug 15 2016

I'm going to try bisecting this.

Comment 6 by kbr@chromium.org, Aug 18 2016

Cc: vmi...@chromium.org
 Issue 638384  has been merged into this issue.

Comment 7 by vmi...@chromium.org, Aug 18 2016

Looking at swarming task logs, the earliest this crash reproduced on Linux Debug is May 25th, with a single instance in that month.  Frequency increased on June 9th through today.

Earliest report: 2016-05-25 19:53:21 https://chromium-swarm.appspot.com/user/task/2d941f446e473210

Prior to that there were a series of failures in CheckMicrotasksScopesConsistency on March 15th.  There is a chance that the recent memory errors are related to micro tasks.

2016-03-15 17:28:22 https://chromium-swarm.appspot.com/user/task/2d9423ce6b5a5110

Task logs showing this trend:

https://chromium-swarm.appspot.com/user/tasks?limit=105&state=completed_failure&task_tag=stepname%3Agpu_process_launch_tests%20on%20NVIDIA%20GPU%20on%20Linux%20on%20Linux&cursor=CkoSRGoQc35jaHJvbWl1bS1zd2FybXIwCxILVGFza1JlcXVlc3QYrsjsr9qvmYN9DAsSEVRhc2tSZXN1bHRTdW1tYXJ5GAEMGAAgAA%3D%3D


gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/8d2184e5f1/Linux Debug (NVIDIA)/30609	Completed (failed)	2016-06-09 17:57:35	0:05	4:31	0.035 $	‑‑	build125-m1	25  <-- Dealloc crash
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/2f8988f30f/Linux Debug (NVIDIA)/63027	Completed (failed)	2016-06-09 11:30:22	0:05	4:50	0.0352 $	‑‑	build156-m4	25  <-- Dealloc crash
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/c7dd049a68/Linux Debug (NVIDIA)/63007	Completed (failed)	2016-06-09 04:31:59	0:06	4:55	0.0356 $	‑‑	build146-m4	25  <-- Dealloc crash
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/38b8ec90d5/Linux Debug (NVIDIA)/63006	Completed (failed)	2016-06-09 04:11:29	0:09	5:00	0.0369 $	‑‑	build146-m4	25  <-- Dealloc crash
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/a798cddc2f/Linux Debug (NVIDIA)/30570	Completed (failed)	2016-06-09 03:45:07	0:07	4:54	0.0354 $	‑‑	build80-m4	25  <-- Dealloc crash
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/7730fb44d9/Linux Debug (NVIDIA)/29857	Completed (failed)	2016-05-25 19:53:21	3:27	4:50	0.0384 $	‑‑	build146-m4	25  <-- Dealloc crash
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/ab11c95a7f/Linux Debug (NVIDIA)/28171	Completed (failed)	2016-04-19 14:07:16	0:02	0:01	0.0026 $	‑‑	build80-m4	25  <-- Dealloc crash
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/bb52028ca9/Linux Release (NVIDIA)/76337	Completed (failed)	2016-04-19 13:55:53	0:10	0:01	0.0014 $	‑‑	build107-m4	25  <-- unrelated failure
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/d4935fa7c3/Linux Release (NVIDIA)/39041	Completed (failed)	2016-04-19 13:50:27	0:01	0:01	0.0011 $	‑‑	build125-m1	25  <-- unrelated failure
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/fda7b949b9/Linux Debug (NVIDIA)/60418	Completed (failed)	2016-04-19 13:45:19	0:04	0:02	0.0025 $	‑‑	build157-m1	25  <-- unrelated failure
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/c4bd102de0/Linux Release (NVIDIA)/76335	Completed (failed)	2016-04-19 13:32:53	0:06	0:01	0.0011 $	‑‑	build78-m4	25  <-- unrelated failure
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/04d8ecc3a5/Linux Release (NVIDIA)/39040	Completed (failed)	2016-04-19 13:23:31	0:03	0:01	0.0014 $	‑‑	build150-m4	25  <-- unrelated failure
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/8147e3e99e/Linux Debug (NVIDIA)/27320	Completed (failed)	2016-03-15 17:28:22	0:05	1:31	0.0136 $	‑‑	build157-m1	25  <-- CheckMicrotasksScopesConsistency DCHECK
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/5a37d6283c/Linux Debug (NVIDIA)/58954	Completed (failed)	2016-03-15 17:23:25	0:11	1:26	0.013 $	‑‑	build145-m4	25  <-- CheckMicrotasksScopesConsistency DCHECK
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/22dd24688a/Linux Debug (NVIDIA)/27319	Completed (failed)	2016-03-15 17:20:13	0:08	1:41	0.0154 $	‑‑	build155-m4	25  <-- CheckMicrotasksScopesConsistency DCHECK
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/3a5c7b6660/Linux Debug (NVIDIA)/58953	Completed (failed)	2016-03-15 17:15:49	0:04	1:28	0.0133 $	‑‑	build78-m4	25  <-- CheckMicrotasksScopesConsistency DCHECK
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/c5eea1430c/Linux Debug (NVIDIA)/27318	Completed (failed)	2016-03-15 17:13:38	0:04	1:34	0.0138 $	‑‑	build105-m4	25  <-- CheckMicrotasksScopesConsistency DCHECK
gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/b758be9995/Linux Debug (NVIDIA)/58952	Completed (failed)	2016-03-15 17:08:28	0:03	1:24	0.0125 $	‑‑	build151-m4	25  <-- CheckMicrotasksScopesConsistency DCHECK

Comment 8 by kbr@chromium.org, Aug 18 2016

Cc: mstarzinger@chromium.org jkummerow@chromium.org bmeu...@chromium.org mvstan...@chromium.org
I suspect the CheckMicrotasksScopesConsistency issue is unrelated; I vaguely recall triaging that and think the associated bug was actually fixed.

However, the increased incidence of this crash in V8's stub generator since then is marked. Thanks very much Victor for finding this evidence.

V8 team: who can help investigate this bug? Local reproduction is difficult. Victor's found that it mainly reproduces after a rebuild, i.e., with a cold disk cache -- so it seems like a race condition. Victor's been working on a script to bisect the failure on the Swarming bots, so probably he can help whoever picks this up.

It may be a race condition where there's a double-free, and V8 just happens to be the component affected.

Tracking this down will greatly improve stability of the product on the bots. Thanks for your help.

Comment 9 by vmi...@chromium.org, Aug 20 2016

Quick update on this.  I've been trying to bisect this on swarming bots, running up to 300 iterations at each step.  The bisect is nearly done; I'll post an update once it's finished.

# BAD  git 5314658a6e5aad26636d9d2bd20890b241ead672 r398740 FAILS 3/28
# BAD  git 8014547d2f0c780bf395e45d125e23df1463514f r398727 FAILS 6/53 <<< V8 roll
# BAD  git 2bf28420eb2dcc22d1b118dafdf1989fcf986a55 r398726 FAILS 3/116
# BAD  git 211658866f3462be8edc3dd80865d6b68266ca81 r398709 FAILS 3/79
# BAD  git 06bd78559d65ac5b38b7bcfd370baf35532e79ea r398707 FAILS 3/74
# GOOD git 3a5132a428812c397f17b99cfe9101863f8a851c r398705 FAILS 0/300
# GOOD git 00dd0a05d86c5b2d3df2f8707ab93517300ed8d3 r398701 FAILS 0/300
# GOOD git 7a5d04b5299133bf31df423e0eb125fbb2998d65 r398693 FAILS 0/300
# GOOD git 57055bee86b5bbfbe138ac592eed0e9d7d9a423c r398647 FAILS 0/300
# GOOD git a6aafa8a89f49715c19e45232194bce94435e20c r397624 FAILS 0/300
# GOOD git 125abdb5f929af6f5100681e3d8d0d35449a7ad0 r398554 FAILS 0/300
# GOOD git c98d04bc781d7b82808c927554a6e020b0a31573 r398368 FAILS 0/300
# GOOD git 0063cf80ae18f025807016c41f17462d51bee872 r396136 FAILS 0/300

Remaining to bisect: http://test-results.appspot.com/revision_range?start=398707&end=398709

Nothing stands out to me in that range.  I rather suspect the following change just before things started going bad, so I may need to reconfirm due to the nature of the flaky failures.

commit 96a6dfa2c30ab9b22abd20c87ed0e0d6ae41c40e
Author: dpranke <dpranke@chromium.org>
Date:   Wed Jun 8 15:28:05 2016 -0700

    Change //build/config/compiler:optimize_max to use -O3.
    
    Certain components (e.g., v8) really want to be compiled with -O3,
    but the current ":optimize_max" setting just used -O2. Since "max"
    should theoretically mean "max", let's try making it be -O3 across
    the board and see what happens.
    
    R=brettw@chromium.org
    BUG= 616031 
    
    Review-Url: https://codereview.chromium.org/2048163002
    Cr-Commit-Position: refs/heads/master@{#398704}

I suspect this because it's a stack unwinding problem.  In Debug mode, on free() tcmalloc uses GetStackTrace() to save the caller's stack.  This is what is failing, and looks like GetStackTrace() has unwound past the top of the stack, or __builtin_frame_address(0) is returning an incorrect value due to optimizations.

Crashing line: https://cs.chromium.org/chromium/src/third_party/tcmalloc/vendor/src/stacktrace_x86-inl.h?q=stacktrace_x86-inl.h&sq=package:chromium&dr=C&l=325

Note the following stack dump, the "Crash address" is near the stack pointer, but higher than the top of the stack on entry to _start (rsp = 0x00007fffe822ed40).

Operating system: Linux
	                  0.0.0 Linux 3.13.0-61-generic #100-Ubuntu SMP Wed Jul 29 11:21:34 UTC 2015 x86_64
	CPU: amd64
	     family 6 model 60 stepping 3
	     1 CPU
	
	GPU: UNKNOWN
	
	Crash reason:  SIGSEGV
	Crash address: 0x7fffe8240408
	Process uptime: not available
	
	Thread 0 (crashed)
	 0  libbase.so!GetStackTrace(void**, int, int) + 0x52
	    rax = 0x00007fffe8240400   rdx = 0x00007fffe822a068
	    rcx = 0x0000000000000001   rbx = 0x0000150cc5bc6230
	    rsi = 0x0000000000000005   rdi = 0x00007fffe822a620
	    rbp = 0x00007fffe8229f20   rsp = 0x00007fffe8229ee0
	     r8 = 0x00007fffe822a324    r9 = 0x0000000000000001
	    r10 = 0x000000000000011c   r11 = 0x000000000000001f
	    r12 = 0x00007fffe822a968   r13 = 0x0000150cc6029380
	    r14 = 0x00007fffe822aa58   r15 = 0x0000000000000004
	    rip = 0x00007f50acf95cd2
	    Found by: given as instruction pointer in context
	 1  libbase.so!MallocBlockQueueEntry::MallocBlockQueueEntry(MallocBlock*, unsigned long) + 0x5d
	    rbx = 0x0000150cc5bc6230   rbp = 0x00007fffe8229f50
	    rsp = 0x00007fffe8229f30   r12 = 0x00007fffe822a968
	    r13 = 0x0000150cc6029380   r14 = 0x00007fffe822aa58
	    r15 = 0x0000000000000004   rip = 0x00007f50acfa222d
	    Found by: call frame info
	 2  libbase.so!MallocBlock::ProcessFreeQueue(MallocBlock*, unsigned long, int) + 0x8d
	    rbx = 0x0000150cc5bc6230   rbp = 0x00007fffe822a3a0
	    rsp = 0x00007fffe8229f60   r12 = 0x00007fffe822a968
	    r13 = 0x0000150cc6029380   r14 = 0x00007fffe822aa58
	    r15 = 0x0000000000000004   rip = 0x00007f50acfa006d
	    Found by: call frame info
	 3  libbase.so!MallocBlock::Deallocate(int) + 0x124
	    rbx = 0x0000150cc5bc6230   rbp = 0x00007fffe822a3f0
	    rsp = 0x00007fffe822a3b0   r12 = 0x00007fffe822a968
	    r13 = 0x0000150cc6029380   r14 = 0x00007fffe822aa58
	    r15 = 0x0000000000000004   rip = 0x00007f50acfa42d4
	    Found by: call frame info
	 4  libbase.so!DebugDeallocate(void*, int) + 0xe5
	    rbx = 0x0000150cc5bc6230   rbp = 0x00007fffe822a440
	    rsp = 0x00007fffe822a400   r12 = 0x00007fffe822a968
	    r13 = 0x0000150cc6029380   r14 = 0x00007fffe822aa58
	    r15 = 0x0000000000000004   rip = 0x00007f50acf9cb95
	    Found by: call frame info
         ...
	86  chrome!_GLOBAL__sub_I_BC_PDF417Detector.cpp + 0x18
	    rsp = 0x00007fffe822ed28   rip = 0x00007f50ad8ee808
	    Found by: stack scanning
	87  chrome!_start + 0x29
	    rsp = 0x00007fffe822ed40   rip = 0x00007f50ad8ee831
	    Found by: stack scanning

Comment 10 by kbr@chromium.org, Aug 20 2016

Cc: dpranke@chromium.org
Thanks Victor for digging into this so deeply. It's been a longstanding problem.

+dpranke as FYI for optimization level change above.

I think I've confirmed that -O2 -> -O3 increases the crash rate, however reverting to -O2 doesn't seem to 100% fix it.

So far "-O2 -fno-omit-frame-pointer" seems to work 100%; I'll do some extended tests to confirm.  I suspect that -O3 is just doing more inlining, whereas -O2 still inlines but less often.  AFAIK any -O level could imply -fomit-frame-pointer, which makes it unsafe to call GetStackTrace().  Perhaps the solution is to always have -fno-omit-frame-pointer on Debug builds.
Owner: vmi...@chromium.org
Status: Started (was: Available)
Confirmed #11, adding -fno-omit-frame-pointer results in no more failures in 900 runs on ToT.  I'm going to add this flag to debug builds.
Components: -Blink>JavaScript
Removing JavaScript label.

Comment 14 by kbr@chromium.org, Aug 22 2016

Blockedon: 616031
Thanks very much Victor for tracking this down. Blocking it on the root cause bug.

Project Member

Comment 15 by bugdroid1@chromium.org, Aug 23 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4e69ee6824fc94c59762b5f05f9f340fb4466d7f

commit 4e69ee6824fc94c59762b5f05f9f340fb4466d7f
Author: vmiura <vmiura@chromium.org>
Date: Tue Aug 23 01:58:27 2016

Explicitly ask for stack frame pointers on Debug posix builds.

GCC / LLVM can omit stack frames at any optimization level.  We use -Os
for Android Debug, and -O3 for targets like v8.  This can cause the runtime
stack unwinding debug feature of 'tcmalloc' to mis-unwind and crash.

R=brettw@chromium.org
R=dpranke@chromium.org
BUG= 636489 

Review-Url: https://codereview.chromium.org/2266073002
Cr-Commit-Position: refs/heads/master@{#413628}

[modify] https://crrev.com/4e69ee6824fc94c59762b5f05f9f340fb4466d7f/build/config/compiler/BUILD.gn

Project Member

Comment 16 by bugdroid1@chromium.org, Aug 23 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8021d966e853806efec25188e320e45e0bc0bc8b

commit 8021d966e853806efec25188e320e45e0bc0bc8b
Author: johnme <johnme@chromium.org>
Date: Tue Aug 23 13:39:20 2016

Revert of Explicitly ask for stack frame pointers on Debug posix builds. (patchset #1 id:1 of https://codereview.chromium.org/2266073002/ )

Reason for revert:
This broke all Android x86/x64 debug bots, for example:

https://build.chromium.org/p/chromium.android/builders/Android%20x86%20Builder%20%28dbg%29/builds/7886
https://build.chromium.org/p/chromium.android/builders/Android%20x64%20Builder%20%28dbg%29/builds/7915

It broke because ffmpeg expects to be compiled with -fomit-frame-pointer so that files like third_party/ffmpeg/libavcodec/x86/mpegaudiodsp.c:86 can use an extra register; globally applying -fno-omit-frame-pointer appears to have caused it to run out of registers.

If you look at third_party/ffmpeg/ffmpeg.gyp:260, you'll see it removes this flag if it has been set globally:

'cflags!': [
  '-fno-omit-frame-pointer',
],

But the GN equivalent third_party/ffmpeg/BUILD.gn can't do this, because it's an error in GN to remove a flag that hasn't been set. A clean solution is probably to create a new config in build/config/compiler/BUILD.gn providing the default no-omit-frame-pointer cflag, that is always included, then third_party/ffmpeg/BUILD.gn can unconditionally remove that config and set its own omit-frame-pointer cflag.

Original issue's description:
> Explicitly ask for stack frame pointers on Debug posix builds.
>
> GCC / LLVM can omit stack frames at any optimization level.  We use -Os
> for Android Debug, and -O3 for targets like v8.  This can cause the runtime
> stack unwinding debug feature of 'tcmalloc' to mis-unwind and crash.
>
> R=brettw@chromium.org
> R=dpranke@chromium.org
> BUG= 636489 
>
> Committed: https://crrev.com/4e69ee6824fc94c59762b5f05f9f340fb4466d7f
> Cr-Commit-Position: refs/heads/master@{#413628}

TBR=brettw@chromium.org,dpranke@chromium.org,kbr@chromium.org,vmiura@chromium.org
# Skipping CQ checks because original CL landed less than 1 days ago.
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG= 636489 

Review-Url: https://codereview.chromium.org/2269063002
Cr-Commit-Position: refs/heads/master@{#413722}

[modify] https://crrev.com/8021d966e853806efec25188e320e45e0bc0bc8b/build/config/compiler/BUILD.gn

Project Member

Comment 18 by bugdroid1@chromium.org, Aug 26 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d3b4af41bc9b058a3548b2beda4cf1d1e84c68e6

commit d3b4af41bc9b058a3548b2beda4cf1d1e84c68e6
Author: vmiura <vmiura@chromium.org>
Date: Fri Aug 26 01:33:22 2016

Add a GN build configuration for controlling stack frame generation.

This is part 1 of staging crrev.com/2266073002.

Part 1: Add empty config("default_stack_frames").
Part 2: Disable this config in third_party/ffmpeg.
Part 3: Move Chromium -fomit/-fno-omit-frame-pointer logic into this config.

TBR=brettw@chromium.org
TBR=dpranke@chromium.org
BUG= 636489 

Review-Url: https://codereview.chromium.org/2280533004
Cr-Commit-Position: refs/heads/master@{#414618}

[modify] https://crrev.com/d3b4af41bc9b058a3548b2beda4cf1d1e84c68e6/build/config/BUILDCONFIG.gn
[modify] https://crrev.com/d3b4af41bc9b058a3548b2beda4cf1d1e84c68e6/build/config/compiler/BUILD.gn

> Are there more steps to take to resolve this bug?

Yep, working on a few more patches to stage the fix without breaking third_party/ffmpeg.
Project Member

Comment 20 by bugdroid1@chromium.org, Aug 26 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/third_party/ffmpeg/+/35740fc7b72ac1d9adff69e67f3f61b639484dc3

commit 35740fc7b72ac1d9adff69e67f3f61b639484dc3
Author: Victor Miura <vmiura@chromium.org>
Date: Fri Aug 26 01:51:28 2016

Override the default_stack_frames GN config.

Chromium default flags in Debug builds will be changed to include
-fno-omit-frame-pointer as part of the 'default_stack_frames' config
which breaks ffmpeg compile on Android x86.

This CL disables the 'default_stack_frames' config, so we can keep
'-fomit-frame-pointer' for ffmpeg.

BUG= 636489 

Change-Id: I5ddf565cb0d720099d0d1a06a337045e722192b7
Reviewed-on: https://chromium-review.googlesource.com/376130
Reviewed-by: Dale Curtis <dalecurtis@chromium.org>

[modify] https://crrev.com/35740fc7b72ac1d9adff69e67f3f61b639484dc3/BUILD.gn

Project Member

Comment 21 by bugdroid1@chromium.org, Aug 27 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/58ceb62e15a832d540fd461b1985ffb4d96d85f1

commit 58ceb62e15a832d540fd461b1985ffb4d96d85f1
Author: vmiura <vmiura@chromium.org>
Date: Sat Aug 27 10:48:36 2016

Roll third_party\ffmpeg 75976ae02..edafabaee (1 commit)

https://chromium.googlesource.com/chromium/third_party/ffmpeg.git/+log/75976ae02..edafabaee

$ git log 75976ae02..edafabaee --date=short --no-merges --format='%ad %ae %s'
2016-08-25 vmiura@chromium.org Override the default_stack_frames GN config.

TBR=dalecurtis@chromium.org
BUG= 636489 

Review-Url: https://codereview.chromium.org/2286833003
Cr-Commit-Position: refs/heads/master@{#414902}

[modify] https://crrev.com/58ceb62e15a832d540fd461b1985ffb4d96d85f1/DEPS

@vmiura - can you confirm if my understanding is correct from this bug thread and the related CLs ... ?

1) tcmalloc in a debug config needs frame pointers in order to get a valid stack trace.

2) it appears that the compiler is free to omit frame pointers arbitrarily, regardless of debug / optimize setting, unless -f[no-]omit-frame-pointer is explicitly set.

3) the :ffmpeg_internal needs -fomit-frame-pointer to be set, regardless of the debug and optimization settings.

4) Hence, it would follow if tcmalloc was called from inside an ffmpeg_internal callsite, it would be unhappy.

5) 4) isn't a problem because ... ?

6) -fomit-frame-pointer produces smaller binaries (unsurprisingly) so we want this to be on on official builds and possibly also release builds. Do we really care about binary size on release (not official) builds, though?

Cc: primiano@chromium.org
@dpranke yes pretty much summed it up.

> 1) tcmalloc in a debug config needs frame pointers in order to get a valid stack trace.
> 
> 2) it appears that the compiler is free to omit frame pointers arbitrarily, regardless
>  of debug / optimize setting, unless -f[no-]omit-frame-> pointer is explicitly set.
>
> 3) the :ffmpeg_internal needs -fomit-frame-pointer to be set, regardless of the debug
>  and optimization settings.

Correct.

> 4) Hence, it would follow if tcmalloc was called from inside an ffmpeg_internal callsite,
>  it would be unhappy.
> 
> 
> 5) 4) isn't a problem because ... ?

Yes, it may be unhappy.  I'm not sure we've seen a failure from it in practice, because
of... blind luck.

ffmpeg has used -fomit-frame-pointer for a long time, and I'm making the assumption it's
OK to keep the status quo for now.  It is worth following up.

Either ffmpeg doesn't use tcmalloc, or the stars didn't align to cause it to break.  We
run a lot of v8 code across our bots, and somehow only gpu_process_launch_tests
triggered this in a somewhat repeatable manner (~10% of full suite runs).

> 6) -fomit-frame-pointer produces smaller binaries (unsurprisingly) so we want this to
> be on on official builds and possibly also release builds. Do we really care about
> binary size on release (not official) builds, though?

+primiano@ do we care about increasing Clank binary size on Release, non-official builds?

I took the "if it ain't broken" option for now.  tcmalloc would be OK if it's not a Debug
build, and profiling or sanitizing builds will force the stacks on.


Project Member

Comment 24 by bugdroid1@chromium.org, Aug 30 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ecdecf5e9fec3c61496f3cedd78c0ba146eb5b91

commit ecdecf5e9fec3c61496f3cedd78c0ba146eb5b91
Author: vmiura <vmiura@chromium.org>
Date: Tue Aug 30 04:36:34 2016

Explicitly ask for stack frame pointers on Debug posix builds.

GCC / LLVM can omit stack frames at any optimization level.  We use -Os
for Android Debug, and -O3 for targets like v8.  This can cause the runtime
stack unwinding debug feature of 'tcmalloc' to mis-unwind and crash.

This CL adds "-fno-omit-frame-pointer" by default to Debug builds, as well
as fixing consistency for the setting in profiling and stanitizer builds.

The -f*omit-frame-pointer flag settings are moved to
config("default_stack_frames"), to enable build targets to disable and
override the default settings.

R=brettw@chromium.org
R=dpranke@chromium.org
BUG= 636489 

Review-Url: https://codereview.chromium.org/2266073002
Cr-Commit-Position: refs/heads/master@{#415102}

[modify] https://crrev.com/ecdecf5e9fec3c61496f3cedd78c0ba146eb5b91/build/config/compiler/BUILD.gn

@dpranke, FWIW I was thinking about why the tcmalloc crashes are so infrequent.

The stack unwinder uses a few heuristics to know when to give up, for example if it gets a stack pointer that is decreasing, or is > 100000 bytes away from the current stack pointer, it gives up.  When the stack frame is bad it's very likely to fail these checks.  For a crash to occur, the unwinder has to hit a pointer which is very close to a real looking stack pointer, but is actually out of bounds and not mmapped.

https://cs.chromium.org/chromium/src/third_party/tcmalloc/vendor/src/stacktrace_x86-inl.h?q=stacktrace_x86-inl.h&sq=package:chromium&dr=C&l=232

I notice that there is also a case that uses msync() to verify the pointer before dereferencing it, however this seems slow to use frequently, and looks disabled in the tcmalloc case.

https://cs.chromium.org/chromium/src/third_party/tcmalloc/vendor/src/stacktrace_x86-inl.h?q=stacktrace_x86-inl.h&sq=package:chromium&dr=C&l=266
> +primiano@ do we care about increasing Clank binary size on Release, non-official builds?

Uhm no. I think it's already the case that base and some other target have different -O values on official vs non-official, right?
Just not sure why the fix should be only for non-official.
> Just not sure why the fix should be only for non-official.

The fix is only needed in DEBUG builds, because they enable stack tracing in tcmalloc.

The reason (I assume) to change Release builds to -fno-omit-frame-pointers is to potentially get better stack dumps from crashes.  It may be good but isn't the P1 issue.
> The reason (I assume) to change Release builds to -fno-omit-frame-pointers is 
> to potentially get better stack dumps from crashes. 

Right.
>> The reason (I assume) to change Release builds to -fno-omit-frame-pointers is 
>> to potentially get better stack dumps from crashes. 
>
> Right.

The stack walker we use for minidumps uses extra debug information, and is actually able to unwind these stacks that tcmalloc fails on.  Comment #9 has an example, where GetStackTrace crashed, but the minidump got the full stack up to '_start'.

For each frame it lists what it used -
 Found by: call frame info
 Found by: stack scanning
Definitely breakpad uses a different and way more reliable way to get stack traces (which happens post-the-fact, not on the device).
in general the chrome unwinder itself (in base/debug) should be more reliable, as it uses libunwind, which in turn uses unwind tables from CFI, which don't require frame pointers. AFAIK unwind tables are always there on Linux, we strip them only on Android (Where we don't have tcmalloc) to save the binary size. 

So, from what I learn in this bug, looks like the tcmalloc unwinder doesn't use unwind table and always uses frame pointers (very likely for performance reasons), but this mechanism seems fragile (fun fact I am reviewing some unrelated stack unwinding code that uses FP and we are seeing precisely these kinds of problems)

Beyond whatever baindaid we are going to do in the short term, at this point I might just curious debugallocation: does tcmalloc unwind the stack (with its own stack unwinder which seems fragile) on every free? Or only when it detects that a free is non valid?
In the 1st case anybody has an idea why? Maybe we should look into that and (if it doesn't require forking other than some config.h) get rid of that specific part of debugallocation as a longer term fix?

Cc: -bmeu...@chromium.org

Comment 32 by enne@chromium.org, Nov 8 2016

vmiura: Is there more work to do here?

Comment 33 Deleted

Labels: -Pri-2 Pri-1
Status: Fixed (was: Started)
Sorry I crossed tracks, ignore comment #33.

This issue is fixed.

Sign in to add a comment