GPU tests flaky (crash) on Linux Nvidia Debug |
||||||||||||
Issue descriptionThe test that fails is different but the stack trace (crash in v8) is the same. Builds: https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Debug%20%28NVIDIA%29/builds/32833 https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Debug%20%28NVIDIA%29/builds/32830 https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Debug%20%28NVIDIA%29/builds/32827 https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Debug%20%28NVIDIA%29/builds/32826 Stack trace in attached log. I was able to repro this on my workstation the other day.
,
Aug 11 2016
Is this suddenly happening?
,
Aug 11 2016
Is this maybe related to the latest GC crashers?
,
Aug 11 2016
I don't think this is a P1. These flaky crashes have been on the waterfall at least since August 2. Attached is the full build log from https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Debug%20%28NVIDIA%29/builds/32703 in which GpuProcess.identify_active_gpu1 failed. The crash happened during compilation. It's a little disconcerting since it looks like there might be C heap corruption. It seems to be happening only in Debug mode though -- there's no evidence of this on the Release bot: https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Release%20%28NVIDIA%29?numbuilds=200 .
,
Aug 15 2016
I'm going to try bisecting this.
,
Aug 18 2016
,
Aug 18 2016
Looking at swarming task logs, the earliest this crash reproduced on Linux Debug is May 25th, with a single instance in that month. Frequency increased on June 9th through today. Earliest report: 2016-05-25 19:53:21 https://chromium-swarm.appspot.com/user/task/2d941f446e473210 Prior to that there were a series of failures in CheckMicrotasksScopesConsistency on March 15th. There is a chance that the recent memory errors are related to micro tasks. 2016-03-15 17:28:22 https://chromium-swarm.appspot.com/user/task/2d9423ce6b5a5110 Task logs showing this trend: https://chromium-swarm.appspot.com/user/tasks?limit=105&state=completed_failure&task_tag=stepname%3Agpu_process_launch_tests%20on%20NVIDIA%20GPU%20on%20Linux%20on%20Linux&cursor=CkoSRGoQc35jaHJvbWl1bS1zd2FybXIwCxILVGFza1JlcXVlc3QYrsjsr9qvmYN9DAsSEVRhc2tSZXN1bHRTdW1tYXJ5GAEMGAAgAA%3D%3D gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/8d2184e5f1/Linux Debug (NVIDIA)/30609 Completed (failed) 2016-06-09 17:57:35 0:05 4:31 0.035 $ ‑‑ build125-m1 25 <-- Dealloc crash gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/2f8988f30f/Linux Debug (NVIDIA)/63027 Completed (failed) 2016-06-09 11:30:22 0:05 4:50 0.0352 $ ‑‑ build156-m4 25 <-- Dealloc crash gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/c7dd049a68/Linux Debug (NVIDIA)/63007 Completed (failed) 2016-06-09 04:31:59 0:06 4:55 0.0356 $ ‑‑ build146-m4 25 <-- Dealloc crash gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/38b8ec90d5/Linux Debug (NVIDIA)/63006 Completed (failed) 2016-06-09 04:11:29 0:09 5:00 0.0369 $ ‑‑ build146-m4 25 <-- Dealloc crash gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/a798cddc2f/Linux Debug (NVIDIA)/30570 Completed (failed) 2016-06-09 03:45:07 0:07 4:54 0.0354 $ ‑‑ build80-m4 25 <-- Dealloc crash gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/7730fb44d9/Linux Debug (NVIDIA)/29857 Completed (failed) 2016-05-25 19:53:21 3:27 4:50 0.0384 $ ‑‑ build146-m4 25 <-- Dealloc crash gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/ab11c95a7f/Linux Debug (NVIDIA)/28171 Completed (failed) 2016-04-19 14:07:16 0:02 0:01 0.0026 $ ‑‑ build80-m4 25 <-- Dealloc crash gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/bb52028ca9/Linux Release (NVIDIA)/76337 Completed (failed) 2016-04-19 13:55:53 0:10 0:01 0.0014 $ ‑‑ build107-m4 25 <-- unrelated failure gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/d4935fa7c3/Linux Release (NVIDIA)/39041 Completed (failed) 2016-04-19 13:50:27 0:01 0:01 0.0011 $ ‑‑ build125-m1 25 <-- unrelated failure gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/fda7b949b9/Linux Debug (NVIDIA)/60418 Completed (failed) 2016-04-19 13:45:19 0:04 0:02 0.0025 $ ‑‑ build157-m1 25 <-- unrelated failure gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/c4bd102de0/Linux Release (NVIDIA)/76335 Completed (failed) 2016-04-19 13:32:53 0:06 0:01 0.0011 $ ‑‑ build78-m4 25 <-- unrelated failure gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/04d8ecc3a5/Linux Release (NVIDIA)/39040 Completed (failed) 2016-04-19 13:23:31 0:03 0:01 0.0014 $ ‑‑ build150-m4 25 <-- unrelated failure gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/8147e3e99e/Linux Debug (NVIDIA)/27320 Completed (failed) 2016-03-15 17:28:22 0:05 1:31 0.0136 $ ‑‑ build157-m1 25 <-- CheckMicrotasksScopesConsistency DCHECK gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/5a37d6283c/Linux Debug (NVIDIA)/58954 Completed (failed) 2016-03-15 17:23:25 0:11 1:26 0.013 $ ‑‑ build145-m4 25 <-- CheckMicrotasksScopesConsistency DCHECK gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/22dd24688a/Linux Debug (NVIDIA)/27319 Completed (failed) 2016-03-15 17:20:13 0:08 1:41 0.0154 $ ‑‑ build155-m4 25 <-- CheckMicrotasksScopesConsistency DCHECK gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/3a5c7b6660/Linux Debug (NVIDIA)/58953 Completed (failed) 2016-03-15 17:15:49 0:04 1:28 0.0133 $ ‑‑ build78-m4 25 <-- CheckMicrotasksScopesConsistency DCHECK gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/c5eea1430c/Linux Debug (NVIDIA)/27318 Completed (failed) 2016-03-15 17:13:38 0:04 1:34 0.0138 $ ‑‑ build105-m4 25 <-- CheckMicrotasksScopesConsistency DCHECK gpu_process_launch_tests on NVIDIA GPU on Linux/Linux/b758be9995/Linux Debug (NVIDIA)/58952 Completed (failed) 2016-03-15 17:08:28 0:03 1:24 0.0125 $ ‑‑ build151-m4 25 <-- CheckMicrotasksScopesConsistency DCHECK
,
Aug 18 2016
I suspect the CheckMicrotasksScopesConsistency issue is unrelated; I vaguely recall triaging that and think the associated bug was actually fixed. However, the increased incidence of this crash in V8's stub generator since then is marked. Thanks very much Victor for finding this evidence. V8 team: who can help investigate this bug? Local reproduction is difficult. Victor's found that it mainly reproduces after a rebuild, i.e., with a cold disk cache -- so it seems like a race condition. Victor's been working on a script to bisect the failure on the Swarming bots, so probably he can help whoever picks this up. It may be a race condition where there's a double-free, and V8 just happens to be the component affected. Tracking this down will greatly improve stability of the product on the bots. Thanks for your help.
,
Aug 20 2016
Quick update on this. I've been trying to bisect this on swarming bots, running up to 300 iterations at each step. The bisect is nearly done; I'll post an update once it's finished. # BAD git 5314658a6e5aad26636d9d2bd20890b241ead672 r398740 FAILS 3/28 # BAD git 8014547d2f0c780bf395e45d125e23df1463514f r398727 FAILS 6/53 <<< V8 roll # BAD git 2bf28420eb2dcc22d1b118dafdf1989fcf986a55 r398726 FAILS 3/116 # BAD git 211658866f3462be8edc3dd80865d6b68266ca81 r398709 FAILS 3/79 # BAD git 06bd78559d65ac5b38b7bcfd370baf35532e79ea r398707 FAILS 3/74 # GOOD git 3a5132a428812c397f17b99cfe9101863f8a851c r398705 FAILS 0/300 # GOOD git 00dd0a05d86c5b2d3df2f8707ab93517300ed8d3 r398701 FAILS 0/300 # GOOD git 7a5d04b5299133bf31df423e0eb125fbb2998d65 r398693 FAILS 0/300 # GOOD git 57055bee86b5bbfbe138ac592eed0e9d7d9a423c r398647 FAILS 0/300 # GOOD git a6aafa8a89f49715c19e45232194bce94435e20c r397624 FAILS 0/300 # GOOD git 125abdb5f929af6f5100681e3d8d0d35449a7ad0 r398554 FAILS 0/300 # GOOD git c98d04bc781d7b82808c927554a6e020b0a31573 r398368 FAILS 0/300 # GOOD git 0063cf80ae18f025807016c41f17462d51bee872 r396136 FAILS 0/300 Remaining to bisect: http://test-results.appspot.com/revision_range?start=398707&end=398709 Nothing stands out to me in that range. I rather suspect the following change just before things started going bad, so I may need to reconfirm due to the nature of the flaky failures. commit 96a6dfa2c30ab9b22abd20c87ed0e0d6ae41c40e Author: dpranke <dpranke@chromium.org> Date: Wed Jun 8 15:28:05 2016 -0700 Change //build/config/compiler:optimize_max to use -O3. Certain components (e.g., v8) really want to be compiled with -O3, but the current ":optimize_max" setting just used -O2. Since "max" should theoretically mean "max", let's try making it be -O3 across the board and see what happens. R=brettw@chromium.org BUG= 616031 Review-Url: https://codereview.chromium.org/2048163002 Cr-Commit-Position: refs/heads/master@{#398704} I suspect this because it's a stack unwinding problem. In Debug mode, on free() tcmalloc uses GetStackTrace() to save the caller's stack. This is what is failing, and looks like GetStackTrace() has unwound past the top of the stack, or __builtin_frame_address(0) is returning an incorrect value due to optimizations. Crashing line: https://cs.chromium.org/chromium/src/third_party/tcmalloc/vendor/src/stacktrace_x86-inl.h?q=stacktrace_x86-inl.h&sq=package:chromium&dr=C&l=325 Note the following stack dump, the "Crash address" is near the stack pointer, but higher than the top of the stack on entry to _start (rsp = 0x00007fffe822ed40). Operating system: Linux 0.0.0 Linux 3.13.0-61-generic #100-Ubuntu SMP Wed Jul 29 11:21:34 UTC 2015 x86_64 CPU: amd64 family 6 model 60 stepping 3 1 CPU GPU: UNKNOWN Crash reason: SIGSEGV Crash address: 0x7fffe8240408 Process uptime: not available Thread 0 (crashed) 0 libbase.so!GetStackTrace(void**, int, int) + 0x52 rax = 0x00007fffe8240400 rdx = 0x00007fffe822a068 rcx = 0x0000000000000001 rbx = 0x0000150cc5bc6230 rsi = 0x0000000000000005 rdi = 0x00007fffe822a620 rbp = 0x00007fffe8229f20 rsp = 0x00007fffe8229ee0 r8 = 0x00007fffe822a324 r9 = 0x0000000000000001 r10 = 0x000000000000011c r11 = 0x000000000000001f r12 = 0x00007fffe822a968 r13 = 0x0000150cc6029380 r14 = 0x00007fffe822aa58 r15 = 0x0000000000000004 rip = 0x00007f50acf95cd2 Found by: given as instruction pointer in context 1 libbase.so!MallocBlockQueueEntry::MallocBlockQueueEntry(MallocBlock*, unsigned long) + 0x5d rbx = 0x0000150cc5bc6230 rbp = 0x00007fffe8229f50 rsp = 0x00007fffe8229f30 r12 = 0x00007fffe822a968 r13 = 0x0000150cc6029380 r14 = 0x00007fffe822aa58 r15 = 0x0000000000000004 rip = 0x00007f50acfa222d Found by: call frame info 2 libbase.so!MallocBlock::ProcessFreeQueue(MallocBlock*, unsigned long, int) + 0x8d rbx = 0x0000150cc5bc6230 rbp = 0x00007fffe822a3a0 rsp = 0x00007fffe8229f60 r12 = 0x00007fffe822a968 r13 = 0x0000150cc6029380 r14 = 0x00007fffe822aa58 r15 = 0x0000000000000004 rip = 0x00007f50acfa006d Found by: call frame info 3 libbase.so!MallocBlock::Deallocate(int) + 0x124 rbx = 0x0000150cc5bc6230 rbp = 0x00007fffe822a3f0 rsp = 0x00007fffe822a3b0 r12 = 0x00007fffe822a968 r13 = 0x0000150cc6029380 r14 = 0x00007fffe822aa58 r15 = 0x0000000000000004 rip = 0x00007f50acfa42d4 Found by: call frame info 4 libbase.so!DebugDeallocate(void*, int) + 0xe5 rbx = 0x0000150cc5bc6230 rbp = 0x00007fffe822a440 rsp = 0x00007fffe822a400 r12 = 0x00007fffe822a968 r13 = 0x0000150cc6029380 r14 = 0x00007fffe822aa58 r15 = 0x0000000000000004 rip = 0x00007f50acf9cb95 Found by: call frame info ... 86 chrome!_GLOBAL__sub_I_BC_PDF417Detector.cpp + 0x18 rsp = 0x00007fffe822ed28 rip = 0x00007f50ad8ee808 Found by: stack scanning 87 chrome!_start + 0x29 rsp = 0x00007fffe822ed40 rip = 0x00007f50ad8ee831 Found by: stack scanning
,
Aug 20 2016
Thanks Victor for digging into this so deeply. It's been a longstanding problem. +dpranke as FYI for optimization level change above.
,
Aug 20 2016
I think I've confirmed that -O2 -> -O3 increases the crash rate, however reverting to -O2 doesn't seem to 100% fix it. So far "-O2 -fno-omit-frame-pointer" seems to work 100%; I'll do some extended tests to confirm. I suspect that -O3 is just doing more inlining, whereas -O2 still inlines but less often. AFAIK any -O level could imply -fomit-frame-pointer, which makes it unsafe to call GetStackTrace(). Perhaps the solution is to always have -fno-omit-frame-pointer on Debug builds.
,
Aug 22 2016
Confirmed #11, adding -fno-omit-frame-pointer results in no more failures in 900 runs on ToT. I'm going to add this flag to debug builds.
,
Aug 22 2016
Removing JavaScript label.
,
Aug 22 2016
Thanks very much Victor for tracking this down. Blocking it on the root cause bug.
,
Aug 23 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4e69ee6824fc94c59762b5f05f9f340fb4466d7f commit 4e69ee6824fc94c59762b5f05f9f340fb4466d7f Author: vmiura <vmiura@chromium.org> Date: Tue Aug 23 01:58:27 2016 Explicitly ask for stack frame pointers on Debug posix builds. GCC / LLVM can omit stack frames at any optimization level. We use -Os for Android Debug, and -O3 for targets like v8. This can cause the runtime stack unwinding debug feature of 'tcmalloc' to mis-unwind and crash. R=brettw@chromium.org R=dpranke@chromium.org BUG= 636489 Review-Url: https://codereview.chromium.org/2266073002 Cr-Commit-Position: refs/heads/master@{#413628} [modify] https://crrev.com/4e69ee6824fc94c59762b5f05f9f340fb4466d7f/build/config/compiler/BUILD.gn
,
Aug 23 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/8021d966e853806efec25188e320e45e0bc0bc8b commit 8021d966e853806efec25188e320e45e0bc0bc8b Author: johnme <johnme@chromium.org> Date: Tue Aug 23 13:39:20 2016 Revert of Explicitly ask for stack frame pointers on Debug posix builds. (patchset #1 id:1 of https://codereview.chromium.org/2266073002/ ) Reason for revert: This broke all Android x86/x64 debug bots, for example: https://build.chromium.org/p/chromium.android/builders/Android%20x86%20Builder%20%28dbg%29/builds/7886 https://build.chromium.org/p/chromium.android/builders/Android%20x64%20Builder%20%28dbg%29/builds/7915 It broke because ffmpeg expects to be compiled with -fomit-frame-pointer so that files like third_party/ffmpeg/libavcodec/x86/mpegaudiodsp.c:86 can use an extra register; globally applying -fno-omit-frame-pointer appears to have caused it to run out of registers. If you look at third_party/ffmpeg/ffmpeg.gyp:260, you'll see it removes this flag if it has been set globally: 'cflags!': [ '-fno-omit-frame-pointer', ], But the GN equivalent third_party/ffmpeg/BUILD.gn can't do this, because it's an error in GN to remove a flag that hasn't been set. A clean solution is probably to create a new config in build/config/compiler/BUILD.gn providing the default no-omit-frame-pointer cflag, that is always included, then third_party/ffmpeg/BUILD.gn can unconditionally remove that config and set its own omit-frame-pointer cflag. Original issue's description: > Explicitly ask for stack frame pointers on Debug posix builds. > > GCC / LLVM can omit stack frames at any optimization level. We use -Os > for Android Debug, and -O3 for targets like v8. This can cause the runtime > stack unwinding debug feature of 'tcmalloc' to mis-unwind and crash. > > R=brettw@chromium.org > R=dpranke@chromium.org > BUG= 636489 > > Committed: https://crrev.com/4e69ee6824fc94c59762b5f05f9f340fb4466d7f > Cr-Commit-Position: refs/heads/master@{#413628} TBR=brettw@chromium.org,dpranke@chromium.org,kbr@chromium.org,vmiura@chromium.org # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG= 636489 Review-Url: https://codereview.chromium.org/2269063002 Cr-Commit-Position: refs/heads/master@{#413722} [modify] https://crrev.com/8021d966e853806efec25188e320e45e0bc0bc8b/build/config/compiler/BUILD.gn
,
Aug 25 2016
This still seems to be flaking on latest builds: https://build.chromium.org/p/chromium.gpu/builders/Linux%20Debug%20%28NVIDIA%29/builds/66632 https://build.chromium.org/p/chromium.gpu/builders/Linux%20Debug%20%28NVIDIA%29/builds/66630 https://build.chromium.org/p/chromium.gpu/builders/Linux%20Debug%20%28NVIDIA%29/builds/66629 Are there more steps to take to resolve this bug?
,
Aug 26 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d3b4af41bc9b058a3548b2beda4cf1d1e84c68e6 commit d3b4af41bc9b058a3548b2beda4cf1d1e84c68e6 Author: vmiura <vmiura@chromium.org> Date: Fri Aug 26 01:33:22 2016 Add a GN build configuration for controlling stack frame generation. This is part 1 of staging crrev.com/2266073002. Part 1: Add empty config("default_stack_frames"). Part 2: Disable this config in third_party/ffmpeg. Part 3: Move Chromium -fomit/-fno-omit-frame-pointer logic into this config. TBR=brettw@chromium.org TBR=dpranke@chromium.org BUG= 636489 Review-Url: https://codereview.chromium.org/2280533004 Cr-Commit-Position: refs/heads/master@{#414618} [modify] https://crrev.com/d3b4af41bc9b058a3548b2beda4cf1d1e84c68e6/build/config/BUILDCONFIG.gn [modify] https://crrev.com/d3b4af41bc9b058a3548b2beda4cf1d1e84c68e6/build/config/compiler/BUILD.gn
,
Aug 26 2016
> Are there more steps to take to resolve this bug? Yep, working on a few more patches to stage the fix without breaking third_party/ffmpeg.
,
Aug 26 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/third_party/ffmpeg/+/35740fc7b72ac1d9adff69e67f3f61b639484dc3 commit 35740fc7b72ac1d9adff69e67f3f61b639484dc3 Author: Victor Miura <vmiura@chromium.org> Date: Fri Aug 26 01:51:28 2016 Override the default_stack_frames GN config. Chromium default flags in Debug builds will be changed to include -fno-omit-frame-pointer as part of the 'default_stack_frames' config which breaks ffmpeg compile on Android x86. This CL disables the 'default_stack_frames' config, so we can keep '-fomit-frame-pointer' for ffmpeg. BUG= 636489 Change-Id: I5ddf565cb0d720099d0d1a06a337045e722192b7 Reviewed-on: https://chromium-review.googlesource.com/376130 Reviewed-by: Dale Curtis <dalecurtis@chromium.org> [modify] https://crrev.com/35740fc7b72ac1d9adff69e67f3f61b639484dc3/BUILD.gn
,
Aug 27 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/58ceb62e15a832d540fd461b1985ffb4d96d85f1 commit 58ceb62e15a832d540fd461b1985ffb4d96d85f1 Author: vmiura <vmiura@chromium.org> Date: Sat Aug 27 10:48:36 2016 Roll third_party\ffmpeg 75976ae02..edafabaee (1 commit) https://chromium.googlesource.com/chromium/third_party/ffmpeg.git/+log/75976ae02..edafabaee $ git log 75976ae02..edafabaee --date=short --no-merges --format='%ad %ae %s' 2016-08-25 vmiura@chromium.org Override the default_stack_frames GN config. TBR=dalecurtis@chromium.org BUG= 636489 Review-Url: https://codereview.chromium.org/2286833003 Cr-Commit-Position: refs/heads/master@{#414902} [modify] https://crrev.com/58ceb62e15a832d540fd461b1985ffb4d96d85f1/DEPS
,
Aug 29 2016
@vmiura - can you confirm if my understanding is correct from this bug thread and the related CLs ... ? 1) tcmalloc in a debug config needs frame pointers in order to get a valid stack trace. 2) it appears that the compiler is free to omit frame pointers arbitrarily, regardless of debug / optimize setting, unless -f[no-]omit-frame-pointer is explicitly set. 3) the :ffmpeg_internal needs -fomit-frame-pointer to be set, regardless of the debug and optimization settings. 4) Hence, it would follow if tcmalloc was called from inside an ffmpeg_internal callsite, it would be unhappy. 5) 4) isn't a problem because ... ? 6) -fomit-frame-pointer produces smaller binaries (unsurprisingly) so we want this to be on on official builds and possibly also release builds. Do we really care about binary size on release (not official) builds, though?
,
Aug 30 2016
@dpranke yes pretty much summed it up. > 1) tcmalloc in a debug config needs frame pointers in order to get a valid stack trace. > > 2) it appears that the compiler is free to omit frame pointers arbitrarily, regardless > of debug / optimize setting, unless -f[no-]omit-frame-> pointer is explicitly set. > > 3) the :ffmpeg_internal needs -fomit-frame-pointer to be set, regardless of the debug > and optimization settings. Correct. > 4) Hence, it would follow if tcmalloc was called from inside an ffmpeg_internal callsite, > it would be unhappy. > > > 5) 4) isn't a problem because ... ? Yes, it may be unhappy. I'm not sure we've seen a failure from it in practice, because of... blind luck. ffmpeg has used -fomit-frame-pointer for a long time, and I'm making the assumption it's OK to keep the status quo for now. It is worth following up. Either ffmpeg doesn't use tcmalloc, or the stars didn't align to cause it to break. We run a lot of v8 code across our bots, and somehow only gpu_process_launch_tests triggered this in a somewhat repeatable manner (~10% of full suite runs). > 6) -fomit-frame-pointer produces smaller binaries (unsurprisingly) so we want this to > be on on official builds and possibly also release builds. Do we really care about > binary size on release (not official) builds, though? +primiano@ do we care about increasing Clank binary size on Release, non-official builds? I took the "if it ain't broken" option for now. tcmalloc would be OK if it's not a Debug build, and profiling or sanitizing builds will force the stacks on.
,
Aug 30 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ecdecf5e9fec3c61496f3cedd78c0ba146eb5b91 commit ecdecf5e9fec3c61496f3cedd78c0ba146eb5b91 Author: vmiura <vmiura@chromium.org> Date: Tue Aug 30 04:36:34 2016 Explicitly ask for stack frame pointers on Debug posix builds. GCC / LLVM can omit stack frames at any optimization level. We use -Os for Android Debug, and -O3 for targets like v8. This can cause the runtime stack unwinding debug feature of 'tcmalloc' to mis-unwind and crash. This CL adds "-fno-omit-frame-pointer" by default to Debug builds, as well as fixing consistency for the setting in profiling and stanitizer builds. The -f*omit-frame-pointer flag settings are moved to config("default_stack_frames"), to enable build targets to disable and override the default settings. R=brettw@chromium.org R=dpranke@chromium.org BUG= 636489 Review-Url: https://codereview.chromium.org/2266073002 Cr-Commit-Position: refs/heads/master@{#415102} [modify] https://crrev.com/ecdecf5e9fec3c61496f3cedd78c0ba146eb5b91/build/config/compiler/BUILD.gn
,
Aug 30 2016
@dpranke, FWIW I was thinking about why the tcmalloc crashes are so infrequent. The stack unwinder uses a few heuristics to know when to give up, for example if it gets a stack pointer that is decreasing, or is > 100000 bytes away from the current stack pointer, it gives up. When the stack frame is bad it's very likely to fail these checks. For a crash to occur, the unwinder has to hit a pointer which is very close to a real looking stack pointer, but is actually out of bounds and not mmapped. https://cs.chromium.org/chromium/src/third_party/tcmalloc/vendor/src/stacktrace_x86-inl.h?q=stacktrace_x86-inl.h&sq=package:chromium&dr=C&l=232 I notice that there is also a case that uses msync() to verify the pointer before dereferencing it, however this seems slow to use frequently, and looks disabled in the tcmalloc case. https://cs.chromium.org/chromium/src/third_party/tcmalloc/vendor/src/stacktrace_x86-inl.h?q=stacktrace_x86-inl.h&sq=package:chromium&dr=C&l=266
,
Aug 30 2016
> +primiano@ do we care about increasing Clank binary size on Release, non-official builds? Uhm no. I think it's already the case that base and some other target have different -O values on official vs non-official, right? Just not sure why the fix should be only for non-official.
,
Aug 30 2016
> Just not sure why the fix should be only for non-official. The fix is only needed in DEBUG builds, because they enable stack tracing in tcmalloc. The reason (I assume) to change Release builds to -fno-omit-frame-pointers is to potentially get better stack dumps from crashes. It may be good but isn't the P1 issue.
,
Aug 30 2016
> The reason (I assume) to change Release builds to -fno-omit-frame-pointers is > to potentially get better stack dumps from crashes. Right.
,
Aug 30 2016
>> The reason (I assume) to change Release builds to -fno-omit-frame-pointers is >> to potentially get better stack dumps from crashes. > > Right. The stack walker we use for minidumps uses extra debug information, and is actually able to unwind these stacks that tcmalloc fails on. Comment #9 has an example, where GetStackTrace crashed, but the minidump got the full stack up to '_start'. For each frame it lists what it used - Found by: call frame info Found by: stack scanning
,
Aug 30 2016
Definitely breakpad uses a different and way more reliable way to get stack traces (which happens post-the-fact, not on the device). in general the chrome unwinder itself (in base/debug) should be more reliable, as it uses libunwind, which in turn uses unwind tables from CFI, which don't require frame pointers. AFAIK unwind tables are always there on Linux, we strip them only on Android (Where we don't have tcmalloc) to save the binary size. So, from what I learn in this bug, looks like the tcmalloc unwinder doesn't use unwind table and always uses frame pointers (very likely for performance reasons), but this mechanism seems fragile (fun fact I am reviewing some unrelated stack unwinding code that uses FP and we are seeing precisely these kinds of problems) Beyond whatever baindaid we are going to do in the short term, at this point I might just curious debugallocation: does tcmalloc unwind the stack (with its own stack unwinder which seems fragile) on every free? Or only when it detects that a free is non valid? In the 1st case anybody has an idea why? Maybe we should look into that and (if it doesn't require forking other than some config.h) get rid of that specific part of debugallocation as a longer term fix?
,
Aug 31 2016
,
Nov 8 2016
vmiura: Is there more work to do here?
,
Nov 9 2016
Sorry I crossed tracks, ignore comment #33. This issue is fixed. |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by kbr@chromium.org
, Aug 10 2016