Some traces from external libraries are missing with --enable-heap-profiling=native |
|||||||||
Issue description
Looks like something is missing from TraceStackFramePointers. Not sure if this is libx11 missing frame pointers or what. Would be nice to figure out.
Copying from internal thread with tasak@
------
I've finished investigating why dropped.
This is a bug of base::debug::TraceStackFramePointers. in base/debug/stack_trace.cc.
I replaced the function with backtrace() defined in execinfo.h and ran chrome with --enable-heap-profiling=native (... it's too slow...)
So I saw:
[Thead: CrBrowserMain]
_start
__libc_start_main
ChromeMain
...
StartupBrowserCreatorImpl::Launch
...
StartupBrowserCreatorImpl::ProcessSpecifiedURLs
StartupBrowserCreatorImpl::OpenTabsInBrowser
content::Navigate
content::WebContents::Create
content::WebContentsImpl::CreateWithOpener
views::Widget::Init
views::DesktopNativeWidgetAura::InitNativeWidget
....
ui:SkBitmapToXcusorImageCreate
XCursorImageCreate
The XCursorImageCreate was reported as "unknown". The information about who invoked was dropped. However, now we have entire stacktraces from _start to XCursorImageCreate.
-----
,
Jul 21 2016
I'd say that it's rather important to fix this bug to explain the 30% of "unknown" objects in the native-stack profiler. Do we have any plan to fix this?
,
Jul 21 2016
Discussed offline. Our plan is to do the following two things. a) Integrate the task categorization with the native-stack profiler. This will give us mostly precise results. b) Experiment with fully unwinding frame pointers of third-party libraries. a) is blocked by tracing v2. b) is not blocked. Dimitry is planning to try b).
,
Aug 22 2016
Got it working on Linux, will submit CL soon. But on Android it's harder, because on the main thread we're always called from JNI, and we always try to scan the stack, and we never succeed, only find some garbage like two last frames in the following stack: ... 08-22 12:40:13.418 10670 10670 E chromium: #16 0xd573ee6b /data/app/com.google.android.apps.chrome-3/lib/arm/libchrome.so+0x0051ce6b 08-22 12:40:13.418 10670 10670 E chromium: #17 0xd55063e3 /data/app/com.google.android.apps.chrome-3/lib/arm/libchrome.so+0x002e43e3 08-22 12:40:13.418 10670 10670 E chromium: #18 0xe101303e /data/app/com.google.android.apps.chrome-3/oat/arm/base.odex+0x00aa503e 08-22 12:40:13.418 10670 10670 E chromium: #19 0xebad806f /system/usr/icu/icudt55l.dat+0x0001906f 08-22 12:40:13.418 10670 10670 E chromium: #20 0xebad8073 /system/usr/icu/icudt55l.dat+0x00019073 So, should I try to make it work on Android? I have some ideas, but there will be runtime cost...
,
Aug 22 2016
tasak@: Would you re-try the native stack profiler once dskiba@ lands the CL for Linux?
,
Aug 24 2016
Actually, I would appreciate it if you could check it before it lands. The CL is here: https://codereview.chromium.org/2276813002, but you need to apply parent CL first, so: git cl patch 2203053003 git cl patch 2276813002
,
Aug 24 2016
tasak@: Please check it today so that dskiba@ can land it.
,
Aug 25 2016
Sorry. I'm still working in progress.
,
Aug 29 2016
So would you give me an advice about how to see this CL? I did: - I prepared 2 chrome (enable_profiling=true), i.e. one was trunk and the other was patched trunk. - For each chrome, I captured trace.json by using chrome://tracing, and ran symbolize_trace to solve pc:address. - I reopened 2 trace.json (by using chrome://tracing), and compared. So ... both trace.json have [unknown] and have stacktraces containing </lib/x86_64-linux-gpu/libglib-2.0.so.0.4002.0>. I think, probably I made some mistakes...
,
Sep 1 2016
So the CL that is being reviewed scans 512 bytes of stack to find frames once we stumble on a bad frame. Here are allocations from CrBrowserMain from three different unwinding modes: No Scanning (i.e. without the CL): 88.5 KiB <[stack]> 36.3 KiB <unnamed> 1.4 KiB cairo_debug_reset_static_data 1.5 KiB _XUpdateAtomCache 74.6 KiB <unresolved> 2.2 KiB char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag) 2.2 KiB ShimCppNew 3.4 KiB ValidateChunkAMD64 4.0 KiB gpu::GpuControlList::LoadList(std::string const&, gpu::GpuControlList::OsFilter) 742.6 KiB g_main_context_dispatch 4.4 KiB PrefRegistrySimple::RegisterBooleanPref(std::string const&, bool) 5.6 KiB dlopen_doit 6.0 KiB ft_mem_qrealloc 6.5 KiB __GI__dl_allocate_tls 7.2 KiB pa_xmalloc0 7.9 KiB _XEnq 19.4 KiB ft_mem_qalloc 19.7 KiB _dl_new_object 28.5 KiB </lib/x86_64-linux-gnu/libglib-2.0.so.0.4002.0> 32.2 KiB XcursorImageCreate 40.5 KiB g_malloc 43.7 KiB g_realloc 60.3 KiB g_malloc0 354.9 KiB std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) 3.4 MiB __libc_start_main 132.0 KiB <other> Scanning 512 bytes: 4.5 KiB <unresolved> 2.0 KiB g_object_new 7.9 KiB <unnamed> 4.1 KiB ft_mem_qrealloc 11.4 KiB ft_mem_qalloc 14.0 KiB g_malloc 19.7 KiB _dl_new_object 30.3 KiB g_realloc 51.2 KiB g_malloc0 121.0 KiB </lib/x86_64-linux-gnu/libglib-2.0.so.0.4002.0> 4.7 MiB __libc_start_main 4.4 KiB <other> Scanning 8192 bytes: 3.1 KiB ft_mem_qrealloc 5.1 MiB __libc_start_main So while scanning 512 bytes clearly improves situation, it looks like scanning 8 KiB gives better results. However, there is probably performance penalty to that, I'll measure it and come back here.
,
Sep 1 2016
Wait I'm just confused about one thing: why scanning more (8k instead of 512 bytes) gives you *less* frames? SO far the 512bytes output seems the most realistic, but at this point I'm curious why increasing to 8k makes such a big difference.
,
Sep 1 2016
Ah, I forgot to mention that those are top-level entries for CrBrowserMain. So ideally we should see a single entry there, __libc_start_main, but if we fail to unwind to that frame and stop somewhere else, that frame ends in the top-level. 512 vs 8192 means that 3rd party libraries that we call (and which then call us back) are sometimes using more than 512 bytes of stack.
,
Sep 1 2016
Ahh I see, all that stuff happened in 3rd party libraries then. I though it was chrome frames.
,
Sep 7 2016
OK, I timed TraceStackFramePointers() by adding a histogram to AllocationContextTracker::GetContextSnapshot(). Increasing scan area to 8192 didn't affect timing at all, so I'm bumping it. Overall, scanning costs us ~100 nanoseconds (+25%) per call.
No scanning:
Histogram: A-TraceStackFramePointers(ns) recorded 2158421 samples, average = 394.5
0 ...
114 O (1 = 0.0%) {0.0%}
135 -------------------------------------------------------------O (252458 = 11.7%) {0.0%}
160 ------------------------------------------------------O (223723 = 10.4%) {11.7%}
190 -----------------------------------------------------------------O (270543 = 12.5%) {22.1%}
226 -----------------------------------------O (170961 = 7.9%) {34.6%}
268 --------------------------------O (134364 = 6.2%) {42.5%}
318 ------------------------------O (125542 = 5.8%) {48.7%}
378 ---------------------------------------------------O (213111 = 9.9%) {54.6%}
449 ------------------------------------------------O (198925 = 9.2%) {64.4%}
533 ------------------------------------------------------------------------O (299522 = 13.9%) {73.6%}
633 ----------------------------------------O (168217 = 7.8%) {87.5%}
752 -----------O (45935 = 2.1%) {95.3%}
894 -------O (29470 = 1.4%) {97.4%}
1062 ---O (10969 = 0.5%) {98.8%}
1262 -O (5809 = 0.3%) {99.3%}
1500 -O (4622 = 0.2%) {99.6%}
1782 -O (2382 = 0.1%) {99.8%}
2117 O (1014 = 0.0%) {99.9%}
2516 O (180 = 0.0%) {100.0%}
2990 O (57 = 0.0%) {100.0%}
3553 O (23 = 0.0%) {100.0%}
4222 O (36 = 0.0%) {100.0%}
5017 O (40 = 0.0%) {100.0%}
5961 O (47 = 0.0%) {100.0%}
7083 O (50 = 0.0%) {100.0%}
8416 O (34 = 0.0%) {100.0%}
10000 O
Scan area = 512:
Histogram: A-TraceStackFramePointers(ns) recorded 1515524 samples, average = 517.8
0 ...
135 ----------------------------------------------------O (117984 = 7.8%) {0.0%}
160 --------------------------------------O (85123 = 5.6%) {7.8%}
190 ---------------------------------------------------O (114894 = 7.6%) {13.4%}
226 ---------------------------------------------------------O (128872 = 8.5%) {21.0%}
268 ----------------------------------------------O (104038 = 6.9%) {29.5%}
318 ------------------------------------------------O (108763 = 7.2%) {36.4%}
378 ---------------------------------------------------------------------O (156053 = 10.3%) {43.5%}
449 --------------------------------------------------------O (125668 = 8.3%) {53.8%}
533 ------------------------------------------------------------------------O (163028 = 10.8%) {62.1%}
633 ---------------------------------------O (87784 = 5.8%) {72.9%}
752 ------------------------------------------------O (107630 = 7.1%) {78.7%}
894 --------------------------------------------------------O (127631 = 8.4%) {85.8%}
1062 --------------------O (44649 = 2.9%) {94.2%}
1262 --------O (19123 = 1.3%) {97.1%}
1500 ------O (14293 = 0.9%) {98.4%}
1782 --O (5573 = 0.4%) {99.3%}
2117 -O (2067 = 0.1%) {99.7%}
2516 -O (1364 = 0.1%) {99.8%}
2990 O (391 = 0.0%) {99.9%}
3553 O (116 = 0.0%) {100.0%}
4222 O (46 = 0.0%) {100.0%}
5017 O (21 = 0.0%) {100.0%}
5961 O (33 = 0.0%) {100.0%}
7083 O (40 = 0.0%) {100.0%}
8416 O (23 = 0.0%) {100.0%}
10000 O
Scan area = 8192:
Histogram: A-TraceStackFramePointers(ns) recorded 1960225 samples, average = 501.2
0 ...
135 -----------------------------------------------O (129803 = 6.6%) {0.0%}
160 ------------------------------------------------------------------------O (200897 = 10.2%) {6.6%}
190 ----------------------------------------------------------------------O (194457 = 9.9%) {16.9%}
226 -----------------------------------------------O (131631 = 6.7%) {26.8%}
268 ---------------------------------------------O (125332 = 6.4%) {33.5%}
318 -------------------------------------------------O (135780 = 6.9%) {39.9%}
378 -------------------------------------------------------------------O (187126 = 9.5%) {46.8%}
449 -----------------------------------------------------------O (165970 = 8.5%) {56.4%}
533 ---------------------------------------------------------------------O (191590 = 9.8%) {64.8%}
633 ----------------------------------------O (110963 = 5.7%) {74.6%}
752 -----------------------------------------O (114912 = 5.9%) {80.3%}
894 --------------------------------------------------------O (156789 = 8.0%) {86.1%}
1062 ----------------------O (60539 = 3.1%) {94.1%}
1262 --------O (23371 = 1.2%) {97.2%}
1500 ------O (16938 = 0.9%) {98.4%}
1782 ---O (7795 = 0.4%) {99.3%}
2117 -O (2843 = 0.1%) {99.7%}
2516 -O (1744 = 0.1%) {99.8%}
2990 O (695 = 0.0%) {99.9%}
3553 O (285 = 0.0%) {99.9%}
4222 O (75 = 0.0%) {100.0%}
5017 O (65 = 0.0%) {100.0%}
5961 O (61 = 0.0%) {100.0%}
7083 O (64 = 0.0%) {100.0%}
8416 O (44 = 0.0%) {100.0%}
10000 O
,
Sep 8 2016
I built the patched chrome kMaxTraces=162, and ran telemetry with enable-heap-profiling=native. It ran much faster than chrome using backtrace() for stacktrace.
,
Sep 8 2016
awesome, thanks for the thorough data Dmitry. +1 for the bump
,
Oct 25 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a8c951effbf899ae4951c67d606a634905f88ab4 commit a8c951effbf899ae4951c67d606a634905f88ab4 Author: dskiba <dskiba@google.com> Date: Tue Oct 25 23:39:11 2016 Link stack frames of JNI stubs to JNI callbacks. Chrome's native heap profiler relies on stack frame pointers to unwind. That is fast, but requires that all code in a call chain is built with frame pointer support. Which is not the case for Java / JNI code on Android, so unwinding stops prematurely once it crosses JNI boundary. This CL exploits the fact that both JNI stubs (which call into Java) and JNI callbacks (which are called by Java) are generated. It changes JNI generator to save stack frame pointer in a JNI stub, and later link saved frame pointer to a frame pointer of a JNI callback. That repairs broken stack frame chain and allows unwinder to yield complete trace. Changes in this CL are active only for profiling builds (ones that build with enable_profiling=true). Otherwise JNI generator produces exactly the same code as before (modulo removed empty lines). Example of a complete trace made possible by this CL: <.../base.odex> Java_org_chromium_base_SystemMessageHandler_nativeDoRunLoopOnce base::MessageLoop::DoWork base::MessageLoop::DeferOrRunPendingTask base::MessageLoop::RunTask base::debug::TaskAnnotator::RunTask cc::SingleThreadProxy::BeginMainFrame cc::SingleThreadProxy::DoBeginMainFrame content::CompositorImpl::UpdateLayerTreeHost..............................#4 <.../base.odex> Java_org_chromium_chrome_..._ToolbarSceneLayer_nativeUpdateToolbarLayer...#3 android::ToolbarSceneLayer::UpdateToolbarLayer android::ToolbarLayer::PushResource ui::ResourceManagerImpl::GetResource......................................#2 <.../base.odex> Java_org_chromium_ui_resources_ResourceManager_nativeOnResourceReady......#1 ui::ResourceManagerImpl::OnResourceReady gfx::CreateSkBitmapFromJavaBitmap ...more frames from Chrome... Previously unwinding would've stopped at #1. Code added by JNI generator links frames #1->#2 and #3->#4. BUG= 624362 Review-Url: https://codereview.chromium.org/2361353002 Cr-Commit-Position: refs/heads/master@{#427534} [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/base/android/jni_android.cc [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/base/android/jni_android.h [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/base/android/jni_generator/golden_sample_for_tests_jni.h [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/base/android/jni_generator/jni_generator.py [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/base/android/jni_generator/jni_generator_tests.py [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/base/android/jni_generator/testCalledByNatives.golden [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/base/android/jni_generator/testConstantsFromJavaP.golden [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/base/android/jni_generator/testFromJavaP.golden [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/base/android/jni_generator/testFromJavaPGenerics.golden [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/base/android/jni_generator/testMultipleJNIAdditionalImport.golden [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/base/android/jni_generator/testNativeExportsOnlyOption.golden [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/base/android/jni_generator/testSingleJNIAdditionalImport.golden [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/base/debug/stack_trace.cc [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/base/debug/stack_trace.h [modify] https://crrev.com/a8c951effbf899ae4951c67d606a634905f88ab4/build/config/android/rules.gni
,
Nov 25 2016
triaging pass: I think this has been fixed after all the changes to the stack scanner right?
,
Nov 28 2016
There is one case still not fixed: stack scanning on Linux on non-main threads. This is because we still don't have a good way to get stack size / address for those threads, see issue 617730 .
,
Jan 4 2017
,
Mar 3 2017
,
May 30 2018
Chrome switched to memlog for native heap profiling, and memlog has sampling mode which unwinds way less, making perf acceptable even with CFI unwinding (not to mention that on Android we have optimized CFI). |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by dskiba@chromium.org
, Jun 29 2016Yeah, this is expected. Libraries on Linux don't have frame pointers, so TraceStackFramePointers() can't unwind into them. This is a tradeoff - as tasak@ discovered backtrace() is too slow. I guess we can try scanning the stack to find first valid frame (which will be in Chrome's code) and unwind from there. That would get us something like this: [Thead: CrBrowserMain] _start __libc_start_main ChromeMain ... StartupBrowserCreatorImpl::Launch ... StartupBrowserCreatorImpl::ProcessSpecifiedURLs StartupBrowserCreatorImpl::OpenTabsInBrowser content::Navigate content::WebContents::Create content::WebContentsImpl::CreateWithOpener views::Widget::Init views::DesktopNativeWidgetAura::InitNativeWidget .... ui:SkBitmapToXcusorImageCreate <libx11.so> XLastFunctionFromTheLibrary malloc ShimXXX InsertAllocation I.e. we would skip all frames inside libx11, except for the last one (which called malloc).