Issue metadata
Sign in to add a comment
|
43% regression in tracing_perftests at 478515:478520 |
||||||||||||||||||||
Issue descriptionSee the link to graphs below.
,
Jun 12 2017
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8976981331434898784
,
Jun 12 2017
Suspecting clang switch
,
Jun 12 2017
+thakis, TL;DR some tracing perftests have jumped on all android bots after the switch to clang. fmeawad@, yes, the other CLs in range [1] seem unrelated. It seems that all the Android bots jumped and all of them have the switch to clang in the range (some of them didn't fully catch up, at the time of writing, the clang roll is the very last point for the Nexus5/5X) https://chromeperf.appspot.com/report?sid=027dd6133223affafb424a412c687cd51e69815ab9a43f5f44938308f3aba92f&start_rev=476588&end_rev=478562 The important data point is that this affected only the benchmarks that use TracedValue but not the TRACE_EVENT0 macros. So this is not about the TRACE_EVENT* macros themselves (good, they are a hell to reason about) but only TracedValue. So there must be something about TracedValue where clang does worse than GCC (specifically 40-60% worse). tracing_perftests are quite small and should be easy to repro. They are pretty much a unittest with printf and don't depend on any telemetry harness. The code lives in components/tracing/test/trace_event_perftest.cc The build target is called tracing_perftests [1] https://chromium.googlesource.com/chromium/src/+log/4b325fbec44c64bbe13371ecf29462c4fdd210cc..e4a28698a612c3a99d802e7658960c3a09533cec
,
Jun 12 2017
issue 732289 tracks other clang-related regressions; don't know if you want to dupe or keep a dedicated bug for this.
,
Jun 12 2017
,
Jun 13 2017
So, I gave an initial stab and reproduced locally on a N5. In my repro clang is >3x slower than gcc (~1600 ms vs 440 ms) for TraceEventPerfTest.Create_10000_TracedValue Unfortunately I have to give up for today as I run out of my time slot on this. Leaving some notes around for whoever wants to carry on with the investigation. <rant>I spent 80% of my time really just to get to a state where all our perf tools sort of align and I can use a sampling profiler, sigh. I hate developing on these uncivilized platforms.</rant> I grabbed a N5 @ google/hammerhead/hammerhead:6.0/MRA59G/2457013:userdebug/dev-keys] (I don't think the version of android is relevant, but can be useful for symbols) Minified repro case: https://chromium-review.googlesource.com/c/532955/ Gn args: see CL title $ ninja -C out/offdroid_prof tracing_profiling $ adb push adb push out/offdroid_prof/exe.unstripped/tracing_profiling /data/local/tmp/tp_clang # Ditto for gcc (% building with is_clang=false) $ adb shell time /data/local/tmp/tp_clang 0m1.63s real 0m1.59s user 0m0.01s system $ adb shell time /data/local/tmp/tp_gcc 0m0.47s real 0m0.44s user 0m0.02s system The issue reproduces with a minimal repro (with and without setting enable_profiling=true). Good. Now let's try to use simpleperf (https://android.googlesource.com/platform/prebuilts/simpleperf/) $ adb shell "cd /data/local/tmp; ./simpleperf record -o clang.data --dump-symbols ./tp_clang" $ adb shell "cd /data/local/tmp; ./simpleperf record -o gcc.data --dump-symbols ./tp_gcc" And then ./simpleperf report -i clang.data, ditto for gcc. By glancing at the two results (see attachment), the code produced by clang seems to spend way more time in libc.so Clang: head prof-clang.txt ----- Overhead Command Pid Tid Shared Object Symbol 13.80% t_offdroid_prof 23906 23906 /system/lib/libc.so libc.so[+62130] 12.63% t_offdroid_prof 23906 23906 /system/lib/libc.so libc.so[+62110] 10.93% t_offdroid_prof 23906 23906 /system/lib/libc.so libc.so[+62114] 10.33% t_offdroid_prof 23906 23906 /system/lib/libc.so libc.so[+62134] 8.68% t_offdroid_prof 23906 23906 /system/lib/libc.so @plt 4.38% t_offdroid_prof 23906 23906 /system/lib/libc.so __memcpy_base 3.96% t_offdroid_prof 23906 23906 /data/local/tmp/t_offdroid_prof base::Pickle::WriteBytes(void const*, int) 3.65% t_offdroid_prof 23906 23906 /system/lib/libc.so libc.so[+62164] 3.16% t_offdroid_prof 23906 23906 [kernel.kallsyms] _raw_spin_unlock_irqrestore 2.69% t_offdroid_prof 23906 23906 /system/lib/libc.so memset 2.02% t_offdroid_prof 23906 23906 /data/local/tmp/t_offdroid_prof @plt 1.55% t_offdroid_prof 23906 23906 /system/lib/libc.so libc.so[+62160] 1.53% t_offdroid_prof 23906 23906 /data/local/tmp/t_offdroid_prof base::Pickle::WriteString(base::BasicStringPiece<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > > const&) 1.45% t_offdroid_prof 23906 23906 /system/bin/linker [linker]strcmp 1.28% t_offdroid_prof 23906 23906 /data/local/tmp/t_offdroid_prof base::trace_event::TracedValue::SetString(char const*, base::BasicStringPiece<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > >) 1.24% t_offdroid_prof 23906 23906 [kernel.kallsyms] get_page_from_freelist 1.15% t_offdroid_prof 23906 23906 [kernel.kallsyms] do_page_fault ----- GCC: head prof-gcc.txt ----- Overhead Command Pid Tid Shared Object Symbol 11.81% t_offdroid_gcc_ 23911 23911 /system/lib/libc.so __memcpy_base 9.48% t_offdroid_gcc_ 23911 23911 [kernel.kallsyms] _raw_spin_unlock_irqrestore 8.93% t_offdroid_gcc_ 23911 23911 /data/local/tmp/t_offdroid_gcc_prof base::Pickle::WriteBytes(void const*, int) 6.32% t_offdroid_gcc_ 23911 23911 /system/lib/libc.so memset 3.98% t_offdroid_gcc_ 23911 23911 /data/local/tmp/t_offdroid_gcc_prof @plt 3.35% t_offdroid_gcc_ 23911 23911 /data/local/tmp/t_offdroid_gcc_prof base::Pickle::WriteString(base::BasicStringPiece<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > > const&) 3.04% t_offdroid_gcc_ 23911 23911 [kernel.kallsyms] get_page_from_freelist 2.90% t_offdroid_gcc_ 23911 23911 /system/bin/linker [linker]strcmp 2.25% t_offdroid_gcc_ 23911 23911 [kernel.kallsyms] do_page_fault 2.10% t_offdroid_gcc_ 23911 23911 [kernel.kallsyms] free_hot_cold_page 2.09% t_offdroid_gcc_ 23911 23911 [kernel.kallsyms] handle_pte_fault 2.06% t_offdroid_gcc_ 23911 23911 /data/local/tmp/t_offdroid_gcc_prof main 1.82% t_offdroid_gcc_ 23911 23911 /data/local/tmp/t_offdroid_gcc_prof void base::Pickle::WriteBytesStatic<8u>(void const*) 1.74% t_offdroid_gcc_ 23911 23911 /system/bin/linker [linker]SymbolName::elf_hash() 1.73% t_offdroid_gcc_ 23911 23911 [kernel.kallsyms] handle_mm_fault 1.66% t_offdroid_gcc_ 23911 23911 [kernel.kallsyms] __memzero ----- Now, I'd be really interested to know what are those libc.so[+62114]. Unfortunately for whatever reason I failed to symbolize them (And I though that anecdotally I knew how to symbolize stuff on Android). I suspect they are just calls to the allocator. I downloaded the system image symbols from go/doxkg and then tried both: - arm-linux-androideabi-addr2line -fCe out/target/product/hammerhead/symbols/system/lib/libc.so - directly pushing the symbols (libc.so, libc++.so) onto /data/local/tmp/system/lib and passing --symfs /data/local/tmp to simpleperf record && simpleperf report (I double-checked with strace that is correctly picking up the files) In neither case simpleperf is picking them up (I suspect this is a technical fallout of Android starting using elf packing on Android at some point) so, why do we end up with all those extra call? Can't really tell. I glanced at the official (without enable_profiling=true) binaries and they are quite strongly different EDIT: I modified the initial repro case and minified it further since I started writing this. I moved the MakeUnqiue off the loop (what I wrote above was based on the first patchset of the repro case linked above). Results still hold (3x diff) and the perf records are very similar. Unless I did something stupid last minute, this seem to suggest that MakeUnique is not really the problem. I updated the attachments, they refer to patchset 3. At this point it should be relatively simple to carry on from here. Maybe something with std::string optimizations? Try to rule it out by further simplifying the repro and not using any string to find out. I am attaching the two unstripped, official, NOT profiling (i.e. enable_profiling=false), binaries of patchset3 in case somebody wants to pass them to /s/chrome/src/third_party/android_tools/ndk/prebuilt/linux-x86_64/bin/gdb (start from `disass main` and follow from there) symbols: https://drive.google.com/drive/folders/0B1oJWfrTOXwxM21abEZxU1Y3OG8?usp=sharing
,
Jun 13 2017
Let me see if I can manage to nerd snipe a +dskiba from the wild :P
,
Jun 13 2017
I am wondering if it reproduces on Desktop (assuming we can still compile using GCC on Linux), I will pursue that path as it might be easier to debug.
,
Jun 13 2017
I didn't try but: - we use different -O levels on desktop compared to android - the GCC toolchain is not the same (we use NDK for android and a debian sysroot for linux) - I believe that even the c++lib is not the same - Can't tell yet if this is codegen specific (i.e. repros only on arm) or more like due to high level optimizations (somehow suspect the latter, with very weakly).
,
Jun 13 2017
primiano, if you still have access to your reduced repro and it's not a lot of effort, could you, for both compilers: * build your mini target with `ninja -v` * cd to build dir, paste the compile command for the main file and append '-E -o repro.ii' (should produce roughly the same with each compiler) * paste compile command again but append `-S -o foo.s` instead and append those files? (If the inefficient code is in the tracing lib and not in the main file, then that won't tell us anything, but it's hopefully in main())
,
Jun 13 2017
Re #15: I can take a look tomorrow. The only problem is that the current repro is not a single TU yet (still involves some base/ pieces). So when you say "the main file" I can't tell whether you want the actual main (I think that's not interesting), the TU that defines TracedValue, the TU that defines the Pickle, or some base::allocator stuff. If you have a strong preference for any of them I'll be happy to do that. But from eyeballing the asm of the main file (https://www.diffchecker.com/KZSqJNr9) there doesn't seem anything too big there. Maybe somebody in MTV can take my repro and minify into a single translation unit (As long as it preserves the 3x slowdown) and we can follow up from there?
,
Jun 13 2017
I mean the file containing main(). But if you think that that's probably not interesting, then I believe you :-) I'll poke it some.
,
Jun 13 2017
http://www/~primiano/bugs/crbug_732413 has the invocations, the two .ii and the two .s for the main. Can try more tomorrow if you want, just leave some note here, just gotta run now.
,
Jun 14 2017
Maybe GCC inlines more? I see that we use GCC-specific -finline-limit=64 option. So I wonder how perf changes if you change it to 0 for example? Another interesting thing is to build with enable_profiling=true. I also wonder if building Clang version with -Os (instead of -Oz) will show any perf difference.
,
Nov 7
This regression has been open for half a year. It's not very actionable and the regression has been in all Chrome user's hands for months. |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by fmea...@chromium.org
, Jun 12 2017