New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 732413 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner: ----
Closed: Nov 7
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug-Regression

Blocked on:
issue 481675



Sign in to add a comment

43% regression in tracing_perftests at 478515:478520

Project Member Reported by fmea...@chromium.org, Jun 12 2017

Issue description

See the link to graphs below.
 
All graphs for this bug:
  https://chromeperf.appspot.com/group_report?bug_id=732413

Original alerts at time of bug-filing:
  https://chromeperf.appspot.com/group_report?keys=agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgIDg1qm8sQoM


Bot(s) for this bug's original alert(s):

android-one
Cc: primiano@chromium.org
Suspecting clang switch

Comment 4 Deleted

Cc: thakis@chromium.org
+thakis, TL;DR some tracing perftests have jumped on all android bots after the switch to clang.

fmeawad@, yes, the other CLs in range [1] seem unrelated. It seems that all the Android bots jumped and all of them have the switch to clang in the range (some of them didn't fully catch up, at the time of writing, the clang roll is the very last point for the Nexus5/5X)

https://chromeperf.appspot.com/report?sid=027dd6133223affafb424a412c687cd51e69815ab9a43f5f44938308f3aba92f&start_rev=476588&end_rev=478562

The important data point is that this affected only the benchmarks that use TracedValue but not the TRACE_EVENT0 macros. 
So this is not about the TRACE_EVENT* macros themselves (good, they are a hell to reason about) but only TracedValue.
So there must be something about TracedValue where clang does worse than GCC (specifically 40-60% worse).

tracing_perftests are quite small and should be easy to repro. They are pretty much a unittest with printf and don't depend on any telemetry harness.
The code lives in components/tracing/test/trace_event_perftest.cc
The build target is called tracing_perftests

[1] https://chromium.googlesource.com/chromium/src/+log/4b325fbec44c64bbe13371ecf29462c4fdd210cc..e4a28698a612c3a99d802e7658960c3a09533cec

Comment 6 by thakis@chromium.org, Jun 12 2017

issue 732289 tracks other clang-related regressions; don't know if you want to dupe or keep a dedicated bug for this.

Comment 7 by thakis@chromium.org, Jun 12 2017

Blockedon: 481675

Comment 8 Deleted

Comment 9 Deleted

Comment 10 Deleted

So, I gave an initial stab and reproduced locally on a N5. In my repro clang is >3x slower than gcc (~1600 ms vs 440 ms) for TraceEventPerfTest.Create_10000_TracedValue
Unfortunately I have to give up for today as I run out of my time slot on this.  Leaving some notes around for whoever wants to carry on with the investigation.
<rant>I spent 80% of my time really just to get to a state where all our perf tools sort of align and I can use a sampling profiler, sigh. I hate developing on these uncivilized platforms.</rant>

I grabbed a N5 @ google/hammerhead/hammerhead:6.0/MRA59G/2457013:userdebug/dev-keys] (I don't think the version of android is relevant, but can be useful for symbols)

Minified repro case:
https://chromium-review.googlesource.com/c/532955/
Gn args: see CL title

$ ninja -C out/offdroid_prof tracing_profiling
$ adb push adb push out/offdroid_prof/exe.unstripped/tracing_profiling /data/local/tmp/tp_clang

# Ditto for gcc (% building with is_clang=false)

$ adb shell time /data/local/tmp/tp_clang
    0m1.63s real     0m1.59s user     0m0.01s system

$ adb shell time /data/local/tmp/tp_gcc
    0m0.47s real     0m0.44s user     0m0.02s system

The issue reproduces with a minimal repro (with and without setting enable_profiling=true). Good.

Now let's try to use simpleperf (https://android.googlesource.com/platform/prebuilts/simpleperf/)

$ adb shell "cd /data/local/tmp; ./simpleperf record -o clang.data  --dump-symbols ./tp_clang"
$ adb shell "cd /data/local/tmp; ./simpleperf record -o gcc.data  --dump-symbols ./tp_gcc"

And then ./simpleperf report -i clang.data, ditto for gcc.

By glancing at the two results (see attachment), the code produced by clang seems to spend way more time in libc.so

Clang: head prof-clang.txt
-----
Overhead  Command          Pid    Tid    Shared Object                    Symbol
13.80%    t_offdroid_prof  23906  23906  /system/lib/libc.so              libc.so[+62130]
12.63%    t_offdroid_prof  23906  23906  /system/lib/libc.so              libc.so[+62110]
10.93%    t_offdroid_prof  23906  23906  /system/lib/libc.so              libc.so[+62114]
10.33%    t_offdroid_prof  23906  23906  /system/lib/libc.so              libc.so[+62134]
8.68%     t_offdroid_prof  23906  23906  /system/lib/libc.so              @plt
4.38%     t_offdroid_prof  23906  23906  /system/lib/libc.so              __memcpy_base
3.96%     t_offdroid_prof  23906  23906  /data/local/tmp/t_offdroid_prof  base::Pickle::WriteBytes(void const*, int)
3.65%     t_offdroid_prof  23906  23906  /system/lib/libc.so              libc.so[+62164]
3.16%     t_offdroid_prof  23906  23906  [kernel.kallsyms]                _raw_spin_unlock_irqrestore
2.69%     t_offdroid_prof  23906  23906  /system/lib/libc.so              memset
2.02%     t_offdroid_prof  23906  23906  /data/local/tmp/t_offdroid_prof  @plt
1.55%     t_offdroid_prof  23906  23906  /system/lib/libc.so              libc.so[+62160]
1.53%     t_offdroid_prof  23906  23906  /data/local/tmp/t_offdroid_prof  base::Pickle::WriteString(base::BasicStringPiece<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > > const&)
1.45%     t_offdroid_prof  23906  23906  /system/bin/linker               [linker]strcmp
1.28%     t_offdroid_prof  23906  23906  /data/local/tmp/t_offdroid_prof  base::trace_event::TracedValue::SetString(char const*, base::BasicStringPiece<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > >)
1.24%     t_offdroid_prof  23906  23906  [kernel.kallsyms]                get_page_from_freelist
1.15%     t_offdroid_prof  23906  23906  [kernel.kallsyms]                do_page_fault
-----

GCC: head prof-gcc.txt
-----
Overhead  Command          Pid    Tid    Shared Object                        Symbol
11.81%    t_offdroid_gcc_  23911  23911  /system/lib/libc.so                  __memcpy_base
9.48%     t_offdroid_gcc_  23911  23911  [kernel.kallsyms]                    _raw_spin_unlock_irqrestore
8.93%     t_offdroid_gcc_  23911  23911  /data/local/tmp/t_offdroid_gcc_prof  base::Pickle::WriteBytes(void const*, int)
6.32%     t_offdroid_gcc_  23911  23911  /system/lib/libc.so                  memset
3.98%     t_offdroid_gcc_  23911  23911  /data/local/tmp/t_offdroid_gcc_prof  @plt
3.35%     t_offdroid_gcc_  23911  23911  /data/local/tmp/t_offdroid_gcc_prof  base::Pickle::WriteString(base::BasicStringPiece<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > > const&)
3.04%     t_offdroid_gcc_  23911  23911  [kernel.kallsyms]                    get_page_from_freelist
2.90%     t_offdroid_gcc_  23911  23911  /system/bin/linker                   [linker]strcmp
2.25%     t_offdroid_gcc_  23911  23911  [kernel.kallsyms]                    do_page_fault
2.10%     t_offdroid_gcc_  23911  23911  [kernel.kallsyms]                    free_hot_cold_page
2.09%     t_offdroid_gcc_  23911  23911  [kernel.kallsyms]                    handle_pte_fault
2.06%     t_offdroid_gcc_  23911  23911  /data/local/tmp/t_offdroid_gcc_prof  main
1.82%     t_offdroid_gcc_  23911  23911  /data/local/tmp/t_offdroid_gcc_prof  void base::Pickle::WriteBytesStatic<8u>(void const*)
1.74%     t_offdroid_gcc_  23911  23911  /system/bin/linker                   [linker]SymbolName::elf_hash()
1.73%     t_offdroid_gcc_  23911  23911  [kernel.kallsyms]                    handle_mm_fault
1.66%     t_offdroid_gcc_  23911  23911  [kernel.kallsyms]                    __memzero
-----


Now, I'd be really interested to know what are those libc.so[+62114]. Unfortunately for whatever reason I failed to symbolize them (And I though that anecdotally I knew how to symbolize stuff on Android). 
I suspect they are just calls to the allocator.

I downloaded the system image symbols from go/doxkg and then tried both:
- arm-linux-androideabi-addr2line -fCe out/target/product/hammerhead/symbols/system/lib/libc.so
- directly pushing the symbols  (libc.so, libc++.so) onto /data/local/tmp/system/lib and passing --symfs /data/local/tmp to simpleperf record && simpleperf report (I double-checked with strace that is correctly picking up the files)
In neither case simpleperf is picking them up (I suspect this is a technical fallout of Android starting using elf packing on Android at some point)
               
so, why do we end up with all those extra call?
Can't really tell. I glanced at the official (without enable_profiling=true) binaries and they are quite strongly different

EDIT: I modified the initial repro case and minified it further since I started writing this. I moved the MakeUnqiue off the loop (what I wrote above was based on the first patchset of the repro case linked above). Results still hold (3x diff) and the perf records are very similar. Unless I did something stupid last minute, this seem to suggest that MakeUnique is not really the problem. I updated the attachments, they refer to patchset 3.

At this point it should be relatively simple to carry on from here. Maybe something with std::string optimizations? Try to rule it out by further simplifying the repro and not using any string to find out.

I am attaching the two unstripped, official, NOT profiling (i.e. enable_profiling=false), binaries of patchset3 in case somebody wants to pass them to /s/chrome/src/third_party/android_tools/ndk/prebuilt/linux-x86_64/bin/gdb
(start from `disass main` and follow from there)
symbols: https://drive.google.com/drive/folders/0B1oJWfrTOXwxM21abEZxU1Y3OG8?usp=sharing
prof-clang.txt
15.0 KB View Download
prof-gcc.txt
14.2 KB View Download
Cc: dskiba@chromium.org
Let me see if I can manage to nerd snipe a +dskiba from the wild :P
I am wondering if it reproduces on Desktop (assuming we can still compile using GCC on Linux), I will pursue that path as it might be easier to debug.
I didn't try but:
- we use different -O levels on desktop compared to android
- the GCC toolchain is not the same (we use NDK for android and a debian sysroot for linux)
- I believe that even the c++lib is not the same
- Can't tell yet if this is codegen specific (i.e. repros only on arm) or more like due to high level optimizations (somehow suspect the latter, with very weakly).
primiano, if you still have access to your reduced repro and it's not a lot of effort, could you, for both compilers:

* build your mini target with `ninja -v`
* cd to build dir, paste the compile command for the main file and append '-E -o repro.ii' (should produce roughly the same with each compiler)
* paste compile command again but append `-S -o foo.s` instead

and append those files?

(If the inefficient code is in the tracing lib and not in the main file, then that won't tell us anything, but it's hopefully in main())
Re #15: I can take a look tomorrow. The only problem is that the current repro is not a single TU yet (still involves some base/ pieces). So when you say "the main file" I can't tell whether you want the actual main (I think that's not interesting), the TU that defines TracedValue, the TU that defines the Pickle, or some base::allocator stuff.
If you have a strong preference for any of them I'll be happy to do that. But from eyeballing the asm of the main file (https://www.diffchecker.com/KZSqJNr9) there doesn't seem anything too big there. 

Maybe somebody in MTV can take my repro and minify into a single translation unit (As long as it preserves the 3x slowdown) and we can follow up from there?
I mean the file containing main(). But if you think that that's probably not interesting, then I believe you :-)

I'll poke it some.
http://www/~primiano/bugs/crbug_732413 has the invocations, the two .ii and the two .s for the main. Can try more tomorrow if you want, just leave some note here, just gotta run now.
Maybe GCC inlines more? I see that we use GCC-specific -finline-limit=64 option. So I wonder how perf changes if you change it to 0 for example?

Another interesting thing is to build with enable_profiling=true. I also wonder if building Clang version with -Os (instead of -Oz) will show any perf difference.
Status: WontFix (was: Untriaged)
This regression has been open for half a year. It's not very actionable and the regression has been in all Chrome user's hands for months.

Sign in to add a comment