New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 773142 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Invoking clang is too slow

Project Member Reported by diand...@chromium.org, Oct 10 2017

Issue description

As per bug #773138, invoking clang for _anything_ takes 70 ms.  This is just too long.  Invoking gcc takes 10 ms or less.

Someone should look at clang and see if there's anything we can do to reduce this since it burns up time for no reason and we call clang _a lot_.


NOTE: it seems as if (for some reason) a large chunk of time is spent on memprotect.  Am I reading that right?  Why would that take 20 ms?

$ strace -r /usr/bin/clang --version 2>/tmp/foo.txt > /dev/null; sort /tmp/foo.txt | tail
     0.000350 open("/usr/lib32/gcc-cross/i386-redhat-linux", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
     0.000406 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
     0.000431 open("/usr/lib64/x86_64-redhat-linux6E/gcc/x86_64-redhat-linux6E", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
     0.001312 brk(NULL)                 = 0x2758000
     0.001397 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90390a3000
     0.001404 sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
     0.001546 open("/proc/cpuinfo", O_RDONLY|O_CLOEXEC) = 3
     0.001557 +++ exited with 0 +++
     0.002042 mprotect(0x1f34000, 892928, PROT_READ) = 0
     0.022944 mprotect(0x7f9038ac9000, 3620864, PROT_READ) = 0


 
Note that in  bug #767073  laszio@chromium.org said:

> I believe that most of the difference between gcc and llvm comes from 
> dynamic linker; clang is linked against libLLVM-5.0svn.so (which has 
> ~50MB text) and spends a considerable amount of time in ld.so in the 
> beginning.

Is that different than the mprotect?

...from a long time ago I seem to remember some concept of prelinking.  Does that not help with the big dynamic library?
Labels: Build-Toolchain
Owner: laszio@chromium.org
Status: Assigned (was: Untriaged)

Comment 3 by vapier@chromium.org, Oct 10 2017

prelinking doesn't work with PIEs or ASLR and has largely been abandoned by the respective upstreams

Comment 4 by laszio@chromium.org, Oct 10 2017

Another advantage of prelinking is that we can go back to "many individual libraries", which can save 200-300MB-ish on the SDK.

zygote has all the performance advantages of prelinking but that seems to be non-trivial...
We also need to support standalone SDK usage so zygote probably is not an option. Is there any reason we can't link llvm statically?

Comment 6 by vapier@chromium.org, Oct 10 2017

i don't think we've ever used prelinking before, and i doubt trying to use it is a viable long term solution

i'm not sure how a zygote approach would even work in the SDK ... there's not really any place we could keep a process resident to easily communicate with and fork into the right env

Comment 7 by laszio@chromium.org, Oct 10 2017

The reason is size. Linking it statically increases the SDK by ~400MB. and this is measured on llvm 3.9. It can only be larger now.

Zygote doesn't have to be persistent. Making it timeout after 10s should be good enough. Assuming the saving per-invocation is 60ms, it is 60ms / (10s + 60ms) ~= 0.6% slower than the persistent one in the worst case.

It looks like that we don't even have to touch clang:

/* zygote.c */
int main() {
  ...
  if (is_server) {
    clang_dl = dlopen('clang.real'); // assuming PIE
    clang_main = dlsym(clang_dl, 'main');
    while (...) { // wait for connection; timeout = 10s
      if (fork() == 0) {
        rv = clang_main(argc, argv);
        // return rv to client and exit.
      }
    }
  } else { // is client
    // send argc / argv to server
    // wait for return value
    // return
  }
}

Comment 8 by laszio@chromium.org, Oct 10 2017

Another approach is to just statically link clang. Other tools and libraries remain shared. If they are only used rarely, this might be the simplest.

Comment 9 by vapier@chromium.org, Oct 10 2017

what is the zygote gaining us there ?  just keeping the files mapped ?  or is the startup overhead significant too ?

if we can statically link just clang, that'd be worth investigating.
zygote forks after dlopen-ed clang so the time consuming dynamic linking / relocation is done before forking. Statically linked clang is fast because of this.

Previous analysis is here: https://bugs.chromium.org/p/chromium/issues/detail?id=661019
i grok how the zygote in your example works.  what i'm asking is where is the actual savings coming from.  if it's mostly from the file maps, having a process live with those libs already loaded is trivial.  but if it's from ctors and the writable sections (and processing of the relocs), that wouldn't be sufficient.  i think it's the latter, but not 100%.

Comment 12 by lloz...@google.com, Oct 10 2017

about #9, I would prefer if we stay with one of the configurations that is supported by LLVM (cmake). 
As far as I understand, llvm only support 1) all shared, 2) one big shared library (what we are using now) and 3) all static.
Can't we just go with 3)?
BTW (if it matters) this is what android does. 
#11: The saving comes from ld.so, which spends most of its time on relocation. In fact, ld.so accounts for 52.51% in compiling an empty C function. The second largest part is kernel (33.54%). Considering the 23ms mprotect(..., 3.6MB) in #0, which matches the second LOAD segment until .got, I would bet that most of the kernel time also comes ld.so doing relocation / writing got.

So zygote should be able to save up to 86.06%, or roughly 60ms.

(cros2) ~/tmp$ sudo perf record -e cycles /usr/bin/clang -c trivial.c -o /dev/null
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.024 MB perf.data (289 samples) ]

(cros2) ~/tmp$ sudo perf report -s dso
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 289  of event 'cycles'
# Event count (approx.): 91024686
#
# Overhead  Shared Object
# ........  ...................
#
    52.51%  ld-2.23.so
    33.54%  [kernel.kallsyms]
     7.17%  libLLVM-5.0svn.so
     4.50%  libc-2.23.so
     1.89%  clang-5.0
     0.38%  libstdc++.so.6.0.20
Correction: until .got -> until .data (i.e., including .got but not including .data)
Not sure if I answered #11 correctly, so here's more clarification.

The time spent mostly comes from processing of the relocs (~86%). I think zygote (and static linking) can fix most of that.
In my past lives, I had success in significantly reducing llvm library size by compiling with -fvisibility=hidden and marking specific functions visible.
However, this is a pain staking exercise.
Components: Infra
Components: -Infra Infra>Client>ChromeOS
[It appears that a bunch of old cros issues bulk-added the "Infra" component recently, but they should probably be "Infra>Client>ChromeOS".]
Components: -Infra>Client>ChromeOS
Components: Tools>ChromeOS-Toolchain
Owner: g...@chromium.org
laszio is on a different team now, so it's quite unlikely that he'll get to this. I'll take it in his place, though no promises about seeing substantial progress in the near future. :)

FWIW, we have a project to use ThinLTO and AFDO for clang/lld/etc. that should hopefully be started some time in Q4. If we're still spending a substantial piece of time in ld, that might not be *hugely* helpful here, but we'll see...
Note that this is still a daily pain point for me.  I spend far too much time waiting for the kernel build to run and I find myself context switching away to something else and then forgetting what I was in the middle of testing.  I could certainly get in the habit of just building the kernel with "-clang" but so far I have resisted the temptation.

Sign in to add a comment