Invoking clang is too slow |
|||||||
Issue descriptionAs per bug #773138, invoking clang for _anything_ takes 70 ms. This is just too long. Invoking gcc takes 10 ms or less. Someone should look at clang and see if there's anything we can do to reduce this since it burns up time for no reason and we call clang _a lot_. NOTE: it seems as if (for some reason) a large chunk of time is spent on memprotect. Am I reading that right? Why would that take 20 ms? $ strace -r /usr/bin/clang --version 2>/tmp/foo.txt > /dev/null; sort /tmp/foo.txt | tail 0.000350 open("/usr/lib32/gcc-cross/i386-redhat-linux", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) 0.000406 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) 0.000431 open("/usr/lib64/x86_64-redhat-linux6E/gcc/x86_64-redhat-linux6E", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) 0.001312 brk(NULL) = 0x2758000 0.001397 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90390a3000 0.001404 sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0 0.001546 open("/proc/cpuinfo", O_RDONLY|O_CLOEXEC) = 3 0.001557 +++ exited with 0 +++ 0.002042 mprotect(0x1f34000, 892928, PROT_READ) = 0 0.022944 mprotect(0x7f9038ac9000, 3620864, PROT_READ) = 0
,
Oct 10 2017
,
Oct 10 2017
prelinking doesn't work with PIEs or ASLR and has largely been abandoned by the respective upstreams
,
Oct 10 2017
Another advantage of prelinking is that we can go back to "many individual libraries", which can save 200-300MB-ish on the SDK. zygote has all the performance advantages of prelinking but that seems to be non-trivial...
,
Oct 10 2017
We also need to support standalone SDK usage so zygote probably is not an option. Is there any reason we can't link llvm statically?
,
Oct 10 2017
i don't think we've ever used prelinking before, and i doubt trying to use it is a viable long term solution i'm not sure how a zygote approach would even work in the SDK ... there's not really any place we could keep a process resident to easily communicate with and fork into the right env
,
Oct 10 2017
The reason is size. Linking it statically increases the SDK by ~400MB. and this is measured on llvm 3.9. It can only be larger now.
Zygote doesn't have to be persistent. Making it timeout after 10s should be good enough. Assuming the saving per-invocation is 60ms, it is 60ms / (10s + 60ms) ~= 0.6% slower than the persistent one in the worst case.
It looks like that we don't even have to touch clang:
/* zygote.c */
int main() {
...
if (is_server) {
clang_dl = dlopen('clang.real'); // assuming PIE
clang_main = dlsym(clang_dl, 'main');
while (...) { // wait for connection; timeout = 10s
if (fork() == 0) {
rv = clang_main(argc, argv);
// return rv to client and exit.
}
}
} else { // is client
// send argc / argv to server
// wait for return value
// return
}
}
,
Oct 10 2017
Another approach is to just statically link clang. Other tools and libraries remain shared. If they are only used rarely, this might be the simplest.
,
Oct 10 2017
what is the zygote gaining us there ? just keeping the files mapped ? or is the startup overhead significant too ? if we can statically link just clang, that'd be worth investigating.
,
Oct 10 2017
zygote forks after dlopen-ed clang so the time consuming dynamic linking / relocation is done before forking. Statically linked clang is fast because of this. Previous analysis is here: https://bugs.chromium.org/p/chromium/issues/detail?id=661019
,
Oct 10 2017
i grok how the zygote in your example works. what i'm asking is where is the actual savings coming from. if it's mostly from the file maps, having a process live with those libs already loaded is trivial. but if it's from ctors and the writable sections (and processing of the relocs), that wouldn't be sufficient. i think it's the latter, but not 100%.
,
Oct 10 2017
about #9, I would prefer if we stay with one of the configurations that is supported by LLVM (cmake). As far as I understand, llvm only support 1) all shared, 2) one big shared library (what we are using now) and 3) all static. Can't we just go with 3)? BTW (if it matters) this is what android does.
,
Oct 11 2017
#11: The saving comes from ld.so, which spends most of its time on relocation. In fact, ld.so accounts for 52.51% in compiling an empty C function. The second largest part is kernel (33.54%). Considering the 23ms mprotect(..., 3.6MB) in #0, which matches the second LOAD segment until .got, I would bet that most of the kernel time also comes ld.so doing relocation / writing got.
So zygote should be able to save up to 86.06%, or roughly 60ms.
(cros2) ~/tmp$ sudo perf record -e cycles /usr/bin/clang -c trivial.c -o /dev/null
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.024 MB perf.data (289 samples) ]
(cros2) ~/tmp$ sudo perf report -s dso
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 289 of event 'cycles'
# Event count (approx.): 91024686
#
# Overhead Shared Object
# ........ ...................
#
52.51% ld-2.23.so
33.54% [kernel.kallsyms]
7.17% libLLVM-5.0svn.so
4.50% libc-2.23.so
1.89% clang-5.0
0.38% libstdc++.so.6.0.20
,
Oct 11 2017
Correction: until .got -> until .data (i.e., including .got but not including .data)
,
Oct 11 2017
Not sure if I answered #11 correctly, so here's more clarification. The time spent mostly comes from processing of the relocs (~86%). I think zygote (and static linking) can fix most of that.
,
Oct 11 2017
In my past lives, I had success in significantly reducing llvm library size by compiling with -fvisibility=hidden and marking specific functions visible. However, this is a pain staking exercise.
,
Dec 26 2017
,
Jan 2 2018
[It appears that a bunch of old cros issues bulk-added the "Infra" component recently, but they should probably be "Infra>Client>ChromeOS".]
,
Jan 4 2018
,
Feb 5 2018
,
Sep 25
laszio is on a different team now, so it's quite unlikely that he'll get to this. I'll take it in his place, though no promises about seeing substantial progress in the near future. :) FWIW, we have a project to use ThinLTO and AFDO for clang/lld/etc. that should hopefully be started some time in Q4. If we're still spending a substantial piece of time in ld, that might not be *hugely* helpful here, but we'll see...
,
Sep 25
Note that this is still a daily pain point for me. I spend far too much time waiting for the kernel build to run and I find myself context switching away to something else and then forgetting what I was in the middle of testing. I could certainly get in the habit of just building the kernel with "-clang" but so far I have resisted the temptation. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by diand...@chromium.org
, Oct 10 2017