python wrapper for C compiler costs lots of CPU cycles |
||||||||
Issue descriptionSummary: We're burning 50 ms of useless computation time loading up python every time the compiler is invoked ===== More details: Every time the C compiler, assembler, or linker is invoked we traverse through many wrappers, some of them python based. If we converted the python wrapper to C (or something with similarly low overhead) I calculate that we'd save 50 ms of computation time per compiler invocation. This should translate to faster full/incremental builds for developers and should also save power / CPU usage of our build servers. NOTE: this is forked from bug #767073 where I'm sorta working around the issue in the kernel build (though even if we can fix this the kernel change still has merit). --- Here's the order of things when running "aarch64-cros-linux-gnu-clang" in the choot, mostly found by running: "strace -s 128 ${CMD} -e trace=process" 1. /usr/bin/aarch64-cros-linux-gnu-clang - a portage wrapper - https://gitweb.gentoo.org/proj/gcc-config.git/tree/wrapper.c 2. /usr/x86_64-pc-linux-gnu/aarch64-cros-linux-gnu/gcc-bin/4.9.x/aarch64-cros-linux-gnu-clang - a python wrapper - https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/sys-devel/llvm/files/ 3. /usr/bin/ccache; cache calls to the compiler 4. /usr/bin/clang; the real compiler ===== Summary of computation time: 1. portage wrapper: negligible 2. python wrapper: ~60 ms. 3. ccache: negligible 4. clang (if not cached): ~70 ms. ===== Timings of the various stages (I copied Ubuntu's "time" program to /b/tip/time to make this easier) using "--version" which ccache won't cache: #1: 0.1229 seconds rm /b/tip/tmp/time.txt for i in $(seq 100); do /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' /usr/bin/aarch64-cros-linux-gnu-clang -x c /dev/null -c -o /dev/null > /dev/null done python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)' -- #2: 0.1253 seconds rm /b/tip/tmp/time.txt for i in $(seq 100); do /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' /usr/x86_64-pc-linux-gnu/aarch64-cros-linux-gnu/gcc-bin/4.9.x/aarch64-cros-linux-gnu-clang -x c /dev/null -c -o /dev/null > /dev/null done python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)' -- #3: 0.0713 rm /b/tip/tmp/time.txt for i in $(seq 100); do /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' ccache "/usr/bin/clang" "--sysroot=/usr/aarch64-cros-linux-gnu" "-Qunused-arguments" "-grecord-gcc-switches" "-fPIE" "-D_FORTIFY_SOURCE=2" "-fstack-protector-strong" "-pie" "-fno-omit-frame-pointer" "-e" "trace=process" "-B/usr/libexec/gcc/aarch64-cros-linux-gnu" "-target" "aarch64-cros-linux-gnu" -x c /dev/null -c -o /dev/null > /dev/null done python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)' -- #4: 0.0692 rm /b/tip/tmp/time.txt for i in $(seq 100); do /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' /usr/bin/clang "--sysroot=/usr/aarch64-cros-linux-gnu" "-Qunused-arguments" "-grecord-gcc-switches" "-fPIE" "-D_FORTIFY_SOURCE=2" "-fstack-protector-strong" "-pie" "-fno-omit-frame-pointer" "-e" "trace=process" "-B/usr/libexec/gcc/aarch64-cros-linux-gnu" "-target" "aarch64-cros-linux-gnu" -x c /dev/null -c -o /dev/null > /dev/null done python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)' ===== Timings with a zero-sized C file that's got a cache hit: #1: 0.0491 rm /b/tip/tmp/time.txt /b/tip/empty.* touch /b/tip/empty.c for i in $(seq 100); do /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' /usr/bin/aarch64-cros-linux-gnu-clang -c /b/tip/empty.c -o /b/tip/empty.o > /dev/null done python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)' -- #2: 0.0456 rm /b/tip/tmp/time.txt /b/tip/empty.* touch /b/tip/empty.c for i in $(seq 100); do /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' /usr/x86_64-pc-linux-gnu/aarch64-cros-linux-gnu/gcc-bin/4.9.x/aarch64-cros-linux-gnu-clang -c /b/tip/empty.c -o /b/tip/empty.o > /dev/null done python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)' -- #3: 0.00 rm /b/tip/tmp/time.txt /b/tip/empty.* touch /b/tip/empty.c for i in $(seq 100); do /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' ccache "/usr/bin/clang" "--sysroot=/usr/aarch64-cros-linux-gnu" "-Qunused-arguments" "-grecord-gcc-switches" "-fPIE" "-D_FORTIFY_SOURCE=2" "-fstack-protector-strong" "-pie" "-fno-omit-frame-pointer" "-e" "trace=process" "-B/usr/libexec/gcc/aarch64-cros-linux-gnu" "-target" "aarch64-cros-linux-gnu" -c /b/tip/empty.c -o /b/tip/empty.o > /dev/null done python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)' -- #4: 0.0684 rm /b/tip/tmp/time.txt /b/tip/empty.* touch /b/tip/empty.c for i in $(seq 100); do /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' /usr/bin/clang "--sysroot=/usr/aarch64-cros-linux-gnu" "-Qunused-arguments" "-grecord-gcc-switches" "-fPIE" "-D_FORTIFY_SOURCE=2" "-fstack-protector-strong" "-pie" "-fno-omit-frame-pointer" "-e" "trace=process" "-B/usr/libexec/gcc/aarch64-cros-linux-gnu" "-target" "aarch64-cros-linux-gnu" -c /b/tip/empty.c -o /b/tip/empty.o > /dev/null done python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'
,
Oct 10 2017
,
Oct 10 2017
"just rewrite in C" isn't exactly a straight forward solution. there's a decent amount of work being done in the wrappers (including debug support like bisection) that would not translate well to C at all. C++ would be better, but still not great.
,
Oct 10 2017
go? How much overhead is a hello world go program?
,
Oct 10 2017
There seems to be no .pyc files, because the wrapper isn't invoked as .py. Let me see if that can make things better tomorrow.
,
Oct 10 2017
I'll make a wager (based on past history) that generating pyo file will not improve things in any major way. Python is just slow to start.
,
Oct 10 2017
I wonder if we have made the python wrapper slower recently. We did some re-structuring and made it cleaner by modularizing it.
,
Oct 10 2017
> I wonder if we have made the python wrapper slower recently.
> We did some re-structuring and made it cleaner by modularizing it.
That's possible. It's really easy to accidentally add slowdowns in python. The problem is that "import" often runs a non-trivial amount of python code. There's even a warning in the wrapper...
A quick test shows that the smallest possible overhead of doing a wrapper in python is 17-18 ms:
---
rm /b/tip/tmp/time.txt /b/tip/empty.*
touch /b/tip/empty.py
for i in $(seq 100); do
/b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' python /b/tip/empty.py > /dev/null
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'
---
20 ms would be much better than 50 ms, but I'd still assert for something called this often that we really want this overhead to be near zero. We've talked about all sorts of things to speed up builders and so there's clearly a need. IMHO this will give us a big bang for our buck even if it means re-writing a bit of python. ...and this will help individual developers in their day-to-day activities a lot, too...
,
Oct 10 2017
Tried your code. Looks like the output of time has only two digits (at least for me), and most of the time the output values are 0. Set the seq count to 1000 and the time will seem to be much less.
,
Oct 10 2017
For me they're all 0.01 and 0.02. I chose the gnu time command because it was easier to script. ...but here it is with the bash one with gives better accuracy:
rm /b/tip/tmp/time.txt /b/tip/empty.*
touch /b/tip/empty.py
for i in $(seq 100); do
{ time python /b/tip/empty.py > /dev/null ; } 2> /b/tip/onetime.txt
grep real /b/tip/onetime.txt | sed 's/.*0m\([^s]*\)s/\1/' >> /b/tip/tmp/time.txt
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'
...in that case I got ~24ms
,
Oct 10 2017
Actually, this gets us down to 8ms, which ain't too shabby. Notice the magic '-S':
---
rm /b/tip/tmp/time.txt /b/tip/empty.*
touch /b/tip/empty.py
for i in $(seq 100); do
{ time python -S /b/tip/empty.py > /dev/null ; } 2> /b/tip/onetime.txt
grep real /b/tip/onetime.txt | sed 's/.*0m\([^s]*\)s/\1/' >> /b/tip/tmp/time.txt
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'
---
I haven't thought through all the consequences of not importing 'site' packages.
,
Oct 10 2017
For #4, here's overhead for compiled Go binaries for comparison:
$ cat empty.go
package main
func main() {
}
$ go build empty.go
$ time ./empty; time ./empty; time ./empty
real 0m0.003s
user 0m0.000s
sys 0m0.000s
real 0m0.002s
user 0m0.004s
sys 0m0.000s
real 0m0.002s
user 0m0.000s
sys 0m0.000s
On the same system, python (without -S) takes:
$ time python empty.py; time python empty.py; time python empty.py
real 0m0.021s
user 0m0.012s
sys 0m0.008s
real 0m0.019s
user 0m0.008s
sys 0m0.008s
real 0m0.020s
user 0m0.016s
sys 0m0.000s
I would expect the speedup from python to hold for non-empty programs like the wrapper.
The speedup is probably going to be the same for other compiled languages like C and C++.
However, it is probably much easier to re-write the python wrapper in Go than in C or C++.
,
Oct 10 2017
i'd investigate the python routes first (minimizing imports, eliding site if possible, getting .pyc files) as those should be low risk/friction to get in, especially with other tooling (like the bisection code) being written in. if we need something more drastic, and the toolchain team is fine with Go, then using that as the wrapper language sounds fine to me. since it'll be a static program, it shouldn't impact the standalone toolchain scenario.
,
Oct 10 2017
Sounds like "go" is a good candidate then. a 3 ms overhead seems acceptable to me (hopefully it can still be <= 5 ms once we add real code?) laszio@: are you happy owning this?
,
Oct 10 2017
I tried with .pyc / .pyo and it only saves 4ms per invocation. sysroot_wrapper.hardened in pyo, 1000 iterations: real 0m29.747s user 0m22.044s sys 0m5.600s unmodified, 1000 iterations: real 0m33.762s user 0m24.272s sys 0m6.360s
,
Oct 10 2017
I was just looking at the import statements for the target compiler wrapper. On the main execution path, we do 3 imports: os, sys and re. I don't think we can eliminate os or sys, but I think with a little work we can eliminate re....I'm not sure how much performance that would buy.
,
Oct 10 2017
converting to go is not going to be cheap and done overnight. the python avenue should be exhausted first imo (although in the end it's up to the toolchain team since they "own" these wrappers).
,
Oct 10 2017
I also prefer we look at improving the python start up. I had also read that using -S could be of help and that eliminating globals can help. I am not a big fan of rewritting the wrapper at the moment. This is not a trivial effort and everyone in the team has other priorities. What this bug is mainly complaining about is the serialized execution of the wrapper calling trivial uses of the compiler (eg: --version). but that is not the general case. So, I agree we should look at improving start up time of clang and the wrapper but I don't think we should be working on a full rewrite for now.
,
Oct 10 2017
> What this bug is mainly complaining about is the serialized execution > of the wrapper calling trivial uses of the compiler (eg: --version). > but that is not the general case. That's not what I meant this bug to be about. The original bug ( bug #767073 ) was dealing with the kernel problem where everything was crazy serial. We're solving that in the kernel by caching. _This_ bug is about a constant 50 ms overhead with every single call to the compiler, even if ccache hits (because the python runs _before_ ccache as per the design of this wrapper). This 50 ms is costing developers time every time they compile. This 50 ms is costing our builders time. > I am not a big fan of rewritting the wrapper at the moment. This is not > a trivial effort and everyone in the team has other priorities. I agree that it's not trivial, but IMHO it's worth it. We've spent tons of time trying to optimize the build in other ways (to speedup our builders) and IMHO this will have at least as big of a payoff as other things that we've spent lots of time on. > the python avenue should be exhausted first imo If there's an easy way to make the python faster then let's do it, but I guess the question is what's "fast enough". ...in the very least it seems like 5 ms of total overhead is probably "fast enough". Is 20 ms of total overhead "fast enough"? IMHO no. Is 10 ms of total overhead "fast enough"? Hard to say. laszio@ nicely pointed to bug #661019 for a previous analysis. I don't think that one isolated exactly how much overhead was the compiler startup time, but it looks like lots of it was. squawks-R58-9247: 2h59m37s (chrome by gcc, everything else by clang/shared) squawks-R58-9255: 2h19m48s (chrome by gcc, everything else by clang/dylib) squawks-R58-9281: 2h5m48s (everything by clang/dylib) ...you can see how much these tiny little numbers add up to make our builders slow.
,
Oct 10 2017
From some data that laszio collected when we were migrating to LLVM, the average time to generate a .o file for Chrome browser is: 3.5 secs. (this is +O2, AFDO and other optimizations). so, 50/3500 = 0.0142 .. So, the maximum we could save is 1.42 percent. So, yes, it is inefficient but the only case were we have seen it really matters is in "configure" steps. (serialized + trivial calls to compiler). I am sure some people in my team would love to re-implement this, specially now that the door is open for Go. But I cannot say we are going to re-implement it next week.
,
Oct 10 2017
Looking over ~200 recent CQ runs (master-paladin:16275 through master-paladin:16478), the average amount of time spent in the BuildPackages step of the slowest slave of each run was 1218 seconds (and the stage has other overhead and isn't just compiling for its duration). The BuildImage step in the same slowest stage average is 455 seconds (and has even more non-compile overhead). Assuming the 1.42% and the numbers I just listed gives ~24s per CQ run as a maximum improvement.
,
Oct 10 2017
Importing logging* seems to be quite expensive. It even tries to import socket! Luckily, it's needed only when SYSROOT is not defined:
(sorted by cumulative time)
1762 function calls (1742 primitive calls) in 0.018 seconds
Ordered by: cumulative time
List reduced from 258 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.018 0.018 <string>:1(<module>)
1 0.002 0.002 0.018 0.018 /usr/x86_64-pc-linux-gnu/x86_64-cros-linux-gnu/gcc-bin/4.9.x/x86_64-cros-linux-gnu-clang:327(main)
1 0.002 0.002 0.008 0.008 /usr/lib64/python2.7/logging/handlers.py:24(<module>)
1 0.005 0.005 0.006 0.006 /usr/lib64/python2.7/socket.py:45(<module>)
1 0.002 0.002 0.005 0.005 /usr/lib64/python2.7/logging/__init__.py:24(<module>)
1 0.001 0.001 0.002 0.002 /usr/lib64/python2.7/collections.py:1(<module>)
1 0.000 0.000 0.001 0.001 /usr/x86_64-pc-linux-gnu/x86_64-cros-linux-gnu/gcc-bin/4.9.x/x86_64-cros-linux-gnu-clang:154(log_parent_process_tree)
With SYSROOT defined / not importing logging*:
507 function calls (501 primitive calls) in 0.001 seconds
Ordered by: cumulative time
List reduced from 76 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.001 0.001 <string>:1(<module>)
1 0.000 0.000 0.001 0.001 /usr/x86_64-pc-linux-gnu/x86_64-cros-linux-gnu/gcc-bin/4.9.x/x86_64-cros-linux-gnu-clang:327(main)
2 0.000 0.000 0.001 0.000 /usr/lib64/python2.7/re.py:138(match)
2 0.000 0.000 0.001 0.000 /usr/lib64/python2.7/re.py:230(_compile)
2 0.000 0.000 0.000 0.000 /usr/lib64/python2.7/sre_compile.py:567(compile)
With .pyc and SYSROOT set to /build/chell, the overhead becomes 16.87ms (on my machine).
,
Oct 10 2017
@20: Good information. OK, let me get some more data myself too. I'm not at all confident that the 3.5 seconds to generate a .o is typical for most packages, but perhaps I will be proven to be wrong. I'll run some tests on the kernel soon. Do you happen to know if most builders use a cold ccache or a filled ccache? If they are using a filled ccache then likely the speedup will be quite spectacular since (as per above calculations) the cost to generate a ".o" file from cache is almost all python overhead. -- @21: I was thinking more of release builders, not paladins. Paladins only build one package. Picking a random release builder, like kevin-release, I see that "Build Packages" took 57 minutes. If we only save 1.42% that will still be less than a minute, but as per above I question the 1.42% number. -- I'll also note that there are really two motivations here: 1. Speed up interactive builds (often incremental) for individual developers. 2. Speed up builders (and incidentally reduce CPU time cost on builders) #2 is certainly important to Google and we've spent lots of time on it in the past. #1 ought to also be important to Google. Shaving seconds off incremental builds can be a huge boost in productivity. I'll try to get more real numbers here for kernel builds.
,
Oct 10 2017
#21, #20: I think this is highly misleading. 3.5s average compile time per source file ? Those must be _huge_ source files. I don't think that number is an accurate average for the entire build. Running "make allmodconfig; make" on the chromeos-4.4 Linux kernel compiles ~19,000 files. The linker is invoked ~8,000 times, plus there are various other jobs. Total CPU time on my server (using the Ubuntu compiler) is 165m user + 13m sys, or 178m. Even assuming that linking doesn't take any CPU time, that results in an average CPU time of (178*60)/19000 = 0.56s per file. If I randomly assume that a link job takes as much time as a compilation, the time would be (178*60)/27000 = 0.39s per job/file, which seems much more realistic. Calculating it differently: If each C compiler invocation takes an additional 50ms, that would add 19000 * 50 = 950000ms or 950s CPU time to the build, resulting in an overhead of ~10%.
,
Oct 11 2017
Clarification for #22: 1. The 16.87ms is measured with -S. 2. With and without -S or pyc, not importing logging* always saves 17ms. 3. -S and .pyc together reduce startup time by 10ms.
,
Oct 11 2017
#23: Re: ccache in builders It looks like ccache isn't useful in *-release and *-pfq builders; Their chroots are replaced every time. It may be useful in *-paladin and *-incremental, however. (from chell-chrome-release) 03:23:21: INFO: RunCommand: /b/c/cbuild/repository/chromite/bin/cros_sdk --buildbot-log-version --nouse-image --create --replace '--chrome_root=/b/c/cbuild/repository/.cache/distfiles/target-master/chrome-src-internal' in /b/c/cbuild/repository 03:23:22: NOTICE: Deleting chroot. ...
,
Oct 11 2017
ccache isn't stored inside the chroot, it's stored in the common cache dir preciously so it'll survive chroot wipes. ccache in release builders has been historically useful when we built more than one board at a time, but we no longer do that. at the time, i checked cold cache builds and made sure that the overhead wasn't significant. i don't recall where i posted those numbers now though. we could add a ccache summary to our cbuildbot reporting stage so we can get a sense of running stats.
,
Oct 11 2017
re c#23: I mention paladins because those are the ones that really directly affect developers. Release builds run on a three times a day schedule (with discussions to move it to twice a day) and take many hours, so a few minutes either way doesn't seem as important to me. I think they also might be dedicated machines where we're paying for them regardless if they're idle or active the entire time -- someone on the Infra team can probably indicate one way or the other. Paladins end up building more than one package -- they'll build packages for all the things that change, and I believe depending on the dependencies, that can trigger other packages to be rebuilt. Also, paladins have some failure cases where they need to rebuild more things after a previous bad build. I totally agree that reducing developer build time is important. I agree that the 1.42% is potentially suspect, but that was the number that was quoted.
,
Oct 11 2017
- I don't have a reason to doubt or suspect the data collected by Ting-Yuan (laszio) while building Chrome with LLVM. Chrome is C++, heavy use of templates and more compiler optimized than other packages. So, for regular C the compiler will be faster. Keep in mind Chrome is the package that takes the longest to build during build packages. - The compiler is an expensive "operation". That is why tools like ccache exist. There is nothing surprising in it being much more expensive than a wrapper written in python. - Ting-Yuan showed that the 50ms time is incorrect. It only happens when SYSROOT is not set which should not be the case within the chroot (we need to do something to avoid this confusing behavior). - on #24 groeck mentioned around ~10% overhead for kernel sources, but that is based on the 50ms time. So, that should be reduced by half or a little bit more. (between 4 and 5%) - Ting-Yuan showed that he can further reduce the overhead to ~16 ms. We do care about buildtime. That is something we have improved in the past and during the migration to LLVM. We are working to move to a faster linker (30% or 40% faster) (which is what usually dominates the time in the modify/build developer cycle) and we have an idea to reduce the compile time by 10-15%. Those have bigger impact than rewriting the wrapper.
,
Oct 11 2017
> I don't have a reason to doubt or suspect the data collected
> by Ting-Yuan (laszio) while building Chrome with LLVM.
I wasn't saying this (or if I did, I didn't intend to). I was more saying that I don't care how long Chrome files take to compile since I don't compile them. I care how long Linux kernel files take to compile and I believe that those two things may be very different.
> Ting-Yuan showed that the 50ms time is incorrect. It only happens when SYSROOT is not set
> which should not be the case within the chroot (we need to do something to avoid this confusing behavior).
Ah, this I didn't understand from the above! I see that logging is only there when SYSROOT is set, but to me that didn't mean anything. Thank you for clarifying as it makes my experiments make much more sense.
...so basically the old code really had an overhead of about 28 ms as per call:
rm /b/tip/tmp/time.txt /b/tip/empty.*
touch /b/tip/empty.c
for i in $(seq 100); do
{ time SYSROOT=/build/kevin /usr/bin/aarch64-cros-linux-gnu-clang -c /b/tip/empty.c -o /b/tip/empty.o > /dev/null ; } 2> /b/tip/onetime.txt
grep real /b/tip/onetime.txt | sed 's/.*0m\([^s]*\)s/\1/' >> /b/tip/tmp/time.txt
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'
> The compiler is an expensive "operation". That is why tools like ccache exist.
> There is nothing surprising in it being much more expensive than a wrapper
> written in python.
Fair enough, but in this case the wrapper is in front of ccache. ...and python wrapper is much slower than ccache.
> We do care about buildtime. That is something we have improved in the past and during
> the migration to LLVM. We are working to move to a faster linker (30% or 40% faster)
> (which is what usually dominates the time in the modify/build developer cycle) and we
> have an idea to reduce the compile time by 10-15%. Those have bigger impact than rewriting
> the wrapper.
OK, sounds great.
=====
FWIW: I did some experiments to see how much a clean kernel build takes on my system with a warm ccache. Unfortunately the numbers were super noisy. :( I basically compared how most of the speedups discussed here (plus removing "re") helped. Specifically my changes were:
@@ -1,4 +1,4 @@
-#!/usr/bin/env python2
+#!/usr/bin/python2 -S
# Copyright (c) 2012 The Chromium OS Authors. All rights reserved.
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
@@ -45,7 +45,6 @@
# compiler invocation.
import os
-import re
import sys
WRAPPER_ONLY_OPTIONS = set(('-print-cmdline', '-nopie', '-noccache'))
@@ -142,9 +141,8 @@
def get_proc_status(pid, item):
with open('/proc/%i/status' % pid) as fp:
for line in fp:
- m = re.match(r'%s:\s*(.*)' % re.escape(item), line)
- if m:
- return m.group(1)
+ if line.startswith(item + ":"):
+ return line.partition(":")[-1].lstrip()
return None
@@ -228,6 +226,7 @@
# Log 'time' like resource usage, along with corresponding command line, to
# GETRUSAGE file name if it is defined.
def log_rusage(log_file, argv0, execargs):
+ return
import datetime
import logging
@@ -372,7 +371,9 @@
cmdline = [x for x in myargs if x not in WRAPPER_ONLY_OPTIONS]
- if re.match(r'i.86|x86_64', os.path.basename(sys.argv[0])):
+ bn = os.path.basename(sys.argv[0])
+ # Complicated way to look for x86_64 or i.86...
+ if 'x86_64' in bn or 'i386' in bn or 'i686' in bn:
cmdline.extend(X86_DISABLE_FLAGS)
if not invoke_clang:
@@ -412,7 +413,7 @@
gcc_cmdline.append(flag)
sysroot = os.environ.get('SYSROOT', '')
- if not sysroot:
+ if False and not sysroot:
# Keep these import statements here, rather than making them global, to
# prevent performance penalties when this execution path is taken (see
# commend at top of this file).
@@ -464,7 +465,7 @@
linker = arch + '-ld'
linker_path = get_linker_path(linker)
clang_cmdline += ['-B' + linker_path]
- if re.match(r'i.86', arch):
+ if 'i386' in arch or 'i686' in arch:
# TODO: -target i686-pc-linux-gnu causes clang to search for
# libclang_rt.asan-i686.a which doesn't exist because it's packaged
# as libclang_rt.asan-i386.a. We can't use -target i386-pc-linux-gnu
--
Running just the compiler test, I went from about 28 ms per invocation to about 13 ms per invocation, so saving 15 ms.
--
rm /tmp/testlog.txt
CCACHE_DIR=/var/cache/chromeos-cache/distfiles/ccache/ ccache -s >> /tmp/testlog.txt
for i in $(seq 5); do
rm -rf /build/kevin/var/cache/portage/sys-kernel/chromeos-kernel-4_4
{ time FEATURES="-buildpkg" USE="-kmemleak lockdebug kgdb vtconsole -firmware_install" emerge-${BOARD} --nodeps chromeos-kernel-4_4 ; } 2>> /tmp/testlog.txt
done
CCACHE_DIR=/var/cache/chromeos-cache/distfiles/ccache/ ccache -s >> /tmp/testlog.txt
--
I got this before:
real 1m19.339s
user 2m47.252s
sys 3m2.716s
real 1m16.349s
user 2m48.116s
sys 3m5.464s
real 1m21.457s
user 2m47.712s
sys 3m5.328s
real 1m18.416s
user 2m49.364s
sys 3m5.924s
real 1m13.730s
user 2m46.384s
sys 3m2.928s
And after:
real 1m21.196s
user 2m34.656s
sys 3m5.372s
real 1m16.896s
user 2m35.400s
sys 2m59.708s
real 1m18.910s
user 2m35.580s
sys 3m3.736s
real 1m16.814s
user 2m35.076s
sys 3m5.160s
real 1m18.987s
user 2m34.392s
sys 3m1.724s
---
user time went down by about 13 seconds per invocation, but since the kernel is so parallel and I have a beefy workstation it didn't really save any user time.
So in terms of just an incremental kernel build the numbers agree: we're chewing up some time here and it would be nice to get it sped up, but it doesn't seem to make a major impact on incremental build time, at least in the specific test above. It does make a ~6% decrease in CPU usage in the particular test above but that's not crazy.
=====
Given all the above it seems sane to say:
* Let's take the low hanging fruit.
* Let's leave it in python for now.
* Eventually it would be nice to get this in Go and save a few more CPU cycles, but that could easily be a P3.
Sound good?
,
Oct 11 2017
Thanks for the detailed analysis Dough. I agree with your conclusions. I will leave the bug open so that we implement the low hanging fruit. And will file a separate bug to implement the wrapper in Go.
,
Oct 11 2017
Filed issue 773875 for the Go compiler wrapper implementation.
,
Oct 13 2017
Everything except the "pyc" / "pyo" conversion is easy. Should I post a patch for that? For the "pyc" / "pyo" conversion I'm not sure how we're going to do it. In theory you could try to do something like this in the chroot, I guess: https://stackoverflow.com/questions/21825910/shebang-for-compiled-python-code
,
Oct 13 2017
I've uploaded these two, but I'm not 100% sure how to test them since it's not immediately clear to me how these get rolled out. I can dig, but I also figure someone here probably just knows. I tested the bits of python here in isolation and also manually modified my own "sysroot_wrapper.hardened" in the chroot. * https://chromium-review.googlesource.com/#/c/chromiumos/overlays/chromiumos-overlay/+/719258 compiler wrappers: Update wrapper to not use "env" and invoke "python -S" * https://chromium-review.googlesource.com/#/c/chromiumos/overlays/chromiumos-overlay/+/719259 compiler wrappers: Avoid use of 're.'
,
Oct 14 2017
wrt pyc, it should just be renaming the file to have a py suffix and calling an optimize helper in the ebuild
,
Oct 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/cc828a5bad5e82b46e2b48e26daf3f363251ad28 commit cc828a5bad5e82b46e2b48e26daf3f363251ad28 Author: Douglas Anderson <dianders@chromium.org> Date: Mon Oct 16 10:04:28 2017 cros_setup_toolchain: Munge the shebang back to use "env" In CL:719258 we're optimizing the shebang of the toolchain wrappers. ...but we can't necessarily use this optimized version outside the chroot. Munge it back. BUG=chromium:773138 TEST=See that things get munged back Change-Id: I08e280bca46cd2f36049502cd87466d4f581895f Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/719719 Reviewed-by: Manoj Gupta <manojgupta@chromium.org> [modify] https://crrev.com/cc828a5bad5e82b46e2b48e26daf3f363251ad28/scripts/cros_setup_toolchains.py
,
Oct 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/64d4d5aa4ca7330b2bce96adbbbc1aa2e6e097f0 commit 64d4d5aa4ca7330b2bce96adbbbc1aa2e6e097f0 Author: Douglas Anderson <dianders@chromium.org> Date: Mon Oct 16 10:05:32 2017 compiler wrappers: Update wrappers to not use "env" and invoke "python -S" Let's speedup the python wrapper a little bit. We know where python is and we also know that it's OK to execute this script raw (AKA skip including site packages). This speeds up the execution of the script by about 7 ms (on my system from ~29 ms to ~22 ms). NOTE: this change is partially undone when outside the chroot by CL:719719 CQ-DEPEND=CL:719719 BUG=chromium:773138 TEST=See bug Change-Id: I7bf64306655a05dc5661a93136b694e78b549544 Reviewed-on: https://chromium-review.googlesource.com/719258 Commit-Ready: Douglas Anderson <dianders@chromium.org> Tested-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Manoj Gupta <manojgupta@chromium.org> [modify] https://crrev.com/64d4d5aa4ca7330b2bce96adbbbc1aa2e6e097f0/sys-devel/llvm/files/clang_host_wrapper.header [modify] https://crrev.com/64d4d5aa4ca7330b2bce96adbbbc1aa2e6e097f0/sys-devel/gcc/gcc-9999.ebuild [rename] https://crrev.com/64d4d5aa4ca7330b2bce96adbbbc1aa2e6e097f0/sys-devel/llvm/llvm-5.0_pre305632_p20170806-r6.ebuild [modify] https://crrev.com/64d4d5aa4ca7330b2bce96adbbbc1aa2e6e097f0/sys-devel/gcc/files/sysroot_wrapper.hardened.header
,
Oct 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/54f79daa269d025d114e029c1fbf1de2f4e39395 commit 54f79daa269d025d114e029c1fbf1de2f4e39395 Author: Douglas Anderson <dianders@chromium.org> Date: Mon Oct 16 10:05:32 2017 compiler wrappers: Avoid use of 're.' Our use of regular expressions is quite basic. Let's avoid using them and stick to string manipulation. This avoids 'import re' which will slightly speed up execution time. My measurements were a bit noisy, but this seemed to shave ~2 ms on the execution time of this script. BUG=chromium:773138 TEST=See bug Change-Id: Ic69cb8356acd19486513f20355f1bede6906c70d Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/719259 Reviewed-by: Manoj Gupta <manojgupta@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [rename] https://crrev.com/54f79daa269d025d114e029c1fbf1de2f4e39395/sys-devel/llvm/llvm-5.0_pre305632_p20170806-r7.ebuild [modify] https://crrev.com/54f79daa269d025d114e029c1fbf1de2f4e39395/sys-devel/gcc/files/sysroot_wrapper.hardened.body [modify] https://crrev.com/54f79daa269d025d114e029c1fbf1de2f4e39395/sys-devel/gcc/gcc-9999.ebuild [modify] https://crrev.com/54f79daa269d025d114e029c1fbf1de2f4e39395/sys-devel/llvm/files/wrapper_script_common [modify] https://crrev.com/54f79daa269d025d114e029c1fbf1de2f4e39395/sys-devel/llvm/files/clang_host_wrapper.body
,
Oct 16 2017
> wrt pyc, it should just be renaming the file to have a py suffix and > calling an optimize helper in the ebuild Ah, it looks like the chroot already has all the magic in it to directly execute .pyc files! I didn't expect that. ...but I don't actually see a speedup here. The results are a bit noisy for me, but it seems like it might actually be slower! Can anyone else confirm that?
,
Oct 17 2017
#39: How many iterations did you try? With 100 I also got somewhat unstable results. 1000 looks pretty stable to me. I implemented the .pyc approach in the following CL and got consistent result to #25. Based on Doug's patches, the overhead of the wrapper is now less than 15.5ms. $ time.sh x86_64-cros-linux-gnu-gcc -noccache real 0m15.509s user 0m9.764s sys 0m2.932s https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/722671 time.sh --- #!/bin/bash export SYSROOT=/build/chell run() { for i in $(seq 1000); do $* &> /dev/null done } time run $*
,
Oct 17 2017
tl;dr: My numbers continue to show that .pyc / .pyo is slower than .py --- Oh, doh! I think I went back to 100 iterations on accident. OK, at 1000 it's more consistent. ...but I'm still seeing weirdness in that the .pyc / .pyo is often slower. As far as I can tell, though, there's some sort of bi-modality going on. When running the same tests sometimes I see good numbers and other times bad. When it's in a certain mode it tends to stay there. Here's some examples. I first did this to prep: --- file /usr/x86_64-pc-linux-gnu/aarch64-cros-linux-gnu/gcc-bin/4.9.x/sysroot_wrapper.hardened cp /usr/x86_64-pc-linux-gnu/aarch64-cros-linux-gnu/gcc-bin/4.9.x/sysroot_wrapper.hardened /tmp/wrapper.py python -m compileall /tmp/wrapper.py chmod +x /tmp/wrapper.pyc python -OO -m compileall /tmp/wrapper.py chmod +x /tmp/wrapper.pyo --- I then ran the attached script twice. As you can see, it's not consistent... . /b/tip/tmp/timetests ; . /b/tip/tmp/timetests ===== py ===== 0.021915 0.0219 0.022227 ===== pyc ===== 0.025591 0.025673 0.0229113333333 ===== pyo ===== 0.017517 0.0173945 0.017056 ===== py ===== 0.016318 0.0152835 0.0151203333333 ===== pyc ===== 0.017062 0.0165925 0.0165496666667 ===== pyo ===== 0.017185 0.018336 0.0202573333333 --- On a hunch, I decided that perhaps this is a cpufreq problem where the system isn't properly going up to a fast enough cpu frequency. I ran this outside the chroot: cd /sys/devices/system/cpu for c in cpu*; do echo performance > $c/cpufreq/scaling_governor; done grep "" */*/scaling_governor ...then I re-ran: . /b/tip/tmp/timetests ; . /b/tip/tmp/timetests Now the tests are consistent. ...and the compiled tests still show as slower. ===== py ===== 0.010385 0.0104075 0.0103703333333 ===== pyc ===== 0.011453 0.0115235 0.011499 ===== pyo ===== 0.011407 0.0113465 0.0113786666667 ===== py ===== 0.010302 0.0101925 0.0101536666667 ===== pyc ===== 0.011175 0.0112485 0.0113123333333 ===== pyo ===== 0.011491 0.0114745 0.0114233333333 === I also tried changing the above to use "-noccache" like you did. That's in "timetests2". Again compiled tests are slower. . /b/tip/tmp/timetests2 ; . /b/tip/tmp/timetests2 ===== py ===== clang-5.0: error: no input files 0.026692 0.0264095 0.0265666666667 ===== pyc ===== clang-5.0: error: no input files 0.027913 0.028025 0.0280016666667 ===== pyo ===== clang-5.0: error: no input files 0.027973 0.027863 0.0278026666667 ===== py ===== clang-5.0: error: no input files 0.026324 0.026315 0.026249 ===== pyc ===== clang-5.0: error: no input files 0.027406 0.0274195 0.0274873333333 ===== pyo ===== clang-5.0: error: no input files 0.027845 0.028 0.0280426666667 ===== Don't forget to go back to "powersave" governor when you're done. cd /sys/devices/system/cpu for c in cpu*; do echo powersave > $c/cpufreq/scaling_governor; done grep "" */*/scaling_governor
,
Oct 17 2017
Closing as Fixed since this is as good as we'll get w/ python, I think. Future work: * Make incremental kernel builds faster: bug #767073 * Move wrapper to Go: bug #773875 * Make launching clang faster: bug #773142
,
Oct 17 2017
,
Oct 17 2017
I can reproduce what Doug saw: when invoking the pyc/pyo directly, it is slower. It is faster when imported and invoked in another python script. So my CL is still valid :)
,
Oct 17 2017
@44: nice. OK, leaving open then.
,
Oct 18 2017
I cannot find the specification of when pyc is used. According to strace, it looks like that pyc is used only when 1) it is called directly. 2) the module is imported in some other python codes. However, 1) can't specify shebang (at least I don't know how) so the '-S' trick is missing. I think this is the reason why Doug saw pyc slower: pyc without '-S' is slower than py + '-S'. When implementing using 2), pyc + '-S' is faster than py + '-S'.
,
Oct 18 2017
So it looks like having a dummy wrapper that imports the main module is the way to go. Can someone give me +2? https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/722671
,
Oct 18 2017
> pyc without '-S' is slower than py + '-S'. When implementing using 2), pyc + '-S' is faster than py + '-S'. Oh! This makes total sense!
,
Dec 4 2017
,
Feb 5 2018
,
Sep 25
laszio is on a different team now, so it's quite unlikely that he'll get to this. I'll take it in his place, though no promises about seeing substantial progress in the near future. :) |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by diand...@chromium.org
, Oct 10 2017