New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 773138 link

Starred by 4 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

python wrapper for C compiler costs lots of CPU cycles

Project Member Reported by diand...@chromium.org, Oct 10 2017

Issue description

Summary: 

We're burning 50 ms of useless computation time loading up python every time the compiler is invoked

=====

More details:

Every time the C compiler, assembler, or linker is invoked we traverse through many wrappers, some of them python based.  If we converted the python wrapper to C (or something with similarly low overhead) I calculate that we'd save 50 ms of computation time per compiler invocation.  

This should translate to faster full/incremental builds for developers and should also save power / CPU usage of our build servers.

NOTE: this is forked from  bug #767073  where I'm sorta working around the issue in the kernel build (though even if we can fix this the kernel change still has merit).

---

Here's the order of things when running "aarch64-cros-linux-gnu-clang" in the choot, mostly found by running: "strace -s 128 ${CMD} -e trace=process"

1. /usr/bin/aarch64-cros-linux-gnu-clang - a portage wrapper - https://gitweb.gentoo.org/proj/gcc-config.git/tree/wrapper.c

2. /usr/x86_64-pc-linux-gnu/aarch64-cros-linux-gnu/gcc-bin/4.9.x/aarch64-cros-linux-gnu-clang - a python wrapper -
 https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/sys-devel/llvm/files/

3. /usr/bin/ccache; cache calls to the compiler

4. /usr/bin/clang; the real compiler

=====

Summary of computation time:

1. portage wrapper: negligible

2. python wrapper: ~60 ms.

3. ccache: negligible

4. clang (if not cached): ~70 ms.

=====

Timings of the various stages (I copied Ubuntu's "time" program to /b/tip/time to make this easier) using "--version" which ccache won't cache:

#1: 0.1229 seconds

rm /b/tip/tmp/time.txt
for i in $(seq 100); do
  /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' /usr/bin/aarch64-cros-linux-gnu-clang -x c /dev/null -c -o /dev/null > /dev/null
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'

--

#2: 0.1253 seconds

rm /b/tip/tmp/time.txt
for i in $(seq 100); do
  /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' /usr/x86_64-pc-linux-gnu/aarch64-cros-linux-gnu/gcc-bin/4.9.x/aarch64-cros-linux-gnu-clang -x c /dev/null -c -o /dev/null > /dev/null
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'

--

#3: 0.0713

rm /b/tip/tmp/time.txt
for i in $(seq 100); do
  /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' ccache "/usr/bin/clang" "--sysroot=/usr/aarch64-cros-linux-gnu" "-Qunused-arguments" "-grecord-gcc-switches" "-fPIE" "-D_FORTIFY_SOURCE=2" "-fstack-protector-strong" "-pie" "-fno-omit-frame-pointer" "-e" "trace=process" "-B/usr/libexec/gcc/aarch64-cros-linux-gnu" "-target" "aarch64-cros-linux-gnu" -x c /dev/null -c -o /dev/null > /dev/null
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'

--

#4: 0.0692

rm /b/tip/tmp/time.txt
for i in $(seq 100); do
  /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' /usr/bin/clang "--sysroot=/usr/aarch64-cros-linux-gnu" "-Qunused-arguments" "-grecord-gcc-switches" "-fPIE" "-D_FORTIFY_SOURCE=2" "-fstack-protector-strong" "-pie" "-fno-omit-frame-pointer" "-e" "trace=process" "-B/usr/libexec/gcc/aarch64-cros-linux-gnu" "-target" "aarch64-cros-linux-gnu" -x c /dev/null -c -o /dev/null > /dev/null
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'

=====

Timings with a zero-sized C file that's got a cache hit:

#1: 0.0491

rm /b/tip/tmp/time.txt /b/tip/empty.*
touch /b/tip/empty.c
for i in $(seq 100); do
  /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' /usr/bin/aarch64-cros-linux-gnu-clang -c /b/tip/empty.c -o /b/tip/empty.o > /dev/null
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'

--

#2: 0.0456

rm /b/tip/tmp/time.txt /b/tip/empty.*
touch /b/tip/empty.c
for i in $(seq 100); do
  /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' /usr/x86_64-pc-linux-gnu/aarch64-cros-linux-gnu/gcc-bin/4.9.x/aarch64-cros-linux-gnu-clang -c /b/tip/empty.c -o /b/tip/empty.o > /dev/null
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'

--

#3: 0.00

rm /b/tip/tmp/time.txt /b/tip/empty.*
touch /b/tip/empty.c
for i in $(seq 100); do
  /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' ccache "/usr/bin/clang" "--sysroot=/usr/aarch64-cros-linux-gnu" "-Qunused-arguments" "-grecord-gcc-switches" "-fPIE" "-D_FORTIFY_SOURCE=2" "-fstack-protector-strong" "-pie" "-fno-omit-frame-pointer" "-e" "trace=process" "-B/usr/libexec/gcc/aarch64-cros-linux-gnu" "-target" "aarch64-cros-linux-gnu" -c /b/tip/empty.c -o /b/tip/empty.o > /dev/null
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'

--

#4: 0.0684

rm /b/tip/tmp/time.txt /b/tip/empty.*
touch /b/tip/empty.c
for i in $(seq 100); do
  /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' /usr/bin/clang "--sysroot=/usr/aarch64-cros-linux-gnu" "-Qunused-arguments" "-grecord-gcc-switches" "-fPIE" "-D_FORTIFY_SOURCE=2" "-fstack-protector-strong" "-pie" "-fno-omit-frame-pointer" "-e" "trace=process" "-B/usr/libexec/gcc/aarch64-cros-linux-gnu" "-target" "aarch64-cros-linux-gnu" -c /b/tip/empty.c -o /b/tip/empty.o > /dev/null
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'



 
Forked the overall slowness of invoking clang to bug #773142, though that seems like it might (?) be harder to fix?
Labels: Build-Toolchain
Owner: laszio@chromium.org
Status: Assigned (was: Untriaged)

Comment 3 by vapier@chromium.org, Oct 10 2017

"just rewrite in C" isn't exactly a straight forward solution.  there's a decent amount of work being done in the wrappers (including debug support like bisection) that would not translate well to C at all.

C++ would be better, but still not great.
go? How much overhead is a hello world go program?

Comment 5 by laszio@chromium.org, Oct 10 2017

There seems to be no .pyc files, because the wrapper isn't invoked as .py. Let me see if that can make things better tomorrow.
I'll make a wager (based on past history) that generating pyo file will not
improve things in any major way. Python is just slow to start.

Comment 7 by lloz...@google.com, Oct 10 2017

I wonder if we have made the python wrapper slower recently. 
We did some re-structuring and made it cleaner by modularizing it. 

Cc: davidri...@chromium.org
> I wonder if we have made the python wrapper slower recently. 
> We did some re-structuring and made it cleaner by modularizing it. 

That's possible.  It's really easy to accidentally add slowdowns in python.  The problem is that "import" often runs a non-trivial amount of python code.  There's even a warning in the wrapper...

A quick test shows that the smallest possible overhead of doing a wrapper in python is 17-18 ms:

---

rm /b/tip/tmp/time.txt /b/tip/empty.*
touch /b/tip/empty.py
for i in $(seq 100); do
  /b/tip/tmp/time -a -o /b/tip/tmp/time.txt -f'%e' python /b/tip/empty.py > /dev/null
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'

---

20 ms would be much better than 50 ms, but I'd still assert for something called this often that we really want this overhead to be near zero.  We've talked about all sorts of things to speed up builders and so there's clearly a need.  IMHO this will give us a big bang for our buck even if it means re-writing a bit of python.  ...and this will help individual developers in their day-to-day activities a lot, too...

Comment 9 by groeck@chromium.org, Oct 10 2017

Tried your code. Looks like the output of time has only two digits (at least for me), and most of the time the output values are 0. Set the seq count to 1000 and the time will seem to be  much less.

For me they're all 0.01 and 0.02.  I chose the gnu time command because it was easier to script.  ...but here it is with the bash one with gives better accuracy:

rm /b/tip/tmp/time.txt /b/tip/empty.*
touch /b/tip/empty.py
for i in $(seq 100); do
  { time python /b/tip/empty.py > /dev/null ; } 2> /b/tip/onetime.txt
  grep real /b/tip/onetime.txt | sed 's/.*0m\([^s]*\)s/\1/' >> /b/tip/tmp/time.txt
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'

...in that case I got ~24ms
Actually, this gets us down to 8ms, which ain't too shabby.  Notice the magic '-S':

---

rm /b/tip/tmp/time.txt /b/tip/empty.*
touch /b/tip/empty.py
for i in $(seq 100); do
  { time python -S /b/tip/empty.py > /dev/null ; } 2> /b/tip/onetime.txt
  grep real /b/tip/onetime.txt | sed 's/.*0m\([^s]*\)s/\1/' >> /b/tip/tmp/time.txt
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'

---

I haven't thought through all the consequences of not importing 'site' packages.
For #4, here's overhead for compiled Go binaries for comparison:

$ cat empty.go 
package main

func main() {
}
$ go build empty.go
$ time ./empty; time ./empty; time ./empty

real    0m0.003s
user    0m0.000s
sys     0m0.000s

real    0m0.002s
user    0m0.004s
sys     0m0.000s

real    0m0.002s
user    0m0.000s
sys     0m0.000s


On the same system, python (without -S) takes:
$ time python empty.py; time python empty.py; time python empty.py

real    0m0.021s
user    0m0.012s
sys     0m0.008s

real    0m0.019s
user    0m0.008s
sys     0m0.008s

real    0m0.020s
user    0m0.016s
sys     0m0.000s


I would expect the speedup from python to hold for non-empty programs like the wrapper.
The speedup is probably going to be the same for other compiled languages like C and C++.
However, it is probably much easier to re-write the python wrapper in Go than in C or C++.

i'd investigate the python routes first (minimizing imports, eliding site if possible, getting .pyc files) as those should be low risk/friction to get in, especially with other tooling (like the bisection code) being written in.  if we need something more drastic, and the toolchain team is fine with Go, then using that as the wrapper language sounds fine to me.  since it'll be a static program, it shouldn't impact the standalone toolchain scenario.
Sounds like "go" is a good candidate then.  a 3 ms overhead seems acceptable to me (hopefully it can still be <= 5 ms once we add real code?)  laszio@: are you happy owning this?
I tried with .pyc / .pyo and it only saves 4ms per invocation.

sysroot_wrapper.hardened in pyo, 1000 iterations:
real    0m29.747s
user    0m22.044s
sys     0m5.600s

unmodified, 1000 iterations:
real    0m33.762s
user    0m24.272s
sys     0m6.360s

I was just looking at the import statements for the target compiler wrapper.  On the main execution path, we do 3 imports: os, sys and re.  I don't think we can eliminate os or sys, but I think with a little work we can eliminate re....I'm not sure how much performance that would buy.
converting to go is not going to be cheap and done overnight.  the python avenue should be exhausted first imo (although in the end it's up to the toolchain team since they "own" these wrappers).

Comment 18 by lloz...@google.com, Oct 10 2017

I also prefer we look at improving the python start up. I had also read that using -S could be of help and that eliminating globals can help.

I am not a big fan of rewritting the wrapper at the moment. This is not a trivial effort and everyone in the team has other priorities.

What this bug is mainly complaining about is the serialized execution of the wrapper calling trivial uses of the compiler (eg: --version). but that is not the general case. So, I agree we should look at improving start up time of clang and the wrapper but I don't think we should be working on a full rewrite for now. 
> What this bug is mainly complaining about is the serialized execution 
> of the wrapper calling trivial uses of the compiler (eg: --version). 
> but that is not the general case.

That's not what I meant this bug to be about.  The original bug ( bug #767073 ) was dealing with the kernel problem where everything was crazy serial.  We're solving that in the kernel by caching.  

_This_ bug is about a constant 50 ms overhead with every single call to the compiler, even if ccache hits (because the python runs _before_ ccache as per the design of this wrapper).  This 50 ms is costing developers time every time they compile.  This 50 ms is costing our builders time.


> I am not a big fan of rewritting the wrapper at the moment. This is not 
> a trivial effort and everyone in the team has other priorities.

I agree that it's not trivial, but IMHO it's worth it.  We've spent tons of time trying to optimize the build in other ways (to speedup our builders) and IMHO this will have at least as big of a payoff as other things that we've spent lots of time on.


> the python avenue should be exhausted first imo

If there's an easy way to make the python faster then let's do it, but I guess the question is what's "fast enough".

...in the very least it seems like 5 ms of total overhead is probably "fast enough".  Is 20 ms of total overhead "fast enough"?  IMHO no.  Is 10 ms of total overhead "fast enough"?  Hard to say.  laszio@ nicely pointed to  bug #661019  for a previous analysis.  I don't think that one isolated exactly how much overhead was the compiler startup time, but it looks like lots of it was.

squawks-R58-9247: 2h59m37s (chrome by gcc, everything else by clang/shared)
squawks-R58-9255: 2h19m48s (chrome by gcc, everything else by clang/dylib)
squawks-R58-9281: 2h5m48s  (everything by clang/dylib)

...you can see how much these tiny little numbers add up to make our builders slow.

Comment 20 by lloz...@google.com, Oct 10 2017

From some data that laszio collected when we were migrating to LLVM, the average time to generate a .o file for Chrome browser is: 3.5 secs. (this is +O2, AFDO and other optimizations). 
so, 50/3500 = 0.0142 .. So, the maximum we could save is 1.42 percent. 

So, yes, it is inefficient but the only case were we have seen it really matters is in "configure" steps. (serialized + trivial calls to compiler). 

I am sure some people in my team would love to re-implement this, specially now that the door is open for Go.  But I cannot say we are going to re-implement it next week. 




Looking over ~200 recent CQ runs (master-paladin:16275 through master-paladin:16478), the average amount of time spent in the BuildPackages step of the slowest slave of each run was 1218 seconds (and the stage has other overhead and isn't just compiling for its duration).  The BuildImage step in the same slowest stage average is 455 seconds (and has even more non-compile overhead).

Assuming the 1.42% and the numbers I just listed gives ~24s per CQ run as a maximum improvement.  
Importing logging* seems to be quite expensive. It even tries to import socket! Luckily, it's needed only when SYSROOT is not defined:

(sorted by cumulative time)

         1762 function calls (1742 primitive calls) in 0.018 seconds

   Ordered by: cumulative time
   List reduced from 258 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.018    0.018 <string>:1(<module>)
        1    0.002    0.002    0.018    0.018 /usr/x86_64-pc-linux-gnu/x86_64-cros-linux-gnu/gcc-bin/4.9.x/x86_64-cros-linux-gnu-clang:327(main)
        1    0.002    0.002    0.008    0.008 /usr/lib64/python2.7/logging/handlers.py:24(<module>)
        1    0.005    0.005    0.006    0.006 /usr/lib64/python2.7/socket.py:45(<module>)
        1    0.002    0.002    0.005    0.005 /usr/lib64/python2.7/logging/__init__.py:24(<module>)
        1    0.001    0.001    0.002    0.002 /usr/lib64/python2.7/collections.py:1(<module>)
        1    0.000    0.000    0.001    0.001 /usr/x86_64-pc-linux-gnu/x86_64-cros-linux-gnu/gcc-bin/4.9.x/x86_64-cros-linux-gnu-clang:154(log_parent_process_tree)


With SYSROOT defined / not importing logging*:
         507 function calls (501 primitive calls) in 0.001 seconds

   Ordered by: cumulative time
   List reduced from 76 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.001    0.001 <string>:1(<module>)
        1    0.000    0.000    0.001    0.001 /usr/x86_64-pc-linux-gnu/x86_64-cros-linux-gnu/gcc-bin/4.9.x/x86_64-cros-linux-gnu-clang:327(main)
        2    0.000    0.000    0.001    0.000 /usr/lib64/python2.7/re.py:138(match)
        2    0.000    0.000    0.001    0.000 /usr/lib64/python2.7/re.py:230(_compile)
        2    0.000    0.000    0.000    0.000 /usr/lib64/python2.7/sre_compile.py:567(compile)

With .pyc and SYSROOT set to /build/chell, the overhead becomes 16.87ms (on my machine).

@20: Good information.  OK, let me get some more data myself too.  I'm not at all confident that the 3.5 seconds to generate a .o is typical for most packages, but perhaps I will be proven to be wrong.  I'll run some tests on the kernel soon.

Do you happen to know if most builders use a cold ccache or a filled ccache?  If they are using a filled ccache then likely the speedup will be quite spectacular since (as per above calculations) the cost to generate a ".o" file from cache is almost all python overhead.

--

@21: I was thinking more of release builders, not paladins.  Paladins only build one package.  Picking a random release builder, like kevin-release, I see that "Build Packages" took 57 minutes.  If we only save 1.42% that will still be less than a minute, but as per above I question the 1.42% number.

--

I'll also note that there are really two motivations here:

1. Speed up interactive builds (often incremental) for individual developers.
2. Speed up builders (and incidentally reduce CPU time cost on builders)

#2 is certainly important to Google and we've spent lots of time on it in the past.

#1 ought to also be important to Google.  Shaving seconds off incremental builds can be a huge boost in productivity.  I'll try to get more real numbers here for kernel builds.
#21, #20: I think this is highly misleading. 3.5s average compile time per source file ? Those must be _huge_ source files. I don't think that number is an accurate average for the entire build.

Running "make allmodconfig; make" on the chromeos-4.4 Linux kernel compiles 
~19,000 files. The linker is invoked ~8,000 times, plus there are various other jobs. Total CPU time on my server (using the Ubuntu compiler) is 165m user + 13m sys, or 178m. Even assuming that linking doesn't take any CPU time, that results in an average CPU time of (178*60)/19000 = 0.56s per file. If I randomly assume that a link job takes as much time as a compilation, the time would be (178*60)/27000 = 0.39s per job/file, which seems much more realistic.

Calculating it differently: If each C compiler invocation takes an additional 50ms, that would add 19000 * 50 = 950000ms or 950s CPU time to the build, resulting in an overhead of ~10%.

Clarification for #22:
1. The 16.87ms is measured with -S.
2. With and without -S or pyc, not importing logging* always saves 17ms.
3. -S and .pyc together reduce startup time by 10ms.
#23: Re: ccache in builders

It looks like ccache isn't useful in *-release and *-pfq builders; Their chroots are replaced every time. It may be useful in *-paladin and *-incremental, however.

(from chell-chrome-release)
03:23:21: INFO: RunCommand: /b/c/cbuild/repository/chromite/bin/cros_sdk --buildbot-log-version --nouse-image --create --replace '--chrome_root=/b/c/cbuild/repository/.cache/distfiles/target-master/chrome-src-internal' in /b/c/cbuild/repository
03:23:22: NOTICE: Deleting chroot.
...

ccache isn't stored inside the chroot, it's stored in the common cache dir preciously so it'll survive chroot wipes.

ccache in release builders has been historically useful when we built more than one board at a time, but we no longer do that.  at the time, i checked cold cache builds and made sure that the overhead wasn't significant.  i don't recall where i posted those numbers now though.

we could add a ccache summary to our cbuildbot reporting stage so we can get a sense of running stats.
re c#23:

I mention paladins because those are the ones that really directly affect developers.  Release builds run on a three times a day schedule (with discussions to move it to twice a day) and take many hours, so a few minutes either way doesn't seem as important to me.  I think they also might be dedicated machines where we're paying for them regardless if they're idle or active the entire time -- someone on the Infra team can probably indicate one way or the other.

Paladins end up building more than one package -- they'll build packages for all the things that change, and I believe depending on the dependencies, that can trigger other packages to be rebuilt.  Also, paladins have some failure cases where they need to rebuild more things after a previous bad build.

I totally agree that reducing developer build time is important.

I agree that the 1.42% is potentially suspect, but that was the number that was quoted.
- I don't have a reason to doubt or suspect the data collected by Ting-Yuan (laszio) while building Chrome with LLVM. Chrome is C++, heavy use of templates and more compiler optimized than other packages. So, for regular C the compiler will be faster. Keep in mind Chrome is the package that takes the longest to build during build packages. 
- The compiler is an expensive "operation". That is why tools like ccache exist. There is nothing surprising in it being much more expensive than a wrapper written in python.
- Ting-Yuan showed that the 50ms time is incorrect. It only happens when SYSROOT is not set which should not be the case within the chroot (we need to do something to avoid this confusing behavior).
- on #24 groeck mentioned around ~10% overhead for kernel sources, but that is based on the 50ms time. So, that should be reduced by half or a little bit more. (between 4 and 5%)
- Ting-Yuan showed that he can further reduce the overhead to ~16 ms.

We do care about buildtime. That is something we have improved in the past and during the migration to LLVM. We are working to move to a faster linker (30% or 40% faster) (which is what usually dominates the time in the modify/build developer cycle) and we have an idea to reduce the compile time by 10-15%. Those have bigger impact than rewriting the wrapper.  
> I don't have a reason to doubt or suspect the data collected 
> by Ting-Yuan (laszio) while building Chrome with LLVM.

I wasn't saying this (or if I did, I didn't intend to).  I was more saying that I don't care how long Chrome files take to compile since I don't compile them.  I care how long Linux kernel files take to compile and I believe that those two things may be very different.


> Ting-Yuan showed that the 50ms time is incorrect. It only happens when SYSROOT is not set
> which should not be the case within the chroot (we need to do something to avoid this confusing behavior).

Ah, this I didn't understand from the above!  I see that logging is only there when SYSROOT is set, but to me that didn't mean anything.  Thank you for clarifying as it makes my experiments make much more sense.

...so basically the old code really had an overhead of about 28 ms as per call:

rm /b/tip/tmp/time.txt /b/tip/empty.*
touch /b/tip/empty.c
for i in $(seq 100); do
  { time SYSROOT=/build/kevin /usr/bin/aarch64-cros-linux-gnu-clang -c /b/tip/empty.c -o /b/tip/empty.o > /dev/null ; } 2> /b/tip/onetime.txt
  grep real /b/tip/onetime.txt | sed 's/.*0m\([^s]*\)s/\1/' >> /b/tip/tmp/time.txt
done
python -c 'lines=open("/b/tip/tmp/time.txt").readlines(); print sum(float(x) for x in lines)/len(lines)'


> The compiler is an expensive "operation". That is why tools like ccache exist.
> There is nothing surprising in it being much more expensive than a wrapper
> written in python.

Fair enough, but in this case the wrapper is in front of ccache.  ...and python wrapper is much slower than ccache.


> We do care about buildtime. That is something we have improved in the past and during
> the migration to LLVM. We are working to move to a faster linker (30% or 40% faster)
> (which is what usually dominates the time in the modify/build developer cycle) and we
> have an idea to reduce the compile time by 10-15%. Those have bigger impact than rewriting
> the wrapper.  

OK, sounds great.

=====

FWIW: I did some experiments to see how much a clean kernel build takes on my system with a warm ccache.  Unfortunately the numbers were super noisy.  :(  I basically compared how most of the speedups discussed here (plus removing "re") helped.  Specifically my changes were:

@@ -1,4 +1,4 @@
-#!/usr/bin/env python2
+#!/usr/bin/python2 -S
 # Copyright (c) 2012 The Chromium OS Authors. All rights reserved.
 # Use of this source code is governed by a BSD-style license that can be
 # found in the LICENSE file.
@@ -45,7 +45,6 @@
 # compiler invocation.
 
 import os
-import re
 import sys
 
 WRAPPER_ONLY_OPTIONS = set(('-print-cmdline', '-nopie', '-noccache'))
@@ -142,9 +141,8 @@
 def get_proc_status(pid, item):
   with open('/proc/%i/status' % pid) as fp:
     for line in fp:
-      m = re.match(r'%s:\s*(.*)' % re.escape(item), line)
-      if m:
-        return m.group(1)
+      if line.startswith(item + ":"):
+        return line.partition(":")[-1].lstrip()
   return None
 
 
@@ -228,6 +226,7 @@
 # Log 'time' like resource usage, along with corresponding command line, to
 # GETRUSAGE file name if it is defined.
 def log_rusage(log_file, argv0, execargs):
+  return
   import datetime
   import logging
 
@@ -372,7 +371,9 @@
 
   cmdline = [x for x in myargs if x not in WRAPPER_ONLY_OPTIONS]
 
-  if re.match(r'i.86|x86_64', os.path.basename(sys.argv[0])):
+  bn = os.path.basename(sys.argv[0])
+  # Complicated way to look for x86_64 or i.86...
+  if 'x86_64' in bn or 'i386' in bn or 'i686' in bn:
     cmdline.extend(X86_DISABLE_FLAGS)
 
   if not invoke_clang:
@@ -412,7 +413,7 @@
       gcc_cmdline.append(flag)
 
   sysroot = os.environ.get('SYSROOT', '')
-  if not sysroot:
+  if False and not sysroot:
     # Keep these import statements here, rather than making them global, to
     # prevent performance penalties when this execution path is taken (see
     # commend at top of this file).
@@ -464,7 +465,7 @@
     linker = arch + '-ld'
     linker_path = get_linker_path(linker)
     clang_cmdline += ['-B' + linker_path]
-    if re.match(r'i.86', arch):
+    if 'i386' in arch or 'i686' in arch:
       # TODO: -target i686-pc-linux-gnu causes clang to search for
       # libclang_rt.asan-i686.a which doesn't exist because it's packaged
       # as libclang_rt.asan-i386.a. We can't use -target i386-pc-linux-gnu

--

Running just the compiler test, I went from about 28 ms per invocation to about 13 ms per invocation, so saving 15 ms.

--

rm /tmp/testlog.txt
CCACHE_DIR=/var/cache/chromeos-cache/distfiles/ccache/ ccache -s >> /tmp/testlog.txt
for i in $(seq 5); do
  rm -rf /build/kevin/var/cache/portage/sys-kernel/chromeos-kernel-4_4
  { time FEATURES="-buildpkg" USE="-kmemleak lockdebug kgdb vtconsole -firmware_install" emerge-${BOARD} --nodeps chromeos-kernel-4_4 ; } 2>> /tmp/testlog.txt
done
CCACHE_DIR=/var/cache/chromeos-cache/distfiles/ccache/ ccache -s >> /tmp/testlog.txt

--

I got this before:

real    1m19.339s
user    2m47.252s
sys     3m2.716s

real    1m16.349s
user    2m48.116s
sys     3m5.464s

real    1m21.457s
user    2m47.712s
sys     3m5.328s

real    1m18.416s
user    2m49.364s
sys     3m5.924s

real    1m13.730s
user    2m46.384s
sys     3m2.928s

And after:

real    1m21.196s
user    2m34.656s
sys     3m5.372s

real    1m16.896s
user    2m35.400s
sys     2m59.708s

real    1m18.910s
user    2m35.580s
sys     3m3.736s

real    1m16.814s
user    2m35.076s
sys     3m5.160s

real    1m18.987s
user    2m34.392s
sys     3m1.724s

---

user time went down by about 13 seconds per invocation, but since the kernel is so parallel and I have a beefy workstation it didn't really save any user time.

So in terms of just an incremental kernel build the numbers agree: we're chewing up some time here and it would be nice to get it sped up, but it doesn't seem to make a major impact on incremental build time, at least in the specific test above.  It does make a ~6% decrease in CPU usage in the particular test above but that's not crazy.

=====

Given all the above it seems sane to say:

* Let's take the low hanging fruit.
* Let's leave it in python for now.
* Eventually it would be nice to get this in Go and save a few more CPU cycles, but that could easily be a P3.


Sound good?
Thanks for the detailed analysis Dough. 

I agree with your conclusions. 

I will leave the bug open so that we implement the low hanging fruit.
And will file a separate bug to implement the wrapper in Go.
Filed issue 773875 for the Go compiler wrapper implementation. 
Everything except the "pyc" / "pyo" conversion is easy.  Should I post a patch for that?

For the "pyc" / "pyo" conversion I'm not sure how we're going to do it.  In theory you could try to do something like this in the chroot, I guess:

  https://stackoverflow.com/questions/21825910/shebang-for-compiled-python-code
I've uploaded these two, but I'm not 100% sure how to test them since it's not immediately clear to me how these get rolled out.  I can dig, but I also figure someone here probably just knows.

I tested the bits of python here in isolation and also manually modified my own "sysroot_wrapper.hardened" in the chroot.

* https://chromium-review.googlesource.com/#/c/chromiumos/overlays/chromiumos-overlay/+/719258 
  compiler wrappers: Update wrapper to not use "env" and invoke "python -S"        
* https://chromium-review.googlesource.com/#/c/chromiumos/overlays/chromiumos-overlay/+/719259 
  compiler wrappers: Avoid use of 're.'        


wrt pyc, it should just be renaming the file to have a py suffix and calling an optimize helper in the ebuild
Project Member

Comment 36 by bugdroid1@chromium.org, Oct 16 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/cc828a5bad5e82b46e2b48e26daf3f363251ad28

commit cc828a5bad5e82b46e2b48e26daf3f363251ad28
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Oct 16 10:04:28 2017

cros_setup_toolchain: Munge the shebang back to use "env"

In CL:719258 we're optimizing the shebang of the toolchain wrappers.
...but we can't necessarily use this optimized version outside the
chroot.  Munge it back.

BUG=chromium:773138
TEST=See that things get munged back

Change-Id: I08e280bca46cd2f36049502cd87466d4f581895f
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/719719
Reviewed-by: Manoj Gupta <manojgupta@chromium.org>

[modify] https://crrev.com/cc828a5bad5e82b46e2b48e26daf3f363251ad28/scripts/cros_setup_toolchains.py

Project Member

Comment 37 by bugdroid1@chromium.org, Oct 16 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/64d4d5aa4ca7330b2bce96adbbbc1aa2e6e097f0

commit 64d4d5aa4ca7330b2bce96adbbbc1aa2e6e097f0
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Oct 16 10:05:32 2017

compiler wrappers: Update wrappers to not use "env" and invoke "python -S"

Let's speedup the python wrapper a little bit.  We know where python
is and we also know that it's OK to execute this script raw (AKA skip
including site packages).

This speeds up the execution of the script by about 7 ms (on my system
from ~29 ms to ~22 ms).

NOTE: this change is partially undone when outside the chroot by
CL:719719

CQ-DEPEND=CL:719719
BUG=chromium:773138
TEST=See bug

Change-Id: I7bf64306655a05dc5661a93136b694e78b549544
Reviewed-on: https://chromium-review.googlesource.com/719258
Commit-Ready: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Manoj Gupta <manojgupta@chromium.org>

[modify] https://crrev.com/64d4d5aa4ca7330b2bce96adbbbc1aa2e6e097f0/sys-devel/llvm/files/clang_host_wrapper.header
[modify] https://crrev.com/64d4d5aa4ca7330b2bce96adbbbc1aa2e6e097f0/sys-devel/gcc/gcc-9999.ebuild
[rename] https://crrev.com/64d4d5aa4ca7330b2bce96adbbbc1aa2e6e097f0/sys-devel/llvm/llvm-5.0_pre305632_p20170806-r6.ebuild
[modify] https://crrev.com/64d4d5aa4ca7330b2bce96adbbbc1aa2e6e097f0/sys-devel/gcc/files/sysroot_wrapper.hardened.header

Project Member

Comment 38 by bugdroid1@chromium.org, Oct 16 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/54f79daa269d025d114e029c1fbf1de2f4e39395

commit 54f79daa269d025d114e029c1fbf1de2f4e39395
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Oct 16 10:05:32 2017

compiler wrappers: Avoid use of 're.'

Our use of regular expressions is quite basic.  Let's avoid using them
and stick to string manipulation.  This avoids 'import re' which will
slightly speed up execution time.

My measurements were a bit noisy, but this seemed to shave ~2 ms on
the execution time of this script.

BUG=chromium:773138
TEST=See bug

Change-Id: Ic69cb8356acd19486513f20355f1bede6906c70d
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/719259
Reviewed-by: Manoj Gupta <manojgupta@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[rename] https://crrev.com/54f79daa269d025d114e029c1fbf1de2f4e39395/sys-devel/llvm/llvm-5.0_pre305632_p20170806-r7.ebuild
[modify] https://crrev.com/54f79daa269d025d114e029c1fbf1de2f4e39395/sys-devel/gcc/files/sysroot_wrapper.hardened.body
[modify] https://crrev.com/54f79daa269d025d114e029c1fbf1de2f4e39395/sys-devel/gcc/gcc-9999.ebuild
[modify] https://crrev.com/54f79daa269d025d114e029c1fbf1de2f4e39395/sys-devel/llvm/files/wrapper_script_common
[modify] https://crrev.com/54f79daa269d025d114e029c1fbf1de2f4e39395/sys-devel/llvm/files/clang_host_wrapper.body

> wrt pyc, it should just be renaming the file to have a py suffix and 
> calling an optimize helper in the ebuild

Ah, it looks like the chroot already has all the magic in it to directly execute .pyc files!  I didn't expect that.

...but I don't actually see a speedup here.  The results are a bit noisy for me, but it seems like it might actually be slower!  Can anyone else confirm that?
#39: How many iterations did you try? With 100 I also got somewhat unstable results. 1000 looks pretty stable to me.

I implemented the .pyc approach in the following CL and got consistent result to #25. Based on Doug's patches, the overhead of the wrapper is now less than 15.5ms.

$ time.sh x86_64-cros-linux-gnu-gcc -noccache
real 0m15.509s
user 0m9.764s
sys  0m2.932s

https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/722671


time.sh
---
#!/bin/bash

export SYSROOT=/build/chell
run() {
  for i in $(seq 1000); do
    $* &> /dev/null
  done
}

time run $*
tl;dr: My numbers continue to show that .pyc / .pyo is slower than .py

---

Oh, doh!  I think I went back to 100 iterations on accident.  OK, at 1000 it's more consistent.  ...but I'm still seeing weirdness in that the .pyc / .pyo is often slower.  As far as I can tell, though, there's some sort of bi-modality going on.  When running the same tests sometimes I see good numbers and other times bad.  When it's in a certain mode it tends to stay there.


Here's some examples.  I first did this to prep:

--- 

file /usr/x86_64-pc-linux-gnu/aarch64-cros-linux-gnu/gcc-bin/4.9.x/sysroot_wrapper.hardened
cp /usr/x86_64-pc-linux-gnu/aarch64-cros-linux-gnu/gcc-bin/4.9.x/sysroot_wrapper.hardened /tmp/wrapper.py
python -m compileall /tmp/wrapper.py
chmod +x /tmp/wrapper.pyc 
python -OO -m compileall /tmp/wrapper.py
chmod +x /tmp/wrapper.pyo

---

I then ran the attached script twice.  As you can see, it's not consistent...

. /b/tip/tmp/timetests ; . /b/tip/tmp/timetests

===== py =====
0.021915
0.0219
0.022227
===== pyc =====
0.025591
0.025673
0.0229113333333
===== pyo =====
0.017517
0.0173945
0.017056
===== py =====
0.016318
0.0152835
0.0151203333333
===== pyc =====
0.017062
0.0165925
0.0165496666667
===== pyo =====
0.017185
0.018336
0.0202573333333

---

On a hunch, I decided that perhaps this is a cpufreq problem where the system isn't properly going up to a fast enough cpu frequency.  I ran this outside the chroot:

cd /sys/devices/system/cpu
for c in cpu*; do echo performance > $c/cpufreq/scaling_governor; done
grep "" */*/scaling_governor

...then I re-ran:

. /b/tip/tmp/timetests ; . /b/tip/tmp/timetests

Now the tests are consistent.  ...and the compiled tests still show as slower.

===== py =====
0.010385
0.0104075
0.0103703333333
===== pyc =====
0.011453
0.0115235
0.011499
===== pyo =====
0.011407
0.0113465
0.0113786666667
===== py =====
0.010302
0.0101925
0.0101536666667
===== pyc =====
0.011175
0.0112485
0.0113123333333
===== pyo =====
0.011491
0.0114745
0.0114233333333



===

I also tried changing the above to use "-noccache" like you did.  That's in "timetests2".  Again compiled tests are slower.

. /b/tip/tmp/timetests2 ; . /b/tip/tmp/timetests2 
===== py =====
clang-5.0: error: no input files
0.026692
0.0264095
0.0265666666667
===== pyc =====
clang-5.0: error: no input files
0.027913
0.028025
0.0280016666667
===== pyo =====
clang-5.0: error: no input files
0.027973
0.027863
0.0278026666667
===== py =====
clang-5.0: error: no input files
0.026324
0.026315
0.026249
===== pyc =====
clang-5.0: error: no input files
0.027406
0.0274195
0.0274873333333
===== pyo =====
clang-5.0: error: no input files
0.027845
0.028
0.0280426666667

=====

Don't forget to go back to "powersave" governor when you're done.

cd /sys/devices/system/cpu
for c in cpu*; do echo powersave > $c/cpufreq/scaling_governor; done
grep "" */*/scaling_governor




timetests
2.0 KB View Download
timetests2
1.8 KB View Download
Closing as Fixed since this is as good as we'll get w/ python, I think.  Future work:

* Make incremental kernel builds faster:  bug #767073 
* Move wrapper to Go: bug #773875
* Make launching clang faster: bug #773142

Status: Fixed (was: Assigned)
I can reproduce what Doug saw: when invoking the pyc/pyo directly, it is slower. It is faster when imported and invoked in another python script. So my CL is still valid :)
Status: Assigned (was: Fixed)
@44: nice.  OK, leaving open then.
I cannot find the specification of when pyc is used. According to strace, it looks like that pyc is used only when

1) it is called directly.
2) the module is imported in some other python codes.

However, 1) can't specify shebang (at least I don't know how) so the '-S' trick is missing. I think this is the reason why Doug saw pyc slower: pyc without '-S' is slower than py + '-S'. When implementing using 2), pyc + '-S' is faster than py + '-S'.
So it looks like having a dummy wrapper that imports the main module is the way to go. Can someone give me +2?
https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/722671
> pyc without '-S' is slower than py + '-S'. When implementing using 2), pyc + '-S' is faster than py + '-S'.

Oh!  This makes total sense!

Comment 49 Deleted

Comment 50 Deleted

Comment 51 Deleted

Comment 52 Deleted

Comment 53 Deleted

Comment 54 Deleted

Components: Build
Components: Tools>ChromeOS-Toolchain
Owner: g...@chromium.org
laszio is on a different team now, so it's quite unlikely that he'll get to this. I'll take it in his place, though no promises about seeing substantial progress in the near future. :)

Sign in to add a comment