Measure goma overhead and try to reduce it |
||||
Issue descriptionWith the switch to clang-cl on Windows the calculus for how long it takes to build Chrome and how powerful a machine it takes has changed (currently just for Googlers). Currently a very powerful machine is still needed but it is worth trying to assess how much of that is actually a necessary requirement, and how much is avoidable overhead. That is, where is all of the CPU time on the local machine going? Are there any opportunities to reduce this so that single-socket machines or laptops can build Chrome much more quickly? One potential end-state would be doing entirely cloud builds - even doing the linking remotely - but this bug is intended to be focused on incremental improvements such as measuring the CPU overhead of linking, python, gomacc, and other process types, and seeing if there are any improvements that could be helpful on a less powerful machine, such as a four-core laptop. I will do some initial research with ETW profiling to find out the distribution of CPU time by process type and then do some rough profiling of the main consumers of CPU time.
,
Apr 2 2018
I've attached the batch file and .wpaProfile file that I used to generate the results shown above. The batch file needs to be run from an administrator command prompt, in a chromium\src directory, with out\debug_component set to the desired gn args. It assumes that xperf.exe is installed - downloading and running UIforETW is the simplest way to do this. The .wpaProfile file needs to be in the same directory as the batch file. The generated traces contain exact CPU usage information, but no call stacks or CPU sampling data. In order to determine *where* CPU time is being sent it is necessary to use CPU sampling with call stacks. This is best done by building a smaller target (such as base) and using UIforETW with context switch call stacks disabled to record all CPU usage during the build. Trying to do this for a full build of Chrome will work poorly because the trace file will be impossibly large.
,
Apr 3 2018
,
Apr 10 2018
Hi Yoshisato, do you know if anyone has some cycles to look at CPU usage of Goma client with Bruce's profile and script? Not high priority but would be nice to improve.
,
Apr 10 2018
I'm working for improving include processor performance. Include processor consumes more than 60% cpu time of compiler_proxy. Currently I could improve its performance around x2.5 on Linux, x1.5 on Windows. (I'm targeting x3 on Linux, x2 on Win)
,
Apr 10 2018
might be related http://b/67019832 http://b/74519601
,
Apr 24 2018
I believe this is upstream b/69675454
,
Apr 25 2018
I also measured CPU usage with etw_build_cpuusage.bat (as Comment 2), using the head version of goma, on my Z840 Windows 10. Since I've improved include processor a lot, compiler_proxy.exe's CPU usage gets quite smaller than Comment 1. Raw data Idle 25,204,825.11 compiler_proxy.exe 2,372,393.95 python.exe 1,957,796.76 System 741,542.59 lld-link.exe 519,036.00 svchost.exe 491,740.64 gomacc.exe 452,197.88 clang-cl.exe 391,321.34 Parity.exe 362,797.00 ninja.exe 245,493.15 MsMpEng.exe 218,245.74 winhelm.exe 155,423.61 conhost.exe 42,454.60 yasm.exe 34,474.46 Removed several item (as etw_build_cpuusage.bat does) Process Name CPU Usage [ms] CPU [m] CPU [%] compiler_proxy.exe 2,372,393.95 39 37.45% python.exe 1,957,796.76 32 30.90% lld-link.exe 519,036.00 8 8.19% gomacc.exe 452,197.88 7 7.14% clang-cl.exe 391,321.34 6 6.18% Parity.exe 362,797.00 6 5.73% ninja.exe 245,493.15 4 3.87% yasm.exe 34,474.46 0 0.54% ---------------------------------------------------------- Total 6,335,510.54 102 my args.gn is like the following (I should have followed Comment 2's args.gn...) is_debug = true use_goma = true goma_dir = "C:\src\goma\client\out\ReleaseClang" is_component_build = true use_lld = true Actually, while building chrome, only 20% of CPU is used on my Z840.
,
Apr 25 2018
With Comment 1's args.gn (without jumbo build) on my Z840. Process Name CPU Usage (ms) CPU [m] CPU [%] python.exe 1,981,571.07 33 33.04% compiler_proxy.exe 1,798,492.31 29 29.98% clang-cl.exe 498,163.85 8 8.31% Parity.exe 497,795.64 8 8.30% lld-link.exe 472,120.15 7 7.87% gomacc.exe 463,617.72 7 7.73% ninja.exe 239,867.85 3 4.00% yasm.exe 46,364.04 0 0.77% Total 5,997,992.63 95 Now python.exe consumes CPU more than compiler_proxy.exe. Most of this effort will be included in goma version 154.
,
Apr 25 2018
Very nice! From ~6500 s down to ~1800 s is incredible. That seems like much more savings than expected. Any guesses why? I take it that these were the settings you were using? is_clang = true use_lld = true is_debug = true is_component_build = true enable_nacl = false target_cpu = "x86" remove_webcore_debug_symbols=true use_goma = true symbol_level = 2 For reference, here are the best timings I got with those settings previously, on a slightly different machine but it should be the right order of magnitude. Note in particular that my compiler_proxy.exe numbers were much higher but my numbers for other process types were generally lower, suggesting that if anything my processors are faster: Non-jumbo goma component debug build compiler_proxy 6,605,814.5 69.6% 110 python.exe 1,341,109.5 14.1% 22 clang-cl.exe 578,689.0 6.1% 10 gomacc.exe 488,166.1 5.1% 8 lld-link.exe 253,176.5 2.7% 4 ninja.exe 174,206.3 1.8% 3 yasm.exe 47,263.5 0.5% 1 Total---- 9,488,425.5 158 I look forward to goma 154 - any guesses on when it will ship?
,
Apr 25 2018
> Any guesses why? The reason is just improvement of include processor, which enumerates what files should be uploaded to goma server. whole list is in b/77663820, but main points are just these two: - cache the parsed syntax of header files; make a kind of AST for header files, and cache it. we don't need to parse #define, #if, etc twice. This improvement is critical (faster by x2 or more) - improve macro expansion; implemented faster macro expansion (for #if condition). It covers narrow range of the existing fastpath algorithm, but in chrome, the cover range doesn't change and it's 60% faster than the existing one (faster by x1.3~1.5?) goma version 154 was actually shipped today, but unfortunately reverted since mac builder didn't have AppKit python module X( (crbug.com/836696). I hope it available again when AppKit python module is installed in mac builders. If you want to try anyway, you can download the 2018-04-25's version of goma from https://pantheon.corp.google.com/storage/browser/goma-buildbot-archive/b67a5f6bd529ee41b2c6e1bbd29e8e3116bafa0d/gn/
,
May 8 2018
Goma profiling has shown that memory allocations (and freeing) consume a significant amount of CPU time in compiler_proxy.exe. This effect is worse on Windows than on other platforms. One explanation would be that the Windows heap is less efficient. While this is possible (and who doesn't love some good Microsoft bashing) it's not totally convincing - the Windows heap is pretty good in most cases. Another explanation is that maybe we are doing more allocations on Windows. Since compiler_proxy.exe does allocations for each token, including #defines, the inclusion of Windows.h with its thousands of #defines could be the root cause of these OS differences. Conveniently enough I did some work a few months ago to reduce the usage of Windows.h (crbug.com/796644). By undoing some of this work I can test how this effects CPU usage in compiler_proxy.exe. I modified my test batch file to just build base (I also made it more stand-alone - it generates an out\goma_test directory with my test args) and ran some tests with and without a #include of Windows.h added to base\time\time.h. Over multiple runs on 16-core/32-thread workstation a I saw fairly consistency improvements: Default: ~16 CPU seconds for compiler_proxy.exe, 124 translation units include Windows.h With extra Windows.h: ~19 CPU seconds for compiler_proxy.exe, 247 translation units include Windows.h That's about an extra 24 ms per translation unit. If we extrapolate that to the ~10,000 avoidable includes of Windows.h when building the Chrome target (currently ~14,000, probably reducible to ~4,000) then that's a savings of about 240 s of CPU time, or four minutes, which is not a trivial amount, especially on the laptop I am using these days. These tests were done with goma version 155, updated today. New batch file and same-old .wpaProfile file attached. |
||||
►
Sign in to add a comment |
||||
Comment 1 by brucedaw...@chromium.org
, Apr 2 2018Measurements of CPU usage on a Z620 using goma with local fallback disabled (set GOMA_MAX_SUBPROCS=0). Key results: compiler_proxy.exe uses ~70-71% of CPU time (110-118 CPU minutes) in non-jumbo builds. compiler_proxy.exe uses ~57-58% of CPU time (56-58 CPU minutes) in jumbo builds python.exe is the next biggest consumer lld's CPU time is irrelevantly small Process Name CPU Usage (ms) CPU minutes Jumbo goma component debug build Jumbo build settings: compiler_proxy 3,507,389.3 56.5% 58 is_clang = true python.exe 1,454,740.6 23.4% 24 use_lld = true clang-cl.exe 580,180.1 9.3% 10 is_debug = true gomacc.exe 282,257.5 4.5% 5 is_component_build = true lld-link.exe 212,914.3 3.4% 4 enable_nacl = false ninja.exe 120,113.8 1.9% 2 target_cpu = "x86" yasm.exe 51,267.0 0.8% 1 remove_webcore_debug_symbols=true Total---- 6,208,862.7 103 use_jumbo_build = true use_goma = true Jumbo goma component debug build symbol_level = 2 compiler_proxy 3,382,033.8 57.9% 56 python.exe 1,433,059.7 24.6% 24 clang-cl.exe 352,528.8 6.0% 6 gomacc.exe 288,577.5 4.9% 5 lld-link.exe 215,744.3 3.7% 4 ninja.exe 120,321.4 2.1% 2 yasm.exe 44,031.5 0.8% 1 Total---- 5,836,297.1 97 Non-jumbo goma component debug build Non-jumbo build settings are as above but use_jumbo_build=false compiler_proxy 7,072,012.5 70.8% 118 python.exe 1,387,924.6 13.9% 23 clang-cl.exe 570,344.7 5.7% 10 gomacc.exe 483,462.9 4.8% 8 lld-link.exe 256,532.4 2.6% 4 ninja.exe 175,663.6 1.8% 3 yasm.exe 49,204.3 0.5% 1 Total---- 9,995,145.2 167 Non-jumbo goma component debug build compiler_proxy 6,605,814.5 69.6% 110 python.exe 1,341,109.5 14.1% 22 clang-cl.exe 578,689.0 6.1% 10 gomacc.exe 488,166.1 5.1% 8 lld-link.exe 253,176.5 2.7% 4 ninja.exe 174,206.3 1.8% 3 yasm.exe 47,263.5 0.5% 1 Total---- 9,488,425.5 158 If compiler_proxy.exe's CPU usage could be reduced significantly then this would probably reduce build times significantly on laptops, and slightly on Z840 class machines.