New issue
Advanced search Search tips

Issue 819319 link

Starred by 4 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocking:
issue 787983



Sign in to add a comment

Measure goma overhead and try to reduce it

Project Member Reported by brucedaw...@chromium.org, Mar 6 2018

Issue description

With the switch to clang-cl on Windows the calculus for how long it takes to build Chrome and how powerful a machine it takes has changed (currently just for Googlers). Currently a very powerful machine is still needed but it is worth trying to assess how much of that is actually a necessary requirement, and how much is avoidable overhead.

That is, where is all of the CPU time on the local machine going? Are there any opportunities to reduce this so that single-socket machines or laptops can build Chrome much more quickly?

One potential end-state would be doing entirely cloud builds - even doing the linking remotely - but this bug is intended to be focused on incremental improvements such as measuring the CPU overhead of linking, python, gomacc, and other process types, and seeing if there are any improvements that could be helpful on a less powerful machine, such as a four-core laptop.

I will do some initial research with ETW profiling to find out the distribution of CPU time by process type and then do some rough profiling of the main consumers of CPU time.

 
Measurements of CPU usage on a Z620 using goma with local fallback disabled (set GOMA_MAX_SUBPROCS=0).

Key results:
compiler_proxy.exe uses ~70-71% of CPU time (110-118 CPU minutes) in non-jumbo builds.
compiler_proxy.exe uses ~57-58% of CPU time (56-58 CPU minutes) in jumbo builds
python.exe is the next biggest consumer
lld's CPU time is irrelevantly small

Process Name	CPU Usage (ms)		CPU minutes					
Jumbo goma component debug build			Jumbo build settings:			
compiler_proxy	3,507,389.3	56.5%	58		is_clang = true			
python.exe	1,454,740.6	23.4%	24		use_lld = true			
clang-cl.exe	580,180.1	9.3%	10		is_debug = true			
gomacc.exe	282,257.5	4.5%	5		is_component_build = true			
lld-link.exe	212,914.3	3.4%	4		enable_nacl = false			
ninja.exe	120,113.8	1.9%	2		target_cpu = "x86"			
yasm.exe	51,267.0	0.8%	1		remove_webcore_debug_symbols=true			
Total----	6,208,862.7		103		use_jumbo_build = true			
         	        				use_goma = true			
Jumbo goma component debug build			symbol_level = 2			
compiler_proxy	3,382,033.8	57.9%	56					
python.exe	1,433,059.7	24.6%	24					
clang-cl.exe	352,528.8	6.0%	6					
gomacc.exe	288,577.5	4.9%	5					
lld-link.exe	215,744.3	3.7%	4					
ninja.exe	120,321.4	2.1%	2					
yasm.exe	44,031.5	0.8%	1					
Total----	5,836,297.1		97					
								
Non-jumbo goma component debug build			Non-jumbo build settings are as above but use_jumbo_build=false			
compiler_proxy	7,072,012.5	70.8%	118					
python.exe	1,387,924.6	13.9%	23					
clang-cl.exe	570,344.7	5.7%	10					
gomacc.exe	483,462.9	4.8%	8					
lld-link.exe	256,532.4	2.6%	4					
ninja.exe	175,663.6	1.8%	3					
yasm.exe	49,204.3	0.5%	1					
Total----	9,995,145.2		167					
								
Non-jumbo goma component debug build								
compiler_proxy	6,605,814.5	69.6%	110					
python.exe	1,341,109.5	14.1%	22					
clang-cl.exe	578,689.0	6.1%	10					
gomacc.exe	488,166.1	5.1%	8					
lld-link.exe	253,176.5	2.7%	4					
ninja.exe	174,206.3	1.8%	3					
yasm.exe	47,263.5	0.5%	1					
Total----	9,488,425.5		158					


If compiler_proxy.exe's CPU usage could be reduced significantly then this would probably reduce build times significantly on laptops, and slightly on Z840 class machines.

I've attached the batch file and .wpaProfile file that I used to generate the results shown above. The batch file needs to be run from an administrator command prompt, in a chromium\src directory, with out\debug_component set to the desired gn args. It assumes that xperf.exe is installed - downloading and running UIforETW is the simplest way to do this. The .wpaProfile file needs to be in the same directory as the batch file.

The generated traces contain exact CPU usage information, but no call stacks or CPU sampling data. In order to determine *where* CPU time is being sent it is necessary to use CPU sampling with call stacks. This is best done by building a smaller target (such as base) and using UIforETW with context switch call stacks disabled to record all CPU usage during the build. Trying to do this for a full build of Chrome will work poorly because the trace file will be impossibly large.

CPUSummaryByProcessName.wpaProfile
75.7 KB Download
etw_build_cpuusage.bat
3.8 KB View Download

Comment 3 by ukai@chromium.org, Apr 3 2018

Components: Infra>Goma

Comment 4 by estaab@chromium.org, Apr 10 2018

Cc: brucedaw...@chromium.org
Owner: yyanagisawa@chromium.org
Hi Yoshisato, do you know if anyone has some cycles to look at CPU usage of Goma client with Bruce's profile and script? Not high priority but would be nice to improve.
Owner: shinyak@chromium.org
I'm working for improving include processor performance. Include processor consumes more than 60% cpu time of compiler_proxy.
Currently I could improve its performance around x2.5 on Linux, x1.5 on Windows.
(I'm targeting x3 on Linux, x2 on Win)

Comment 6 by ukai@chromium.org, Apr 10 2018

might be related
 http://b/67019832
 http://b/74519601

I believe this is upstream b/69675454
I also measured CPU usage with etw_build_cpuusage.bat (as Comment 2),
using the head version of goma, on my Z840 Windows 10.

Since I've improved include processor a lot, compiler_proxy.exe's CPU usage gets quite smaller than Comment 1.

Raw data
Idle			25,204,825.11
compiler_proxy.exe	2,372,393.95
python.exe		1,957,796.76
System			741,542.59
lld-link.exe		519,036.00
svchost.exe		491,740.64
gomacc.exe		452,197.88
clang-cl.exe		391,321.34
Parity.exe		362,797.00
ninja.exe		245,493.15
MsMpEng.exe		218,245.74
winhelm.exe		155,423.61
conhost.exe		42,454.60
yasm.exe		34,474.46

Removed several item (as etw_build_cpuusage.bat does)
Process Name		CPU Usage [ms] 	CPU [m]	CPU [%]
compiler_proxy.exe	2,372,393.95	39	37.45%
python.exe		1,957,796.76	32	30.90%
lld-link.exe		519,036.00	8	8.19%
gomacc.exe		452,197.88	7	7.14%
clang-cl.exe		391,321.34	6	6.18%
Parity.exe		362,797.00	6	5.73%
ninja.exe		245,493.15	4	3.87%
yasm.exe		34,474.46	0	0.54%
----------------------------------------------------------
Total			6,335,510.54	102

my args.gn is like the following (I should have followed Comment 2's args.gn...)
is_debug = true
use_goma = true
goma_dir = "C:\src\goma\client\out\ReleaseClang"
is_component_build = true
use_lld = true

Actually, while building chrome, only 20% of CPU is used on my Z840.


With Comment 1's args.gn (without jumbo build) on my Z840.

Process Name		CPU Usage (ms) 	CPU [m]	CPU [%]
python.exe		1,981,571.07	33	33.04%
compiler_proxy.exe	1,798,492.31	29	29.98%
clang-cl.exe		498,163.85	8	8.31%
Parity.exe		497,795.64	8	8.30%
lld-link.exe		472,120.15	7	7.87%
gomacc.exe		463,617.72	7	7.73%
ninja.exe		239,867.85	3	4.00%
yasm.exe		46,364.04	0	0.77%
			
Total			5,997,992.63	95


Now python.exe consumes CPU more than compiler_proxy.exe.

Most of this effort will be included in goma version 154.

Very nice! From ~6500 s down to ~1800 s is incredible. That seems like much more savings than expected. Any guesses why? I take it that these were the settings you were using?

is_clang = true
use_lld = true
is_debug = true
is_component_build = true
enable_nacl = false
target_cpu = "x86"
remove_webcore_debug_symbols=true
use_goma = true
symbol_level = 2

For reference, here are the best timings I got with those settings previously, on a slightly different machine but it should be the right order of magnitude. Note in particular that my compiler_proxy.exe numbers were much higher but my numbers for other process types were generally lower, suggesting that if anything my processors are faster:

Non-jumbo goma component debug build								
compiler_proxy	6,605,814.5	69.6%	110					
python.exe	1,341,109.5	14.1%	22					
clang-cl.exe	578,689.0	6.1%	10					
gomacc.exe	488,166.1	5.1%	8					
lld-link.exe	253,176.5	2.7%	4					
ninja.exe	174,206.3	1.8%	3					
yasm.exe	47,263.5	0.5%	1					
Total----	9,488,425.5		158

I look forward to goma 154 - any guesses on when it will ship?

> Any guesses why?

The reason is just improvement of include processor, which enumerates what files should be uploaded to goma server.
whole list is in b/77663820, but main points are just these two:

  - cache the parsed syntax of header files; make a kind of AST for header files, and cache it. we don't need to parse #define, #if, etc twice. This improvement is critical (faster by x2 or more)

  - improve macro expansion; implemented faster macro expansion (for #if condition). It covers narrow range of the existing fastpath algorithm, but in chrome, the cover range doesn't change and it's 60% faster than the existing one (faster by x1.3~1.5?)


goma version 154 was actually shipped today, but unfortunately reverted since mac builder didn't have AppKit python module X( (crbug.com/836696). I hope it available again when AppKit python module is installed in mac builders.

If you want to try anyway, you can download the 2018-04-25's version of goma from https://pantheon.corp.google.com/storage/browser/goma-buildbot-archive/b67a5f6bd529ee41b2c6e1bbd29e8e3116bafa0d/gn/


Goma profiling has shown that memory allocations (and freeing) consume a significant amount of CPU time in compiler_proxy.exe. This effect is worse on Windows than on other platforms. One explanation would be that the Windows heap is less efficient. While this is possible (and who doesn't love some good Microsoft bashing) it's not totally convincing - the Windows heap is pretty good in most cases.

Another explanation is that maybe we are doing more allocations on Windows. Since compiler_proxy.exe does allocations for each token, including #defines, the inclusion of Windows.h with its thousands of #defines could be the root cause of these OS differences.

Conveniently enough I did some work a few months ago to reduce the usage of Windows.h (crbug.com/796644). By undoing some of this work I can test how this effects CPU usage in compiler_proxy.exe.

I modified my test batch file to just build base (I also made it more stand-alone - it generates an out\goma_test directory with my test args) and ran some tests with and without a #include of Windows.h added to base\time\time.h. Over multiple runs on 16-core/32-thread workstation a I saw fairly consistency improvements:
Default: ~16 CPU seconds for compiler_proxy.exe, 124 translation units include Windows.h
With extra Windows.h: ~19 CPU seconds for compiler_proxy.exe, 247 translation units include Windows.h

That's about an extra 24 ms per translation unit. If we extrapolate that to the ~10,000 avoidable includes of Windows.h when building the Chrome target (currently ~14,000, probably reducible to ~4,000) then that's a savings of about 240 s of CPU time, or four minutes, which is not a trivial amount, especially on the laptop I am using these days.

These tests were done with goma version 155, updated today. New batch file and same-old .wpaProfile file attached.

etw_build_cpuusage.bat
4.1 KB View Download
CPUSummaryByProcessName.wpaProfile
75.7 KB Download

Sign in to add a comment