Python CPU usage on chrome builds is significant |
||||||||||||
Issue descriptionOn a typical debug component jumbo build of Chrome I see about 1,860 seconds of CPU time consumed by python.exe. When trying to build Chrome on a 48-thread Z840 that probably represents less than a minute of elapsed time, but on an 8-thread machine it represents about four minutes. If there is any low hanging fruit that could, say, halve the CPU time used by the miscellaneous scripts then that would be worthwhile. I'll add additional information to this bug listing CPU usage by script type. ⛆ |
|
|
,
Nov 22 2017
,
Nov 22 2017
,
Nov 22 2017
,
Nov 23 2017
Stamp file generation uses build\toolchain\win\tool_wrapper.py. The entire stamp operation consists of this:
def ExecStamp(self, path):
"""Simple stamp command."""
open(path, 'w').close()
However, doing this in Python requires firing up a complete python interpreter, importing seven modules, etc. Examination of an ETW trace suggests that these Python invocations consume 80 ms of CPU time on average (CPU time, not elapsed run time which is much greater), and on my test build with jumbo enabled the stamp action is run 4,629 times, accounting for more than half of the 8,923 python scripts run.
A naive extrapolation is that these stamp steps are consuming 370 seconds of CPU time. On a CPU bound build on an eight thread machine this could save ~45 s of run time. Even on a Z840 it might save 10 s of run time, maybe.
Another data point is that .ninja_log analysis consistently shows stamp creation with a weighted time (run time divided by tasks running) of 13 to 20 s.
The savings may be less (measurements were done on a excessively busy machine) or greater (hidden process creation overhead).
Running half as many scripts would make ETW analyzing the remaining scripts easier to do by reducing PID reuse.
,
Nov 23 2017
Switching from invoking python to invoking touch.exe (currently in depot_tools\win_tools-2_7_6_bin\git\usr\bin\touch.exe) did not help. An integrated stamp command presumably would help more.
,
Nov 23 2017
I think there is one major contributor here: The blink bindings generator turning idl files into C++ source. There are 1351 idl files in Blink and not all are trivial to turn into C++. I wonder how quick it would be to let one python process handle them all in one go. Adding some people to CC that might know if that has been investigated.
,
Nov 23 2017
I'm not away of any such efforts, but it seems pretty likely that starting fewer python processes would be an improvement!
,
Nov 23 2017
,
Nov 24 2017
This type of changes will improves python interpreter startup overhead for idl_compiler on win. https://codereview.chromium.org/2726103005/
,
Nov 24 2017
,
Nov 27 2017
CPU consumption by python sub-processes can be divided into roughly two types: 1) Trivial steps (such as python build\toolchain\win\tool_wrapper.py stamp $file) where the overhead comes from process creation. This overhead seems to vary between 20 and 80 ms per process, depending on computer speed and anti-virus overhead. This cost can only be avoided by making commands like 'stamp' be ninja built-ins, which would reduce the cost by at least two orders of magnitude. These numbers are based on testing done over the weekend with various 'touch' implementations and it showed that replacing the python stamp command with a precompiled touch.exe would, at most, halve the overhead - barely worth it. 2) Expensive steps like converting IDL files into C++ source. The process-creation overhead could be reduced by processing multiple IDL files in one step. The computational overhead could be reduced by rewriting the converter in C++. Profiling would be needed to determine which, if any, of these steps would be worthwhile. This profiling is best done outside of ninja to allow more accurate and repeatable measurements. It is difficult to estimate how much speedup might be realized from these changes on machines with many cores, because these often have enough CPU power available that the speedup depends entirely on whether the steps are on the critical path. On machines with fewer cores the build is usually completely CPU bound so every bit of CPU time savings turns into a build-time reduction.
,
Nov 28 2017
For the specific of the binding generator I think it goes beyond python overhead. IDL files depend on other IDL files so I think we end up processing the same IDL files many times in different generation calls. Whether it's a noticeable part of the time I don't know.
,
Feb 10 2018
I filed bug 810978 for removing a few thousand stamp steps. "do less work" > "do work faster", when possible :-)
,
Apr 2 2018
https://bugs.chromium.org/p/angleproject/issues/detail?id=2449 tracks excessive python CPU usage in Angle's lvl_genvk.py.
,
Apr 2 2018
Regarding comment#10, this change now applies to all platforms. This reduces total CPU usage by a large factor. It may reduce parallelism in some cases so it adds a modest cost on massively parallel builds, but on lesser machines it can give a huge speedup (10:1?): The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6de84855ecbe01843752e8b0e019c82e5e3a4d0a commit 6de84855ecbe01843752e8b0e019c82e5e3a4d0a Author: Takuto Ikuta <tikuta@google.com> Date: Tue Mar 13 12:32:24 2018 Use single action for idl compiler This is for reducing total cpu consumption by idl_compiler, and let other parallel build tasks can utilize cpu more efficiently. If we use larger -j and modifying idl files frequently, this introduces around 10 seconds build speed regression, but in other cases, this patch has advantage because this lets other generator tasks run simultaneously at the same time. I expect this improves build speed around 20~30 seconds on linux bot and 1 minutes on win bots if the build runs idl_compiler. In buildbot, generating bindings/core/* files took more than 20 seconds (94s ~ 116s in below trace). https://chromium-build-stats.appspot.com/ninja_log/2018/03/12/swarm979-c4/ninja_log.swarm979-c4.chrome-bot.20180311-201738.14539.gz/trace.html from https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_chromium_rel_ng/43925 In this 22 seconds, mojom generator tasks are blocked to run. Build trace with this patch shows other tasks run somewhat simultaneously (13s ~ 41s). https://chromium-build-stats.appspot.com/ninja_log/2018/03/12/swarm979-c4/ninja_log.swarm979-c4.chrome-bot.20180311-215549.20798.gz/trace.html from https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_chromium_rel_ng/43949 In this 28 seconds, mojom generator tasks can run at the same time. idl_compiler tasks block other tasks around 10 seconds. * for the record With this patch, generate_bindings_core_v8_interfaces took 14.574s on Z840 Linux. Without this patch, on Linux, it took 5.115s with -j48 7.317s with -j16 13.296s with -j8 95.401s with -j1 With this patch, generate_bindings_core_v8_interfaces took 21.230s on Z840 Win10. Without this patch, on Win10, it took 9.460s with -j48 12.086s with -j16 23.497s with -j8 169.762s with -j1 Bug: 821256 Change-Id: Ic3163edfd3cfb4be375e6f24ffc7c11cc9025783 Reviewed-on: https://chromium-review.googlesource.com/958804 Commit-Queue: Takuto Ikuta <tikuta@google.com> Reviewed-by: Kenichi Ishibashi <bashi@chromium.org> Reviewed-by: Hitoshi Yoshida <peria@chromium.org> Cr-Commit-Position: refs/heads/master@{#542775} [modify] https://crrev.com/6de84855ecbe01843752e8b0e019c82e5e3a4d0a/third_party/WebKit/Source/bindings/scripts/scripts.gni
,
Apr 4 2018
As discussed in crbug.com/819319, the CPU usage for python.exe on a Z620 when building Chrome on Windows is typically about 1,400 CPU s. This can be as high as 24.6% of the total CPU time used. This comes from 7,822 different invocations of Python. The original numbers reported in this bug were 1,860 CPU s from 8,923 invocations of Python. The reduction in script count is probably largely from gn fixes that reduce the number of .stamp files created and from the IDL change listed above. That may also have accounted for the reduction in CPU time (although that is inherently a very noisy measure). It is still possible to do the remaining .stamp steps without using Python.
,
Apr 4 2018
(There's another ~3k stamp files we can remove by changing gn. But that gn change is a bit harder to implement. I should write down what needs to be done on the gn bug before I forget it all.)
,
Apr 4 2018
Would that change to gn be to avoid more .stamp files completely, or make the creation of .stamp files more efficient? The creation of .stamp files would be orders of magnitude more efficient if gn did it natively instead of invoking python, so either fix would be equally effective. I did some tests and it's not that Python is particularly slow - it's just that creating a process in order to touch a file is slow, whereas creating a file in-proc is very quick.
,
Apr 4 2018
Avoid. "in proc" would mean in ninja, right? I don't think we want to start busyboxing random useful commands into ninja.
,
Apr 26 2018
,
Apr 27 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e99409faacc8b42cc3130cc3f1c3006afac4484b commit e99409faacc8b42cc3130cc3f1c3006afac4484b Author: Nico Weber <thakis@chromium.org> Date: Fri Apr 27 01:00:14 2018 win: Use stamp tool that's 6x as fast and that uses 25% as much ram per stamp. Bug: 787903 Change-Id: Ia29658bf8a04ffe6ae8e32530726e217781bb3d1 Reviewed-on: https://chromium-review.googlesource.com/1031317 Reviewed-by: Scott Graham <scottmg@chromium.org> Commit-Queue: Nico Weber <thakis@chromium.org> Cr-Commit-Position: refs/heads/master@{#554244} [modify] https://crrev.com/e99409faacc8b42cc3130cc3f1c3006afac4484b/build/toolchain/toolchain.gni [modify] https://crrev.com/e99409faacc8b42cc3130cc3f1c3006afac4484b/build/toolchain/win/tool_wrapper.py
,
Apr 27 2018
Those numbers are from scottmt's tim, I didn't measure in a full build. Maybe it helps, maybe it doesn't -- it'd be cool if someone could check.
,
Apr 27 2018
Looks a bit faster, when tested on a full goma build. The differences might be greater on non-goma builds when high -j values aren't starving local processes and distorting the results. Avoiding another 3k stamps (comment 18) would still be nice to have.
,
Jun 14 2018
Bruce, the roll landed "Migrating ANGLE to split LVL repos" update that should fix this compilation issue: https://chromium-review.googlesource.com/c/chromium/src/+/1100115 Can you verify?
,
Jun 20 2018
I count which python scripts are called many times when we build chrome on windows.
And I think most of overhead of python script is come from invoking of python interpreter.
So reducing the number of python script invocation should help to reduce total cpu usage in most cases.
top20 is like below.
976 ../../build/toolchain/win/tool_wrapper.py
822 ../../mojo/public/tools/bindings/mojom_bindings_generator.py
195 ../../mojo/public/tools/bindings/generate_type_mappings.py
131 ../../tools/json_schema_compiler/compiler.py
113 ../../third_party/yasm/run_yasm.py
74 ../../tools/protoc_wrapper/protoc_wrapper.py
60 ../../tools/grit/pak_util.py
50 ../../tools/grit/grit.py
50 ../../services/service_manager/public/tools/manifest/manifest_collator.py
50 ../../build/write_buildflag_header.py
22 ../../build/gn_run_binary.py
13 ../../third_party/blink/renderer/build/scripts/make_names.py
13 ../../build/util/version.py
12 gen/third_party/angle/third_party/vulkan-validation-layers/angle/vulkan/lvl_genvk.py
10 ../../third_party/angle/third_party/spirv-tools/src/utils/generate_grammar_tables.py
10 gen/third_party/angle/third_party/vulkan-loader/angle/vulkan/loader_genvk.py
9 ../../net/tools/dafsa/make_dafsa.py
7 ../../components/vector_icons/aggregate_vector_icons.py
6 ../../third_party/blink/renderer/bindings/scripts/idl_compiler.py
5 ../../third_party/inspector_protocol/code_generator.py
Many of tool_wrapper is called for linking. But I guess it can be removed in lld.
794 ../../build/toolchain/win/tool_wrapper.py link-wrapper
165 ../../build/toolchain/win/tool_wrapper.py recursive-mirror
15 ../../build/toolchain/win/tool_wrapper.py rc-wrapper
2 ../../build/toolchain/win/tool_wrapper.py asm-wrapper
Most of mojom_bindings_generator.py is called for generate subcommand. I think generate subcommand can be consolidated for some mojom files, similar to jumbofication.
528 ../../mojo/public/tools/bindings/mojom_bindings_generator.py --use_bundled_pylibs generate
148 ../../mojo/public/tools/bindings/mojom_bindings_generator.py parse
146 ../../mojo/public/tools/bindings/mojom_bindings_generator.py verify
I also think it is possible to apply consolidation for ../../mojo/public/tools/bindings/generate_type_mappings.py and json_schema_compiler
,
Jun 20 2018
Is this still a measurable fraction of the build time? Interpreter startup doesn't take super long compared to e.g. the linking subprocess it invokes -- intuitively I would've thought that most of these wrappers are fine. Do you have measurements showing the opposite? (For touching files, interpreter startup was slow compared to the actual work the script did and we touched many files, so I can see why no longer running python for that might help, but I still don't know how much the change in comment 22 really helped with build time.)
,
Jun 20 2018
Oh, forgot to say: Maybe we want to invoke python with -S -- for people who have many site packages installed locally it might need much longer to start than a pristine python installation.
,
Jun 20 2018
Recent measurement is in here. https://b.corp.google.com/issues/69675454#comment12 Specifying -S seems nice, why not that is default for GN in action?
,
Jun 20 2018
It's a somewhat obscure flag, I'd guess nobody thought of it when implementing actions in gn. Thanks for the b/ link, will look at it tomorrow when I'm back on corp.
,
Jun 20 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f7d47b7e8c03400343d4b6753ff79ade9d09ca68 commit f7d47b7e8c03400343d4b6753ff79ade9d09ca68 Author: Takuto Ikuta <tikuta@chromium.org> Date: Wed Jun 20 14:41:09 2018 Do not use tool-wrapper for lld link When building chrome on windows, tool-wrapper for linking is called 794 times. And I found that calling python wrapper is relatively slow than calling link via ninja. This CL replaces python wrapper to ninja wrapper to remove such overhead when we use lld. I compared time of wrapper invocation using following powershell script. for ($i = 0; $i -lt 100; $i++) { # ninja -t msvc -e environment.x64 -- C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe -c ' ' # python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe -c ' ' # code for FlushFileBuffers is commented out. } * ninja wrapper, 7.451s for 100 invocation * python wrapper, 17.803s for 100 invocation Using ninja as env wrapper is more than 2 times faster. Bug: 787903 Change-Id: I6531b59a6c43085782b0138a264ecfc9a6f65833 Reviewed-on: https://chromium-review.googlesource.com/1107678 Commit-Queue: Takuto Ikuta <tikuta@chromium.org> Reviewed-by: Nico Weber <thakis@chromium.org> Cr-Commit-Position: refs/heads/master@{#568836} [modify] https://crrev.com/f7d47b7e8c03400343d4b6753ff79ade9d09ca68/build/toolchain/win/BUILD.gn
,
Jun 20 2018
It looks like avoiding the python wrapper for linking saves us 0.01 s per invocation, so this change will save about 8 s (794 * .01 s). The wall-clock saving will be less because of parallelism. So, this is a nice change to have (and it simplifies things) but it will have modest impact. The vast majority of the Python overhead is not from invoking the interpreter but from running the code inside the scripts. This python.exe CPU time is a non-trivial portion of build times when doing goma builds, especially on low core-count machines. Modifying python scripts to make them do more per invocation in order to reduce Python interpreter startup overhead is likely to be counterproductive because it will reduce parallelism and won't save much CPU time. Modifying python scripts to make them faster would be great - I don't know if there are any easy opportunities out there.
,
Jun 20 2018
Isn't it 0.1s according to the data in comment 31? So 80s, which is at least somewhat visible if you don't have too many cores. (It's still a small percentage of the linking work happening, but it's 100% free performance.)
,
Jun 20 2018
Sorry, you are correct. Off-by-one (magnitude) error on my part. 80 s of CPU time is a definite improvement on laptop builds, and 0.1 s savings per invocation makes merging scripts somewhat more tempting, although I don't think there's another 80 s on the table. And yeah, free performance in this case - I like it.
,
Jun 21 2018
I took execution time of python scripts on linux when building chrome except some wrapper invoking binary (protoc, yasm, node, gcc). Roughly, 1523 python scripts take 346.378 seconds in total. Breakdown is like below. +-------+---------------+--------------------+-------------------------------------------------------------------------------------------------------+ | count | total_time_ms | avg_ms | script | +-------+---------------+--------------------+-------------------------------------------------------------------------------------------------------+ | 802 | 112270 | 139.98753117206982 | ../../mojo/public/tools/bindings/mojom_bindings_generator.py | | 12 | 48252 | 4021 | gen/third_party/angle/third_party/vulkan-validation-layers/angle/vulkan/lvl_genvk.py | | 130 | 40930 | 314.84615384615387 | ../../tools/json_schema_compiler/compiler.py | | 53 | 40543 | 764.9622641509434 | ../../tools/grit/grit.py | | 6 | 11970 | 1995 | ../../third_party/blink/renderer/bindings/scripts/idl_compiler.py | | 10 | 9772 | 977.2 | gen/third_party/angle/third_party/vulkan-loader/angle/vulkan/loader_genvk.py | | 191 | 9561 | 50.05759162303665 | ../../mojo/public/tools/bindings/generate_type_mappings.py | | 1 | 8003 | 8003 | ../../tools/metrics/histograms/generate_expired_histograms_array.py | | 9 | 6821 | 757.8888888888889 | ../../net/tools/dafsa/make_dafsa.py | | 60 | 6470 | 107.83333333333333 | ../../tools/grit/pak_util.py | | 1 | 5833 | 5833 | ../../third_party/catapult/tracing/bin/generate_about_tracing_contents | | 1 | 5206 | 5206 | ../../third_party/blink/renderer/build/scripts/core/css/make_css_value_keywords.py | | 2 | 2763 | 1381.5 | ../../third_party/blink/renderer/bindings/scripts/compute_interfaces_info_individual.py | | 1 | 2539 | 2539 | ../../third_party/blink/renderer/bindings/scripts/generate_v8_context_snapshot_external_references.py | | 2 | 2415 | 1207.5 | ../../third_party/blink/renderer/bindings/scripts/generate_origin_trial_features.py | | 1 | 2233 | 2233 | ../../third_party/blink/renderer/build/scripts/make_css_property_names.py | | 43 | 2177 | 50.627906976744185 | ../../services/service_manager/public/tools/manifest/manifest_collator.py | | 49 | 2004 | 40.89795918367347 | ../../build/write_buildflag_header.py | | 1 | 1943 | 1943 | ../../third_party/blink/renderer/devtools/scripts/build/build_release_applications.py | | 13 | 1933 | 148.69230769230768 | ../../third_party/blink/renderer/build/scripts/make_names.py | +-------+---------------+--------------------+-------------------------------------------------------------------------------------------------------+ lvl_genvk.py has large execution time regardless its execution count. But mojom is largest time consumer and its breakdown become like below. +-------+---------------+--------------------+------------+ | count | total_time_ms | avg_ms | subcommand | +-------+---------------+--------------------+------------+ | 515 | 68727 | 133.45048543689322 | generate | | 144 | 27962 | 194.18055555555554 | parse | | 142 | 14635 | 103.06338028169014 | verify | | 1 | 946 | 946 | precompile | +-------+---------------+--------------------+------------+ So, we want to focus on these time consuming python scripts for better lower python cpu usage.
,
Jun 21 2018
Attached csv contains script execution time and bit modified command line args. This may help to analyze further.
,
Jun 21 2018
,
Jun 22 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/054eb9147e0a877d52410a18c948035fa6062381 commit 054eb9147e0a877d52410a18c948035fa6062381 Author: Takuto Ikuta <tikuta@chromium.org> Date: Fri Jun 22 01:15:57 2018 Do not use wrapper for link in win cross build This is follow up of https://chromium.googlesource.com/chromium/src/+/f7d47b7e8c03400343d4b6753ff79ade9d09ca68 This CL specifies libpath explicitly for LIB directories. Bug: 854849 , 787903 Change-Id: I9b65b0c0980af1f7eb4ece2ffbb31d8bde44e7ce Reviewed-on: https://chromium-review.googlesource.com/1109452 Commit-Queue: Takuto Ikuta <tikuta@chromium.org> Reviewed-by: Nico Weber <thakis@chromium.org> Cr-Commit-Position: refs/heads/master@{#569486} [modify] https://crrev.com/054eb9147e0a877d52410a18c948035fa6062381/build/toolchain/win/BUILD.gn [modify] https://crrev.com/054eb9147e0a877d52410a18c948035fa6062381/build/toolchain/win/setup_toolchain.py |
|||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by brucedaw...@chromium.org
, Nov 22 2017