New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 787903 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 3
Type: Bug

Blocked on: View detail
issue 810978
issue angleproject:2449

Blocking:
issue 787983



Sign in to add a comment

Python CPU usage on chrome builds is significant

Project Member Reported by brucedaw...@chromium.org, Nov 22 2017

Issue description

On a typical debug component jumbo build of Chrome I see about 1,860 seconds of CPU time consumed by python.exe. When trying to build Chrome on a 48-thread Z840 that probably represents less than a minute of elapsed time, but on an 8-thread machine it represents about four minutes.

If there is any low hanging fruit that could, say, halve the CPU time used by the miscellaneous scripts then that would be worthwhile. I'll add additional information to this bug listing CPU usage by script type.

 
Blockedon: 773138
Blocking: 495670
Blocking: 787983
Blocking: -495670
Stamp file generation uses build\toolchain\win\tool_wrapper.py. The entire stamp operation consists of this:

  def ExecStamp(self, path):
    """Simple stamp command."""
    open(path, 'w').close()

However, doing this in Python requires firing up a complete python interpreter, importing seven modules, etc. Examination of an ETW trace suggests that these Python invocations consume 80 ms of CPU time on average (CPU time, not elapsed run time which is much greater), and on my test build with jumbo enabled the stamp action is run 4,629 times, accounting for more than half of the 8,923 python scripts run.

A naive extrapolation is that these stamp steps are consuming 370 seconds of CPU time. On a CPU bound build on an eight thread machine this could save ~45 s of run time. Even on a Z840 it might save 10 s of run time, maybe.

Another data point is that .ninja_log analysis consistently shows stamp creation with a weighted time (run time divided by tasks running) of 13 to 20 s.

The savings may be less (measurements were done on a excessively busy machine) or greater (hidden process creation overhead). 

Running half as many scripts would make ETW analyzing the remaining scripts easier to do by reducing PID reuse.

Switching from invoking python to invoking touch.exe (currently in depot_tools\win_tools-2_7_6_bin\git\usr\bin\touch.exe) did not help. An integrated stamp command presumably would help more.

Comment 7 by brat...@opera.com, Nov 23 2017

Cc: j...@opera.com brat...@opera.com foolip@chromium.org yukishiino@chromium.org
I think there is one major contributor here: 

The blink bindings generator turning idl files into C++ source. There are 1351 idl files in Blink and not all are trivial to turn into C++.

I wonder how quick it would be to let one python process handle them all in one go. Adding some people to CC that might know if that has been investigated.

Comment 8 by foolip@chromium.org, Nov 23 2017

I'm not away of any such efforts, but it seems pretty likely that starting fewer python processes would be an improvement!

Comment 9 by vapier@chromium.org, Nov 23 2017

Blockedon: -773138
Cc: bashi@chromium.org
This type of changes will improves python interpreter startup overhead for idl_compiler on win.
https://codereview.chromium.org/2726103005/
Cc: tikuta@chromium.org
CPU consumption by python sub-processes can be divided into roughly two types:

1) Trivial steps (such as python build\toolchain\win\tool_wrapper.py stamp $file) where the overhead comes from process creation. This overhead seems to vary between 20 and 80 ms per process, depending on computer speed and anti-virus overhead. This cost can only be avoided by making commands like 'stamp' be ninja built-ins, which would reduce the cost by at least two orders of magnitude. These numbers are based on testing done over the weekend with various 'touch' implementations and it showed that replacing the python stamp command with a precompiled touch.exe would, at most, halve the overhead - barely worth it.

2) Expensive steps like converting IDL files into C++ source. The process-creation overhead could be reduced by processing multiple IDL files in one step. The computational overhead could be reduced by rewriting the converter in C++. Profiling would be needed to determine which, if any, of these steps would be worthwhile. This profiling is best done outside of ninja to allow more accurate and repeatable measurements.

It is difficult to estimate how much speedup might be realized from these changes on machines with many cores, because these often have enough CPU power available that the speedup depends entirely on whether the steps are on the critical path. On machines with fewer cores the build is usually completely CPU bound so every bit of CPU time savings turns into a build-time reduction.

Comment 13 by brat...@opera.com, Nov 28 2017

For the specific of the binding generator I think it goes beyond python overhead. IDL files depend on other IDL files so I think we end up processing the same IDL files many times in different generation calls. Whether it's a noticeable part of the time I don't know.
I filed bug 810978 for removing a few thousand stamp steps. "do less work" > "do work faster", when possible :-)
https://bugs.chromium.org/p/angleproject/issues/detail?id=2449 tracks excessive python CPU usage in Angle's lvl_genvk.py.
Regarding comment#10, this change now applies to all platforms. This reduces total CPU usage by a large factor. It may reduce parallelism in some cases so it adds a modest cost on massively parallel builds, but on lesser machines it can give a huge speedup (10:1?):

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6de84855ecbe01843752e8b0e019c82e5e3a4d0a

commit 6de84855ecbe01843752e8b0e019c82e5e3a4d0a
Author: Takuto Ikuta <tikuta@google.com>
Date: Tue Mar 13 12:32:24 2018

Use single action for idl compiler

This is for reducing total cpu consumption by idl_compiler, and let other parallel build tasks can utilize cpu more efficiently.

If we use larger -j and modifying idl files frequently, this introduces around 10 seconds build speed regression,
but in other cases, this patch has advantage because this lets other generator tasks run simultaneously at the same time.
I expect this improves build speed around 20~30 seconds on linux bot and 1 minutes on win bots if the build runs idl_compiler.

In buildbot, generating bindings/core/* files took more than 20 seconds (94s ~ 116s in below trace).
https://chromium-build-stats.appspot.com/ninja_log/2018/03/12/swarm979-c4/ninja_log.swarm979-c4.chrome-bot.20180311-201738.14539.gz/trace.html from https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_chromium_rel_ng/43925
In this 22 seconds, mojom generator tasks are blocked to run.

Build trace with this patch shows other tasks run somewhat simultaneously (13s ~ 41s).
https://chromium-build-stats.appspot.com/ninja_log/2018/03/12/swarm979-c4/ninja_log.swarm979-c4.chrome-bot.20180311-215549.20798.gz/trace.html from https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_chromium_rel_ng/43949
In this 28 seconds, mojom generator tasks can run at the same time. idl_compiler tasks block other tasks around 10 seconds.


* for the record

With this patch, generate_bindings_core_v8_interfaces took 14.574s on Z840 Linux.

Without this patch, on Linux, it took
5.115s with -j48
7.317s with -j16
13.296s with -j8
95.401s with -j1

With this patch, generate_bindings_core_v8_interfaces took  21.230s on Z840 Win10.

Without this patch, on Win10, it took
9.460s with -j48
12.086s with -j16
23.497s with -j8
169.762s with -j1

Bug:   821256  
Change-Id: Ic3163edfd3cfb4be375e6f24ffc7c11cc9025783
Reviewed-on: https://chromium-review.googlesource.com/958804
Commit-Queue: Takuto Ikuta <tikuta@google.com>
Reviewed-by: Kenichi Ishibashi <bashi@chromium.org>
Reviewed-by: Hitoshi Yoshida <peria@chromium.org>
Cr-Commit-Position: refs/heads/master@{#542775}
[modify] https://crrev.com/6de84855ecbe01843752e8b0e019c82e5e3a4d0a/third_party/WebKit/Source/bindings/scripts/scripts.gni

As discussed in crbug.com/819319, the CPU usage for python.exe on a Z620 when building Chrome on Windows is typically about 1,400 CPU s. This can be as high as 24.6% of the total CPU time used. This comes from 7,822 different invocations of Python.

The original numbers reported in this bug were 1,860 CPU s from 8,923 invocations of Python. The reduction in script count is probably largely from gn fixes that reduce the number of .stamp files created and from the IDL change listed above. That may also have accounted for the reduction in CPU time (although that is inherently a very noisy measure).

It is still possible to do the remaining .stamp steps without using Python.

(There's another ~3k stamp files we can remove by changing gn. But that gn change is a bit harder to implement. I should write down what needs to be done on the gn bug before I forget it all.)
Would that change to gn be to avoid more .stamp files completely, or make the creation of .stamp files more efficient? The creation of .stamp files would be orders of magnitude more efficient if gn did it natively instead of invoking python, so either fix would be equally effective.

I did some tests and it's not that Python is particularly slow - it's just that creating a process in order to touch a file is slow, whereas creating a file in-proc is very quick.

Avoid.

"in proc" would mean in ninja, right? I don't think we want to start busyboxing random useful commands into ninja.
Blockedon: 810978
Project Member

Comment 22 by bugdroid1@chromium.org, Apr 27 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e99409faacc8b42cc3130cc3f1c3006afac4484b

commit e99409faacc8b42cc3130cc3f1c3006afac4484b
Author: Nico Weber <thakis@chromium.org>
Date: Fri Apr 27 01:00:14 2018

win: Use stamp tool that's 6x as fast and that uses 25% as much ram per stamp.

Bug: 787903
Change-Id: Ia29658bf8a04ffe6ae8e32530726e217781bb3d1
Reviewed-on: https://chromium-review.googlesource.com/1031317
Reviewed-by: Scott Graham <scottmg@chromium.org>
Commit-Queue: Nico Weber <thakis@chromium.org>
Cr-Commit-Position: refs/heads/master@{#554244}
[modify] https://crrev.com/e99409faacc8b42cc3130cc3f1c3006afac4484b/build/toolchain/toolchain.gni
[modify] https://crrev.com/e99409faacc8b42cc3130cc3f1c3006afac4484b/build/toolchain/win/tool_wrapper.py

Those numbers are from scottmt's tim, I didn't measure in a full build. Maybe it helps, maybe it doesn't -- it'd be cool if someone could check.
Looks a bit faster, when tested on a full goma build. The differences might be greater on non-goma builds when high -j values aren't starving local processes and distorting the results.

Avoiding another 3k stamps (comment 18) would still be nice to have.

Comment 25 by tobine@google.com, Jun 14 2018

Bruce, the roll landed "Migrating ANGLE to split LVL repos" update that should fix this compilation issue: https://chromium-review.googlesource.com/c/chromium/src/+/1100115

Can you verify?
I count which python scripts are called many times when we build chrome on windows.
And I think most of overhead of python script is come from invoking of python interpreter.
So reducing the number of python script invocation should help to reduce total cpu usage in most cases.

top20 is like below.
    976 ../../build/toolchain/win/tool_wrapper.py
    822 ../../mojo/public/tools/bindings/mojom_bindings_generator.py
    195 ../../mojo/public/tools/bindings/generate_type_mappings.py
    131 ../../tools/json_schema_compiler/compiler.py
    113 ../../third_party/yasm/run_yasm.py
     74 ../../tools/protoc_wrapper/protoc_wrapper.py
     60 ../../tools/grit/pak_util.py
     50 ../../tools/grit/grit.py
     50 ../../services/service_manager/public/tools/manifest/manifest_collator.py
     50 ../../build/write_buildflag_header.py
     22 ../../build/gn_run_binary.py
     13 ../../third_party/blink/renderer/build/scripts/make_names.py
     13 ../../build/util/version.py
     12 gen/third_party/angle/third_party/vulkan-validation-layers/angle/vulkan/lvl_genvk.py
     10 ../../third_party/angle/third_party/spirv-tools/src/utils/generate_grammar_tables.py
     10 gen/third_party/angle/third_party/vulkan-loader/angle/vulkan/loader_genvk.py
      9 ../../net/tools/dafsa/make_dafsa.py
      7 ../../components/vector_icons/aggregate_vector_icons.py
      6 ../../third_party/blink/renderer/bindings/scripts/idl_compiler.py
      5 ../../third_party/inspector_protocol/code_generator.py


Many of tool_wrapper is called for linking. But I guess it can be removed in lld.

    794 ../../build/toolchain/win/tool_wrapper.py link-wrapper
    165 ../../build/toolchain/win/tool_wrapper.py recursive-mirror
     15 ../../build/toolchain/win/tool_wrapper.py rc-wrapper
      2 ../../build/toolchain/win/tool_wrapper.py asm-wrapper

Most of mojom_bindings_generator.py is called for generate subcommand. I think generate subcommand can be consolidated for some mojom files, similar to jumbofication.
    528 ../../mojo/public/tools/bindings/mojom_bindings_generator.py --use_bundled_pylibs generate
    148 ../../mojo/public/tools/bindings/mojom_bindings_generator.py parse
    146 ../../mojo/public/tools/bindings/mojom_bindings_generator.py verify


I also think it is possible to apply consolidation for ../../mojo/public/tools/bindings/generate_type_mappings.py and json_schema_compiler
Is this still a measurable fraction of the build time? Interpreter startup doesn't take super long compared to e.g. the linking subprocess it invokes -- intuitively I would've thought that most of these wrappers are fine. Do you have measurements showing the opposite?

(For touching files, interpreter startup was slow compared to the actual work the script did and we touched many files, so I can see why no longer running python for that might help, but I still don't know how much the change in comment 22 really helped with build time.)
Oh, forgot to say: Maybe we want to invoke python with -S -- for people who have many site packages installed locally it might need much longer to start than a pristine python installation.
Recent measurement is in here. 
https://b.corp.google.com/issues/69675454#comment12

Specifying -S seems nice, why not that is default for GN in action?

It's a somewhat obscure flag, I'd guess nobody thought of it when implementing actions in gn.

Thanks for the b/ link, will look at it tomorrow when I'm back on corp.
Project Member

Comment 31 by bugdroid1@chromium.org, Jun 20 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f7d47b7e8c03400343d4b6753ff79ade9d09ca68

commit f7d47b7e8c03400343d4b6753ff79ade9d09ca68
Author: Takuto Ikuta <tikuta@chromium.org>
Date: Wed Jun 20 14:41:09 2018

Do not use tool-wrapper for lld link

When building chrome on windows, tool-wrapper for linking is called 794 times.
And I found that calling python wrapper is relatively slow than calling link via ninja.

This CL replaces python wrapper to ninja wrapper to remove such overhead when we use lld.

I compared time of wrapper invocation using following powershell script.

for ($i = 0; $i -lt 100; $i++) {
    # ninja -t msvc -e environment.x64 -- C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe -c ' '
    # python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe -c ' ' # code for FlushFileBuffers is commented out.
}

* ninja wrapper, 7.451s for 100 invocation
* python wrapper, 17.803s for 100 invocation
Using ninja as env wrapper is more than 2 times faster.

Bug: 787903
Change-Id: I6531b59a6c43085782b0138a264ecfc9a6f65833
Reviewed-on: https://chromium-review.googlesource.com/1107678
Commit-Queue: Takuto Ikuta <tikuta@chromium.org>
Reviewed-by: Nico Weber <thakis@chromium.org>
Cr-Commit-Position: refs/heads/master@{#568836}
[modify] https://crrev.com/f7d47b7e8c03400343d4b6753ff79ade9d09ca68/build/toolchain/win/BUILD.gn

It looks like avoiding the python wrapper for linking saves us 0.01 s per invocation, so this change will save about 8 s (794 * .01 s). The wall-clock saving will be less because of parallelism. So, this is a nice change to have (and it simplifies things) but it will have modest impact.

The vast majority of the Python overhead is not from invoking the interpreter but from running the code inside the scripts. This python.exe CPU time is a non-trivial portion of build times when doing goma builds, especially on low core-count machines. Modifying python scripts to make them do more per invocation in order to reduce Python interpreter startup overhead is likely to be counterproductive because it will reduce parallelism and won't save much CPU time. Modifying python scripts to make them faster would be great - I don't know if there are any easy opportunities out there.

Isn't it 0.1s according to the data in comment 31? So 80s, which is at least somewhat visible if you don't have too many cores. (It's still a small percentage of the linking work happening, but it's 100% free performance.)
Sorry, you are correct. Off-by-one (magnitude) error on my part. 80 s of CPU time is a definite improvement on laptop builds, and 0.1 s savings per invocation makes merging scripts somewhat more tempting, although I don't think there's another 80 s on the table.

And yeah, free performance in this case - I like it.

I took execution time of python scripts on linux when building chrome except some wrapper invoking binary (protoc, yasm, node, gcc).

Roughly, 1523 python scripts take 346.378 seconds in total.

Breakdown is like below.
+-------+---------------+--------------------+-------------------------------------------------------------------------------------------------------+
| count | total_time_ms |       avg_ms       |                                                script                                                 |
+-------+---------------+--------------------+-------------------------------------------------------------------------------------------------------+
|   802 |        112270 | 139.98753117206982 | ../../mojo/public/tools/bindings/mojom_bindings_generator.py                                          |
|    12 |         48252 |               4021 | gen/third_party/angle/third_party/vulkan-validation-layers/angle/vulkan/lvl_genvk.py                  |
|   130 |         40930 | 314.84615384615387 | ../../tools/json_schema_compiler/compiler.py                                                          |
|    53 |         40543 |  764.9622641509434 | ../../tools/grit/grit.py                                                                              |
|     6 |         11970 |               1995 | ../../third_party/blink/renderer/bindings/scripts/idl_compiler.py                                     |
|    10 |          9772 |              977.2 | gen/third_party/angle/third_party/vulkan-loader/angle/vulkan/loader_genvk.py                          |
|   191 |          9561 |  50.05759162303665 | ../../mojo/public/tools/bindings/generate_type_mappings.py                                            |
|     1 |          8003 |               8003 | ../../tools/metrics/histograms/generate_expired_histograms_array.py                                   |
|     9 |          6821 |  757.8888888888889 | ../../net/tools/dafsa/make_dafsa.py                                                                   |
|    60 |          6470 | 107.83333333333333 | ../../tools/grit/pak_util.py                                                                          |
|     1 |          5833 |               5833 | ../../third_party/catapult/tracing/bin/generate_about_tracing_contents                                |
|     1 |          5206 |               5206 | ../../third_party/blink/renderer/build/scripts/core/css/make_css_value_keywords.py                    |
|     2 |          2763 |             1381.5 | ../../third_party/blink/renderer/bindings/scripts/compute_interfaces_info_individual.py               |
|     1 |          2539 |               2539 | ../../third_party/blink/renderer/bindings/scripts/generate_v8_context_snapshot_external_references.py |
|     2 |          2415 |             1207.5 | ../../third_party/blink/renderer/bindings/scripts/generate_origin_trial_features.py                   |
|     1 |          2233 |               2233 | ../../third_party/blink/renderer/build/scripts/make_css_property_names.py                             |
|    43 |          2177 | 50.627906976744185 | ../../services/service_manager/public/tools/manifest/manifest_collator.py                             |
|    49 |          2004 |  40.89795918367347 | ../../build/write_buildflag_header.py                                                                 |
|     1 |          1943 |               1943 | ../../third_party/blink/renderer/devtools/scripts/build/build_release_applications.py                 |
|    13 |          1933 | 148.69230769230768 | ../../third_party/blink/renderer/build/scripts/make_names.py                                          |
+-------+---------------+--------------------+-------------------------------------------------------------------------------------------------------+

lvl_genvk.py has large execution time regardless its execution count.
But mojom is largest time consumer and its breakdown become like below.
+-------+---------------+--------------------+------------+
| count | total_time_ms |       avg_ms       | subcommand |
+-------+---------------+--------------------+------------+
|   515 |         68727 | 133.45048543689322 | generate   |
|   144 |         27962 | 194.18055555555554 | parse      |
|   142 |         14635 | 103.06338028169014 | verify     |
|     1 |           946 |                946 | precompile |
+-------+---------------+--------------------+------------+

So, we want to focus on these time consuming python scripts for better lower python cpu usage.
Attached csv contains script execution time and bit modified command line args.
This may help to analyze further.
python.csv
593 KB View Download
Blockedon: angleproject:2449
There already is a bug for lvl_genvk.py
Project Member

Comment 38 by bugdroid1@chromium.org, Jun 22 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/054eb9147e0a877d52410a18c948035fa6062381

commit 054eb9147e0a877d52410a18c948035fa6062381
Author: Takuto Ikuta <tikuta@chromium.org>
Date: Fri Jun 22 01:15:57 2018

Do not use wrapper for link in win cross build

This is follow up of
https://chromium.googlesource.com/chromium/src/+/f7d47b7e8c03400343d4b6753ff79ade9d09ca68

This CL specifies libpath explicitly for LIB directories.

Bug:  854849 , 787903
Change-Id: I9b65b0c0980af1f7eb4ece2ffbb31d8bde44e7ce
Reviewed-on: https://chromium-review.googlesource.com/1109452
Commit-Queue: Takuto Ikuta <tikuta@chromium.org>
Reviewed-by: Nico Weber <thakis@chromium.org>
Cr-Commit-Position: refs/heads/master@{#569486}
[modify] https://crrev.com/054eb9147e0a877d52410a18c948035fa6062381/build/toolchain/win/BUILD.gn
[modify] https://crrev.com/054eb9147e0a877d52410a18c948035fa6062381/build/toolchain/win/setup_toolchain.py

Sign in to add a comment