New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 877722 link

Starred by 5 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Integrate ThinLTO backend codegen actions as part of the ninja build graph for better parallelism, distribution, and incrementality

Project Member Reported by r...@chromium.org, Aug 25

Issue description

LLVM ThinLTO is the new (as of 2016) way to do scalable link time optimization in LLVM.

There are two main ways to do ThinLTO today:
1. Inside the link step, the linker internally manages object file generation actions with its own cache and thread pool.
2. With explicit build system support, as implemented in Bazel/Blaze (I'm unsure which)

The first solution is a good first step, but it leaves a lot of build performance on the table, and does a lot of things that gn, ninja, and goma could do better.

This bug covers adding that explicit build system support to gn, and maybe some additional work to integrate it with goma.

At a high level, C++ compilation looks like a massively parallel set of compilation jobs, followed by a single link step. A ThinLTO build looks like the same of compilation actions, a thin link step, followed by a second stage of compilation for every ThinLTO object file used by the link, followed by a final native re-link. The thin link step does as much analysis as it can in a reasonable amount of time, deferring optimization transformations and object file generation to the second compilation phase.

This is more or less just as much as gn and ninja need to know. Goma has a second concern, which is that second stage compilation needs to be able to access some number of other bitcode files from the build. In order to distribute these actions (if profitable), we will need to read the index written by the thin link and upload the set of files used by the second compilation to goma. This is a lot like goma's existing header scanning, but it must be done for a new file format.

Another major concern is that when there are many binaries, as there are in Chrome, the code generation work ends up being duplicated. You can end up in a situation where the 'base' library is linked into M binaries, but the code generation steps for object files from base are largely redundant if they don't meaningfully participate in cross-module optimization. We don't currently have any plans to address this, but you can imagine a solution that tries to coordinate and cache work across links to see if they are compiling effectively the same input bitcode. Solving that is interesting, but out of scope for this bug.

I'm filing this now so we have a place to track work for it, but we're still in the planning stages. We need to gather more data about build performance to motivate these changes. If ThinLTO with the linker-internal object file cache is fast enough, there may not be enough reason to go to all this trouble to modify gn and goma. However, my belief is that doing this work will make ThinLTO builds a lot faster and more accessible for developers who need to replicate Chrome's official build configurations. It might also allow us to enable CFI in regular release builds, so developers test something that is closer to what actually ships.
 
Cc: dxf@google.com
Cc: p...@chromium.org dpranke@google.com
As discussed in person, this sounds like a great plan to me. My main concern is that the thin link step needs to write a ton of data, but hopefully all the codegen parallelism easily makes up for that.


(+dpranke fyi)
Components: Build
> My main concern is that the thin link step needs to write a ton of data, but hopefully all the codegen parallelism easily makes up for that.

Actually, when we were talking in the meeting, I didn't have that much background on how the existing thinlto blaze integration worked. The thin link step doesn't have to write a ton of data, it basically writes one (small-ish) index per object file in the link, telling that object where to do its cross-module imports from. That may be a ton of files, but hopefully it's not a lot of data. The parallelized thinlto backend jobs read their per-TU bitcode file, and then optionally read a bunch of other bitcode files to do cross-module imports.

---

So, the ThinLTO bot finished compiling! \o/
https://ci.chromium.org/buildbot/chromium.clang/ToTWinThinLTO64/1754

The compile step took 6h25m, and a regular official release bot takes 3h54m. The bots have different hardware, but that makes me feel like we do want this.
On Linux, an actual official build takes ~3 hr, without goma. Any thoughts on why Windows would be so much slower?

Apart from that, I don't think I quite follow what the proposed solution would be.

I guess you want to parallelize the "link" across machines so that you can do faster codegen in the second pass? That almost sounds like you'd be turning LTO off; I guess instead of just reading the header files for other modules, you'd need access to the bitcode for the modules as well? Is there some way we could make that happen during the first pass?

Apart from that as well, you could prototype something without requiring changes to GN or Ninja, by essentially just shelling out to a script instead of invoking the linker, and then having the script re-invoke goma + clang as needed. That would hide part of the build graph from Ninja, which is sub-optimal, but it might actually be a better solution if you could figure out how to do that plus add job server support to ninja. 

At first thought, I have no idea what you'd need to do to GN to try and get it to think of compiling c++ in two passes. If blaze/bazel does something here that we could imitate, I'd like to know what it is.


The way thinlto works is like so:

1. compile all cc files to llvm bitcode (parallelized)
2. slightly entangle all the bitcode files (serial)
3. for all bitcode files, run codegen (currently done by a single process with threads, and locally)

The proposal is to change 3 to not be a single process, but one process per bitcode file. Then parallelism can work on a process level, ninja can schedule better, and we can do the codegen on goma.

bazel supports adding arbitrarily much work at build time from scripts, so from what I understand that's how this works there. ninja doesn't believe in a turing-complete build description language being a good thing, so it doesn't support that.
WDYT of the jobserver-based idea?
I don't think Nico is keen on adding job server support to ninja, but I can't speak for him.

I think, if we can describe the shape of the action graph up front, then we should do it. It requires teaching gn a lot about ThinLTO, but that's probably less complex than a job server implementation. At least, there won't be any platform specific concurrency bugs in it. =P

I'm not sure how job server actions should be integrated into the ninja status line, but if it's explicit, we'll naturally expose the build progress.
To be honest I'm not a big fan of how distributed ThinLTO backend support is implemented in Bazel, since it tightly couples GN to ThinLTO's implementation details. The jobserver-based idea is how I've always imagined that distributed build support would work in GN (and other ninja generators).
Last I talked to thakis@ (i.e., last week), he thought that we'd want to add jobserver support sooner or later.
I might add it for outside folks, but I very strongly feel we shouldn't use that in chrome.
To expand on that a bit: needing a job server means your build config is broken. It can still be useful as help to incrementally get to a better place, or if you don't control the build of all your dependencies, but chromium's build is thankfully sane. I'm not keen on regressing that.
I'm not keen to completely rework the compilation flow for c++ targets in GN when a compiler flag changes, so we might both need to be kinda flexible here :)
> To expand on that a bit: needing a job server means your build config is broken. 

I wouldn't call it "broken". It could mean that the knowledge is legitimately in two different disconnected graphs (e.g., when you have to cross two build systems), which is close to what you're describing here.
Cc: -dpranke@google.com dpranke@chromium.org
http://aegis.sourceforge.net/auug97.pdf ("Recursive Make Considered Harmful", the "say no to job servers" paper) is one of the few things most build systems agree on nowadays. Granted, it's better than not being able to build, but it's not a world we should intentionally move to.

But let's get a prototype working first and then discuss more on how to implement this.
AFAIK that paper says nothing about job servers or that concept specifically, i.e., I've never heard it called that ;)

I am well aware of the other arguments against recursive make and generally agree with them. At the same time, in some cases partitioning makes more sense than others. Apart from that, I doubt we disagree too much on what to do in practice.

So, going back to the three-stage description in #c6, can you expand on the second stage a bit? What is that actually doing, and is it possible to parallelize that as well? It seems like if you could, you'd almost get to a point where you had a more traditional compilation model.


> AFAIK that paper says nothing about job servers 

In fact, job servers didn't actually exist when that paper was written. See http://make.mad-scientist.net/papers/jobserver-implementation/ . 

(I actually was working on build systems at the time; more proof that history moves in circles :).
http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html describes thinlto. In summary, the thin link step figures out which bitcode file contains which function definitions and which bitcode files call which functions, and if a bitcode file has a call to a function but not the definition _and_ it's likely that the call should be inlined, then the definition of the function gets added to that bitcode file.

In other words, you compile every cc file to bitcode, then in a big merge step (conceptually) inject promising inline candidates into every bitcode file, and then codegen all the bitcode files individually.

> AFAIK that paper says nothing about job servers 

Not explicitly, but it makes the point that make should know the whole build graph in advance for reasons, and in that world you don't need a job server.
I'd guess that the Blaze implementation of ThinLTO doesn't deal with the duplicated codegen problem; is that correct?

I'm not sure how GN would, either, since it seems like the whole point is that the bitcode will vary based on which target it'll get linked into?   
Yes, it's my understanding blaze doesn't do that. As Reid says, "We don't currently have any plans to address this" either. We don't yet have data on how many bitcode files either don't get rewritten or get rewritten in exactly the same way in many binaries. Once we have the system basically working, getting that data should be fairly straightforward though; then we can reconsider if attacking this subproblem is worth it.
One thing that I can see affecting my preference for how to do this is how large the action graph would actually be if we built it all up front, if that causes any issues, and how we would address those.

More concretely, in one of my ninja all builds, Chrome had about 500 link targets and about 40,000 object files. I expect we don't get close to linking all those object files into every link target, but let's say we wanted to be able to handle something on the order of magnitude of ten million combinations of object file and link target. Since each of those combinations could result in different native code being generated for the object file, the full action graph would have to have that many nodes, right? If so, how many bytes does that translate to, and would that cause problems for ninja due to having to do a lot more I/O or using a lot more memory than it does now?
That's one of the questions a prototype will answer :-)
Ok. I have a simple implementation of -thinlto-index-only for the COFF linker that produces index files (I haven't checked yet if they are correct), but I think I'll see how ninja likes larger action graphs first before I do more work on lld or Clang.
@thakis - do you have any thoughts on *how* you'd actually do this in GN?

The compilation model for C++ objects is baked into the core of GN, so I'm not really seeing a way to do something without hacking on GN. That is doable, obviously, but I haven't yet thought about what you'd need to change or how you'd want to morph the generated ninja files.
Oh, I'll also mention that the compilation model is heavily baked into goma as well; goma isn't a general-purpose remote task execution thing, and so I don't know what would be involved in trying to modify it to be aware of the different compiler invocations and what resources each would need to be able to access.
Yes, goma would need to know something about this new action. In place of include scanning, goma would need to look at the index to figure out what other bitcode inputs it needs to stage back to its workers, ideally pulling them from a content addressed cache so they don't need to be sent back over the wire.

Aside from that, it's a regular compilation action: stage the inputs (binary files instead of headers now), run the same compiler binary from the same existing package, and return the object file.

I think the index is currently a binary file format, which probably don't want to teach goma to read. It would be much better if the thin link wrote a plain text file of paths for goma to stage.
I don't think goma has any way to do anything like that now. We've talked about adding support for generic remotable operations and/or other toolchains (like java) and/or distributed linking, but all of that is future work.
To get a vague idea of what we might find if we create a ninja build graph this large, I created a build.ninja with some trivial actions, 1,000 objects, and 10,000 targets that each use all of the objects (doing it the other way around caused command lines to be too long).

On my Linux system, this caused ninja to spend about 36 seconds in ".ninja parse". For a clean build, it spent a further 24 seconds in "node stat". After building and touching one source file, "node stat" took almost 47 seconds. In both cases, "lookup node" was responsible for some 7s, "canonicalize str" for 2.5s, and "canonicalize path" for just over 1s. ninja used just under 6GB of memory.

By comparison, if I remove all the target x object nodes from the graph, I get down to 6 seconds for ".ninja parse", about 30ms for "node stat" (in both clean and incremental builds), "lookup node" becomes about 0.8s, "canonicalize str" also about 0.8s, "canonicalize path" about 0.3s, and ninja uses a little over 200MB of memory.

Of these, parsing time is my biggest concern. Stat time to the extent that it delays the point where we actually start building (or determine that there is nothing to do). Memory usage is also up there. The rest seem ok to me.
Project Member

Comment 30 by bugdroid1@chromium.org, Aug 31

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f5149004b62288ece136432ed77d660ae38e1371

commit f5149004b62288ece136432ed77d660ae38e1371
Author: Reid Kleckner <rnk@google.com>
Date: Fri Aug 31 01:41:09 2018

Enable win_linker_timing on Windows CFI and ThinLTO bots

This passes the /time option to LLD, which prints out some statistics
about how long various linking phases took. I plan to add more timers
around the major LTO phases so we have more visibility into how ThinLTO
performs on buildbots.

Also remove two CFI bots that moved from chromium.fyi to chromium.clang.

R=dpranke@chromium.org

Bug: 877722
Change-Id: I63181af92503ac6877db3cec7b3296d3e229e5a0
Reviewed-on: https://chromium-review.googlesource.com/1197792
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Commit-Queue: Reid Kleckner <rnk@chromium.org>
Cr-Commit-Position: refs/heads/master@{#587906}
[modify] https://crrev.com/f5149004b62288ece136432ed77d660ae38e1371/tools/mb/mb_config.pyl

Cc: tejohnson@google.com
Cc: brettw@chromium.org
+brettw since we've been talking about potentially changing GN as well.
re 29: Can you attach the script? Stat time shouldn't be higher than what we currently do since we won't be stating more files. For parsing, make sure to not list the .o files as explicit inputs (they're inputs to the thin link, but not every codegen step needs to dep on all .o files, only on the one thin link output it cares about).
I think we would need to stat more files because they make up part of the thin backend codegen actions.

compile               thin backend
file1.o \           / file1.thin.o \
file2.o - thin-link - file2.thin.o - final link
file3.o /           \ file3.thin.o /

The fileN.thin.o targets are duplicated for each executable/DSO target and we need to stat each of them.
I see. But it should only double the number of .o files stat()ed, right? Since stating is a fraction of a second, stating should become around 1s tops, not 47s (?)
Not if you have lots of targets that all share large numbers of input files (such as our unit test binaries). Because each target's thin-link makes its own  importing decisions, devirtualization decisions, etc., each combined summary (or part of the combined summary that applies to each translation unit) can potentially be different. So the dependency graph looks like (switching to make syntax because ascii art would be unreadable):

target1.thin-link: file1.o file2.o file3.o
target2.thin-link: file1.o file2.o file3.o
target3.thin-link: file1.o file2.o file3.o

target1.obj/file1.o: target1.thin-link
target1.obj/file2.o: target1.thin-link
target1.obj/file3.o: target1.thin-link

target2.obj/file1.o: target2.thin-link
target2.obj/file2.o: target2.thin-link
target2.obj/file3.o: target2.thin-link

target3.obj/file1.o: target3.thin-link
target3.obj/file2.o: target3.thin-link
target3.obj/file3.o: target3.thin-link

target1: target1.obj/file1.o target1.obj/file2.o target1/file3.o
target1: target2.obj/file1.o target2.obj/file2.o target2/file3.o
target1: target3.obj/file1.o target3.obj/file2.o target3/file3.o

i.e. O(M*N) stats.
Sorry, the last part should have been

target1: target1.obj/file1.o target1.obj/file2.o target1/file3.o
target2: target2.obj/file1.o target2.obj/file2.o target2/file3.o
target3: target3.obj/file1.o target3.obj/file2.o target3/file3.o

re: c#28:
> I don't think goma has any way to do anything like that now. We've talked about adding support for generic remotable operations and/or other toolchains (like java) and/or distributed linking, but all of that is future work.

Apparently the work has already been done for ChromeOS:
https://chromium.googlesource.com/infra/goma/client/+/master/client/linker/linker_input_processor/thinlto_import_processor.cc
Regarding the O(M*N) unit test behavior. We've handled this in Bazel for statically linked tests by essentially disabling any real LTO (skipping the thin link and passing in /dev/null as the index file to the ThinLTO backend jobs), and then allowing those ThinLTO backend jobs to be shared between the statically linked tests. This gives you the same optimization pipeline as the main app, but the idea is that you aren't going to get the same whole program optimizations anyway in a unit test as in the main app, so there isn't a big need to test those different whole program optimizations via the unit test.

For tests that are linked shared (or any shared library built and linked into a unit test) this isn't an issue, since each library is LTO'ed and code generated separately during its own link, and the resulting native libraries can be shared by the unit tests.
> disabling any real LTO (skipping the thin link and passing in /dev/null as the index file to the ThinLTO backend jobs)

That doesn't sound like it would work at present with CFI (which currently requires the thin link step), right?
True, if you want to perform CFI on the tests too then you would need to do
the thin link. Presumably the CFI related intrinsics could be
lowered/discarded in the backend ends if the index was null, but now I
remember you saying that you specifically do want to perform CFI for Chrome
tests, so I guess this wouldn't help you.

Sign in to add a comment