Use lld's upcoming profile-guided section layout feature on Linux |
||||
Issue descriptionSome folks at Sony have been working on adding a profile-guided section layout feature to lld: https://reviews.llvm.org/D36351 I think it may be worth considering deploying it in the Linux version of Chrome. We already do something similar on Android (and Windows?) using orderfiles, but the section layout feature if properly designed would have the advantage of being a first-class citizen of the toolchain. The call graph may also be useful for optimizing a new control flow integrity feature that we are working on (for that feature, the call graph itself would be more useful than just the orderfile). I extended AutoFDO with support for creating a call graph profile that lld can read, used it to create a profile of the layers_overlap_3d benchmark in blink_perf.layout and then relinked chrome using that profile. That gave me these performance numbers: BEFORE Running 5 times Ignoring warm-up run (221.60500000000002 ms) 177.14999999999998 ms 217.54499999999996 ms 211.0050000000001 ms 202.66499999999996 ms 200.08500000000004 ms Description: Measures performance of non-overlapping 3D layers. Time: values 177.14999999999998, 217.54499999999996, 211.0050000000001, 202.66499999999996, 200.08500000000004 ms avg 201.69 ms median 202.66499999999996 ms stdev 15.361384214972313 ms min 177.14999999999998 ms max 217.54499999999996 ms AFTER Running 5 times Ignoring warm-up run (178.35000000000002 ms) 172.40000000000003 ms 165.0150000000001 ms 165.85000000000014 ms 162.30499999999995 ms 152.14499999999998 ms Description: Measures performance of non-overlapping 3D layers. Time: values 172.40000000000003, 165.0150000000001, 165.85000000000014, 162.30499999999995, 152.14499999999998 ms avg 163.54300000000003 ms median 165.0150000000001 ms stdev 7.371949029937775 ms min 152.14499999999998 ms max 172.40000000000003 ms So a rather dramatic 20% improvement in performance on that benchmark. Of course, a browser that is optimized for running one microbenchmark isn't necessarily going to be optimized for general usage. I think the next steps would be to decide on a representative workload to profile against (maybe just the set of benchmarks that are monitored by the chrome linux perf bots, or whatever the Windows PGO bot does), take a profile of those and measure the benchmark impact.
,
Aug 26 2017
I'm somewhat unconvinced by how pgo went on Windows -- it's slow enough that we can't use it on the perf bots, and so we've been flying blind on Windows since launching pgo. (Clang/win fixes this cause the order files change less often and are generated asynchronously to the build.) If we do this, profile generation should be done asynchronously to the build. Priority wise, I'd say this is after using lld by default in linux, and maybe Android.
,
Aug 26 2017
I'm unfamiliar with Windows PGO. By asynchronous do you mean that collecting profile data and applying profile data don't have to use the exact same codebase? If so, I believe that this is how profile-guided section layout works as well (e.g. if a symbol is missing, lld will just ignore that part of the profile data).
,
Aug 26 2017
On Windows, we currently do 1 build 2 run some benchmarks to collect profiles 3 build again, using profiles This takes 6h. With asynchronously, I mean: Profile collection bot builds, collects profiles, uploads them somewhere, repeat. Actual builder downloads recent profiles, builds. (Which sounds like what you're saying too)
,
Aug 28 2017
I believe profile-guided section layout has potential. Quick thoughts about Android: * AutoFDO is designed for taking slightly stale profiles * One unfortunate constraint on ARM is that LBR is not available, which severely limits the optimization power of AutoFDO * the current orderfile slows down execution somewhat, but improves startup, to push something new we would need to evaluate both of these dimensions * In the android land we are also becoming increasingly worried about the binary footprint. Which means, the optimization must be tuned to not expand code. Also, if we could gauge this optimization to group unused code to a special section of the binary that is supposedly unused, this would save some resident memory. If it turns out to be >1MiB of savings, that would be good, if it's >10MiB, I'll finance a beer party for everyone involved :)
,
Aug 28 2017
> * In the android land we are also becoming increasingly worried about the binary footprint. Which means, the optimization must be tuned to not expand code. Also, if we could gauge this optimization to group unused code to a special section of the binary that is supposedly unused, this would save some resident memory. If it turns out to be >1MiB of savings, that would be good, if it's >10MiB, I'll finance a beer party for everyone involved :) I'm curiously why you want to pack rarely used code (not unused, right?) to a special section. My understanding is that, if you correctly identify rarely used functions and aggregate them into one place in .text, the kernel won't map the pages for the functions because they are not used at runtime. So you can save memory, no?
,
Aug 28 2017
that's what I am thinking about - extract rarely used functions out of .text, aggregate them to a contiguous region, say at the end of .text. I did not mean to make a special _section_ in the binary beyond .text. Pardon if mentioning "section" confuses the matter. The kernel will of course have the byte range mapped to virtual memory, but the hope is that if the pages stay mostly unused, this mapping won't have pages of physical RAM attached.
,
Aug 28 2017
> One unfortunate constraint on ARM is that LBR is not available, which severely limits the optimization power of AutoFDO We might consider using profiles generated on x86 when targeting ARM. Hopefully the differences between the profiles will be minor enough not to matter in practice. > In the android land we are also becoming increasingly worried about the binary footprint. Which means, the optimization must be tuned to not expand code. Since this is purely a reordering optimization I would not expect it to increase code size, in fact I would expect there to be a slight decrease in code size on ARM because the linker would need to create fewer range extension thunks as a result of functions being placed nearer to their callers.
,
Aug 29 2017
> > One unfortunate constraint on ARM is that LBR is not available, which > > severely limits the optimization power of AutoFDO > We might consider using profiles generated on x86 when targeting ARM. > Hopefully the differences between the profiles will be minor enough not to > matter in practice. Agreed. Most likely for laying out sections in the code the LBR should not matter. But since nobody considered LBR-less AutoFDO before, I don't have confidence in how it affects the inline heuristics. > > In the android land we are also becoming increasingly worried about the > > binary footprint. Which means, the optimization must be tuned to not expand > > code. > > Since this is purely a reordering optimization I would not expect it to > increase code size, in fact I would expect there to be a slight decrease in > code size on ARM because the linker would need to create fewer range extension > thunks as a result of functions being placed nearer to their callers. Fingers crossed :) In my old experiment AutoFDO produced more compact code compared to -O2 (3 MiB saving), but produced bigger code compared to the "partial O2" that we ship these days (300KiB more). Nobody knows the reasons, but I speculate that "it is all the inliner", which was not tuned for our situation, I guess.
,
Sep 6 2017
List of benchmarks used for AutoFDO on Chrome OS: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/server/site_tests/telemetry_AFDOGenerate/telemetry_AFDOGenerate.py#44 I haven't had a chance to compare that list with the one used on Windows. I collected a presumably more realistic profile using the Windows benchmark list linked above, together with blink_perf.layout. That gave me a median 5% improvement across the benchmarks in blink_perf.layout as compared to no layout. I then made some improvements to the lld patch linked above: - if a section is replaced with an identical section using ICF, apply the profile to the replacement section - create a separate symbol table for looking up sections that also includes local symbols. With those changes I get a median 7.5% improvement across blink_perf.layout.
,
Sep 6 2017
pcc: nice! Would it be hard to get numbers for non-microbenchmarks from the list above? Since the sunset of octane, the most important benchmarks from the list seem to be: 1. page_cycler_v2.* (but --pageset-repeat=1 may be low for sampling profiler) 2. speedometer
,
Sep 6 2017
This is probably a silly question, but the numbers look too good to me. 5% to 7.5% improvements look huge. Is it because they are micro-benchmarks?
,
Sep 6 2017
#11: I will take a look. #12: Yes, they are microbenchmarks. The numbers do seem in line with those reported in the Facebook paper, though.
,
Nov 28 2017
Upon re-reading, this is maybe more similar to the orderfile stuff we do in win/clang builds than full PGO (?). Hans knows all about that.
,
Nov 28 2017
The Windows orderfile stuff is even documented nowadays: https://chromium.googlesource.com/chromium/src/+/master/docs/win_order_files.md
,
Jan 10
Archiving P3s older than 1 year with no owner or component. |
||||
►
Sign in to add a comment |
||||
Comment 1 by p...@chromium.org
, Aug 26 2017