Put code in the middle of the native library to reduce PSS and binary size |
|||||||
Issue descriptionChrome on Android uses THUMB2 code for ARM. In this instruction set, the relative jump operand has a range of +/-16MB, per http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489e/Cihfddaf.html. For "far" jumps, the linker generates a trampoline written in ARM32 code for gold, with the following instructions: 32 bit constant ldr r12, [pc, #-4] add r12, pc, r12 bx r12 The jump is then done by jumping to the trampoline (with an instruction set switch, THUMB -> ARM). This means that each far jump target takes 4x4 = 16 bytes of code, increasing binary size (and slightly lowering performance). Looking at the encoding of the instructions, the three instructions are, in hex: e59fc004 e08fc00c e12fff1c Taking a Chrome release build: $ hexdump -v -e '/4 "%08x\n"' out/Release/libchrome.so | grep e59fc004 | wc -l 55378 $ hexdump -v -e '/4 "%08x\n"' out/Release/libchrome.so | grep e08fc00c | wc -l 55378 $ hexdump -v -e '/4 "%08x\n"' out/Release/libchrome.so | grep e12fff1c | wc -l 55378 Meaning that all the trampolines take 55378 * 16 = 886,048 bytes. The ordered code contains both code that will reach out to the rest of the library, and hot code (think //base for instance) which is called from all over the native library. It is currently placed at the beginning of the library, meaning that the average distance from it is higher than if it were at the middle of the binary. From a local experiment, by making the orderfile exhaustive and putting the ordered code at the middle of it, we get: Before: $ size out/Release/libchrome.so text data bss dec hex filename 45535692 2083516 1999456 49618664 2f51ee8 out/Release/libchrome.so After: $ size out/Release/libchrome.so text data bss dec hex filename 44983512 2083516 1999456 49066484 2ecb1f4 out/Release/libchrome.so That is a 45535692 - 44983512 = 552,180 bytes saving. Since the trampolines are contiguous in memory and pretty much always paged in, this should translate to an equivalent PSS saving (clean memory).
,
Jan 15 2018
With LLD, seems that the thunk is different, and more space-efficient: From: https://github.com/llvm-mirror/lld/blob/5f7e346591a5c455aa3d20028836e4a5c98b5b6a/ELF/Thunks.cpp#L102 It is: const uint8_t Data[] = { 0x4f, 0xf6, 0xf4, 0x7c, // P: movw ip,:lower16:S - (P + (L1-P) + 4) 0xc0, 0xf2, 0x00, 0x0c, // movt ip,:upper16:S - (P + (L1-P) + 4) 0xfc, 0x44, // L1: add r12, pc 0x60, 0x47, // bx r12 }; And looking in the binary for 476044fc, that is: add r12, pc bx r12 yields 26719 hits on an unordered build. It would be higher for an ordered build. From what I can tell, the thunk avoid the THUMB -> ARM switch, and takes 4 fewer bytes. Performance may be a bit better, I don't know whether an instruction set has a penalty on ARM.
,
Jan 16 2018
Re-did the experiment with lld: $ size out/Release/lib.unstripped/libchrome.so text data bss dec hex filename 41098945 2058624 1620380 44777949 2ab41dd out/Release/lib.unstripped/libchrome.so With frequently used code in the middle: $ rm -f out/Release/lib.unstripped/libchrome.so && n Release libchrome $ size out/Release/lib.unstripped/libchrome.so text data bss dec hex filename 40795341 2058624 1620380 44474345 2a69fe9 out/Release/lib.unstripped/libchrome.so Difference: $ echo $((41098945 - 40795341)) 303604 So, 300k with lld. The difference is probably due to the smaller thunk, and the overall smaller code size that required fewer thunks to begin with.
,
Jan 16 2018
Wow, 300K is pretty great!
,
Jan 25 2018
Note: lld actually puts the thunks in the binary as symbols: $ arm-linux-gnueabihf-readelf -s out/Release/lib.unstripped/libchrome.so | grep __ThumbV7PILongThunk | wc -l 45533 And each thunk is 12 bytes, that is 546396 total.
,
Feb 15 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a97185b1e0c031cc0aef4f3f759007f02abb5555 commit a97185b1e0c031cc0aef4f3f759007f02abb5555 Author: Benoit Lize <lizeb@chromium.org> Date: Thu Feb 15 16:10:18 2018 android: Add a tool to put frequently-used code in the middle. This reduces binary size (and memory footprint) by >300kB on ARM for libchrome.so, by reducing the number of long relative jump thunks from 45k to 15k, with each thunk being 12 bytes. See the attached bug for details. This CL only adds the script to generate a new orderfile, a forthcoming CL will include this in the build. Bug: 801601 Change-Id: I2b37e032762df8a9ecaa1c03b602a89453d41b97 Reviewed-on: https://chromium-review.googlesource.com/899742 Commit-Queue: Benoit L <lizeb@chromium.org> Reviewed-by: Matthew Cary <mattcary@chromium.org> Reviewed-by: agrieve <agrieve@chromium.org> Reviewed-by: Egor Pasko <pasko@chromium.org> Cr-Commit-Position: refs/heads/master@{#537034} [modify] https://crrev.com/a97185b1e0c031cc0aef4f3f759007f02abb5555/base/android/library_loader/anchor_functions.cc [modify] https://crrev.com/a97185b1e0c031cc0aef4f3f759007f02abb5555/base/android/library_loader/anchor_functions.h [add] https://crrev.com/a97185b1e0c031cc0aef4f3f759007f02abb5555/tools/cygprofile/reorder_native_library.py
,
Mar 23 2018
Thanks to Andrew's help, I have now a draft CL that compiles, at least: https://chromium-review.googlesource.com/c/chromium/src/+/978010 However, the size savings are lower than they used to be, thanks to pcc@ shrinking Chrome's size with LLD improvements. :-) $ ll out/Release/libchrome_partial_ordering.so -rwxr-x--- 1 lizeb eng 44014320 Mar 23 15:54 out/Release/libchrome_partial_ordering.so* $ ll out/Release/libchrome.so -rwxr-x--- 1 lizeb eng 43854576 Mar 23 15:55 out/Release/libchrome.so* $ echo $((44014320-43854576)) 159744 So, 159kB only. On the other hand, visualizations show that all the thunks are now in the middle of the binary, and that we have fewer of them. This is likely thanks to this LLD commit: commit 7211b2ad60d2ca2ffe550ba77406bb482efcd78d Author: Peter Collingbourne <peter@pcc.me.uk> Date: Fri Mar 9 17:54:43 2018 +0000 ELF: Do not create multiple thunks for the same virtual address. This avoids creating multiple thunks for symbols with aliases or which belong to ICF'd sections. This patch reduces the size of Chromium for Android by 260KB (0.8% of .text). Differential Revision: https://reviews.llvm.org/D44284 git-svn-id: https://llvm.org/svn/llvm-project/lld/trunk@327154 91177308-0d34-0410-b5e6-96231b3b80d8 So overall good news, pcc's change is strictly better than mine :-)
,
Mar 23 2018
Nice! (both of you!) 159kb still seems worth it to me. Might be able to make the first link a bit faster. E.g. will the list of symbols be available if "-Wl,--strip-debug" is used? I think I recall that nm could still list them with -g0, but that's a bit different... When we eventually turn on thinlto, the first link will also seem quick :P.
,
Mar 23 2018
It's possible that we could make a change to lld that would place the orderfile sections in the middle of the binary. That would save needing to do two steps. That change was also just the start of a set of linker-side improvements that I made to try to reduce the number of thunks. I also wrote an lld patch that implements a section layout algorithm that tries to place sections halfway between their first reference and their last reference. (If an orderfile is passed, the algorithm would consider all sections in the orderfile as a single section.) Without an orderfile, the algorithm was actually able to remove all unnecessary thunks (necessary meaning e.g. thunking between Thumb and ARM), and with an orderfile it was able to reduce the number of thunks significantly. I did however measure a slight loss in performance in both cases. I will try to find the numbers I collected on size and performance. I'd be curious to see the performance impact of a change that just moves the orderfile sections into the middle.
,
Mar 23 2018
I forgot to mention one of the main reasons why I stopped working on my algorithm: the resulting binary does not compress well. Numbers on code size. First, the baseline: $ rebuild.sh libchrome.so -fuse-ld=$HOME/src2/llvm-project/ra/bin/ld.lld; zgrep '\.text\.thunk' lib.unstripped/libchrome.so.map.gz; wc -c libchrome.so; gzip -c libchrome.so | wc -c 015c1ff4 0003b5dc 4 <internal>:(.text.thunk) 025cd594 0000f720 4 <internal>:(.text.thunk) 43919488 libchrome.so 25730711 Now with the algorithm. It is iterative, so I tried varying the number of iterations: $ for i in `seq 1 10`; do echo $i; THUNK_SECTION_ORDER=1 THUNK_ITERS=$i rebuild.sh libchrome.so -fuse-ld=$HOME/src2/llvm-project/ra/bin/ld.lld; zgrep '\.text\.thunk' lib.unstripped/libchrome.so.map.gz; wc -c libchrome.so; gzip -c libchrome.so | wc -c; done 1 015c1c90 00049930 4 <internal>:(.text.thunk) 025db560 00000108 4 <internal>:(.text.thunk) 43911296 libchrome.so 26427485 2 015c1fec 00016368 4 <internal>:(.text.thunk) 025a833c 000001e0 4 <internal>:(.text.thunk) 43702400 libchrome.so 26860645 3 015c1fe8 0000e64c 4 <internal>:(.text.thunk) 025a0638 000001b0 4 <internal>:(.text.thunk) 43669632 libchrome.so 26973666 4 015c1fd0 0000e514 4 <internal>:(.text.thunk) 025a04c4 000001bc 4 <internal>:(.text.thunk) 43669632 libchrome.so 26940655 5 015c1fc0 0000e07c 4 <internal>:(.text.thunk) 025a0014 000001bc 4 <internal>:(.text.thunk) 43665536 libchrome.so 26911801 6 015c1ff0 0000dbd8 4 <internal>:(.text.thunk) 0259fbc0 000001bc 4 <internal>:(.text.thunk) 43665536 libchrome.so 26882874 7 015c1ff0 0000dc08 4 <internal>:(.text.thunk) 0259fb8c 000001bc 4 <internal>:(.text.thunk) 43665536 libchrome.so 26862034 8 015c1f84 0000deb4 4 <internal>:(.text.thunk) 0259fdc0 000001bc 4 <internal>:(.text.thunk) 43665536 libchrome.so 26842628 9 015c1fc8 0000ded8 4 <internal>:(.text.thunk) 0259f908 000001c8 4 <internal>:(.text.thunk) 43665536 libchrome.so 26827767 10 015c1fdc 0000e0e8 4 <internal>:(.text.thunk) 0259fcf8 000001d4 4 <internal>:(.text.thunk) 43665536 libchrome.so 26810347 (the compressed size appears to be going down, but if I let it run for more iterations it eventually converges on 26760000 or so) So we end up with a binary that is 250KB smaller uncompressed, but >1MB larger compressed. Probably the next thing that I would try would be to change the algorithm so that it only moves a section if it is out of range. Hopefully that should help preserve the more compressible original order, at the cost of requiring more iterations.
,
Mar 23 2018
I tried a number of approaches that involved only moving a section if it was out of range. Unfortunately they either didn't converge to a good position at all and/or increased the compressed size by too much. I then looked at adjusting the position of the orderfile sections relative to the other sections without changing the relative positions of the other sections. It turns out that the best position is not necessarily exactly in the middle of the .text section. I found that placing the orderfile sections at 44% into .text resulted in the smallest number of thunks (it ought to be possible to teach the linker to find the best placement automatically): 015c1ebc 00010f7c 4 <internal>:(.text.thunk) 025a2e1c 0000d2a8 4 <internal>:(.text.thunk) 43739264 libchrome.so 25729757 So that is 180KB saved relative to baseline. I will run performance measurements on this layout over the weekend.
,
Mar 23 2018
ruiu: You might find this interesting.
,
Mar 24 2018
I wrote another patch that uses 4-byte thunks if the target function is in 16MB range of the thunk. On top of the other changes, that gives me: 015c1e14 000072a4 4 <internal>:(.text.thunk) 0259908c 0000b300 4 <internal>:(.text.thunk) 43690112 libchrome.so 25714881 So we lose another 50KB, and the total size of the thunk sections is 75KB. That's probably enough for now.
,
Mar 27 2018
Short thunks: https://reviews.llvm.org/D44963
,
Mar 27 2018
Thanks pcc for adding me to this bug. This is indeed very interesting. I'll review that patch. Slightly off-topic, but I believe there's a more ambitious way of reducing the size of ARM executables. If a linker has an ability to repack instructions, it can rewrite +/-16MB THUMB2 B instruction with a code sequence for a "long" jump in-place instead of creating a new thunk. In theory, it's doable if all offsets are represented as relocations even if they are within the same section. It might be interesting to explore this option as well.
,
Mar 28 2018
Unfortunately that could get very complicated very quickly. One of the problems is that exception handling information (i.e. .ARM.exidx) can contain relative offsets that would need to be fixed up. Even more tricky would be instructions like cbnz which have a limited range, and there would likely need to be a host of new relocation types to handle them. Rui mentioned offline that RISC-V does something similar, so it does at least in principle seem possible, at least on some architectures. But it also isn't clear whether this would be a size win in general because you wouldn't be able to reuse thunks as easily (e.g. imagine that you have more than one function that calls an out-of-range function via bl -- you wouldn't be able to reuse the thunk because lr would end up being set to the wrong value). One other neat trick I thought of was to move the last thunk section in order to increase the likelihood that we will be able to create 4-byte thunks: https://reviews.llvm.org/D44966 That saves 32KB. And the patch for moving the ordered sections into the middle: https://reviews.llvm.org/D44969 which saves 60KB (I decided to forgo optimizing the position in the linker for now). The final total size of the thunk sections is 56988 bytes.
,
Apr 3 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9a54951119a1f2493b994e24b0524512c97584e3 commit 9a54951119a1f2493b994e24b0524512c97584e3 Author: Benoit Lize <lizeb@chromium.org> Date: Tue Apr 03 13:18:26 2018 android: Remove cygprofile/reorder_native_library.py. LLD optimized the layout of ordered sections, making this reordering step unnecessary. See attached bug. Bug: 801601 Change-Id: I349d370fc847fe9be3482c764030d5b149c62c32 Reviewed-on: https://chromium-review.googlesource.com/992333 Commit-Queue: Benoit L <lizeb@chromium.org> Reviewed-by: Egor Pasko <pasko@chromium.org> Cr-Commit-Position: refs/heads/master@{#547684} [delete] https://crrev.com/5b84e88a070e40a1a74576a6cf41109d29a8035a/tools/cygprofile/reorder_native_library.py
,
Apr 4 2018
,
Apr 4 2018
,
Apr 17 2018
https://chromeperf.appspot.com/group_report?rev=550608 Size went down by 86KB, which can probably be partially attributed to the patch mentioned in c#13. The other changes are probably not visible on chromium.perf because, as far as I know, it does not use orderfiles.
,
Apr 18 2018
\o/
,
Jun 6 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by agrieve@chromium.org
, Jan 12 2018