Order of symbols in the binary can affect perf in microbenchmarks |
||||
Issue descriptionForking this from: [gn-dev] static_library vs source_set and perf https://groups.google.com/a/chromium.org/d/msg/gn-dev/v5nXfP7XF4E/HIXxz_CeCAAJ TL;DR of the story is: In the context of Issue 631421 it turns out that switching some targets from source_set to static_library (specifically crrev.com/2172033002) has some effect on performance, visible at least in some micro-benchmarks (. There is no other change (optimization flags or such) involved by the change. By diffing the binary, it seems that the assembly of some functions is different before and after the aforementioned change[1]. torne@ suggested that those diffs could be due to the relative position of symbols (caller vs branch-target) [2]. One extra thought is that: - we tend to build perf-crititical code (e.g. base) with optimize_max config (-O2) - functions defined in headers, get the Optimization level of the source translation unit. So if we have hot functions in those headers they won't get any perf flag. Specifically with this issue, if those functions are non-leaf and call something else (allocator, stl methods, etc) the lack of -O2 can make the call site pessimistic (see [2]) and make it more subjected to symbol ordering. Least but not last, this picture is not full on Android. The official internal binary (chrome_apk) has an extra orderfile step. It is kind of a link-time FDO, where hottest function in the startup path, get reordered first. So this extra step is likely to re-reshuffle the binary. pasko@ however suggests that orderfile reorders only some functions, not all of them. [1] primiano@ ✂ ✂ ✂ ✂ ✂ So, I did diff the symbols of the build artifacts produced by the waterfall before and actual Bruce's CL. I was expecting the binaries to be identical % symbol address reshuffling. Instead I see that some symbols got somehow compiled differently. I wonder if this interferring with link time optimization. The list of the ~200 different symbols is the following: http://pastebin.com/ih73BBNv (the last argument is the size of the symbol). It's not "lots of symbols" if you consider that in total there are ~812272 symbols and only ~200 changed. But I suppose enough to make a difference. The produced assembly looks quite different for those, for instance: WTF::Vector<unsigned int, 0u, WTF::PartitionAllocator>::appendSlowCase blink::protocol::Accessibility::AXValue::~AXValue() WTF::VectorBufferBase<blink::Member<blink::MessagePort>, false, blink::HeapAllocator>::allocateBuffer(unsigned int) ✂ ✂ ✂ ✂ ✂ [2] torne@ ✂ ✂ ✂ ✂ ✂ In the latter version of this one, WTF::MemoryBarrier looks to have been placed too far away from the function to call it with a direct thumb branch, and so it has to load a PC-relative address from the literal pool and blx to it. It also inlined more functions. I would guess that the function it inlines here are too far away to reach with a direct branch and the resulting difference in instruction count/register usage for the branch causes it to be worth inlining when it wasn't in the case where a direct branch worked? The other two look like they could be similar. Different branch instructions (conditional vs nonconditional, etc) have different ranges in Thumb2; I forget all the combinations but there are branches that are +/- 4MiB and 16MiB? Both of these are smaller than our library :) When not doing link-time optimisation, the compiler likely ends up generating pessimistic sequences with relocations that allow very large ranges, and at link time the linker simply fills in the relocs, and you do the big/slow branch sequence "every time" (or else, it generates short branches and then the linker has to generate branch islands to extend the range - both techniques have been used but I don't know what's common in modern toolchains). This mostly means that the layout of the code in the binary has no effect, because the compiler is generating the same instructions regardless of the final layout. LTO means it can optimise for the layout it actually got and actually use efficient short branches where possible, but it means that the codegen is going to depend on the exact layout and that's likely to change wildly if you change the input object file order. All speculative, but seems plausible? ✂ ✂ ✂ ✂ ✂
,
Apr 26 2017
,
Apr 27 2018
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
May 2 2018
@primiano, not sure this is actionable. It's more of a heads-up, "micro benchmarks are noisy/unreliable" and "compilers are unpredictable".
,
May 2 2018
agreed |
||||
►
Sign in to add a comment |
||||
Comment 1 by brucedaw...@chromium.org
, Aug 18 2016