Investigate using -section-ordering-file for .rodata/.bss |
||
Issue descriptionI just checked: LLD respects -section-ordering-file for .rodata (potentially for .bss as well). Nitty details: I checked without ThinLTO (comparing the linkmap was easier this way). Though as usual with LLD we need to provide symbol names instead of input section names. For example, instead of ".rodata._ZN15android_webview12_GLOBAL__N_118kChannelIDFilenameE" in orderfile.arm.out, put "_ZN15android_webview12_GLOBAL__N_118kChannelIDFilenameE". Those are not put at the start of .rodata (pcc: is it the middle, like with .text?). Some more background: .rodata takes ~6MiB, and ~3MiB of that is in strings that have sections renamed by LLVM for deduplication, which complicates ordering. A number of the strings among the 3MiB can be converted to character arrays, I believe, which would make them order-able directly by their name. The .bss is about 1M, and .data is ~100K. One hack that victoryang@ suggested is to order the symbols in .rodata by size (!). This technique yields real memory savings for libc.so on Android, so we should definitely try it for libchrome/libmonochrome.so! Another possibility is to order by prefix. For example, all the symbols generate by the tracing macros (unused in production) may be prefixed with a single prefix and then with a single relink on official builds we would get them ordered out into separate pages that won't be paged in normally. I don't have intuition on these, though would not be surprised if we save 1-2 MiB RAM with these tricks.
,
Dec 24
I believe that on arm the linker would also place ordered non-code sections in the middle. I wonder whether it's worth considering instrumenting loads so that we can order data in the same way as code. The asan pass already knows how to instrument every memory access (and can even cause all memory accesses to be replaced with runtime calls) so one way to quickly prototype this might be to create a replacement "asan runtime" that would basically do the same thing as the current orderfile generator instrumentation.
,
Dec 24
yeah, sanitizer instrumentation is something we considered before. AFAIR it was sitting on the shelf in Gold times because we (at least I) thought that rodata ordering won't be trivial. This is what kcc@ wrote a year ago: > You may implement your own tool using tsan's instrumentation: > Compile the code with -fsanitize=thread; this will instrument all loads/stores > with a function call (e.g. __tsan_read4 for a 4-byte read). > Then implement those functions yourself and link the binary against your run-time library. > This will give you 1-byte granularity. I think it'd be cool to get there eventually!
,
Dec 24
+1 to instrumentation being a great first step. :) Investigating data layout is also on CrOS' list for 2019. A cursory glance indicated that both ARM and intel have perf events for sampling memory accesses; if we get a decent win from better-than-default data layouts, we'll probably be working with our CWP friends to integrate coarse data profiling somehow. (To be clear: intel's perf events looked pretty promising. I saw ARM had memory-related events in `simpleperf`, but am unsure if they're similar quality to intel's. If they aren't useful enough, the ongoing ETM effort will hopefully close that gap for us.)
,
Dec 24
> +1 to instrumentation being a great first step. :) yeah, assuming we start counting from zero, right? Ordering by size is kinda interesting :)
,
Dec 24
> yeah, assuming we start counting from zero, right ;) > Ordering by size is kinda interesting :) Right, sorry -- I should've prefixed that with "in the world of large hammers," If sorting by size consistently gets us a lot of the way to where instrumentation/sampling would get us, I agree that's pretty great. :)
,
Dec 25
I'd like to clarify a detail in the libc memory saving. The saving comes from sorting bss and data sections, not rodata section. The idea is to keep as many pages clean as possible. That said, I do agree that ordering rodata can be useful if we have instrumentation to guide us what objects to group together. Having instrumentation would also help with reducing dirty pages in bss and data sections. For example, we can profile what objects are used during init and group them together.
,
Dec 25
This reminds me that we care about clean pages from code much more because the kernel does not like evicting them from memory, while for rodata there is no such problem. On the other hand, small sections in rodata, even if unused, is often mixed with stuff that is used, hence cannot be evicted either. Fuzzy. |
||
►
Sign in to add a comment |
||
Comment 1 by pasko@chromium.org
, Dec 24