Issue metadata
Sign in to add a comment
|
Increased kernel stack size with clang |
||||||||||||||||||||||||
Issue descriptionAt least for some code stack size is significantly increased with clang. This was reported earlier in the context of using KASAN ( Bug 781317 ), but it also can be seen without KASAN. When building the kernel for betty with clang the following warning is raised: drivers/gpu/drm/radeon/radeon_combios.c:1452:6: error: stack frame size of 2560 bytes in function 'radeon_get_legacy_connector_info_from_table' [-Werror,-Wframe-larger-than=] bool radeon_get_legacy_connector_info_from_table(struct drm_device *dev) The high usage is 'caused' by repeated calls to this function: static struct radeon_i2c_bus_rec combios_setup_i2c_bus(struct radeon_device *rdev, enum radeon_combios_ddc ddc, u32 clk_mask, u32 data_mask) https://elixir.bootlin.com/linux/v4.4.119/source/drivers/gpu/drm/radeon/radeon_combios.c#L1452 The function returns a struct by value, with a size of 80 bytes. The calls are in different branches of a switch-case statement, so the actual overhead of copying the data is only incurred a few times at initialization. It is definitely arguable whether it is a good idea that the function returns such a large data structure by value, but in any case the compiler shouldn't blow up the stack size by sizeof(struct radeon_i2c_bus_rec) for every call. As reference, the frame size of the same function compiled with gcc is only 256 bytes, 10x less! The source file and LLVMs IR are attached.
,
Mar 1 2018
manojgupta@: could you remind me about the option(s) to generate the pre-processed file?
,
Mar 1 2018
You can pass -save-temps to the clang command line and it should dump the .ii and .bc files.
,
Mar 1 2018
The pre-processed file is attached. Command line: x86_64-cros-linux-gnu-clang -B/usr/x86_64-pc-linux-gnu/x86_64-cros-linux-gnu/binutils-bin/2.27.0 -Wp,-MD,drivers/gpu/drm/radeon/.radeon_combios.o.d -nostdinc -isystem /usr/lib64/clang/6.0.0/include -I/mnt/host/source/src/third_party/kernel/v4.4/arch/x86/include -Iarch/x86/include/generated/uapi -Iarch/x86/include/generated -I/mnt/host/source/src/third_party/kernel/v4.4/include -Iinclude -I/mnt/host/source/src/third_party/kernel/v4.4/arch/x86/include/uapi -Iarch/x86/include/generated/uapi -I/mnt/host/source/src/third_party/kernel/v4.4/include/uapi -Iinclude/generated/uapi -include /mnt/host/source/src/third_party/kernel/v4.4/include/linux/kconfig.h -I/mnt/host/source/src/third_party/kernel/v4.4/drivers/gpu/drm/radeon -Idrivers/gpu/drm/radeon -D__KERNEL__ -Qunused-arguments -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Wno-format-security -std=gnu89 -fno-PIE -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387 -mstack-alignment=8 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_FXSAVEQ=1 -DCONFIG_AS_SSSE3=1 -DCONFIG_AS_CRC32=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -DCONFIG_AS_SHA1_NI=1 -DCONFIG_AS_SHA256_NI=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mretpoline-external-thunk -DRETPOLINE -Wno-maybe-uninitialized -Wno-frame-address -Wno-format-truncation -Wno-format-overflow -Wno-int-in-bool-context -Oz --param=allow-store-data-races=0 -Wframe-larger-than=2048 -fstack-protector-strong -target x86_64-cros-linux-gnu -gcc-toolchain /usr/x86_64-pc-linux-gnu/x86_64-cros-linux-gnu/binutils-bin -Wno-unused-variable -Wno-format-invalid-specifier -Wno-gnu -Wno-address-of-packed-member -Wno-duplicate-decl-specifier -Wno-tautological-compare -mno-global-merge -no-integrated-as -fno-omit-frame-pointer -fno-optimize-sibling-calls -g -pg -mfentry -DCC_USING_FENTRY -Werror -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fno-stack-check -Werror=implicit-int -Werror=strict-prototypes -Werror=date-time -Wno-initializer-overrides -Wno-unused-value -Wno-format -Wno-sign-compare -Wno-format-zero-length -Wno-uninitialized -fprofile-sample-use=/build/betty/tmp/portage/sys-kernel/chromeos-kernel-4_4-9999/work/chromeos-kernel-4_4-R66-10323.30-1519040186.gcov -fdebug-info-for-profiling -I/mnt/host/source/src/third_party/kernel/v4.4/include/drm -Iinclude/drm -I/mnt/host/source/src/third_party/kernel/v4.4/drivers/gpu/drm/amd/include -Idrivers/gpu/drm/amd/include -Wframe-larger-than=4096 -save-temps -Wno-constant-logical-operand -DMODULE -D"KBUILD_STR(s)=\#s" -D"KBUILD_BASENAME=KBUILD_STR(radeon_combios)" -D"KBUILD_MODNAME=KBUILD_STR(radeon)" -c -o drivers/gpu/drm/radeon/radeon_combios.o /mnt/host/source/src/third_party/kernel/v4.4/drivers/gpu/drm/radeon/radeon_combios.c
,
Mar 1 2018
A simplified test case:
$ cat test.c
typedef struct
{
int j;
int k;
int l[128];
} T;
T foo();
T bar();
extern int k;
T test()
{
T ret = {};
switch (k) {
case 0:
ret = foo();
break;
case 1:
ret = bar();
break;
default:
break;
}
return ret;
}
$ x86_64-cros-linux-gnu-clang -c test.c -Oz -Wframe-larger-than=2048
test.c:13:3: warning: stack frame size of 2360 bytes in function 'test' [-Wframe-larger-than=]
T test()
^
1 warning generated.
clang is reserving stack space for the calls + the return value. I would have expected that optimizations should have combined these multiple allocas for the calls to foo()/bar() in a single one eventually. But that is not happening.
*** IR Dump After Safe Stack instrumentation pass ***
; Function Attrs: minsize nounwind optsize sspstrong uwtable
define void @test(%struct.T* noalias nocapture sret) local_unnamed_addr #0 {
%2 = alloca %struct.T, align 4
%3 = alloca %struct.T, align 4
%4 = alloca %struct.T, align 4
%.0.sroa_cast = bitcast %struct.T* %2 to i8*
call void @llvm.lifetime.start.p0i8(i64 776, i8* nonnull %.0.sroa_cast)
call void @llvm.memset.p0i8.i64(i8* nonnull %.0.sroa_cast, i8 0, i64 776, i32 4, i1 false)
%5 = load i32, i32* @k, align 4, !tbaa !4
switch i32 %5, label %12 [
i32 0, label %6
i32 1, label %9
]
; <label>:6: ; preds = %1
%7 = bitcast %struct.T* %2 to i8*
call void (%struct.T*, ...) @foo(%struct.T* nonnull sret %3) #3
%8 = bitcast %struct.T* %3 to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull %7, i8* nonnull %8, i64 776, i32 4, i1 false), !tbaa.struct !8
br label %12
; <label>:9: ; preds = %1
%10 = bitcast %struct.T* %2 to i8*
call void (%struct.T*, ...) @bar(%struct.T* nonnull sret %4) #3
%11 = bitcast %struct.T* %4 to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull %10, i8* nonnull %11, i64 776, i32 4, i1 false), !tbaa.struct !8
br label %12
; <label>:12: ; preds = %1, %9, %6
%13 = bitcast %struct.T* %2 to i8*
%14 = bitcast %struct.T* %0 to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %14, i8* nonnull %13, i64 776, i32 4, i1 false), !tbaa.struct !8
call void @llvm.lifetime.end.p0i8(i64 776, i8* nonnull %13)
ret void
}
gbiv@ Are you aware of any stack optimizations or flags to avoid this extra storage.
,
Mar 1 2018
+1 for "Why are we copying like crazy between 3 allocas" I don't know of any flag that would cause or fix this, but I can invent some reasons for why optimizations may/may not be deduping across basic block boundaries. At the very least, we should probably be putting lifetime markers around the calls to %foo and %bar. While this wouldn't solve all of our problems (or get rid of any of the crazy copies), it would presumably keep us from using stack space linear with the number of cases in the switch (!).
,
Mar 2 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7241c3050985477bd0e568e43f91030428ddfed2 commit 7241c3050985477bd0e568e43f91030428ddfed2 Author: Matthias Kaehlcke <mka@chromium.org> Date: Fri Mar 02 06:12:07 2018 CHROMIUM: drm/radeon: Increase threshold for stack size warning With the default threshold of 2048 the following warning is raised when the kernel is built with clang: drivers/gpu/drm/radeon/radeon_combios.c:1452:6: error: stack frame size of 2560 bytes in function 'radeon_get_legacy_connector_info_from_table' [-Werror,-Wframe-larger-than=] bool radeon_get_legacy_connector_info_from_table(struct drm_device *dev) clang is to blame here, with gcc the frame size is only 256 (more details in chromium:817628) This should be definitely be addressed in clang, but since warnings in kernel builds are treated as errors we don't want to be blocked waiting for a fix. As a stopgap increase the warning threshold to 4096 bytes for the affected files. This change can be reverted once clang behaves more civily. BUG= chromium:817628 TEST=build for betty with clang Change-Id: Id7885f37abd9aa7cb5e4183965b4fb5705eacbc5 Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/944610 Reviewed-by: Guenter Roeck <groeck@chromium.org> [modify] https://crrev.com/7241c3050985477bd0e568e43f91030428ddfed2/drivers/gpu/drm/radeon/Makefile
,
Mar 2 2018
My notes so far, purely looking at the stack usage: - Further-reduced test-case: https://godbolt.org/g/HKbPWr - This does not appear to occur in C++ (maybe it's possible if the code was written differently? This case in particular, though, depends on both an aggregate assignment + call happening in the same statement) - This only occurs for `T`s large enough to be RVO'ed; if the ABI tells us to return the struct directly, no issues happen (this can be seen by reducing the size of `T` to 16 bytes or less).
,
Mar 2 2018
Yeah, this only happens for types too big to return directly. This is the pattern I see: x= alloca(T) -- allocas to hold returned value from function calls y= alloca(T) ret= alloca(T) - per call - call foo(x) // store data in x memcpy(ret, x) If call + memcpy sequence can be optimized to store the data directly in the alloca for ret, the unnecessary allocas can be deleted by DCE and the extra stack should not be required.
,
Mar 6 2018
So, it looks like optimizing this away in the general case is undoable.
First, we can't optimize the memcpy into `x` if the address for `x` escapes. Doing so is a violation of at least one ABI, and (more pragmatically), we'd break code like:
struct S { char cs[64]; };
struct S foo(char *y) {
struct S ret = {}; // actually pointing to the hidden return param
char first = ret.cs[0];
++*y;
assert(ret.cs[0] == first);
}
void bar() {
struct S result = foo(&result);
}
The response to this is generally "use `nocapture`", but we don't appear to tag that on `sret` params automatically. This would also be pretty badly incorrect for C++ (see: objects with pointers to themselves), though I'm unsure how bad it would be for C. Without nocapture, could maybe optimize the first set of calls that we see, which would fix the `switch` case, but that wouldn't really help the straightline repro I have.
I'll ask about adding `nocapture` by default, but I still think the complete fix for this involves adding lifetime markers to the temporaries. I was hoping doing so would be straightforward, but it's not looking like that's the case...
,
Mar 6 2018
(er... `foo` in my example was also meant to return `ret`. Oops. :) )
,
Mar 6 2018
Apologies for the spam. Looking at the docs for `nocapture`, all it says is "This indicates that the callee does not make any copies of the pointer that outlive the callee itself". I was really hoping it would be a bit more restrictive by mentioning the use of those copies, but given that it's trivial for me to stash &ret in a global (or put it in an output param, ...), it seems like "`nocapture` `sret` params by default" won't work.
,
Mar 11 2018
With clang's r327192 and r327229 cherrypicked, it looks like radeon_get_legacy_connector_info_from_table eats 192 bytes of stack. As you'd probably expect, sys-kernel/chromeos-kernel-4_4 builds for betty with the workaround in comment 7 reverted. Happy to cherry-pick if we want; as always, there's a risk of bugs in my changes or in exposing existing bugs in code, so it may be good to let the changes sit and see if teams with more aggressive rolling schedules have problems.
,
Mar 11 2018
Thanks George, I can merge it with llvm-next and see how things go.
,
Mar 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/655a49d1a7385c14f9822a39673970a05b41a00c commit 655a49d1a7385c14f9822a39673970a05b41a00c Author: Manoj Gupta <manojgupta@google.com> Date: Tue Mar 13 04:03:06 2018 llvm-next Update to r326829. Update llvm-next to current crosstool version. In addition, add a few cherry picks to fix issues related to Linxu kernel: 1. PLT relocations: r327198. 2. Increased stack usage: r327192 and r327229. Note: Does not affect the current llvm used in Chrome OS. BUG= chromium:817628 BUG= chromium:820140 BUG=chromium:815537 TEST=USE="llvm-next" sudo emerge llvm lld works. Change-Id: I436d578030775b3ffbccb246dbd43394452e67a3 Reviewed-on: https://chromium-review.googlesource.com/957790 Trybot-Ready: Manoj Gupta <manojgupta@chromium.org> Commit-Queue: Manoj Gupta <manojgupta@chromium.org> Tested-by: Manoj Gupta <manojgupta@chromium.org> Reviewed-by: Luis Lozano <llozano@chromium.org> [add] https://crrev.com/655a49d1a7385c14f9822a39673970a05b41a00c/sys-devel/llvm/files/cherry/8fdc88794b44e70bdb93c6cf04baf3c1e3251d8b.patch [delete] https://crrev.com/bd99bff64464733038e8b5cdc488fd9f4941aea6/sys-devel/llvm/files/cherry/98079e294f718c14d25ccf30ab2b1938780ffe4d.patch [rename] https://crrev.com/655a49d1a7385c14f9822a39673970a05b41a00c/sys-devel/lld/lld-6.0_pre321490-r3.ebuild [delete] https://crrev.com/bd99bff64464733038e8b5cdc488fd9f4941aea6/sys-devel/llvm/files/cherry/30f9051d7d8b6f56c8149fd1bdcc714285f77527.patch [delete] https://crrev.com/bd99bff64464733038e8b5cdc488fd9f4941aea6/sys-devel/llvm/files/cherry/16748767563bb9bcb1e1c3e42c35d44924d464d0.patch [rename] https://crrev.com/655a49d1a7385c14f9822a39673970a05b41a00c/sys-devel/llvm/llvm-6.0_pre321490_p20180131-r9.ebuild [add] https://crrev.com/655a49d1a7385c14f9822a39673970a05b41a00c/sys-devel/llvm/files/cherry/824eedb9eb4888575924b1ed80c4250dddd5b59b.patch [delete] https://crrev.com/bd99bff64464733038e8b5cdc488fd9f4941aea6/sys-devel/llvm/files/cherry/284236c047631c8b0eabac3ddd3d0c95253f4361.patch [rename] https://crrev.com/655a49d1a7385c14f9822a39673970a05b41a00c/sys-libs/compiler-rt/compiler-rt-6.0_pre321490-r5.ebuild [delete] https://crrev.com/bd99bff64464733038e8b5cdc488fd9f4941aea6/sys-devel/llvm/files/cherry/ffaaef68dbab0e872c0e6013836170bb78705a81.patch [add] https://crrev.com/655a49d1a7385c14f9822a39673970a05b41a00c/sys-devel/llvm/files/cherry/c28eb6d02c5cedd40b02aa7c496f13a71763312e.patch
,
Mar 21 2018
Thanks George for the fixes. Closing this bug now.
,
Mar 21 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/125683d9491a0996cda1ca9a2009ecfacf0c9c24 commit 125683d9491a0996cda1ca9a2009ecfacf0c9c24 Author: Manoj Gupta <manojgupta@google.com> Date: Wed Mar 21 22:55:38 2018 Revert "CHROMIUM: drm/radeon: Increase threshold for stack size warning" The stack usage for radeon_get_legacy_connector_info_from_table is now 240 bytes with the current Chrome OS llvm. BUG= chromium:817628 TEST=sys-kernel/chromeos-kernel-4_4 builds on betty. Change-Id: I4db07585590d5c314905e4a5adc76e5d9d059797 Reviewed-on: https://chromium-review.googlesource.com/973941 Commit-Ready: Matthias Kaehlcke <mka@chromium.org> Tested-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Reviewed-by: Matthias Kaehlcke <mka@chromium.org> [modify] https://crrev.com/125683d9491a0996cda1ca9a2009ecfacf0c9c24/drivers/gpu/drm/radeon/Makefile |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by manojgupta@chromium.org
, Mar 1 2018