New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 754462 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Kernels 4.9 and later fails to boot with latest sdk

Project Member Reported by cujomalainey@google.com, Aug 10 2017

Issue description

What steps will reproduce the problem?
(1) repo sync
(2) add kernel-next as remote for target kernel folder and fetch
(3) checkout chromeos-4.12-merge-170809 (for eve)
(3) build packages
(4) build image
(5) build kernel
(6) upload kernel and reboot

What is the expected result?
Device boot

What happens instead?
Device hangs after firmware, using a servo, this is the last set of messages before the hang

early console in extract_kernel
input_data: 0x00000000023ad276
input_len: 0x00000000007e0b3d
output: 0x0000000001000000
output_len: 0x00000000017d1f44
kernel_total_size: 0x0000000001bbe000
booted via startup_64()
Physical KASLR using RDRAND RDTSC...

Key difference is a 4.4 kernel will boot using startup_32()

Disabling KASLR will allow the system to boot and therefore works as a development workaround.

Manifest 9784 introduced the bug 
https://crosland.corp.google.com/log/9783.0.0..9784.0.0

A binutils change was introduced in this manifest (2.25 -> 2.27) and is suspected to be the cause, verification is being attempted
 
Labels: Build-Toolchain
Cc: rahulchaudhry@chromium.org manojgupta@chromium.org

Comment 3 by groeck@chromium.org, Aug 10 2017

kernel-next repo is at https://chromium.googlesource.com/chromiumos/third_party/kernel-next 

git remote add kernel-next https://chromium.googlesource.com/chromiumos/third_party/kernel-next 
git fetch kernel-next
git checkout -b chromeos-4.12-merge-170809 kernel-next/chromeos-4.12-merge-170809

Owner: rahulchaudhry@chromium.org
To test with older binutils (2.25):

1) Apply this patch in chromiumos-overlay *before* creating a chroot: https://chromium-review.googlesource.com/#/c/chromiumos/overlays/chromiumos-overlay/+/611329/

   $ cd src/third_party/chromiumos-overlay/
   $ git fetch https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay refs/changes/29/611329/1 && git cherry-pick FETCH_HEAD


2) Create a new chroot with --bootstrap

   $ cros_sdk --bootstrap


3) Verify the binutils version inside the chroot:

   $ ld -v

   check that the version is 2.25.51, and the git hash starts with 082ed0...


4) Use "--nousepkg" flag on all subsequent setup_board or build_packages commands in the chroot.

   $ ~/trunk/src/scripts/setup_board --nousepkg --board=eve
   $ ~/trunk/src/scripts/build_packages --nousepkg --board=eve

Comment 6 by groeck@chromium.org, Aug 11 2017

Confirmed that v4.12 (without chromeos specific changes) also crashes with ToT chroot if CONFIG_RANDOMIZE_BASE (ie kaslr) is enabled.


Comment 7 by groeck@chromium.org, Aug 11 2017

Wondering, in this context: Is it possible to create a chroot with binutils 2.27 and the assembler changes reverted ?

If you mean assembler from 2.25 and linker from 2.27, that won't be easy to install or test.
It's much simpler to create a chroot with binutils 2.25 to test the theory that this was caused by binutils upgrade.

Comment 9 by groeck@chromium.org, Aug 11 2017

No, I mean assembler from 2.27. My understanding is that it is not the vanilla 2.27 but has been patched. I would like to test with that patch reverted.

You can see all the local patches applied on top of the upstream binutils 2.27 branch here: https://android-review.googlesource.com/#/q/status:merged+project:toolchain/binutils+branch:binutils-2_27-branch

The assembler patches are prefixed with "gas: "

Some of these were necessary to get other chromium-os packages to build, or to get some boards to boot.

To build without any local patches, you can edit the binutils-2.27.ebuild and modify the commit/tree hashes:

CROS_WORKON_COMMIT="36b7a6f647c174cf75bc38bee0dbbcdc4f620a0d"
CROS_WORKON_TREE="de020e1f5a528d2046791e6f5962c13307d8c434"

After this, follow steps 2, 3, and 4 from comment #5.

This will now build the binutils-2.27 without any of our local patches (https://android.googlesource.com/toolchain/binutils/+/36b7a6f647c174cf75bc38bee0dbbcdc4f620a0d).

I wanted to add that many of these local patches are from upstream master branch that were backported to binutils-2.27 because they fix some issue discovered after binutils-2.27 was released. All such patches are also present on later binutils releases from upstream.

I think the best course of action right now would be to do a full build with binutils-2.25, and check if the issue is still there. That would confirm that the issue is definitely related to the binutils upgrade. I don't think we have the confirmation yet, but I may be missing something.

Once we confirm that binutils change is the trigger, we can go deeper and bisect across all commits between binutils-2.25 and binutils-2.27 if needed. Since the issue is definitely related to kASLR (something to do with building a position independent kernel?), it may as well be a linker issue instead of the assembler.

#5: I had created the chroot, and it used binutils 2.25, but now it somehow reverted to 2.27. No idea what I did wrong. Any hints ?

The only thing I can think of is that you probably forgot to pass "--nousepkg" to setup_board or build_packages.

I followed the same instructions locally last evening, and have a new chroot with a chell build.
The host and cross binutils are all at 2.25, as expected.

#13: Yes, you are right. Guess that means I'll have to start from scratch ?

Unfortunately :(

Also, while you're at it, remove the old ".cache" directory as well before creating a new chroot.

verified on gnawty, reverting binutils with latest repo xml will allow the device to boot
on 4.9 kernel
Great. So we now have confirmation that the binutils upgrade triggered the issue.
I'm still waiting for a chell device so I can reproduce/debug it locally.

In the meanwhile, it would be very helpful if you can figure out what changed in the kernel binary built with binutils 2.27 vs. the one built with binutils 2.25. Needless to say, a small reproducer outside of the kernel tree would greatly help in figuring out what's going on.

Problem is confirmed to affect peppy. nokaslr workaround works.

peppy is Haswell with kernel 3.8.11.

panther is also Haswell with kernel 3.8.11, and I have a panther device with me.
I can try to reproduce the issue on it.
Doing panther builds now..

peppy is reported to boot with both v4.12 (vanilla) and chromeos-4.12-merge-170809 if "nokaslr" is provided in the boot command line. Both crash without "nokaslr".


System.map differences a one byte difference

cujomalainey2~ % colordiff -w goodmap.txt badmap.txt
81429c81429
< ffffffff81e650fd t __irf_end
---
> ffffffff81e650fe t __irf_end

Comment 23 by groeck@google.com, Aug 12 2017

May need to do some disassembling around the __irf_end code



Labels: OS-Chrome
Some interesting differences in kaslr.s:

0000000000000260 <choose_random_location>:

Binutils 2.25:
 3af:	66 90                	xchg   %ax,%ax 
 3b1:	66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 	data16 data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)

 58b:	41 ba 01 00 00 00    	mov    $0x1,%r10d
 591:	66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 	data16 data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)

 6d0:	90                   	nop
 6d1:	66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 	data16 data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)

Binutils 2.27:
3af:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
 3b6:	66 2e 0f 1f 84 00 00 00 00 00 	nopw   %cs:0x0(%rax,%rax,1)

 58b:	41 ba 01 00 00 00    	mov    $0x1,%r10d
 591:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
 596:	66 2e 0f 1f 84 00 00 00 00 00 	nopw   %cs:0x0(%rax,%rax,1)

 6d0:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
 6d6:	66 2e 0f 1f 84 00 00 00 00 00 	nopw   %cs:0x0(%rax,%rax,1)

Are instructions at 3af offset in 2.27 equivalent to 2.25 ?
Binutils 2.25 vs 2.27 build files are attached at https://drive.google.com/a/google.com/file/d/0B6pPpjh2SnF_dFJLZ0Y1Vm9Qc3c/view?usp=sharing

Running meld shows only few files are different in the two builds and including kaslr.o .

$ meld 25 27 &

I switched the assembler to 2.25 while keeping rest of 2.27 binutils. The image still failed to boot.
So maybe there is a linker issue here.
#26: FWIW, I tried -O0, but that also failed. I also tried to drop -fomit-frame-pointer (which is chromeos specific), but that did not help either.

Once I end up with a bad image flashed on the device, is it possible to specify nokaslr at boot time to get it to boot?
I take back the comment #26.
Doing the same today resulted in a bootable image. Maybe I forgot to clean the old build last time. 
i.e. the problem might be in assembler.
#28: Most likely yes, but unfortunately I don't know how. I usually reinstall the image from flash after it broke.

Rahul, can you verify that applying this patch fixes the problem.
The patch stops assembler from converting BFD_RELOC_X86_64_GOTPCREL relocation to 
BFD_RELOC_X86_64_REX_GOTPCRELX/BFD_RELOC_X86_64_GOTPCRELX relocations.

diff -Nuar binutils-2.27/gas/config/tc-i386.c binutils-2.27_new/gas/config/tc-i386.c
--- binutils-2.27/gas/config/tc-i386.c
+++ binutils-2.27_new/gas/config/tc-i386.c
@@ -10665,9 +10665,10 @@
 		abort ();
 #if defined (OBJ_ELF) || defined (OBJ_MAYBE_ELF)
 	      if (fixp->fx_tcbit2)
-		fixp->fx_r_type = (fixp->fx_tcbit
-				   ? BFD_RELOC_X86_64_REX_GOTPCRELX
-				   : BFD_RELOC_X86_64_GOTPCRELX);
+		fixp->fx_r_type = BFD_RELOC_X86_64_GOTPCREL;
+//		(fixp->fx_tcbit
+//				   ? BFD_RELOC_X86_64_REX_GOTPCRELX
+//				   : BFD_RELOC_X86_64_GOTPCRELX);
 	      else
 #endif
 		fixp->fx_r_type = BFD_RELOC_X86_64_GOTPCREL;
#31: Great catch. Wondering, though: why is this a problem for us, but not for anyone else ?

I found this in the kernel Makefile:

# To build 64-bit compressed kernel as PIE, we disable relocation
# overflow check to avoid relocation overflow error with a new linker
# command-line option, -z noreloc-overflow.
LDFLAGS += $(shell $(LD) --help 2>&1 | grep -q "\-z noreloc-overflow" \
        && echo "-z noreloc-overflow -pie --no-dynamic-linker")

Can that be related ?

I've verified that after applying the binutils patch posted by Manoj, a newly built kernel 4.12.5 boots fine.
greock@ Regarding #32, I believe that is a separate issue.
#31 Disables an assembler optimization which does not seem to be related to overflow.
I just sent https://chromium-review.googlesource.com/#/c/chromiumos/overlays/chromiumos-overlay/+/616227/ for the fix.
It would be great if you can verify that this fixes the issue for all impacted kernel versions and boards.
The steps to apply the fix:

$ cd src/third_party/chromiumos-overlay/
$ git fetch https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay refs/changes/27/616227/1 && git cherry-pick FETCH_HEAD
$ sudo emerge cross-i686-pc-linux-gnu/binutils cross-x86_64-cros-linux-gnu/binutils
# rebuild desired kernel version, test...

You may need to clear portage cache before rebuilding kernel (e.g. for chell, "rm -rf /build/chell/var/cache/portage/sys-kernel/chromeos-kernel-3_18").

You may also need to disable ccache in compiler wrappers:

$ ls -l /usr/x86_64-pc-linux-gnu/{i686-pc-linux-gnu,x86_64-cros-linux-gnu}/gcc-bin/4.9.x/sysroot_wrapper.hardened

In both the files, look for "use_ccache = " and change the value from True to False (as root).

Fails on Gnawty 4.9 kernel
cujomalainey@ What is the output of x86_64-cros-linux-gnu-as --version? 
Also, did you disable ccache as in #35. 
GNU assembler (binutils-2.27-r1-85fafaf039799ebc8053bf36ce1c6e6df7adbbec_cos_gg) 2.27.0.20170315

I forgot to disable ccache, I will re-run my test right now
Disabling ccache did the trick
I built a few images to conclusively verify that the configure flag does indeed fix the unbootable kernel 4.12 issue on chell.
Below are the steps I followed. None of the steps used "--nousepkg", so they're relatively quick.
Each image has a note alongside it indicating whether it boots or not.

* Initial prep with a fresh repo checkout:
  - create fresh chroot.
  - ~/trunk/src/scripts/setup_board --board=chell
  - disable ccache in /usr/x86_64-pc-linux-gnu/{i686-pc-linux-gnu,x86_64-cros-linux-gnu}/gcc-bin/4.9.x/sysroot_wrapper.hardened

* Build image for chell (without --nousepkg, so using binary packages):
  - ~/trunk/src/scripts/build_packages --board=chell
  - build_image chell: R62-9845.0.2017_08_15_2058-a1 *** BOOTS FINE ***

* Re-build kernel 3.18 locally from source:
  - rm -rf /build/chell/var/cache/portage/sys-kernel/chromeos-kernel-3_18
  - emerge-chell sys-kernel/chromeos-kernel-3_18
  - build_image chell: R62-9845.0.2017_08_15_2113-a1 *** BOOTS FINE ***

* Re-build kernel 3.18 locally after cros_workon (-9999 ebuild):
  - cros_workon-chell start sys-kernel/chromeos-kernel-3_18
  - rm -rf /build/chell/var/cache/portage/sys-kernel/chromeos-kernel-3_18
  - emerge-chell sys-kernel/chromeos-kernel-3_18
  - build_image chell: R62-9845.0.2017_08_15_2148-a1 *** BOOTS FINE ***

* Fetch and build kernel 4.12:
  - cd ~/trunk/src/third_party/kernel/v3.18/
  - git remote add kernel-next https://chromium.googlesource.com/chromiumos/third_party/kernel-next
  - git fetch kernel-next
  - git checkout -b chromeos-4.12-merge-170809 kernel-next/chromeos-4.12-merge-170809
  - rm -rf /build/chell/var/cache/portage/sys-kernel/chromeos-kernel-3_18
  - emerge-chell sys-kernel/chromeos-kernel-3_18
  - build_image chell: R62-9845.0.2017_08_15_2212-a1 *** DOES NOT BOOT ***

* Re-build binutils and kernel 4.12 after applying cl/616227:
  - cd ~/trunk/src/third_party/chromiumos-overlay/
  - git fetch https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay refs/changes/27/616227/1 && git cherry-pick FETCH_HEAD
  - sudo emerge cross-i686-pc-linux-gnu/binutils cross-x86_64-cros-linux-gnu/binutils
  - rm -rf /build/chell/var/cache/portage/sys-kernel/chromeos-kernel-3_18
  - emerge-chell sys-kernel/chromeos-kernel-3_18
  - build_image chell: R62-9845.0.2017_08_15_2237-a1 *** BOOTS FINE ***

* Remove "--enable-x86-relax-relocations=no" from binutils ebuild:
  - edit ~/trunk/src/third_party/chromiumos-overlay/sys-devel/binutils/*.ebuild: remove "--enable-x86-relax-relocations=no"
  - sudo emerge cross-i686-pc-linux-gnu/binutils cross-x86_64-cros-linux-gnu/binutils
  - rm -rf /build/chell/var/cache/portage/sys-kernel/chromeos-kernel-3_18
  - emerge-chell sys-kernel/chromeos-kernel-3_18
  - build_image chell: R62-9845.0.2017_08_15_2317-a1 *** DOES NOT BOOT ***

* Add "--disable-x86-relax-relocations" to binutils ebuild:
  - edit ~/trunk/src/third_party/chromiumos-overlay/sys-devel/binutils/*.ebuild: add "--disable-x86-relax-relocations"
  - sudo emerge cross-i686-pc-linux-gnu/binutils cross-x86_64-cros-linux-gnu/binutils
  - rm -rf /build/chell/var/cache/portage/sys-kernel/chromeos-kernel-3_18
  - emerge-chell sys-kernel/chromeos-kernel-3_18
  - build_image chell: R62-9845.0.2017_08_15_2346-a1 *** BOOTS FINE ***

From the discussion on the CL here: https://chromium-review.googlesource.com/#/c/chromiumos/overlays/chromiumos-overlay/+/616227/

Mike Frysinger:
  ... if the only thing breaking is the kernel, and we want a short term workaround, seems like adding the -m flag to the CrOS kernel build makes more sense.


Mike is referring to the -mrelax-relocations assembler flag from my earlier comment.

From "as --help":
  -mrelax-relocations=[no|yes]
                          generate relax relocations

If the kernel builds add -mrelax-relocations=no flag to the assembler, we don't need to disable them at configure time, and the rest of the packages can keep getting the benefit of this optimization.

What do you think? Is it easy (and maintainable) to add this assembler flag to all CrOS kernel builds? Can someone from kernel team test and merge that change as a fix for this issue?

Would this be set somewhere in the cros-kernel2.eclass?
If this is causing bugs here, how are we confident it's not breaking other things that we're just not aware of?

Is this a kernel bug or assembler bug?
no, it'd be in the kernel source tree under arch/x86/.  i haven't been following the bug close enough to suggest it be the Makefile and applied to all kernel objects, or the more limited boot/ set.

putting it in the eclass wouldn't help people building directly, and it kind of ignores the connection between "the kernel itself isn't handling this relocation properly".
I have no idea.

vapier@: is there a central place to set compiler and assembler flags that apply to all (amd64) kernel ebuilds, but not to other packages.
> If this is causing bugs here, how are we confident it's not breaking other things that we're just not aware of?

well, you can't prove bugs don't exist, so if we turn it off in the assembler for all objects everywhere, we'll never be able to undo it.

by only turning it off in the kernel, we can let testing continue to happen in R62+.  the fact that all our unit/vm/hwtests are all passing is a fairly strong signal.

> Is this a kernel bug or assembler bug?

that'd be for the kernel & toolchain guys to figure out.  last i looked, the kernel is in the business of processing relocs itself, so it's possible the kernel is getting it wrong in some cases.  it's not a common thing to process ELF relocs directly ...
I disagree with the idea to put a workaround for handling our toolchain into the kernel source. It would effectively mean that we can no longer build and test upstream kernels. This would be a significant new limitation.

I also don't really understand why this is being blamed on the kernel. There are literally Millions of people out there building the kernel as-is without any change. It is hard to believe that no one but us, with our toolchain, would hit a problem with it.

as of why other people are not reproducing this, I can only say that maybe they are not using the same combination of toolchain versions. We are using a pretty old compiler. 

I understand the point from groeck about not wanting to add options to the kernel. 

I think the simplest and safest is to disable this "optimization" in the assembler.
I found this bug: https://sourceware.org/bugzilla/show_bug.cgi?id=19807 that seems to indicate that kernel did not like such optimization at some point. 

Maybe we can retest this when we move to building the kernel with clang/llvm.
People on both teams seem quite busy at this point to try to dig deeper on this. 
i don't find it surprising at all that the kernel is failing with some toolchain combos.  this isn't uncommon at all, and upstream hits it from time to time (as Luis referenced -- a bug that is still open).

i also fully understand how widely deployed the kernel is and the breadth of the users having been a kernel maintainer for years myself.  that doesn't mean all bugs get caught.  after all we're talking about a scenario that only affects kaslr (which i think you can agree is not commonly deployed -- yet?).  also as Luis points out, we're using additional hardening options that are not common in the wider world.

as i said, the fact the kernel does its own relocation processing, and newer binutils have changed the set of relocations that might be generated, it's perfectly reasonable to think that the kernel isn't handling these correctly in some scenarios.

we're talking about mitigations at this point, not the "final answer".  so in that vein, if the only thing failing atm is the kernel, it makes sense imo to deploy the workaround to the kernel until people can further narrow down where things are going wrong.  if it is the kernel, then either the reloc processing gets fixed, or the optimization gets turned off and that gets sent upstream (e.g. adding the relevant -m flag to the right Makefile).  if it's binutils, and it's a problem in upstream binutils (seems unlikely it's a CrOS specific change), then binutils would get fixed, and perhaps the kernel would continue to carry that patch until they no longer care about the old binutils.

no one is pointing fingers here, so please don't waste time "assigning blame".  that isn't useful.
It appears that ld used by cros-kernel2.eclass, which is x86_64-cros-linux-gnu-ld, does not report "-z noreloc-overflow" as supported option, and/or does not support it in the first place. This will cause the kernel Makefile to assume that the option is not supported. As a result, it won't add "-z noreloc-overflow -pie --no-dynamic-linker" to the kernel's linker options.

Interestingly, this is different to "x86_64-pc-linux-gnu-ld", which _does_ support the option.

Both linkers report the same version, but one (cros) appears to be gold, the other isn't. I have no idea if this is related to the problem, but it is at least very suspicious.

In respect to the supposedly open kernel bug, it should be noted that it _has_ been fixed in the upstream kernel. The kernel community doesn't regard bugzilla as mandatory. Many bugs listed in bugzilla as "NEW", including the one referenced here, have long since been fixed.

Also, please note that KASLR is enabled by default in the Linux kernel. This change was made in March this year.

My understanding is kernel uses bfd linker (not gold) and should be passing -fuse-ld=bfd. 

CroS bfd linker supports this option.
 $ x86_64-cros-linux-gnu-ld.bfd --help|grep noreloc
  -z noreloc-overflow         Disable relocation overflow check
#52: I don't immediatley see "-fuse-ld=bfd" in the kernel build script, but it should be easy to test by poisoning the script which sets the noreloc-overflow flag. I'll do that first thing tomorrow.

the way we pass down the linker is more complicated than that.  we pass it the right linker directly via LD, and set the linker path via -B in CC.  see cros-kernel2.eclass for more details.

if you're building the kernel by hand and not doing any of that, then you'll get gold by default, and the kernel is known to be picky with gold.
Adding some logging into arch/x86/boot/compressed/Makefile.

 $(obj)/vmlinux: $(vmlinux-objs-y) FORCE
+	@echo "Final LDFLAGS: $(LDFLAGS)"
 	$(call if_changed,ld)
 	@:

With our build system (emerge-eve chromeos-kernel-4_4):

Final LDFLAGS: -O1 -O2 --as-needed

With make command in goobuntu (binutils 2.24):

Final LDFLAGS: -m elf_x86_64

With toolchain built using buildall (gcc 7.1.0, binutils 2.28):

Final LDFLAGS: -m elf_x86_64 -z noreloc-overflow -pie --no-dynamic-linker

I have absolutely no idea what is going on here.

Further study suggests that LDFLAGS provided on a make command line overrides LDFLAGS set in a Makefile. With

-LDFLAGS := -m elf_$(UTS_MACHINE)
+override LDFLAGS = -m elf_$(UTS_MACHINE)
...
-LDFLAGS += $(shell $(LD) --help 2>&1 | grep -q "\-z noreloc-overflow" \
+override LDFLAGS += $(shell $(LD) --help 2>&1 | grep -q "\-z noreloc-overflow" \

I get

Final LDFLAGS: -m elf_x86_64 -z noreloc-overflow -pie --no-dynamic-linker

A quick test confirms that this solves the problem. Question remains why we pass LDFLAGS to kernel builds, and how this ever worked.

passing LDFLAGS is critical for some flags (like the cortex workaround).  adding override markers makes sense (short of renaming the var entirely).

in general this isn't a problem because flag vars get scopped like LDFLAGS_vmlinux and ldflags-y, and the build takes care of merging $(LDFLAGS) with the target.  i suspect LDFLAGS flies under the radar more than CFLAGS as it's a much less common setting to customize.
#57: Normally there are kernel configuration options to address various chip errata. arm64 alone has around 8 (including at least one handled as compiler flag), and there more than 20 for arm. For example, --fix-cortex-a53-843419 is enabled with CONFIG_ARM64_ERRATUM_843419. Are there any which are not configurable in the kernel ? If so, where are those set for the respective kernel builds ?

Reason for asking is that, based on my understanding, errata should be handled as configurable kernel options, and kernel builds should not depend on LDFLAGS for that purpose.

there's also the -m option, but we currently stuff that via LD, and we can probably drop it at this point as i think we've got new enough toolchains to not need it anymore

offhand, i don't recall other flags we currently rely upon
I don't find any indication in the kernel source or from the web that passing LDFLAGS to linux kernel builds is supported. On the contrary, looking through the kernel source, I would expect that doing it will likely break the build for various architectures.

Given that, I find it highly unlikely that upstream would accept an "override" patch.

I will, however, be happy to submit an RFC patch upstream to get feedback. Doing this would, however, require specific evidence that there is a situation where providing LDFLAGS to the kenrel build command line is in fact critical under some circumstances, as suggested in #57. I browsed through chromiumos-overlay, overlays, and private-overlays, but did not find anything. In addition to "-fix-cortex-a53-843419", for which providing LDFLAGS isn't needed, are there any other specific examples ?
#59: I didn't see your comment before I submitted mine, sorry. Where is '-m' being used/set ?

The following patch fixes the problem.

diff --git a/eclass/cros-kernel2.eclass b/eclass/cros-kernel2.eclass
index 90979187211..f63f087fe73 100644
--- a/eclass/cros-kernel2.eclass
+++ b/eclass/cros-kernel2.eclass
@@ -925,7 +925,6 @@ kmake() {
 
 	cw_emake \
 		ARCH=${kernel_arch} \
-		LDFLAGS="$(raw-ldflags)" \
 		CROSS_COMPILE="${cross}-" \
 		O="$(cros-workon_get_build_dir)" \
 		KCFLAGS="${kcflags}" \

i don't think that's the route we should be taking.  LDFLAGS is the same category as CFLAGS/etc...

seems like you could easily change compressed/Makefile to use ldflags-y instead of LDFLAGS since the expansion of that is the same in scripts/Makefile.lib.
vapier@: CFLAGS is a bad comparison. The kernel uses (and we set) KCFLAGS, not CFLAGS. KCFLAGS is documented in Documentation/kbuild/kbuild.txt. The use or support of LDFLAGS is not documented (though LDFLAGS_MODULE and LDFLAGS_vmlinux are documented and supported).

Specifying LDFLAGS_vmlinux instead of LDFLAGS doesn't work:

/usr/x86_64-pc-linux-gnu/x86_64-cros-linux-gnu/binutils-bin/2.27/ld.bfd.real: warning: arch/x86/boot/compressed/vmlinux.lds contains output sections; did you forget -T?

you misunderstand my comment.  LDFLAGS/CFLAGS/CPPFLAGS/CXXFLAGS are all the same when it comes to build systems -- they're expected knobs for users to turn.  the fact that the kernel build does it differently is on the kernel.  it's much less frequently used than CFLAGS of course which is probably why no one really noticed or does much about it.

have you tried ldflags-y ?
Cc: diand...@chromium.org
OK, looked at this a bunch...

===

1. How we're passing LDFLAGS

The way we're specifying LDFLAGS into the kernel is by doing the equivalent of this (AKA passing on the command line):

   make LDFLAGS=our_ld_flags

That works really differently than this (AKA passing via Environment variables):

   LDFLAGS=our_ld_flags make


Specifically the first tells make to ignore the things that set LDFLAGS in the Makefile (unless they specifically use "override") and just use the ones we pass.  I don't personally know how many packages work properly with one of those two vs. the other, but certainly any Makefile that needs to do any amount of non-trivial amount of linker work (and thus that needs to ensure some non-optional linker flags take effect) will need to choose how it expects to be called, document that choice, and stick with it.

As far as I can tell from the kernel's kbuild.txt, the kernel has chosen that if you want to pass in flags that you do it with "Environment variables" and not by passing command line arguments into make.  Thus, I believe:

	cw_emake \
		ARCH=${kernel_arch} \
		LDFLAGS="$(raw-ldflags)" \
		CROSS_COMPILE="${cross}-" \
		O="$(cros-workon_get_build_dir)" \
		KCFLAGS="${kcflags}" \
		"$@"

...should instead be (only lightly tested):

	ARCH=${kernel_arch} \
		LDFLAGS="$(raw-ldflags)" \
		CROSS_COMPILE="${cross}-" \
		KCFLAGS="${kcflags}" \
		cw_emake \
		O="$(cros-workon_get_build_dir)" \
		"$@"

It appears the "O=" is special and should actually be passed on the command line.  This is documented in kbuild.txt and is extra evidence that the documented way to work with the kernel is to for all other variables is through environment variables (not through the command line).

Mike: do you agree?

---

2. LDFLAGS documented?

So actually, the kernel doesn't document that it really wants you to pass in LDFLAGS.  AKA there's no mention in the "kbuild.txt" file.  The kernel talks about passing in LDFLAGS_MODULE and LDFLAGS_vmlinux, but these are only (I think) used in the final linking step, not for interrim ones.

In my (very tiny and very simple) tests, it _seems_ like it's OK to pass in LDFLAGS as an environment variable even if it's not documented.  Unless there are problems (Guenter, can you confirm?), I guess we can keep it the way it is...

One note is that if you take the above command and _remove_ LDFLAGS="$(raw-ldflags)" then you actually get an error because the LDFLAGS environment variable was already set to something that the kernel didn't like.  The following, however, it does work (lightly tested):

	unset LDFLAGS
	ARCH=${kernel_arch} \
		CROSS_COMPILE="${cross}-" \
		KCFLAGS="${kcflags}" \
		cw_emake \
		O="$(cros-workon_get_build_dir)" \
		"$@"


FWIW, here's an example from my kevin build with LDFLAGS passed as an environment (the "-O1 -O2 --as-needed" comes from the environment).  This was obtained with "V=1":

  /usr/x86_64-pc-linux-gnu/aarch64-cros-linux-gnu/binutils-bin/2.25.51/ld  -O1 -O2 --as-needed   -r -o one.o two.o three.o ...

...and he's an example link with LDFLAGS fully unset:

  /usr/x86_64-pc-linux-gnu/aarch64-cros-linux-gnu/binutils-bin/2.25.51/ld     -r -o one.o two.o three.o ...



---

3. Does the kernel actually need LDFLAGS passed in?

This is a hard / weird question.  The kernel is such a massive thing that really it's a special case.  It's really tied to gcc (or something that can emulate gcc's command line arguments directly) since it needs to specify all kinds of custom flags in various places.  The kernel also tends to manage its flags by itself.  For instance instead of expecting you to pass in different warning levels through CFLAGS there's a CONFIG option in the kernel that does it for you.  Instead of expecting you to pass in a linker flag to enable an optimization workaround for a broken ARM CPU, there's a CONFIG option in the kernel that does it for you.  Basically, the kernel manages the flags in a way that is unlike most other software packages.


Specifically, one could ask whether it should have been a kernel config option to turn on "--as-needed" in the linker flags and whether the kernel ought to be aware of such a flag.  I don't know the answer to that, but it makes me wonder.

===

So assuming that Guenter agrees, I could go with either the change to pass things as environment variables or with the change to removing passing LDFLAGS to the kernel (and if we actually end up needing any flags we make a kernel change to do it).
Can you transform LDFLAGS into KCFLAGS by using -Wl,<ldflag>?
This assumes you always invokes the linker through the compiler driver.


also, it seems EXTRA_LDFLAGS was supported and replaced by ldflags-y:
http://lkml.iu.edu/hypermail/linux/kernel/0709.3/2581.html

> also, it seems EXTRA_LDFLAGS was supported and replaced by ldflags-y:

This is for use _within_ kernel Makefiles.  The "kbuild.txt" appears to document the interface to the outside world.
how many versions of the kernel will be affected by this? 
And how much testing is needed?
and is this change needed on previous branches?
(I see from the CL that all versions are affected).
Still have the question about branches where this should be applied.
If it makes a difference, binutils change was only done after latest branch point.
@73 - @75: For reasons that Guenter didn't understand, our old kernels have (amazingly) always worked OK even with the overriding of the LDFLAGS by the build system.  We don't really know why, but presumably the default linker flags are somehow enough in most cases?

Said another way:
* It has always been wrong to pass LDFLAGS (and a few other things) through command line to kernel make.
* We've been doing it wrong for years, but it's never caused a problem.  We don't quite know why.
* For whatever reason with the right set of kernel build config / kernel version / binutils, we had a problem.

So we should fix this to be right.  ...but as far as we know there would be no reason to backport it.
Project Member

Comment 77 by bugdroid1@chromium.org, Aug 29 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/c1440c3ce8173bad23e2e8d44ae6f49813801c4a

commit c1440c3ce8173bad23e2e8d44ae6f49813801c4a
Author: Douglas Anderson <dianders@chromium.org>
Date: Tue Aug 29 01:21:32 2017

cros-kernel2: Pass most stuff to kernel make as env vars, not cmdline args

In Makefiles, there are two ways to pass variables in from the command
line.  You can either pass them in via environment variables or via
command line arguments.  The two ways of passing in variables causes
different results if the Makefile tries to modify the variables.  For
instance, if you have a Makefile like:

FOODS += and cookies
all:
	@echo "I like $(FOODS)"

And you call it like this:
    FOODS="ice cream" make
...you'll see "I like ice cream and cookies".  ...but if you call:
    make FOODS="ice cream"
...you'll see "I like ice cream".

The fact that the two ways of passing variables behaves so differently
means that any sufficiently complex build system needs to document
when it expects you to use one method and when it expects you to use
the other.  The kernel build system in kbuild.txt [1] explicitly
documents this type of stuff.  Specifically these are listed under
"Environment variables":
* KCFLAGS
* ARCH
* CROSS_COMPILE

...and also the docs talk about "O=" being set via command line.  We
switch all the variables listed as "Environment variables" to be set
via environment variables.

The docs don't explicitly talk about a way to pass in LDFLAGS, perhaps
because the kernel expects to handle all important LDFLAGS itself.
You can pass LDFLAGS explicitly for the final link step of modules
and/or vmlinux, but not general LDFLAGS that are used for creating
temporary object files.  However:

* If you pass LDFLAGS in via command line, it messes up a few use
  cases.  See  crbug.com/754462 .
* If you pass LDFLAGS in via environment variable, we have the ability
  to pass in some custom LDFLAGS without messing up the use cases.

For now, it seems sane to pass LDFLAGS in via environment variable.
The LDFLAGS that we seem to currently specify don't seem terribly
important and don't seem to affect the build one way or the other, but
it doesn't hurt to pass them in.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/kbuild/kbuild.txt

BUG= chromium:754462 
TEST=Much testing across many machines

Change-Id: I90f21fd6b3f87b893a85a8edf9393d16631c3549
Reviewed-on: https://chromium-review.googlesource.com/624174
Commit-Ready: Douglas Anderson <dianders@chromium.org>
Tested-by: Guenter Roeck <groeck@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Reviewed-by: Chirantan Ekbote <chirantan@chromium.org>

[modify] https://crrev.com/c1440c3ce8173bad23e2e8d44ae6f49813801c4a/sys-kernel/chromeos-kernel-3_8/chromeos-kernel-3_8-9999.ebuild
[modify] https://crrev.com/c1440c3ce8173bad23e2e8d44ae6f49813801c4a/sys-kernel/chromeos-kernel-3_10/chromeos-kernel-3_10-9999.ebuild
[modify] https://crrev.com/c1440c3ce8173bad23e2e8d44ae6f49813801c4a/sys-kernel/chromeos-kernel-3_18/chromeos-kernel-3_18-9999.ebuild
[modify] https://crrev.com/c1440c3ce8173bad23e2e8d44ae6f49813801c4a/eclass/cros-kernel2.eclass
[modify] https://crrev.com/c1440c3ce8173bad23e2e8d44ae6f49813801c4a/sys-kernel/chromeos-kernel-3_14/chromeos-kernel-3_14-9999.ebuild
[modify] https://crrev.com/c1440c3ce8173bad23e2e8d44ae6f49813801c4a/sys-kernel/chromeos-kernel-4_4/chromeos-kernel-4_4-9999.ebuild

Labels: M-62
Status: Fixed (was: Untriaged)
Presuming no backporting to M-61 is needed.  Please yell if someone thinks it is.

Sign in to add a comment