New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 629593 link

Starred by 7 users

Issue metadata

Status: Verified
Owner:
Closed: Aug 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 628505



Sign in to add a comment

NaCl broken on ARM Chromebooks

Project Member Reported by dmazz...@chromium.org, Jul 19 2016

Issue description

I'm reproducing this right now on a Daisy Chromebook running 54.0.2798.0 / platform 8707.0.0

This was originally caught as  bug 628505 , that Chrome OS TTS stopped working on ARM boards, with this blamelist:

https://chromium.googlesource.com/chromium/src/+log/54.0.2787.0..54.0.2790.0?pretty=fuller&n=10000

I suspect that it's this change, since most / all native client seems to be broken:

https://chromium.googlesource.com/native_client/src/native_client.git/+/0949e1bef9d6b25ee44eb69a54e0cc6f8a677375
"Update the nacl_bootstrap GN code to work in ChromeOS chroot builds."

To repro native client breakage, try this url:

https://gonativeclient.appspot.com/static/bullet/fullscreen.html

Alternatively, to test TTS (which uses NaCl also), press Ctrl+Alt+Z to turn on spoken feedback. If you hear sounds but no speech, it's broken. Open chrome://inspect, click on Extensions, and inspect Chrome OS built-in TTS to see the NaCl error message.

 
Labels: M-54
Labels: Proj-GN-Migration
Any update on the fix for this dev blocker?
I will help Bradley on this today. 
Since the problem seems to have started with the fix for GN nacl_bootstrap, I looked at how we where compiling this file with GYP and how we are compiling it with GN.

With GYP: (open this window wide)

[6521/24398] /usr/local/google/home/llozano/proj/gn_problem/chroot/tmp/my-chrome-src/.cros_cache/common/goma+2/gomacc armv7a-cros-linux-gnueabi-gcc -B/usr/local/google/home/llozano/proj/gn_problem/chroot/tmp/my-chrome-src/.cros_cache/chrome-sdk/tarballs/daisy+8613.0.0+target_toolchain/usr/x86_64-pc-linux-gnu/armv7a-cros-linux-gnueabi/binutils-bin/2.25.51-gold -MMD -MF obj/native_client/src/trusted/service_runtime/linux/nacl_bootstrap_lib.nacl_bootstrap.o.d -DV8_DEPRECATION_WARNINGS -D_FILE_OFFSET_BITS=64 -D_POSIX_C_SOURCE=199506 -D_XOPEN_SOURCE=600 -D_GNU_SOURCE=1 -D__STDC_LIMIT_MACROS=1 -DCHROMIUM_BUILD -DCR_CLANG_REVISION=274369-1 -DUI_COMPOSITOR_IMAGE_TRANSPORT -DUSE_AURA=1 -DUSE_ASH=1 -DUSE_PANGO=1 -DUSE_CAIRO=1 -DUSE_CRAS=1 -DUSE_OZONE=1 -DUSE_DEFAULT_RENDER_THEME=1 -DUSE_LIBJPEG_TURBO=1 -DIMAGE_LOADER_EXTENSION=1 -DENABLE_WEBRTC=1 -DENABLE_MEDIA_ROUTER=1 -DUSE_PROPRIETARY_CODECS -DENABLE_PEPPER_CDMS -DENABLE_NOTIFICATIONS -DENABLE_WAYLAND_SERVER=1 -DUSE_UDEV -DFIELDTRIAL_TESTING_ENABLED -DENABLE_TASK_MANAGER=1 -DENABLE_EXTENSIONS=1 -DENABLE_PDF=1 -DENABLE_PLUGINS=1 -DENABLE_SESSION_SERVICE=1 -DENABLE_THEMES=1 -DENABLE_PRINTING=1 -DENABLE_BASIC_PRINTING=1 -DENABLE_PRINT_PREVIEW=1 -DENABLE_SPELLCHECK=1 -DENABLE_CAPTIVE_PORTAL_DETECTION=1 -DENABLE_APP_LIST=1 -DENABLE_SUPERVISED_USERS=1 -DENABLE_MDNS=1 -DENABLE_SERVICE_DISCOVERY=1 -DV8_USE_EXTERNAL_STARTUP_DATA -DFULL_SAFE_BROWSING -DSAFE_BROWSING_CSD -DSAFE_BROWSING_DB_LOCAL -DUSE_LIBPCI=1 -DUSE_NSS_CERTS=1 -DOS_CHROMEOS=1 -DNDEBUG -DNVALGRIND -DDYNAMIC_ANNOTATIONS_ENABLED=0 -Igen -I../../native_client/src/third_party -I../.. -I../../native_client/src/trusted/service_runtime -pthread -fno-strict-aliasing -Wno-extra -Wno-unused-parameter -Wno-missing-field-initializers -fvisibility=hidden -pipe -fPIC -Wno-unused-local-typedefs -fno-builtin -fno-stack-protector -fno-pic -fno-PIC -fno-pie -fno-PIE -Wno-maybe-uninitialized -march=armv7-a -mtune=generic-armv7-a -mfpu=neon -mfloat-abi=hard -mthumb --sysroot=/usr/local/google/home/llozano/proj/gn_problem/chroot/tmp/my-chrome-src/.cros_cache/chrome-sdk/tarballs/daisy+8613.0.0+sysroot_chromeos-base_chromeos-chrome.tar.xz -O2 -fno-ident -fdata-sections -ffunction-sections -gsplit-dwarf -pipe -march=armv7-a -mtune=cortex-a15 -mfpu=neon -mfloat-abi=hard -gsplit-dwarf  -c ../../native_client/src/trusted/service_runtime/linux/nacl_bootstrap.c -o obj/native_client/src/trusted/service_runtime/linux/nacl_bootstrap_lib.nacl_bootstrap.o

with GN:

[760/28446] /usr/local/google/home/llozano/proj/gn_problem/chroot/tmp/my-chrome-src/.cros_cache/common/goma+2/gomacc armv7a-cros-linux-gnueabi-gcc -B/usr/local/google/home/llozano/proj/gn_problem/chroot/tmp/my-chrome-src/.cros_cache/chrome-sdk/tarballs/daisy+8613.0.0+target_toolchain/usr/x86_64-pc-linux-gnu/armv7a-cros-linux-gnueabi/binutils-bin/2.25.51-gold -MMD -MF nacl_bootstrap/obj/native_client/src/trusted/service_runtime/linux/nacl_bootstrap_lib/nacl_bootstrap.o.d -DOS_CHROMEOS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D_FORTIFY_SOURCE=2 -D_POSIX_C_SOURCE=199506 -D_XOPEN_SOURCE=600 -D_GNU_SOURCE=1 -D__STDC_LIMIT_MACROS=1 -I../.. -fno-builtin -fno-stack-protector -fno-pic -fno-PIC -fno-pie -fno-PIE -march=armv7-a -mfloat-abi=hard -mthumb -mtune=generic-armv7-a --sysroot=../../../.cros_cache/chrome-sdk/tarballs/daisy+8613.0.0+sysroot_chromeos-base_chromeos-chrome.tar.xz -Wall -Werror -Wno-psabi -Wno-unused-local-typedefs -Wno-maybe-uninitialized -Wno-missing-field-initializers -Wno-unused-parameter -O2 -fno-ident -fdata-sections -ffunction-sections -g2  -pipe -march=armv7-a -mtune=cortex-a15 -mfpu=neon -mfloat-abi=hard -gsplit-dwarf -c ../../native_client/src/trusted/service_runtime/linux/nacl_bootstrap.c -o nacl_bootstrap/obj/native_client/src/trusted/service_runtime/linux/nacl_bootstrap_lib/nacl_bootstrap.o


there is quite a bit of difference. Several defines like -D_FILE_OFFSET_BITS=64 -DCHROMIUM_BUILD -DCR_CLANG_REVISION=274369-1 -DUI_COMPOSITOR_IMAGE_TRANSPORT -DUSE_AURA=1 -DUSE_ASH=1 -DUSE_PANGO=1 -DUSE_CAIRO=1 -DUSE_CRAS=1 -DUSE_OZONE=1 -DUSE_DEFAULT_RENDER_THEME=1 -DUSE_LIBJPEG_TURBO=1 -DIMAGE_LOADER_EXTENSION=1 -DENABLE_WEBRTC=1 -DENABLE_MEDIA_ROUTER=1 and others.... 

also, compiler options like -fno-strict-aliasing, -fvisibility=hidden ... 

So, too many differences that could cause problems. 

@dpranke, @bradnelson, Is this expected? I assume it is not.

I have put the logs for the GN and GYP in X20 so that Brad can compare.

/google/data/rw/users/ll/llozano/for_bradnelson



Cc: mcgrathr@chromium.org
I would expect the flags to be quite different, but clearly we're picking up the default set of feature defines and probably don't need them.

The most obvious other thing that I noticed was that we're ending up with -gsplit-dwarf, I'm guessing because that's getting set w/ cros_extra_cxx_flags. However, the comments in the //native_client/src/trusted/service_runtime/linux file make it clear that we're trying to go out of our way to avoid split-dwarf here because of possible issues w/ older binutils (though that says clang, not gcc).

The CL in question also caused the build to flip from clang to gcc for this particular file.

Can you try compiling without the flags in cros_extra_cxx_flags that are being duplicated? I think that's how -gsplit-dwarf is getting set (and you should get that where needed already via the other gn args).
(update conflict)

and not sure if this helps at all, but this was the command before the change mentioned in the bug description:
https://chromium.googlesource.com/native_client/src/native_client.git/+/0949e1bef9d6b25ee44eb69a54e0cc6f8a677375

[28683/28738] ../../../../../../../home/llozano/chrome_root/src/third_party/llvm-build/Release+Asserts/bin/clang -MMD -MF nacl_bootstrap_arm/obj/native_client/src/trusted/service_runtime/linux/nacl_bootstrap_lib/nacl_bootstrap.o.d -DOS_CHROMEOS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D_POSIX_C_SOURCE=199506 -D_XOPEN_SOURCE=600 -D_GNU_SOURCE=1 -D__STDC_LIMIT_MACROS=1 -I../../../../../../../home/llozano/chrome_root/src -fno-builtin -fno-stack-protector -fno-pic -fno-PIC -fno-pie -fno-PIE --target=arm-linux-gnueabihf -march=armv7-a -mfloat-abi=hard -mthumb -mtune=generic-armv7-a --sysroot=../../../../../../../build/daisy -Wall -Werror -Wextra -Wno-missing-field-initializers -Wno-unused-parameter -Wno-c++11-narrowing -Wno-covered-switch-default -Wno-deprecated-register -Wno-unneeded-internal-declaration -Wno-inconsistent-missing-override -Wno-shift-negative-value -Wno-undefined-var-template -O2 -fno-ident -fdata-sections -ffunction-sections -g2  -c ../../../../../../../home/llozano/chrome_root/src/native_client/src/trusted/service_runtime/linux/nacl_bootstrap.c -o nacl_bootstrap_arm/obj/native_client/src/trusted/service_runtime/linux/nacl_bootstrap_lib/nacl_bootstrap.o

At this time, the compiler used was incorrect. There are some suspicious differences like -mfpu=neon and -gsplit-dwarf.
It just seems like a bug to me that they are stuff -gsplit-dwarf into some magic variable instead of just putting use_debug_fission=true in the build args.  Why do they do that?
@mcgrathr - these flags end up stemming from the way the CrOS build system works, which is more like a traditional unix setup where most packages are expecting to have EXTRA_CC_FLAGS, EXTRA_CPP_FLAGS, etc and the CrOS team wants to make sure those flags get propagated into the Chromium build.

This is a problematic approach in some cases because there are some chromium targets that need to very carefully control the flags that are used (as you are obviously well aware). In general we're still working on the best way to balance these requirements. GYP was in some ways less affected by this problem because you could explicitly say cflags!: -gsplit-dwarf and the flag would be removed if it was present, but not error out if it wasn't. In GN, you can't control individual flags that are set in other configs.

In particular in this case llozano@ will try getting rid of the -gsplit-dwarf flag and see if that helps.
ok, I have a fix for this in https://chromium-review.googlesource.com/#/c/363505/2
but it turns out that this only affects the simple chrome workflow. It should not happen in the ebuild workflow (I am verifying but I am 90% sure). So, this cannot be the problem that is affecting NACL.
this is what we get for nacl_bootstrap.c on the ebuild workflow.

[720/29575] armv7a-cros-linux-gnueabi-gcc -B/usr/x86_64-pc-linux-gnu/armv7a-cros-linux-gnueabi/binutils-bin/2.25.51-gold -MMD -MF nacl_bootstrap/obj/native_client/src/trusted/service_runtime/linux/nacl_bootstrap_lib/nacl_bootstrap.o.d -DOS_CHROMEOS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D_FORTIFY_SOURCE=2 -D_POSIX_C_SOURCE=199506 -D_XOPEN_SOURCE=600 -D_GNU_SOURCE=1 -D__STDC_LIMIT_MACROS=1 -I../../../../../chromeos-cache/distfiles/target/chrome-src-internal/src -fno-builtin -fno-stack-protector -fno-pic -fno-PIC -fno-pie -fno-PIE -march=armv7-a -mfloat-abi=hard -mthumb -mtune=generic-armv7-a --sysroot=../../../../../../../build/daisy -Wall -Werror -Wno-psabi -Wno-unused-local-typedefs -Wno-maybe-uninitialized -Wno-missing-field-initializers -Wno-unused-parameter -O2 -fno-ident -fdata-sections -ffunction-sections -g2  -pipe -march=armv7-a -mtune=cortex-a15 -mfpu=neon -mfloat-abi=hard  -c ../../../../../chromeos-cache/distfiles/target/chrome-src-internal/src/native_client/src/trusted/service_runtime/linux/nacl_bootstrap.c -o nacl_bootstrap/obj/native_client/src/trusted/service_runtime/linux/nacl_bootstrap_lib/nacl_bootstrap.o

which does not containd the -gsplit-dwarf
I was doing this with simple chrome.
In the ebuild the compilation after the changes looks like this:

[41/21778] armv7a-cros-linux-gnueabi-gcc -B/usr/x86_64-pc-linux-gnu/armv7a-cros-linux-gnueabi/binutils-bin/2.25.51-gold -MMD -MF nacl_bootstrap/obj/native_client/src/trusted/service_runtime/linux/nacl_bootstrap_lib/nacl_bootstrap.o.d -DOS_CHROMEOS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D_FORTIFY_SOURCE=2 -D_POSIX_C_SOURCE=199506 -D_XOPEN_SOURCE=600 -D_GNU_SOURCE=1 -D__STDC_LIMIT_MACROS=1 -I../../../../../chromeos-cache/distfiles/target/chrome-src-internal/src -fno-builtin -fno-stack-protector -fno-pic -fno-PIC -fno-pie -fno-PIE -march=armv7-a -mfloat-abi=hard -mthumb -mtune=generic-armv7-a --sysroot=../../../../../../../build/daisy -Wall -Werror -Wno-psabi -Wno-unused-local-typedefs -Wno-maybe-uninitialized -Wno-missing-field-initializers -Wno-unused-parameter -O2 -fno-ident -fdata-sections -ffunction-sections -g2  -pipe -march=armv7-a -mtune=cortex-a15 -mfpu=neon -mfloat-abi=hard -fauto-profile=/build/daisy/tmp/portage/chromeos-base/chromeos-chrome-54.0.2809.0_rc-r1/work/afdo/chromeos-chrome-amd64-54.0.2808.0_rc-r1.afdo -Wno-error -c ../../../../../chromeos-cache/distfiles/target/chrome-src-internal/src/native_client/src/trusted/service_runtime/linux/nacl_bootstrap.c -o nacl_bootstrap/obj/native_client/src/trusted/service_runtime/linux/nacl_bootstrap_lib/nacl_bootstrap.o

I think the problem maybe caused by the fauto-profile option (not totally sure, found it by accident). It is possible compiler is doing some optimization that nacl_bootstrap does not like. 

dpranke is giving me a patch to avoid passing all of these options to the nacl_bootstrap option. Will try later.

Certainly -fauto-profile or any other kinds of instrumentation options will break nacl_helper_bootstrap.  It's really a losing battle to have any kinds of random options be forced into the tools for that.  That's why it has its own special toolchain definition.  The only "safe" options are the basic ABI stuff like -march, -mtune, -mthumb, -mfloat-abi, etc. and core optimization options like -O*.  Anything else that affects code generation in any way is likely to be a problem.
Cc: ihf@chromium.org shenhan@chromium.org shuchen@chromium.org vapier@chromium.org xiaoche...@chromium.org jcliang@chromium.org hashimoto@chromium.org cmt...@chromium.org chirantan@chromium.org wuyingbing@chromium.org osh...@chromium.org
 Issue 631304  has been merged into this issue.
Project Member

Comment 17 by bugdroid1@chromium.org, Jul 29 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/208cbe37d01f02d13c028a7520fa5b7ef4985c02

commit 208cbe37d01f02d13c028a7520fa5b7ef4985c02
Author: Luis Lozano <llozano@chromium.org>
Date: Wed Jul 27 17:23:47 2016

GN: remove unnecessary -gsplit-dwarf flag.

On GYP, we had to add this flag manually.
On GN, this is unnecessary and may be causing some problems.

BUG= chromium:629593 
TEST=Built with simple chrome workflow.

Change-Id: I34706e93697e44b10e377bd49476f693c303dbf9
Reviewed-on: https://chromium-review.googlesource.com/363505
Commit-Ready: Luis Lozano <llozano@chromium.org>
Tested-by: Luis Lozano <llozano@chromium.org>
Reviewed-by: Steven Bennetts <stevenjb@chromium.org>

[modify] https://crrev.com/208cbe37d01f02d13c028a7520fa5b7ef4985c02/cli/cros/cros_chrome_sdk.py

The patch in #14 only changes flags for nacl_helper_bootstrap.  I do not think that is enough to work around this issue.  I was able to get NaCl working again using the simple chrome workflow and recompiling all of nacl_helper w/out AFDO:

ninja -C out_${SDK_BOARD}/Release -j500 nacl_helper
deploy_chrome --build-dir=out_${SDK_BOARD}/Release --to=${IP}

However, I think this actually recompiles a lot of dependencies, as well.
I am trying to repro now to see all that changed.
Rebuilding nacl_helper without afdo has 9206 ninja steps, as shown in the attached build_nacl_helper_no_afdo.log.
One of these steps made a difference.

To repro:
(1) Enter simple chrome sdk
cros chrome-sdk --board=elm --chroot=~/chromeos/chroot/ --internal

(2) Checkout 54.0.2809.0
cd src
git checkout 54.0.2809.0

(2.a) Optional:  Apply testing patch: https://codereview.chromium.org/2188633004/ <= This doesn't help [*]

(3) Download and extract afdo data using gsutil:
gsutil cp gs://chromeos-prebuilt/afdo-job/canonicals/chromeos-chrome-amd64-54.0.2808.0_rc-r1.afdo.bz2 .
bunzip2 chromeos-chrome-amd64-54.0.2808.0_rc-r1.afdo.bz2

(4) Configure gn to use afdo data:

# use gsutil to download .afdo for $SDK_BOARD
rm -rf out_$SDK_BOARD/Release
gn gen out_$SDK_BOARD/Release --args="$GN_ARGS"
gn args out_$SDK_BOARD/Release
# update args to match attached args_afdo_nogoma.gn

(5) build everything w/ afdo (& no goma)
ninja -C out_${SDK_BOARD}/Release -j 96 chrome chrome_sandbox nacl_helper

(6) deploy to device
deploy_chrome --build-dir=out_${SDK_BOARD}/Release --to=${IP}

 => NaCl fails to load.

## Now, rebuild just nacl_helper w/out afdo:

(7) Configure gn to use afdo data:

# use gsutil to download .afdo for $SDK_BOARD
gn args out_$SDK_BOARD/Release
# update args to match attached args_no_afdo.gn

(8) Rebuild just nacl_helper w/out afdo
ninja -C out_${SDK_BOARD}/Release -j 96 nacl_helper

(9) deploy to device
deploy_chrome --build-dir=out_${SDK_BOARD}/Release --to=${IP}

=> NaCl now loads ok!

[*] Notice the line for nacl_bootstrap.o does NOT have "-fauto-profile=" copmile flag in attached build_full_afdo.log.
args_afdo_nogoma.gn
3.5 KB Download
args_no_afdo.gn
3.3 KB Download
build_full_afdo.log.bz2
931 KB Download
build_nacl_helper_no_afdo.log.bz2
284 KB Download
nacl_helper is a separate binary that runs sandboxed.  It's not special like nacl_helper_bootstrap so in the abstract things like FDO could be fine in nacl_helper.  But perhaps the sandbox context it runs in, or other details of what code is and isn't built into nacl_helper, interfere with the FDO runtime to break things.  I don't know enough off hand about how the FDO runtime works to offer specific advice.
Cc: sshruthi@chromium.org
Owner: dpranke@chromium.org
Okay, the current hypothesis is that

- NaCl on ARM is and always has been broken in GN w/ this set of build flags.

- The reason we didn't see an issue with GYP is that due to the separate NaCl build process in GYP, the same set of flags (AFDO) weren't getting used for the NaCL binaries.

Our options at the moment are:

- flip back to GYP for this canary build

- try to figure out what's wrong and fix it (an unknown effort)

- turn AFDO off (presumably making the build slower)

- try to figure out how to turn AFDO off for just the NaCl targets

The latter is possibly the right thing to do in the long run, and I can work on it,
but it may take a few hours before I even have something I can test. This means that
verifying it for a canary that would ship on Monday might be difficult.

- Flipping back to GYP would create a huge amount of churn and confusion, I would strongly prefer any other option.

- Turning off AFDO for NaCl targets sounds like the right thing to do to me.

If we absolutely need a Canary without a broken NaCl (which affects accessibility and IME) on Monday, I would vote for disabling AFDO entirely for the short term. Unless that has consequences beyond a small performance impact, it sounds like something we can live with for a Canary release or two while we figure this out. It certainly wouldn't be unprecedented for a Canary to go out with performance issues.,

llozano@, thoughts?

+1 to figuring out how to turn AFDO off just for Nacl targets. Out of curiosity, how much slower would we expect (ballpark guess) the build to get if we had to completely turn AFDO off?
Owner: llozano@chromium.org
okay, so, after having looked at this a bit more and refreshed my memory ...

if the problem is in nacl_helper, and not nacl_helper_bootstrap, then life is in a sense more complicated. nacl_helper is a regular linux executable that is compiled w/ the default GN toolchain (the same one that builds chrome, using the same flags and libraries).

Because -fauto-profile is getting set in cros_extra_cflags, that flag is applied unconditionally to all objects, and hence there's no way for us to disable this just for the nacl_helper files. Moreover, of the ~9000 objects needed for nacl_helper, only ~4 of them are nacl_helper-specific, and the rest are shared w/ chrome, so we either have to build twice or turn off afdo either way.

This suggests that either some change has come in that has regressed things either way (and a current GYP build would be broken, too), or maybe the binaries are just different enough between GYP and GN to tickle some bug in AFDO (they should be roughly the same, but might be slightly different and we'd have to look into how exactly). 

So, I'm punting this back to llozano@, who I think is going to try and bisect things for a while and see if he can find a change that breaks things. If not, we might have to turn off AFDO until we can figure out what's wrong.
Labels: Build-Toolchain
my apologies for late reply. I have been working very hard on this problem and I ignored email for a few hours. Been consulting with dpranke and some of his updates come from these discussions.

- Long term solution is to turn off AFDO just for the NaCL targets.
- For Monday, unless I find a solution before, I will have to TEMPORARY disable AFDO for ARM.
- AFDO brings upto %15-20 performance improvement on some benchmarks.
- I am currently bisecting objects files with and without AFDO. It takes time. Hopefully will find the set of objects where we need to disable AFDO. 
 
bisection points to this file:
base/allocator/tcmalloc/debugallocation_shim.o

I am investigating.
If that file is problematic, maybe you can try building with tcmalloc disabled (use_allocator = "none") and see if that goes away?

This might have a perf hit, but I think we've actually disabled tcmalloc by default on everything but linux (and it's disabled on linux for mipsel and arm64 as well).
Cc: primiano@chromium.org
+primiano ...

Or, possibly try setting use_experimental_allocator_shim=false.
I would prefer not to change allocators. I dont want to deal with the performance implications. 
I would prefer just to disable AFDO for this file. According to the comments I saw, Some of the functionality in this file depends on particular stack traces wich can be modified by different optimizations levels. I prefer to to just pass -fno-auto-profile and leave this file alone.

I did a little bit more of investigation and found that disabling AFDO for function do_free_with_callback fixes the problem.
Like this:

#pragma GCC optimize ("no-auto-profile")

inline void do_free_with_callback(void* ptr, void (*invalid_free_fn)(void*))
{ .... }

#pragma GCC reset_options

but I prefer not to touch the code. As I said, some of the comments in there make me nervous about enabling more optimizations for that file. 
Lets just disable AFDO for the file.
@dpranke, can you help me with this? I don't have enough knowledge of GN to do this.


use_experimental_allocator_shim should have no effect on tcmalloc files. Just changes the place where the malloc & friends symbol are introduced, but shouldn't make any change to the way debugallocation_shim.cc is build (There is a bit of naming conflict here, debugallocation_shim is something that has been in tcmalloc since forever and has nothing to do with the allocator shim which is conditioned by use_experimental... which is newer)

Note that in release builds debugallocation_shim.cc == tcmalloc.cc see https://cs.chromium.org/chromium/src/base/allocator/debugallocation_shim.cc?q=debugallocation_shim&sq=package:chromium&dr&l=8&ssfr=1


@primiano, can you comment on #29 and if you agree with just disabling this optimization for that file?
Cc: thakis@chromium.org
re #31. I cannot find any trace of that "-fauto-profile" in the GYP/GN files.
I imagine this comes from some internal CrOS magic build thing? Can't you do something at that level?

I don't have objections to auto-profile flags itself. My only worry is that I don't see any precedent for having #pragma GCC optimize in the codebase. I suggest asking some base owner about this (+thakis). As long as they are fine, I'm okay with that.
From an allocator viewpoint tcmalloc is used only on Linux/CrOS, and I never heard about -fauto-profile in Chrome before so toggling this flag should practically affect only CrOS. 
I did a bit more debugging today and discovered that:

(1) The specific nacl loader failure goes away with (set this in /etc/chrome_dev.conf):
 NACLVERBOSITY=4

(2) In fact it is even fixed with this tiny change:
  (a) NACLVERBOSITY=0
  (b) Change log level from 4->0 of this one line in NaClAppThreadLauncher in native_client/src/trusted/service_runtime/nacl_app_thread.c: 

[1]     NaClLog(0, "NaClAppThreadLauncher: entered\n");

(3) However, NaCl still fails w/ SIGSEGV  if the line [1] is moved below this block:

  NaClXMutexLock(&natp->nap->threads_mu);
  natp->thread_num = NaClAddThreadMu(natp->nap, natp);
  NaClXMutexUnlock(&natp->nap->threads_mu);

So, this suggests that there might be a race condition when launching NaCl app threads that is being exposed by the .afdo optimizations.
As I said in comment #24, because -fauto-profile is set as part of cros_extra_cflags, it is applied to every file in target and there's no way we can skip it for just the one file.

Even if we were to pull out afdo into a separate GN variable (which is probably the right thing to do), the only way to disable flags in GN is on a target-by-target basis, not a file-by-file basis. That means we'd have to disable afdo for all of tcmalloc (which might be okay).

In other words, the simplest thing to do is to add the #pragmas in comment #29, but we can figure out how to refactor things so that either we can toggle off afdo for all of tcmalloc, or toggle off all of cros_extra_cflags for tcmalloc if need be.
Cc: -llozano@chromium.org
I think it is better in this case to modify the calls to the compiler than to modify the source. I may have a hard time adding those pragmas to the source code.

My order of preference for the solution would be:
1) ADD (not skip) -fno-auto-profile at the END of the flags to the compilation of base/allocator/tcmalloc/debugallocation_shim.cc
2) ADD -f-no-auto-profile to the compilation of all of base/allocator/tcmalloc
3) remove all cros_extra_cxxflags for base/allocator/tcmalloc
4) Add the pragmas to the source code.



As I've said, you can't set flags per-file in GN (or in GYP, for that matter), so in order to do #1 in GN, you would have to move debugallocation_shim.cc into a separate target on its own. I don't think we should do that.

In order to do either #2 or #3 we would need to rework the way cros_extra_cflags is implemented, which is probably a good idea (the current implementation, which I did, was dumb). However, this is something of an invasive change so it might not land today if we go this route.

In GN we really don't like randomly tacking on -fenable-x and -fdisable-y to the ends of command lines; it makes the build very hard to reason about. 

The way to do this, which I would be much more in favor of, would be to turn AFDO into a proper build flag and not setting it via cros_extra_cflags at all. We would still need to do that for the entire tcmalloc target.

It seems like the underlying problem here is that AFDO is buggy, and so the end goal should be to fix AFDO so that we don't need a specific workaround for tcmalloc, right? 

Given that, it seems like we should add the pragmas -- which is way easier than any of these other things -- and a TODO for now, and then implement AFDO as a build flag, change the way cros_extra_cflags works, and fix gcc all as follow-on, non urgent tasks. 

I'm not sure why you're reluctant to add the pragma; it's no more difficult than changing the build files. 

All that said, it might be better for thakis@ or brettw@ as an OWNER for //base to weigh in in case there's some real gotcha that would lead us to not want the pragmas.



I am reluctant to add the pragmas to the source because usually source code owner see this as "polluting the code". 
I just asked thakis, and he said he "would prefer not to".

I dont want to call AFDO buggy. Sure, I am sure AFDO has bugs like any other piece of code but what happens in some casees is that it correctly optimizes something and the code breaks because it has wrong assumptions (like order of things in the stack as an example).

I like your idea about making this a separate flag.
But can we do something else as quick workaround? Disable cxx_extra_flags for tcmalloc?
Josafat wants a workaround for this by today since he wants to do a dev release.
I agree with Dirk. If AFDO is buggy on this one file. I don't see why adding pragmas in it is inherently bad. Changing the code in trivial ways is much better than changing the build in non-trivial ways, which I think is the alternative.

I also agree that this kind of thing should use proper build flags.
I am much more comfortable adding pragmas than anything else if we need something "quickly" (today). 

I can work on either or both of the other changes, but am less confident that we can get them in today, given the other breakages I'm also working on.

We can say "AFDO doesn't work" rather than "AFDO is buggy" if that helps ;).
Got agreement from Thakis to add pragmas to the source code under the agreement that this is only a workaround until we provide the solution dpranke is suggesting. I will send a CL for review in a few minutes.
Project Member

Comment 41 by bugdroid1@chromium.org, Aug 1 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a6f64234d71eeb6cd0571bd963301c8a617889f4

commit a6f64234d71eeb6cd0571bd963301c8a617889f4
Author: llozano <llozano@chromium.org>
Date: Mon Aug 01 22:39:55 2016

Workaround for issue breaking NACL on ChromeOS.

ChromeOS uses AFDO to improve Chrome performance. However, after the GN
migration, NACL broke. We triaged this to using AFDO on
tcmalloc. tcmalloc has some dependencies on specific stack traces and
section allocations that may break once AFDO is used. So, we need to
disable AFDO for the file debugallocation_shim.cc (object bisecting
pointed to it). We need a quick workaround, so we are doing this with a
GCC pragma. The real solution will come in a few days by a change in GN
to avoid passing -fauto-profile to tcmalloc files.

BUG= chromium:629593 
TEST=verified by hand problem does not reproduce after this change.

Review-Url: https://codereview.chromium.org/2198253002
Cr-Commit-Position: refs/heads/master@{#409081}

[modify] https://crrev.com/a6f64234d71eeb6cd0571bd963301c8a617889f4/base/allocator/debugallocation_shim.cc

I've posted a CL that illustrates roughly how I think we might make option #3 work, but the approach has downsides (as discussed in the CL) and it's not obvious to me that it's better than what we're doing now in total (i.e., we might introduce other bugs and weirdnesses with it).

Really the only good answer w/ GN is to push as many flags as possible, if not all of them, into the build files, and not use extra flags for anything.
Are we expecting other CLs for this bug?
There is more work to be done. If NaCl is currently working, we can close this bug, but we should file a follow-up one.
Following up on #33:

The actual failure I see during NaClAppThreadLauncher() comes from:

NaClAppThreadLauncher() ->
  natp->thread_num = NaClAddThreadMu(natp->nap, natp); ->
    DynArraySet(&nap->threads, pos, natp) ->
      new_space = realloc(dap->ptr_array, desired_space * sizeof *new_space); -...-> 
        do_realloc_with_callback() -...->
          do_malloc() ->
            ThreadCache::GetCache()

Here is ThreadCache::GetCache, from third_party/tcmalloc/chromium/src/tcmalloc.c:

inline ThreadCache* ThreadCache::GetCache() {
  ThreadCache* ptr = NULL;
  if (!tsd_inited_) {
    InitModule();
  } else {
    ptr = GetThreadHeap();
  }
  if (ptr == NULL) ptr = CreateCacheIfNecessary();
  return ptr;
}

And here is GetThreadHeap():

inline ThreadCache* ThreadCache::GetThreadHeap() {
#ifdef HAVE_TLS
  // __thread is faster, but only when the kernel supports it
  if (KernelSupportsTLS())
    return threadlocal_heap_;
#endif
  return reinterpret_cast<ThreadCache *>(
      perftools_pthread_getspecific(heap_key_));
}

HAVE_TLS is defined, and KernelSupportsTLS() returns true, so GetThreadHeap() returns __thread variable threadlocal_heap_.

However, when launching the second NaCl App thread, threadlocal_heap_ is initialized to 0xffffffff (not 0x00000000 as might be expected), and therefore a new ThreadCache is not created.

threadlocal_heap_ is defined with a massive comment:

  // If TLS is available, we also store a copy of the per-thread object
  // in a __thread variable since __thread variables are faster to read
  // than pthread_getspecific().  We still need pthread_setspecific()
  // because __thread variables provide no way to run cleanup code when
  // a thread is destroyed.
  // We also give a hint to the compiler to use the "initial exec" TLS
  // model.  This is faster than the default TLS model, at the cost that
  // you cannot dlopen this library.  (To see the difference, look at
  // the CPU use of __tls_get_addr with and without this attribute.)
  // Since we don't really use dlopen in google code -- and using dlopen
  // on a malloc replacement is asking for trouble in any case -- that's
  // a good tradeoff for us.
#ifdef HAVE_TLS
  static __thread ThreadCache* threadlocal_heap_
  // This code links against pyautolib.so, which causes dlopen() on that shared
  // object to fail when -fprofile-generate is used with it. Ideally
  // pyautolib.so should not link against this code. There is a bug filed for
  // that:
  // http://code.google.com/p/chromium/issues/detail?id=124489
  // For now the workaround is to pass in -DPGO_GENERATE when building Chrome
  // for instrumentation (-fprofile-generate).
  // For all non-instrumentation builds, this define will not be set and the
  // performance benefit of "intial-exec" will be achieved.
#if defined(HAVE___ATTRIBUTE__) && !defined(PGO_GENERATE)
   __attribute__ ((tls_model ("initial-exec")))
# endif
   ;
#endif

And... sure enough, removing the "initial-exec" attribute for threadlocal_heap_ also fixes the issue.
Cc: bradnelson@chromium.org
about #45,

thanks for this finding djkurtz. Not sure yet how this finding ties with my finding.. 

Anyway, maybe a possible solution would be for nacl_helper to not use tcmalloc?
it seems tcmalloc has some restrictions that are being broken by nacl_helper?

@bradnelson, can you comment on this? (It seems he had not been copied on this bug before?)



Should I open a different bug?

there is some discussion to be had about this issue but this bug is a P1 and is marked as a release blocker.
You cannot disable tcmalloc just for nacl_helper.

If this analysis is correct (and I have no particular reason to doubt it).

The bug mentioned in that comment became irrelevant quite some time ago (pyauto is gone).

However, it may be that the perf benefit of tls_model("initial-exec") isn't worth it, and we should just remove that altogether.

(Frankly, I'm not even sure tcmalloc is still worth it and we should probably re-confirm that belief.)
re c#47, if NaCL is working with above CLs then I would prefer to mark this as fixed and file new bug(s) to follow up on any remaining work 
Status: Fixed (was: Assigned)
we have a short term workaround. So, moving the discussion to chromium:633719.
And marking this as "Fixed".
Will comment on the new issue.
Sorry for slow reply.

Project Member

Comment 52 by bugdroid1@chromium.org, Aug 11 2016

Project Member

Comment 53 by bugdroid1@chromium.org, Aug 16 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2af5a28d83c844c1f58aacb59a9b8b1cb234b129

commit 2af5a28d83c844c1f58aacb59a9b8b1cb234b129
Author: mostynb <mostynb@opera.com>
Date: Tue Aug 16 16:57:03 2016

-fno-auto-profile is only available beginning with stock GCC 5

This followup to https://codereview.chromium.org/2198253002
unbreaks stock GCC 4.8/4.9 builds.

(We suspect that the chromeos GCC 4.9 toolchain has a local patch for this feature.)

BUG= 629593 

Review-Url: https://codereview.chromium.org/2244983002
Cr-Commit-Position: refs/heads/master@{#412267}

[modify] https://crrev.com/2af5a28d83c844c1f58aacb59a9b8b1cb234b129/base/allocator/debugallocation_shim.cc

Comment 54 by son...@google.com, Oct 13 2016

Status: Verified (was: Fixed)
Verified on build 8743.65.0

Sign in to add a comment