New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 633719 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Provide long term solution for problem with nacl_helper, tcmalloc and AFDO.

Project Member Reported by llozano@chromium.org, Aug 2 2016

Issue description

This is a continuation bug for chromium:629593.

Moving discussion about this problem to this issue. We want to close the original bug since we already have a workaround and we want to mark that problem as "Fixed".


 
Cc: mcgrathr@chromium.org mseaborn@chromium.org
Adding Roland + Mark for their reference.

So do I understand the current working theory is that nacl thread creation has an undetected race exposed by AFDO?

Direct link for reference:
https://bugs.chromium.org/p/chromium/issues/detail?id=629593

Labels: Arch-ARM
From comment 45 in  issue 629593 , it sounds like the cause of the problem is that a TLS variable, threadlocal_heap_, "is initialized to 0xffffffff [in a new thread] (not 0x00000000 as might be expected), and therefore a new ThreadCache is not created".

Is this a toolchain bug where AFDO is causing a thread-local variable to be initialised incorrectly, on ARM, maybe triggered by the variable being declared with the attribute tls_model("initial-exec"), which is rarely used?

I'm not familiar with how AFDO works, so I'm not sure how likely it is that it would mess up TLS variables.
Status: Assigned (was: Untriaged)
@#2 - yes, that is what my tracing showed.  Enabling AFDO when building tcmalloc for nacl_helper causes the __thread variable threadlocal_heap_ to behave strangely when it has the attribute tls_model("initial-exec").

I say behaving "strangely" because just straight compiling the existing code causes it to be consistently initialized as 0xffffffff, but re-arranging code within nacl_helper and/or tcmalloc can cause other symptoms (ie crashing even earlier) and/or cause the problem to go away.
Maybe threadlocal_heap_ is being given the same address as some other TLS variable, one that is initialised to -1?

It might be possible to test that by doing:
  printf("%p\n", &threadlocal_heap_);
before threadlocal_heap_ is accessed, and doing the same for other TLS variables.

Can you attach a copy of a nacl_helper executable that exhibits the bug (preferably one with line number debug info)?

Some additional information about this problem:

when failing, the chrome logs shows this:

/var/log/chrome/chrome:[1502:2133:0804/140832:ERROR:nacl_process_host.cc(337)] NaCl process exited with status 139 (0x8b)
/var/log/chrome/chrome:[1502:2133:0804/140832:ERROR:nacl_process_host.cc(337)] NaCl process exited with status 139 (0x8b)

From one of the bugs file against the issue, the error is:

<dt><strong>NativeClient: NaCl module load failed: Nexe crashed during startup</strong></dt>

Also, I am not sure if we have paid enough attention to the comment on thread_cache.h (see https://cs.chromium.org/chromium/src/third_party/tcmalloc/chromium/src/thread_cache.h?rcl=0&l=253) and related bug https://bugs.chromium.org/p/chromium/issues/detail?id=124489.
Is it possible that the nexe files are being dlopen ? or something related to that changed with the GN migration?
I also continued playing with the compiler. 
I found that that with AFDO, the compiler is trying to optimize calls to 
do_free_with_callback by creating an especialized clone and applying constant propagation. 
I don't know more than that. There may be an error in the clone or the change has some side effect that is exposing the problem.

I am not very motivated to continue looking at the compiler generated code. We are moving away from GCC to LLVM and this bug is actually taking time from my effort to move to LLVM. We do have a compiler workaround (do not use AFDO). 
So, we can just use that one.

But, there might be a lurking bug about the dlopen and the TLS model. Should we stop using that TLS model?

about #4
I have put a copy of the bad nacl_helper under /google/data/rw/users/ll/llozano/crosbug_633719


Labels: -Proj-GN-Migration
Project Member

Comment 9 by bugdroid1@chromium.org, Oct 3 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/bffda9f508c0376ab3ae4b9e52b0367c24f3818a

commit bffda9f508c0376ab3ae4b9e52b0367c24f3818a
Author: dpranke <dpranke@chromium.org>
Date: Mon Oct 03 20:26:38 2016

Add an `auto_profile_path` flag to GN to control AFDO.

This change adds a new top-level compiler config to GN, so that we can
control whether AFDO should be enabled or not per-target in CrOS
official builds (which is where we use AFDO).

R=brettw@chromium.org, llozano@chromium.org
BUG=633719

Review-Url: https://codereview.chromium.org/2375783002
Cr-Commit-Position: refs/heads/master@{#422513}

[modify] https://crrev.com/bffda9f508c0376ab3ae4b9e52b0367c24f3818a/base/allocator/BUILD.gn
[modify] https://crrev.com/bffda9f508c0376ab3ae4b9e52b0367c24f3818a/build/config/BUILDCONFIG.gn
[modify] https://crrev.com/bffda9f508c0376ab3ae4b9e52b0367c24f3818a/build/config/compiler/BUILD.gn

Components: Infra
Components: -Infra Infra>Client>ChromeOS
[It appears that a bunch of old cros issues bulk-added the "Infra" component recently, but they should probably be "Infra>Client>ChromeOS".]
Components: -Infra>Client>ChromeOS
Components: Tools>ChromeOS-Toolchain

Sign in to add a comment