New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 594674 link

Starred by 6 users

Issue metadata

Status: Archived
Owner:
Closed: Aug 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocking:
issue 611473
issue 612359
issue 639039



Sign in to add a comment

Ozone X11 mash_shell static builds crash in tcmalloc

Project Member Reported by kylec...@chromium.org, Mar 14 2016

Issue description

If mash_shell is built with Ozone X11 in a static build it crashes on startup. The crash happens in multiple processes including ash_sysui and mus. The crashes do not occur with a component build. The crash seems to be related to tcmalloc and freeing memory.

Example stack trace:
../../third_party/tcmalloc/chromium/src/tcmalloc.cc:289] Attempt to free invalid pointer 0x322ecce34000 11081 
Received signal 11 SEGV_MAPERR 000000000039
#0 0x7f3469f37174 base::debug::StackTrace::StackTrace()
#1 0x7f3469f374b0 base::debug::(anonymous namespace)::StackDumpSignalHandler()
#2 0x7f3468def180 <unknown>
#3 0x7f3469e78490 <unknown>
#4 0x7f3469e7c20f tcmalloc::Log()
#5 0x7f3469e82d04 (anonymous namespace)::InvalidFree()
#6 0x7f3467c8663e SkMallocPixelRef::~SkMallocPixelRef()
#7 0x7f3467c866b1 SkMallocPixelRef::~SkMallocPixelRef()
#8 0x7f3467bf3bba SkBitmap::freePixels()
#9 0x7f3467bfbce0 SkBitmapDevice::~SkBitmapDevice()
#10 0x7f3467c3d3e9 SkCanvas::internalRestore()
<truncated>

Reproduce:
$ gn args out_mus/release
use_goma = true
is_component_build = false
is_debug = false
use_ozone = true
ozone_auto_platforms = false
ozone_platform_x11 = true
target_os = "chromeos"
$ ninja -j1000 -k50 -C out_mus/release chrome chrome_sandbox mash:all
$ out_mus/release/mojo_runner mojo:mash_shell --use-ash-sysui --ash-host-window-bounds=1024x768 --ozone-platform=x11

If the default allocator is used instead of tcmalloc the crashes no longer occur. That can be done by adding the following GN arg:

use_allocator = "none"
 
Labels: -mustash mustash1 Proj-Ozone
Components: MUS>Phase>1
Cc: vapier@chromium.org
Labels: pixelmus
Cc: -vapier@chromium.org
i know nothing about tcmalloc in Chromium ;)
Cc: osh...@chromium.org
Labels: OS-Chrome
oshima@ do you know who would be a good person to investigate this?
Cc: thestig@chromium.org
I followed the instructions at r385521, and I got:

$ out_mus/release/mojo_runner mojo:mash_shell --use-ash-sysui --ash-host-window-bounds=1024x768 --ozone-platform=x11
[WARNING:child_process_host.cc(192)] PATH: /path/to/chrome/src/out_mus/release/Mojo Applications/mash_shell/mash_shell.mojo
[WARNING:child_process_host.cc(192)] PATH: /path/to/chrome/src/out_mus/release/Mojo Applications/tracing/tracing.mojo
[ERROR:native_application_support.cc(47)] Failed to load app library (path: /path/to/chrome/src/out_mus/release/Mojo Applications/mash_shell/mash_shell.mojo)
[ERROR:child_process.cc(86)] Failure to RunNativeApplication()

What am I missing?
Sorry about that, the name of the mojo app mash_shell has been changed to mash_session since the bug was posted. I just compiled/ran at r385468 and confirms the crash still occurs with the following command line.

$ ninja -j1000 -k50 -C out_mus/release chrome chrome_sandbox mash:all
$ out_mus/release/mojo_runner mojo:mash_session --ash-host-window-bounds=1024x768 --ozone-platform=x11

You'll see the console start spewing stack traces as processes crash and get restarted.
I used the same GN arguments posted in #0. The crash appears immediately upon running the command I gave in #9.
You might want to ask the Mojo folks about this. Are you running multiple Mojo applications in the same process? What's happening is tcmalloc got reinitialized and got very confused:

- Edit base/allocator/debugallocation_shim.cc as a quick and hacky way of forcing TCMALLOC_FOR_DEBUGALLOCATION.
- static AllocMap* |alloc_map_| get initialized for the first time as expected.
- memory foo is allocated and tracked with |alloc_map_|.
- |alloc_map_| unexpectedly get initialized a second time!!
- memory foo gets freed... tcmalloc checks against the wrong |alloc_map_| and of course does not find the entry.
Very interesting. I put a printf with the pid right after alloc_map_ is created here:

https://code.google.com/p/chromium/codesearch#chromium/src/third_party/tcmalloc/chromium/src/debugallocation.cc&sq=package:chromium&type=cs&l=377

For every mojo process that gets spawned when running mojo:mash_shell the printf is output twice. So it's happening in every (mojo?) process started by mojo_runner.

Comment 13 by jam@chromium.org, Apr 8 2016

Owner: roc...@chromium.org
Status: Assigned (was: Untriaged)
Ken: can you take a look? 
Blocking: 612359
We're not running multiple mojo applications in the same process, but that doesn't matter. tcmalloc initializes itself lazily as needed, and its heap state lives in static storage.

The child process is spawned using the mojo_runner binary -- which links against base -- and it then uses dlopen to load the mojo app DSO which is also linked against base.

That makes this behavior pretty unsurprising: anything allocated from mojo_runner code is going to use the first heap and anything allocated from DSO code will use the second heap. The runner passes a mojo handle to the app through its ABI, and internally this handle the lifetime of a ref-counted object which was allocated by the runner. When the app closes it, everyone has a bad time.

What IS surprising to me, however, is that we don't run into this in static builds for Windows or even non-Chrome OS Linux.

err... "this handle controls* the lifetime"
Blocking: 611473
Blocking: 639039
Cc: primiano@chromium.org
Turns out this isn't a problem with us intentionally passing objects across the heap boundary. In fact we don't do that and there are no problems with mojo APIs + mojo_runner + DSO + tcmalloc.

The problem here is more subtle and is caused by usage of base::UncheckedMalloc, which happens to be used (almost exclusively) for skia allocations including pixel buffers.

+CC primiano for allocator shim insight

Some context:

We have mojo_runner, which is a small binary used to run Mojo "applications" loaded from DSOs at runtime. Basically we dlopen and call some standard entry point by name ("ServiceMain").

In non-component builds, mojo_runner has it's own statically linked copy of base (and therefore tcmalloc), and any given application DSO has its own statically linked copy of base (and therefore tcmalloc).

This means static mojo_runner code calling malloc/free will use one heap (call it heap A), and DSO code calling malloc/free will use another heap (heap B). This is not strictly by design, it's just a side effect of the linkage and how we load the DSOs.

base::UncheckedMalloc on Linux always goes through allocator::UncheckedAlloc when the allocator shim is enabled.

The problem seems to be that these allocations always allocate on heap A, even when calling directly from DSO code. free() on the other hand does the right thing: if DSO code calls free(), it tries to free from heap B.

We can trivially repro a crash by writing a simple Mojo service:

MojoResult ServiceMain(MojoHandle) {
  void* ptr;
  ignore_result(base::UncheckedMalloc(42, &ptr));  // this doesn't fail
  free(ptr);
  return MOJO_RESULT_OK;
}

and running it in mojo_runner. This will always crash with tcmalloc complaining that we tried to free an invalid pointer.

primiano@, any idea how this is possible? If it helps I can create a minimized repro that doesn't do any mojo stuff.
Reduced proof of concept for this failure mode: https://codereview.chromium.org/2268853002
Cc: torne@chromium.org
TL;DR if mojo_runner embeds chromium code, effectively pulling in duplicated definitions of its symbols, make sure you don't leak those symbols as global, or bad things will happen. 

> In non-component builds, mojo_runner has it's own statically linked copy of base (and therefore tcmalloc), and any given application DSO has its own statically linked copy of base (and therefore tcmalloc).

Correct. This means that when mojo_runner loads a DSO you end up with two copy of base: one in mojo_runner itself and one in the DSO (assuming the DSO is chromium code).
Which can be fine, as long as:
- you don't leak base symbols in the mojo_runner OR you don't require global (in the ELF sense) base symbols in the DSO.
- you never cross the boundaries and never pass objects around (e.g. don't malloc on one side and free on the other, don't pass base C++ objects around the boundaries)

In any other case, lot of subtle bugs will happen, because any shared state (global vars,singletons, etc) will be duplicated for each copy of base.
As long as you just call an entrypoint things should be fine (% this tcmalloc issue).

> This means static mojo_runner code calling malloc/free will use one heap (call it heap A), and DSO code calling malloc/free will use another heap (heap B).
Not really. This really depends on how mojo_runner and the DSO are built.
If the DSO has a malloc call which goes via PLT, the malloc definition in mojo_runner will win, because symbol resolution will start looking for symbols from the root executable. Effectively, I believe this (symbols of mojo_runner winning over the DSO) will happen almost always when you call exported functions (like malloc) unless you build the DSO with -Bsymbolic.
But I'd advocate against that, as it brings its own problems when your DSO calls into third party libraries.

Anyways, back to this specific problem.

>  base::UncheckedMalloc on Linux always goes through allocator::UncheckedAlloc when the allocator shim is enabled.
Hmm, assuming you are following the default chromium config (use_allocator=tcmalloc and use_experimental_allocator_shim=true) both malloc and UncheckedAlloc should end up in the same place.
The code paths are:
when code calls malloc:
 - malloc is defined in base/allocator/allocator_shim_override_libc_symbols.h and aliases to ShimMalloc.
 - ShimMalloc calls into TCMalloc in base/allocator/allocator_shim_default_dispatch_to_tcmalloc.cc, which ultimately calls tc_malloc (From third_party/tcmallolc)

when code calls base::UncheckedAlloc:
 - base::UncheckedAlloc calls into allocator::UncheckedAlloc in base/allocator/allocator_shim.cc as you say
 - allocator::UncheckedAlloc calls TCMalloc in base/allocator/allocator_shim_default_dispatch_to_tcmalloc.cc, which ultimately calls tc_malloc (From third_party/tcmallolc)

So both sites should bring you to the same tc_malloc call.
Now I suspect at this point that what is happening is that one call (either free or UncheckedMalloc) is resolved internally in the DSO and the other one goes through PLT, where the definition in mojo_runner wins.

A way to check this would be adding the following statements in mojo_runner and in the DSO and check the results:
printf("addr of malloc: %p\n", (void*)&malloc);
printf("addr of free: %p\n", (void*)&free);
printf("addr of tc_malloc: %p\n", (void*)&tc_malloc);
printf("addr of UncheckedMalloc: %p\n", (void*)&base::UncheckedMalloc);

if one of them prints the same address on both sides something is wrong.

IMHO the best approach here would be for mojo_runner to not "leak" any symbol.
In practice the best way to achieve this is a linker versioning script ( -Wl,--version-script=) which makes all the symbols of mojo_runner non-global (local: * IIRC)

----------------------
EDIT after reading #20
----------------------
First of all thanks for the repro, this definitely helps speeding up things.
I added the prints and here's the result:

MAIN malloc: 0x41f3f0
MAIN tc_malloc: 0x6695b0
MAIN free: 0x41f760
MAIN UncheckedAlloc: 0x472a50

DSO malloc: 0x41f3f0
DSO tc_malloc: 0x7f2a7aee9d90
DSO free: 0x41f760
DSO UncheckedMalloc: 0x7f2a7ad53850

As I was suspecting what's happening is that when the DSO calls malloc/free it really uses heap A, the heap from the main runner.
Even without the UncheckedMalloc problem this is already bad, because means that you are violating ODR.
Sadly, this is not the case when calling UncheckedMalloc, which goes though the heap B.
In summary, the problem here is that malloc/free go via PLT and get the definition in the mojo_runner, while UncheckedAlloc is a straight jump.

Proof:
$ gdb out/rel/libhaxlib.so
$ disassemble Hax
...
0x0000000000035cf1 <+177>:   callq  0x1a9850 <base::UncheckedMalloc(unsigned long, void**)>
...
0x0000000000035d81 <+321>:   callq  0x30570 <free@plt>

note how UncheckedMalloc is called directly, while free goes via PLT.

So, even if both libhaxlib.so and hax define malloc as a global symbol (proof: 
$ readelf -W --dyn-syms out/rel/libhaxlib.so | grep -w malloc
   459: 00000000000362b0   113 FUNC    GLOBAL DEFAULT   12 malloc

$ readelf -W --dyn-syms out/rel/hax | grep -w malloc
   703: 000000000041f3f0   113 FUNC    GLOBAL DEFAULT   13 malloc
)
the definition in hax (the exe) always wins when you call it via PLT because of symbol resolution rules.


I tried adding a linker script as suggested above myself, however I'm still missing something.
https://codereview.chromium.org/2269013002

For some reason the linker script "almost" works, but few symbols are still exported, causing the entire exercise to fail.
The problem I am facing now is the following:

$ nm -C -D --defined-only out/rel/hax 
00000000004055c0 T calloc
0000000000405510 T free
0000000000405530 T malloc
00000000004057a0 T posix_memalign
0000000000405660 T realloc

I am struggling to understand why local:* doesn't hide those symbols. More importantly, those are only some of the symbols defined in allocator_shim_override_libc_symbols.h.
So I cannot quite understand why, for instance, valloc and pvalloc are hidden, but not posix_memalign.
If you manage to solve this you have solved your problem. +torne as linker ninja.

We looked at this with torne@:
The fact that malloc/calloc & few other symbols remain defined seems to be some gold (the linker) bug. I'll file a separate bug for that.
In fact by setting use_gold=false in GN, the linker script actually does what expected (no symbols leaked from hax binary).

There is a second problem though, this is still not enough to solve your original problem. I didn't think to this in my #21 sorry.
Even if you prevent the main runner binary to leak any symbol, the symbol resolution now will pick malloc/free from glibc and not from the dlopene-ed library.
So now you are in the situation where UncheckedMalloc still uses its own internal tcmalloc implementation but when you call malloc/free you get the glibc implementation, and you get a similar failure from glibc's free (you are asking glibc to free a pointer allocated via the internal tcmalloc).
This is because when resolving the free@plt jump, the symbol resolution starts from the root executable (which this time doesn't leak any "free" symbol) and then follows the dependent .sos with a breadth-first search, following the DT_NEEDED order.
So it will first look for "free" in hax, fail to find it, and then pick "free" from glibc.so. libhaxlib.so is offering a definition of "free", but since you are dlopening there is no way to win that resolution against libc.so which will come always first in the linker resolution.

At this point I believe the only option left is building the library that you dlsym with use_allocator=none.
In essence there is no way that comes to my mind for tcmalloc to override the malloc symbol in a way that works in a dlopen()-ed library (this is true regardless of the value of use_experimental_allocator_shim). the allocator overriding relies on being able to interpose malloc/free & friend symbols, which doesn't work with dlopen-ed libraries (unless you manage to not get any dependency on glibc in your main executable, which is quite tough).

The other option would be building the library that you dlsym with -Bsymbolic, effectively ensuring that malloc/free calls in the DSO are not resolved via PLT jumps. But that IMHO just moves the problem away: as soon as you hit a malloc/free call in a third_party library (e.g., libcups.so) which does a PLT jump, they will pick the symbol from the main runner.
Gold linker bug: https://sourceware.org/bugzilla/show_bug.cgi?id=20507

I had another chat with torne@ and he explained me that in general, this is a common problem with dynamically loaded plugins. The way this is solved in general when you want to use a custom allocator with DSOs (i.e. in a non-monolithic build), is to have malloc defined only in the root executable and let the DSO depend and use that one.

In chrome things are more complicated, as usual. The TL;DR is that all the allocator logic was never designed to work in a scenario where chrome is NOT the root executable.
In fact on Android, where chrome is actually a dlopen-ed library, we do a completely different set of linker incantations (Wl,wrap and friends).

I'm afraid that, as things are today, there is no easy answer that comes to my mind other than setting use_allocator=none, if you want to dynamically load chrome as a DSO.
Happy to chat more over IM/VC.
Another option would be forcing base::UncheckedMalloc to go via PLT. This would probably solve this specific problem. The thing that makes me a bit worried though, is that even if it works it would be a terrible hack.
It would mean that this mojo + DSO + tcmalloc combination would violate ODR and would work only by means of "violating ODR in a smart way" (more practically, you would end up having tcmalloc always defined in mojo_runner and the DSO, and always picking the one in the mojo_runner).
I'd be better not violating ODR in the first place.
Cc: jonr...@chromium.org
Cc: sky@chromium.org
Thanks for the thorough responses!

Building without tcmalloc is not a good option for us unfortunately.

Also, I got a little tripped up on #22: you're saying that DSO code referencing malloc/free will always fall back on either the main binary (if it exports them) or glibc (if not)? So there's no way for the DSO to bring and use its own private malloc et al?

sky@ had one suggestion: would it be feasible to manually remove malloc et al entries from the DSOs after linking, as an extra build step? That would avoid ODR without requiring us to introduce a special snowflake //base.
> Also, I got a little tripped up on #22: you're saying that DSO code referencing malloc/free will always fall back on either the main binary (if it exports them) or glibc (if not)? 

Correct, at least in the general case where you want to just use the "malloc" symbol.
The problem is that the DSO will:
 - expose a global visibilty=default "malloc" symbol
 - have calls to malloc@plt in various sites
but this is not enough to guarantee that those malloc@plt will be linked to the above symbol.
In fact they will be linked to the first linker unit that exposes a malloc-symbol, following a breadth first search which starts from the executable, and follows the other .sos in the DT_NEEDED order (the case of dlopen is even worse comparing to a standard DT_NEEDED dynamic library dependency, as you are essentially always last in the search chain).

In other words, I believe your only chances to get malloc@plt linked to your malloc definition inside your DSO is if:
 - the main executable does NOT define the malloc symbol
 - none of the .sos that the main executable links define malloc

That's why the general safe rule is: only the root executable should define the malloc symbols.
(This is not strictly required, in fact I suspect that everything works here if you do a component build. This is because in that case malloc is defined by libbase.so and that one is usually one of the first DT_NEEDED libs required by the main executable.)


> So there's no way for the DSO to bring and use its own private malloc et al?
Generally not (generally == not just because chrome is special, it's a general problem with on bsd systems)
 unless you do some super weird tricks like we did on Android.
See https://chromium.googlesource.com/chromium/src/+/master/base/allocator/README.md especially the section "On Android: load-time symbol interposition (unlike the Linux/CrOS case) is not possible. ..."

The problem is that wl,wrap trick works only if the DSO is hermetic and you don't swap objects with external .sos, which is true on Android but not on Linux, where we depend on quite some system .so-s like libcups etc).
Note that on Android we do NOT use tcmalloc, which is why that trick is safe, because at the end both hooked and non hooked calls end up in the same place. Doing that trick to use tcmalloc in a DSO would be IMHO quite dangerous.
> sky@ had one suggestion: would it be feasible to manually remove malloc et al entries from the DSOs after linking, as an extra build step? That would avoid ODR without requiring us to introduce a special snowflake //base.

Let me se if I get this right: are you proposing to remove malloc & co from the mojo_runner executable? If you do that you end up with the problem that you will now pick malloc from glibc instead (unless you can make mojo_runner and none of its dependency to not depend on libc, which is imho extremely hard).

The easy and clean way to get that is just use a linker script like I did in https://codereview.chromium.org/2269013002 (% that gold bug mentioned in #23). But that keeps you in the state where malloc calls made from the DSO are resolved against glibc (But not the ones for UncheckedMalloc). Essentially you'd end up in a situation where you think you are using tcmalloc but you are actually not doing that (unless you call UncheckedMalloc).
Also let me clarify the "That would avoid ODR without requiring us to introduce a special snowflake //base"
The problem here is not requiring a snowflake //base.
The problem is that if you want to load chrome as a DSO AND use tcmalloc the supported way of doing that is:
- build mojo_runner WITH tcmalloc, so the symbols are exposed by the root executable and win in loader symbol lookup.
- build the DSO WITHOUT tcmalloc, to avoid ODR.
- fix the code inside of chrome (mostly the allocator shim) that assumes that everything is either built with or without tcmalloc.

Unfortunately, there is no easy way to build some targets with tcmalloc and some targets without. use_allocator is a global gn flag and the only way of doing that is generating the build files and building twice.

Comment 30 by sky@chromium.org, Aug 23 2016

> Let me se if I get this right: are you proposing to remove malloc & co from the mojo_runner executable?

Now that I read all the gory details (thanks for the excellent writeup!) I think my suggestion is if there is a way to post-process the DSO such that UncheckedAlloc ends up using PLT lookup so that it ends up in the main mojo_runner and everyone uses the same alloc? This way we don't need to build parts of chrome twice, once with the shim defined and once without it defined.
Re #29, right, I think we would ideally do that: build mojo_runner with tcmalloc and our DSOs without. But given that this is hard because of the way the build's set up, could we instead write a tool to hack up our DSOs after linking, such that tcmalloc definitions are removed from the DSOs (not from mojo_runner) and DSO code calling tcmalloc will use PLT lookup instead?
Cc: sadrul@chromium.org
Any idea why this is not a more common issue? I mean, in  issue 639039 , we get these crashes only when running mash_browser_tests. But when running chrome --mash normally, we do not see these crashes. Is there a good explanation for that?
Re #32, my guess is that when you run chrome --mash you're using a component build? This issue only affects non-component builds.
So, one thing we can do right now is making UncheckedMalloc always jump via PLT. I would do that as a default thing (it's already the case for malloc, so perf-wise I don't think that is going to change anything in a measurable way for UncheckedMalloc) and won't add special snowflakes there.
Very likely this will solve the crash, because now malloc free and uncheckdmalloc will consistently do PLT jumps, which is good.
On the bad side, though, this will not solve the ODR issue. If we fix UncheckedMalloc, what will happen is that  malloc, free (and now also UncheckedAlloc) calls from the DSO will be resolved consistently against the exported mojo_runner symbols, even if you have a definition in the DSO (here is the ODR violation).

How can this can cause problems? The risk of having that double definition of malloc/free/unchecked-alloc is if:
 - for some reason the compiler or the linker LTO decides to inline one of the calls that we expect to go through PLT. Not sure if and under which conditions this can happen.
 - more realistically, if somebody in future refactors the allocator code and breaks the assumption that all entry-points to allocator must have a PLT jump.
If such a thing happens, â„¢his will regress and crash again.

there is no easy way that comes to my ind to prevent ODR violation other than building the DSO without the allocator code in the first place. If you post-process the DSO you can at most hide the symbol definitions and make them non-public. But that is not the problem here.
The thing that you do NOT want to happen is the DSO to have allocator code, to guarantee that neither the compiler/linker nor a distracted future developer can add internal dependencies from the chromium code in the DSO to the allocator definition in the DSO.
I don't think there is any practical way to strip through some post-processing after the build. Once something gets inlined you practically lose track of it.

I can help with making UncheckedMalloc consistently PLT-ish, but no idea how to prevent the ODR issue.
So, concretely making UncheckedMalloc PLT-ish is as easy as this 2-line change:

https://codereview.chromium.org/2279473002/diff/1/base/allocator/allocator_shim.h

I confirm this solves the crash in your hax toy example (See that CL).
The problem here is that while writing this, I just realized that, as I was worrying about, ODR bites you one second later.
Now that UncheckedMalloc is solved, we have the same identical problem with SetCallNewHandlerOnMallocFailure.

The root cause of the problem is the same: SetCallNewHandlerOnMallocFailure is not PLT-ish, so when called within the DSO hits the internal code that changes the state of the tcmalloc implementation internal to the DSO.
But since malloc/UncheckedMalloc are now resolved against mojo_runner, that check never happens there. So, in a very quiet way, you end up invalidating the suicide-on-oom feature of SetCallNewHandlerOnMallocFailure. Proof: 
https://codereview.chromium.org/2279473002/diff/1/hax/haxlib.cc

Now, of course we could solve this other problem by making also SetCallNewHandlerOnMallocFailure jump via PLT. But as you can imagine, the thing here that makes me worry is: is there another case that would be silently bitten by this ODR violation that I'm not thinking about? In other words, I feel this approach is extremely fragile.


A possible alternative solution
-------------------------------
Now, while doing all this, I think I might just have found an alternative way to deal with this. It's a bit of black magic but less scary than having to audit all the possible entry points in the allocator code that we need to trampoline.

See https://codereview.chromium.org/2270923003

Can you give a shoot at using dlopen(RTLD_LAZY |  RTLD_GLOBAL |  RTLD_DEEPBIND) ?
The manpage for RTLD_DEEPBIND says:
"""Place the lookup scope of the symbols in this library ahead of the global scope. This means that a self-contained library will use its own symbols in preference to global symbols with the same name contained in libraries that have already been loaded. This flag is not specified in POSIX.1-2001."""

If I read that correctly, this means that: if both the mojo_runner and the DSO define malloc, if you load the DSO with DEEPBIND, the PLT jumps of the code inside the DSO will be resolved against the malloc implementation provided by the DSO, and not the mojo_runner.
Which means that now the chrome code in the DSO:
 -if does a straight (non PLT) jump (as it is the case for UncheckedMalloc today) will go directly to the allocator implementation inside the DSO
 -if does a PLT jump (as it is the case for malloc today) will be bound to the symbol provided by the DSO.

I gave it a shoot and it seems to work fine (without having to change anything to the declaration of UncheckedMalloc).
The only open question I have right now is: what is the semantic of GLOBAL + DEEPBIND if the chrome code calls a third party library and that third party library calls malloc? Will the malloc dependency in the third party .so be resolved against the DSO or mojo_runner? Have a try.

Intel seems to use DEEPBIND for a very similar purpose:
From [1]
"""When one creates a shared library on Linux* that links in the Intel runtime libraries statically (for example, using the -static-intel option), the expectation is that the application will run the optimized version of functions from the Intel-provided libraries, which are statically linked to the shared library. For example, it is expected that calls to math functions like cos() resolve to libimf, the Intel-provided math library which contains optimized math functions. On Linux, the default behavior is to resolve the symbol to the GNU libm version of these routines.
"""
If you replace:
 s/the Intel runtime libraries/the Chrome allocator code/
 s/cos()/malloc()/
 s/libm/glibc/
seems like you get back to the same problem of this bugf.

I've never used DEEPBIND before and feels like black magic. But maybe is the kind of black magic worth investigating for this problem.
One thing that I'm quite confident is that it doesn't work on Android. So if it works will be a Linux only thing. Maybe it's fine?

[1] https://software.intel.com/en-us/articles/ensuring-shared-library-uses-intel-math-functions

Project Member

Comment 36 by bugdroid1@chromium.org, Aug 26 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/596a0dd73a3b8f8eaacb58526540107efcba5a57

commit 596a0dd73a3b8f8eaacb58526540107efcba5a57
Author: rockot <rockot@chromium.org>
Date: Fri Aug 26 00:57:51 2016

Enable loading native libraries with RTLD_DEEPBIND

Using RTLD_DEEPBIND by default is problematic (see comments
in native_library_posix.cc), but in the case of Mojo services
where the host binary is lightweight but otherwise may export some
conflicting symbols (e.g. tcmalloc), it is desirable.

This CL adds a NativeLibraryOptions structure and a new
LoadNativeLibraryWithOptions() call. The only supported
option (|prefer_own_symbols|) only affects behavior on
non-OSX, non-Android, POSIX systems.

The default behavior of LoadNativeLibrary is unchanged.

Also adds a unit test for general native library loading since
there wasn't one before.

BUG= 594674 

Review-Url: https://codereview.chromium.org/2277863002
Cr-Commit-Position: refs/heads/master@{#414605}

[modify] https://crrev.com/596a0dd73a3b8f8eaacb58526540107efcba5a57/base/BUILD.gn
[add] https://crrev.com/596a0dd73a3b8f8eaacb58526540107efcba5a57/base/native_library.cc
[modify] https://crrev.com/596a0dd73a3b8f8eaacb58526540107efcba5a57/base/native_library.h
[modify] https://crrev.com/596a0dd73a3b8f8eaacb58526540107efcba5a57/base/native_library_ios.mm
[modify] https://crrev.com/596a0dd73a3b8f8eaacb58526540107efcba5a57/base/native_library_mac.mm
[modify] https://crrev.com/596a0dd73a3b8f8eaacb58526540107efcba5a57/base/native_library_posix.cc
[modify] https://crrev.com/596a0dd73a3b8f8eaacb58526540107efcba5a57/base/native_library_unittest.cc
[modify] https://crrev.com/596a0dd73a3b8f8eaacb58526540107efcba5a57/base/native_library_win.cc
[modify] https://crrev.com/596a0dd73a3b8f8eaacb58526540107efcba5a57/base/test/BUILD.gn
[add] https://crrev.com/596a0dd73a3b8f8eaacb58526540107efcba5a57/base/test/native_library_test_utils.cc
[add] https://crrev.com/596a0dd73a3b8f8eaacb58526540107efcba5a57/base/test/native_library_test_utils.h
[add] https://crrev.com/596a0dd73a3b8f8eaacb58526540107efcba5a57/base/test/test_shared_library.cc

Project Member

Comment 37 by bugdroid1@chromium.org, Aug 30 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8

commit e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8
Author: sky <sky@chromium.org>
Date: Tue Aug 30 17:02:27 2016

Changes around how browser_tests are launched for mash

This does roughly the same thing as we do for chrome --mash. In
particular it bundles the all the mash apps so that we aren't dll
opening them as that causes problems.

BUG= 594674 
TEST=covered by tests.
TBR=erg@chromium.org

Review-Url: https://codereview.chromium.org/2295433003
Cr-Commit-Position: refs/heads/master@{#415329}

[modify] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/chrome/app/mash/BUILD.gn
[modify] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/chrome/app/mash/mash_runner.cc
[modify] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/chrome/test/BUILD.gn
[modify] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/chrome/test/DEPS
[modify] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/chrome/test/base/browser_tests_main.cc
[add] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/chrome/test/base/browser_tests_main_chromeos.cc
[modify] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/chrome/test/base/mash_browser_tests_main.cc
[add] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/chrome/test/base/mash_browser_tests_manifest.json
[modify] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/chrome/test/base/mojo_test_connector.cc
[modify] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/chrome/test/base/mojo_test_connector.h
[add] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/mash/package/BUILD.gn
[add] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/mash/package/DEPS
[add] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/mash/package/mash_packaged_service.cc
[add] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/mash/package/mash_packaged_service.h
[modify] https://crrev.com/e7cc53efdf3b5981c1a01bc1f1dc1dbf658108b8/services/ui/service.cc

Cc: roc...@chromium.org
Owner: sky@chromium.org
Status: Fixed (was: Assigned)
Calling this fixed (thanks Scott).

I'll file a separate bug for the allocator shim + mojo_runner issue which we will need to solve eventually.
Cc: wfh@chromium.org
 Issue 639039  has been merged into this issue.
Labels: VerifyIn-55
 issue 642879  is the one mentioned in comment #38.

Comment 42 by dchan@google.com, Nov 19 2016

Labels: VerifyIn-56

Comment 43 by dchan@google.com, Jan 21 2017

Labels: VerifyIn-57

Comment 44 by dchan@google.com, Mar 4 2017

Labels: VerifyIn-58

Comment 45 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 46 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 48 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)
Components: Internals>Services>WindowService
Components: -MUS>Phase>1

Sign in to add a comment