While looking at issue chrome-os-partner:56243 (ChromeOS/ARM slowness), I found sometimes malloc is a hotspot, and there is a toolchain bug for handling TLS variable threadlocal_heap_ used in tcmalloc making the performance worse. That bug is filed in chromium:650137.
I also found an additional optimization opportunity: changing the TLS model for that variable from initial-exec to local-exec if the code is in the executable, not in a shared library. Below is the code of reading the TLS variable on ARM:
(initial-exec)
36: 4bc3 ldr r3, [pc, #780] ; (344 <_ZN12_GLOBAL__N_19do_mallocEj+0x344>)
38: ee1d 2f70 mrc 15, 0, r2, cr13, cr0, {3}
3c: 447b add r3, pc
3e: 681b ldr r3, [r3, #0]
40: 58d6 ldr r6, [r2, r3]
(local-exec)
36: ee1d 3f70 mrc 15, 0, r3, cr13, cr0, {3}
3a: 4ac1 ldr r2, [pc, #772] ; (340 <_ZN12_GLOBAL__N_19do_mallocEj+0x340>)
3c: 589e ldr r6, [r3, r2]
Changing it from initial-exec to local-exec may also sidestep the toolchain bug (not sure as it hasn't been root-caused).
Comment 1 by benhenry@google.com
, Jan 11