amd64-generic-tot-asan-informational failures related to cryptohome-path (ASAN with high ASLR randomness) |
||||||||||||||||||||||
Issue descriptionFailure: https://build.chromium.org/p/chromiumos.chromium/builders/amd64-generic-tot-asan-informational/builds/11332 Initially there is a warning: WARNING: Image format was not specified for '/tmp/cbuildbot-tmp6YoEnz/chromiumos_qemu_disk.bin.hX7GT2' and probing guessed raw. Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted. Specify the 'raw' format explicitly to remove the restrictions Then most tests fail, e.g. 09:05:51 INFO | autoserv| AUTOTEST_STATUS:: File "/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/core/cros_interface.py", line 486, in CryptohomePath 09:05:51 INFO | autoserv| AUTOTEST_STATUS:: raise OSError('cryptohome-path failed: %s' % stderr) 09:05:51 INFO | autoserv| AUTOTEST_STATUS:: OSError: cryptohome-path failed: [1216/070520:ERROR:cryptohome.cc(39)] Could not get size of system salt: /home/.shadow/salt: No such file or directory I suspect this might be a fluke / problem with the build slave (https://build.chromium.org/p/chromiumos.chromium/buildslaves/build330-m2) I will keep an eye on the next build.
,
Dec 16 2016
FYI: I think that WARNING comes directly from qemu. I've seen that printed every time I start a VM.
,
Dec 16 2016
+derat@, +sque@ I see that there have been some recent changes to cryptohome-path, so I am wondering if those might be related? Dan: I'm still trying to find out if / where cryptohome-path sends any output. Any idea?
,
Dec 16 2016
There is some potentially relevant info in messages here: https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/amd64-generic-tot-asan-informational/R57-9092.0.0-b11333/vm_test_results_1/test_harness/all/SimpleTestVerify/1_autotest_tests/results-01-security_NetworkListeners/security_NetworkListeners/sysinfo/ Lots of snippets like this: 2016-12-16T18:19:07.598075+00:00 WARNING cryptohomed[883]: TSS: Failed unix connect: /var/run/tcsd.socket - No such file or directory 2016-12-16T18:19:07.605843+00:00 WARNING cryptohomed[883]: TSS: Got a list of valid IPs 2016-12-16T18:19:07.606186+00:00 WARNING cryptohomed[883]: TSS: Could not connect to machine: localhost 2016-12-16T18:19:07.606228+00:00 ERR cryptohomed[883]: TSS: Could not connect to any machine in the list. 2016-12-16T18:19:07.606300+00:00 ERR cryptohomed[883]: TSS: Failed to send packet Then: 2016-12-16T18:19:07.989912+00:00 NOTICE autotest[3260]: 10:19:07.630 ERROR| browser:0062| Failure while starting browser backend.#012Traceback (most recent call last):#012 File "/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/browser/browser.py", line 55, in __init__#012 self._browser_backend.Start()#012 File "/usr/local/telemetry/src/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 52, in traced_function#012 return func(*args, **kwargs)#012 File "/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_backend.py", line 166, in Start#012 self._WaitForLogin()#012 File "/usr/local/telemetry/src/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 52, in traced_function#012 return func(*args, **kwargs)#012 File "/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/cros_browser_backend.py", line 264, in _WaitForLogin#012 py_utils.WaitFor(self._IsLoggedIn, 900)#012 File "/us Then more of the TSS failures, then the cryptohome-path failure: 2016-12-16T18:19:08.886294+00:00 NOTICE autotest[3264]: 10:19:08.870 WARNI| test:0606| Autotest caught exception when running test:#012Traceback (most recent call last):#012 File "/usr/local/autotest/common_lib/test.py", line 600, in _exec#012 _call_test_function(self.execute, *p_args, **p_dargs)#012 File "/usr/local/autotest/common_lib/test.py", line 810, in _call_test_function#012 raise error.UnhandledTestFail(e)#012UnhandledTestFail: Unhandled OSError: cryptohome-path failed: Segmentation fault#012#012Traceback (most recent call last):#012 File "/usr/local/autotest/common_lib/test.py", line 804, in _call_test_function#012 return func(*args, **dargs)#012 File "/usr/local/autotest/common_lib/test.py", line 461, in execute#012 dargs)#012 File "/usr/local/autotest/common_lib/test.py", line 347, in _call_run_once_with_retry#012 postprocess_profiled_run, args, dargs)#012 File "/usr/local/autotest/common_lib/test.py", line 376, in _call_run_once#012 self.run_once(*args, **dargs)#012 File "/usr/local/autotest/tests/security_NetworkListeners/security_NetworkListeners.py", line 99, i
,
Dec 16 2016
+ngm@ who landed this cryptohome change in the blame list: https://chromium-review.googlesource.com/#/c/419802/
,
Dec 16 2016
I'm unable to find logs for cryptohome. Are more details available?
,
Dec 16 2016
cruptohomed logs go to messages, e.g. in this folder: https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/amd64-generic-tot-asan-informational/R57-9092.0.0-b11333/vm_test_results_1/test_harness/all/SimpleTestVerify/1_autotest_tests/results-01-security_NetworkListeners/security_NetworkListeners/sysinfo/var/log/
,
Dec 17 2016
Sorry, I don't really know anything about cryptohome. cryptohome-path looks like it's a simple wrapper around either brillo::cryptohome::home::GetRootPath() or GetUserPath() that prints to stdout, so knowing that it segfaulted doesn't narrow things down. Is there any way to get a stack trace from the crash?
,
Dec 17 2016
cryptohome said why it failed: "Could not get size of system salt: /home/.shadow/salt: No such file or directory". This file contains salt for generating obfuscated user names in GetUserPath().
,
Dec 17 2016
The salt was not created by cryptohomed daemon in Mount:Init(). cryptohomed initialization failed when it couldn't connect to tcsd daemon that talks to TPM ("2016-12-16T18:16:31.051423+00:00 WARNING cryptohomed[883]: TSS: Failed unix connect: /var/run/tcsd.socket - No such file or directory").
,
Dec 17 2016
tcsd daemon died soon after start: 2016-12-16T18:16:24.460945+00:00 WARNING kernel: [ 7.459783] init: tcsd main process (556) terminated with status 127 Likely because the TPM is in the dictionary attack lockup state: 2016-12-16T18:16:24.151505+00:00 NOTICE tcsd-pre-start[544]: WARNING: Non-zero dictionary attack counter found: 100
,
Dec 17 2016
What machine is this? Is this a physical board? Does it have an actual TPM chip?
,
Dec 17 2016
This is a VM, so no TPM chip
,
Dec 19 2016
I'm wondering if the build slave is in a bad state. I requested a restart to see if that clears things up, issue 675655
,
Dec 19 2016
,
Dec 20 2016
Looks like the slave restart didn't fix the problem. I'll start investigating.
,
Dec 20 2016
The slave restart happened at: last puppet run: 2016-12-20 12:12:24 PST So the first run that started afterwards is still running: https://build.chromium.org/p/chromiumos.chromium/builders/amd64-generic-tot-asan-informational/builds/11366
,
Dec 20 2016
Ah, gotcha. Thanks
,
Dec 20 2016
OK, now the restart definitely didn't fix the problem :( https://build.chromium.org/p/chromiumos.chromium/builders/amd64-generic-tot-asan-informational/builds/11366
,
Dec 20 2016
Yea :(, I'm going to start taking a closer look.
,
Dec 21 2016
I downloaded one of the failing VMs and started playing around. The very strange thing is that cryptohome-path will segfault even if you give it no arguments, ie, $ cryptohome-path Segmentation fault I've been unsuccessful thus far in my attempts to collect a core dump. cryptohome-path never segfaults when running under gdb. I'm guessing this is related to the glibc update, but that has been reverted so this bot should have been updated. Re earlier comments: - inside the VM, /home/shadow/.salt seems valid in the sense that it exists and contains some data - moving /home/shadow/.salt causes cryptohome-path to emit an error message and fail; it does not segfault - Re tpm dictionary attack state: 100 is the default value when we can't fetch the current state [1]. 1: https://cs.corp.google.com/chromeos_public/src/third_party/trousers/init/tcsd-pre-start.sh?l=65
,
Dec 21 2016
I believe the VM I'm testing is running glibc pre-update (2.19 -> 2.23 [1]): $ ldd --version ldd (Gentoo 2.19-r13 p2) 2.19 1: https://chromium-review.googlesource.com/c/339261/
,
Dec 21 2016
Ah! Enabling ASLR in gdb let me repro the segfault. $ gdb cryptohome-path $ set disable-randomization off $ run # repeat until segfault 1: http://stackoverflow.com/a/4628558
,
Dec 21 2016
As expected from comment #24, the binary no longer segfaults after disabling ASLR.
$ echo 0 | sudo tee /proc/sys/kernel/randomize_va_space # Restore ASLR using 2
$ cryptohome-path # Will not segfault
Also, using LD_DEBUG=reloc shows something is going wrong in preinit section (I'm guessing some sort of static ctor? This is definitely not an area I'm very familiar with). (note: LD_DEBUG=all gives a lot more info, way too much to include here).
$ LD_DEBUG=reloc /usr/sbin/cryptohome-path
18177:
18177: relocation processing: /lib64/libc.so.6
18177:
18177: relocation processing: /lib64/libz.so.1
18177:
18177: relocation processing: /usr/lib64/libevent_core-2.0.so.5
18177:
18177: relocation processing: /lib64/libpthread.so.0
18177:
18177: relocation processing: /usr/lib64/libglib-2.0.so.0
18177:
18177: relocation processing: /lib64/libdl.so.2
18177:
18177: relocation processing: /usr/lib64/libcrypto.so.1.0.0
18177:
18177: relocation processing: /usr/lib64/libgcc_s.so.1
18177:
18177: relocation processing: /lib64/librt.so.1
18177:
18177: relocation processing: /lib64/libm.so.6
18177:
18177: relocation processing: /usr/lib64/libstdc++.so.6
18177:
18177: relocation processing: /usr/lib64/libbase-core-395517.so
18177:
18177: relocation processing: /usr/lib64/libbrillo-cryptohome-395517.so
18177:
18177: relocation processing: /usr/sbin/cryptohome-path
18177:
18177: relocation processing: /lib64/ld-linux-x86-64.so.2
18177:
18177: calling init: /lib64/libpthread.so.0
18177:
18177:
18177: calling preinit: /usr/sbin/cryptohome-path
18177:
Segmentation fault
,
Dec 22 2016
There were a lot of CLs in the first breaking build: https://build.chromium.org/p/chromiumos.chromium/builders/amd64-generic-tot-asan-informational/builds/11332 It might be worthwhile going through each of them and seeing which ones affect cryptohome-path? The crash should be somewhere in here, right: https://cs.corp.google.com/chromeos_public/src/aosp/external/libbrillo/brillo/cryptohome.cc
,
Dec 27 2016
Re comment #22: I didn't see any CLs in that list that appear relevant to the breakage. Assigning to current gardener since I'm OOO for the rest of this week. I need to file a bug on this, but for reproducing, you can only do an asan build in a non-internal cros checkout. One of the packages fails to build right now in an internal cros checkout. The easiest way to repro locally is as follows: # from $CROS_SRC dir $ cros_sdk $ ./setup_board --profile=asan --board=amd64-generic $ ./build_packages --board=amd64-generic $ ./image_to_vm.sh --board=amd64-generic --test_image # from $CROS_SRC/src/scripts dir $ ./bin/cros_start_vm --board=amd64-generic $ ssh root@localhost -p 9222 -o StrictHostKeyChecking=no Once sshed into the virtual machine, run cryptohome-path binary a few times. It should segfault about a third of the time.
,
Jan 3 2017
Happy New Year and welcome back everybody! Just a *friendly* reminder from your Sheriff that this is still failing on the external waterfall. 01/03 00:09:42.771 INFO | server_job:0153| FAIL security_NetworkListeners security_NetworkListeners timestamp=1483423780 localtime=Jan 03 00:09:40 Unhandled OSError: cryptohome-path failed: Segmentation fault (core dumped)
,
Jan 3 2017
-> This week's gardener.
,
Jan 4 2017
I followed the instructions in #27 but I can't successfully build. build_packages fail building dlm-0.0.1-r10 with multiple link errors such as: dlm-0.0.1-r10: gen/include/power_manager/proto_bindings/suspend.pb.cc:19: error: undefined reference to '__asan_report_load8' dlm-0.0.1-r10: gen/include/power_manager/proto_bindings/suspend.pb.cc:19: error: undefined reference to '__asan_report_load8' dlm-0.0.1-r10: gen/include/power_manager/proto_bindings/suspend.pb.cc:19: error: undefined reference to '__ubsan_handle_type_mismatch' dlm-0.0.1-r10: gen/include/power_manager/proto_bindings/suspend.pb.cc:20: error: undefined reference to '__asan_report_load8' dlm-0.0.1-r10: gen/include/power_manager/proto_bindings/suspend.pb.cc:20: error: undefined reference to '__asan_report_load8' Am I missing something?
,
Jan 4 2017
Re #30: It looks like you're building with a cros checkout that has internal files. Building with a public-only checkout worked for me.
,
Jan 5 2017
I've reproduced the build and the crash with cryptohome-path.
Looking at the source for it, just running "cryptohome-path" does very little -- checks argc/argv, and prints an error message.
I altered the source to be just an empty main, with no includes, removed library dependencies from the .gyp file for it, rebuilt it and ran it, and it still crashed.
$ LD_DEBUG=reloc /usr/sbin/cryptohome-path
9734: relocation processing: /lib64/libc.so.6
9734: relocation processing: /usr/lib64/libgcc_s.so.1
9734: relocation processing: /lib64/libdl.so.2
9734: relocation processing: /lib64/libpthread.so.0
9734: relocation processing: /lib64/librt.so.1
9734: relocation processing: /lib64/libm.so.6
9734: relocation processing: /usr/lib64/libstdc++.so.6
9734: relocation processing: /usr/sbin/cryptohome-path
9734: relocation processing: /lib64/ld-linux-x86-64.so.2
9734: calling init: /lib64/libpthread.so.0
9734: calling preinit: /usr/sbin/cryptohome-path
Segmentation fault (core dumped)
An objdump -d of the (pre-stripped) executable shows lots of static initializers ... libc, gmon, cxa, sanitizer, ...
Repeating with LD_DEBUG=all and the core dump consistently happens here relative to the logged output (compare to a run where it completed without error):
[...]
13627: symbol=fork; lookup in file=/usr/lib64/libstdc++.so.6 [0]
13627: symbol=fork; lookup in file=/lib64/libm.so.6 [0]
13627: symbol=fork; lookup in file=/lib64/libpthread.so.0 [0]
13627: binding file /usr/sbin/cryptohome-path [0] to /lib64/libpthread.so.0 [0]: normal symbol `fork'
<< core dumped here consistently, if it did >>
13627: symbol=_dl_get_tls_static_info; lookup in file=/usr/lib64/libstdc++.so.6 [0]
13627: symbol=_dl_get_tls_static_info; lookup in file=/lib64/libm.so.6 [0]
13627: symbol=_dl_get_tls_static_info; lookup in file=/lib64/libpthread.so.0 [0]
[...]
Its at least a hint about where it is occurring. I'll try more tomorrow, but at least this is an update.
,
Jan 7 2017
After figuring out to get a working pair of manifests for the amd64-generic-tot-asan-informational builds (attached), and having to do a complete rebuild of the chroot because it was going back in time too much, I've done some further triaging.
The differences in the two manifests was this short list of paths and versions:
src/aosp/system/connectivity/shill b8ab59eb2547e21fcf077e64cb1afb67d3bdcb71 -> 726fe8b5afab7612a4d09909fe15067a2c21e03d
src/overlays f424ed0aac3da4b2ad57a737e6e9afd265062f86 -> 5ddb0ecebda10a577d0cd49085951ff757e9bc42
src/platform2 101f9b690b1e09be7c9527cb328d63cc60e6c453 -> ebbf0fa486c54a53624a2bd84704dc81c9940a00
src/third_party/adhd 52aa233a3352c981d4f445edfbaea4e14425d965 -> 1efbf06defd77b96c8a534b2c15617256236f18c
src/third_party/autotest/files fbe5ff7d9df877abe401395b841fd84a28d19176 -> 0499ed2235cda86db02f84d64ae798d8032b6702
src/third_party/chromiumos-overlay 8d11c99a29f8c71a96ea11ba2522c91a43b2614d -> 446d5e4b39197162a2fbc18a67e8bd725405a4f9
src/third_party/coreboot 021145eeb6e2223d5c513e34fa808b2d062997b5 -> 0e4272c74e87a58d703a2489c943308bea2b3a4a
src/third_party/kernel/v3.18 229586317c0cb78ec95eb9275d0bb7f623b7f70c -> 1c3f9498f57741bdf2b6d0333a8270efe81021cb
It turned out that the breaking change was
src/third_party/chromiumos-overlay 63e254767c2b60939939b040630750b2ea399c37
But this was an automated submit to mark a large number of ebuilds as stable
For someone to continuing triaging from this point, they would start by syncing to the build 11332 manifest (attached), and checkout 63e254767c2b60939939b040630750b2ea399c37 src/third_party/chromiumos-overlay and start bisecting the files involved.
I think I have enough time to try the obvious candidates before I leave today, and I'll report back on this bug for which files I reverted and whether that still led to the coredump running "cryptohome-path" in the vm. I'll be out this next week so I expect to pass the bug on to hshi@
,
Jan 7 2017
Reverting these paths from 63e254767c2b60939939b040630750b2ea399c37 did *NOT* fix the coredump chromeos-base/attestation chromeos-base/chromite chromeos-base/cryptohome sys-kernel/chromeos-kernel-3_18
,
Jan 9 2017
,
Jan 9 2017
Ok thanks to lpique@ I've set up my repo and can reproduce the crash. It occurs roughly 20% of the time, but if I attach a debugger then it never crashes.
,
Jan 9 2017
Re #36: See comment 24; you need to enable ASLR. $ gdb cryptohome-path $ set disable-randomization off $ run # repeat until segfault
,
Jan 9 2017
Re #37: sorry I missed that! Yes this works, thanks
,
Jan 9 2017
Re #34 notice that the amd4-generic asan build uses kernel 4.4 by default, not kernel 3.18 The list of kernel 4.4 changes in this range is 8c372912c802 CHROMIUM: thermal: rockchip: sync the typo to upstream f34b66cee68b UPSTREAM: thermal: rockchip: handle set_trips without the trip points f905074bcde5 UPSTREAM: thermal: rockchip: optimize the conversion table 8147719c7b38 CHROMIUM: config: set mmap_rnd[_compats]_bits to the maximum b58324298b8d UPSTREAM: netfilter: nfnetlink: use original skbuff when acking batches 6edfbf9581e7 UPSTREAM: thermal: rockchip: fixes invalid temperature case e47a7da072d1 CHROMIUM: drm/rockchip: Only wait for panel ACK on PSR entry I'm suspecting this one commit 8147719c7b38b5eb7713fcc6dfa660a7967d8d1e Author: Nicolas Boichat <drinkcat@chromium.org> Date: Tue Dec 13 15:09:13 2016 +0800 CHROMIUM: config: set mmap_rnd[_compats]_bits to the maximum BUG=b:33398361 TEST=Run CTS CtsAslrMallocTestCases module Change-Id: Iffdcfdd4ce4fbdf445e4ada7c20f2b6935d73a0e Reviewed-on: https://chromium-review.googlesource.com/418145 Commit-Ready: Nicolas Boichat <drinkcat@chromium.org> Tested-by: Nicolas Boichat <drinkcat@chromium.org> Reviewed-by: Mattias Nissler <mnissler@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org>
,
Jan 9 2017
CC drinkcat and dianders who reviewed the kernel 4.4 change that modified mmap_rnd[_compats]_bits config This seems somewhat related to ASLR and the random crashes we're seeing.
,
Jan 9 2017
,
Jan 10 2017
I can confirm that the breaking change is https://chromium-review.googlesource.com/418145 On TOT I am able to reproduce this crash at about 20% probability. But with the kernel 4.4 CL reverted (https://chromium-review.googlesource.com/418145) I am NOT able to reproduce the crash after 200 repeated runs.
,
Jan 10 2017
Please review https://chromium-review.googlesource.com/#/c/426066/ I propose that we revert the kernel patch first. Then I can either reassign this to drinkcat@ or we can start a separate bug to track re-landing the kernel 4.4 patch.
,
Jan 10 2017
As mentioned in the CL. Feel free to revert, but: 1. This will break CTS on N. 2. ASAN/cryptohome is probably broken: I suspect increasing ASLR randomness just makes an underlying issue more likely to happen. Somebody should investigate. 3. AFAIK, more ASLR randomness is generally better for security. I don't think we should compromise that to pass an ASAN test.
,
Jan 10 2017
Nicholas: since the breaking tests are on the x86-64 config: is it really necessary to bump to 32 bits to pass CTS on N? According to CTS AslrMallocTest.cpp, it really only tries huge allocations of up to 2^23 bytes.
,
Jan 10 2017
The values we set in config options match what a normal Android instance would set: https://codesearch.corp.google.com/android/system/core/init/init.cpp?type=cs&q=set_mmap_rnd_bits_action&l=324 . But, yes, we can probably get away with only reverting CONFIG_ARCH_MMAP_RND_BITS change (since we only ever use 32-bit containers).
,
Jan 10 2017
re:#44 drinkcat@ I also initially thought about the possibility of cryptohome being broken, but according to comment #32 this is happening even if we just build an empty executable with a main() function that does nothing and returns 0.
,
Jan 10 2017
Understood, but I suspect there are other executables that actually run fine, why is it only cryptohome that shows this issue? Reverting the ASLR patch just sweeps the issue under the carpet, there's something else going on that needs deeper investigation.
,
Jan 10 2017
Since this has been broken for a month anyway, I suggest that we hold off the revert for a few days (which we'd eventually have to reapply), and try to investigate the underlying issue first.
,
Jan 10 2017
I instrumented 4.4 kernel in arch/x86/mm/mmap.c and in fs/exec.c to look at the mmap_base values for /usr/sbin/cryptohome-path. Clearly the mmap_base values are randomized, but as far as I can see there's no discernible patterns for which base address values cause crashes and which do not. I've seem small addresses, large addresses, and pretty much anything in between that either cause or does not cause crashes.
,
Jan 10 2017
Experiment shows that it is sufficient to set CONFIG_ARCH_MMAP_RND_BITS to 31 in third_party/kernel/v4.4/chromeos/config/x86_64/common.config to completely eliminate the crash. We don't need to go back to 28. Setting it to 32 however will cause crashes.
,
Jan 10 2017
The crash seems related to load_elf_binary() in fs/binfmt_elf.c I instrumented the |load_bias| value calculated for /usr/sbin/cryptohome-path when randomization is enabled. The default load bias equals 0x555555555555ull (seems to be a hard-coded constant in kernel equal to 1/3 of some power of 2) and the randomized load_bias adds a random 32-bit unsigned int left-shifted by 12 bits. So the result is between 0x555555555555ull and 0x655555555554ull. In all the crashing cases, the randomized |load_bias| is greater than 0x600000000000ull, whereas all values of |load_bias| below 0x5fffffffffffull does not cause the crash.
,
Jan 10 2017
+keescook
,
Jan 10 2017
More experiments confirm that the threshold of the maximum return value of arch_mmap_rnd() for which crash begins to occur is roughly 0xAAAAAAAA (2/3 of 2^32). So, it is safe to use 31 bits of randomness but not 32 bits.
The following patch completely eliminates the crash:
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index d2dc0438d654..ccc8301860a6 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -76,7 +76,9 @@ unsigned long arch_mmap_rnd(void)
rnd = get_random_long() & ((1UL << mmap_rnd_bits) - 1);
#endif
else
+ do
rnd = get_random_long() & ((1UL << mmap_rnd_bits) - 1);
+ while (rnd > 0xaaaaaaaaUL);
return rnd << PAGE_SHIFT;
}
,
Jan 10 2017
One thing is clear that I can't reproduce the crash with other executables, such as crossystem. But pretty much every executable target in the cryptohome.gyp can trigger this crash with more or less the same probability, including: /usr/sbin/cryptohome /usr/sbin/cryptohomed /usr/sbin/cryptohome-path /usr/sbin/lockbox-ccache /usr/sbin/tpm-manager It could be an obscure bug in one of the dependency libraries that the various cryptohome targets are linked to, or (although less likely) it could be a kernel bug.
,
Jan 10 2017
Instead of the above patch, please lower the sysctl for mmap_rnd_bits. That should solve it...
,
Jan 10 2017
re:#56 keescook@: yes I understand; the patch in #54 is for illustrative purposes only. We can certainly reduce mmap_rnd_bits from 32 to 31 and that will completely eliminate the crash, however we still want to find out why there's a problem with 32 bits of randomness.
,
Jan 10 2017
Okay, understood. I would examine the ranges available for ET_DYN, brk, mmap, and stack. It's possible that at the extreme ends of their ranges they can collide.
,
Jan 10 2017
For reference, here's the backtrace in gdb when SIGSEGV occurs Program received signal SIGSEGV, Segmentation fault. 0x0000638b1524f520 in ?? () (gdb) bt #0 0x0000638b1524f520 in ?? () #1 0x0000638b1525d7c5 in ?? () #2 0x0000638b1518f698 in ?? () #3 0x0000638b1524284e in ?? () #4 0x00007346c589ab2b in _dl_init (main_map=0x7346c5ab0128, argc=1, argv=0x7ffd5d29c618, env=0x7ffd5d29c628) at dl-init.c:105 #5 0x00007346c588bcda in _dl_start_user () from /lib64/ld-linux-x86-64.so.2 #6 0x0000000000000001 in ?? () #7 0x00007ffd5d29e714 in ?? () #8 0x0000000000000000 in ?? ()
,
Jan 10 2017
can you dump the /proc/$pid/maps file for such a process too?
,
Jan 10 2017
,
Jan 10 2017
Re:#60: I can't do that easily because the process crashes right away and then the PID will already be gone.
But from gdb I can do "info proc mappings" that does the same thing. For example here's the dump from another crash
Program received signal SIGSEGV, Segmentation fault.
0x000063999bf67520 in ?? ()
(gdb) bt
#0 0x000063999bf67520 in ?? ()
#1 0x000063999bf757c5 in ?? ()
#2 0x000063999bea7698 in ?? ()
#3 0x000063999bf5a84e in ?? ()
#4 0x000078621d111b2b in _dl_init (main_map=0x78621d327128, argc=1, argv=0x7ffca6d39488, env=0x7ffca6d39498) at dl-init.c:105
#5 0x000078621d102cda in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#6 0x0000000000000001 in ?? ()
#7 0x00007ffca6d3a714 in ?? ()
#8 0x0000000000000000 in ?? ()
(gdb) info proc mappings
process 2605
Mapped address spaces:
Start Addr End Addr Size Offset objfile
0x7fff7000 0x8fff7000 0x10000000 0x0
0x8fff7000 0x2008fff7000 0x20000000000 0x0
0x2008fff7000 0x10007fff8000 0xdfff0001000 0x0
0x600000000000 0x640000000000 0x40000000000 0x0 [heap]
0x78621b866000 0x78621bbb8000 0x352000 0x0
0x78621bbb8000 0x78621bcd5000 0x11d000 0x0 /usr/lib64/libglib-2.0.so.0.3400.3
0x78621bcd5000 0x78621bcd6000 0x1000 0x11d000 /usr/lib64/libglib-2.0.so.0.3400.3
0x78621bcd6000 0x78621bcd9000 0x3000 0x11d000 /usr/lib64/libglib-2.0.so.0.3400.3
0x78621bcd9000 0x78621bcda000 0x1000 0x120000 /usr/lib64/libglib-2.0.so.0.3400.3
0x78621bcda000 0x78621becc000 0x1f2000 0x0 /usr/lib64/libcrypto.so.1.0.0
0x78621becc000 0x78621becd000 0x1000 0x1f2000 /usr/lib64/libcrypto.so.1.0.0
0x78621becd000 0x78621beeb000 0x1e000 0x1f2000 /usr/lib64/libcrypto.so.1.0.0
0x78621beeb000 0x78621bef7000 0xc000 0x210000 /usr/lib64/libcrypto.so.1.0.0
0x78621bef7000 0x78621befb000 0x4000 0x0
0x78621befb000 0x78621c09b000 0x1a0000 0x0 /lib64/libc-2.23.so
0x78621c09b000 0x78621c29b000 0x200000 0x1a0000 /lib64/libc-2.23.so
0x78621c29b000 0x78621c29f000 0x4000 0x1a0000 /lib64/libc-2.23.so
0x78621c29f000 0x78621c2a1000 0x2000 0x1a4000 /lib64/libc-2.23.so
0x78621c2a1000 0x78621c2a6000 0x5000 0x0
0x78621c2a6000 0x78621c2bc000 0x16000 0x0 /usr/lib64/libgcc_s.so.1
0x78621c2bc000 0x78621c4bb000 0x1ff000 0x16000 /usr/lib64/libgcc_s.so.1
0x78621c4bb000 0x78621c4bc000 0x1000 0x15000 /usr/lib64/libgcc_s.so.1
0x78621c4bc000 0x78621c4bd000 0x1000 0x16000 /usr/lib64/libgcc_s.so.1
0x78621c4bd000 0x78621c4c0000 0x3000 0x0 /lib64/libdl-2.23.so
0x78621c4c0000 0x78621c6bf000 0x1ff000 0x3000 /lib64/libdl-2.23.so
0x78621c6bf000 0x78621c6c0000 0x1000 0x2000 /lib64/libdl-2.23.so
0x78621c6c0000 0x78621c6c1000 0x1000 0x3000 /lib64/libdl-2.23.so
0x78621c6c1000 0x78621c6c8000 0x7000 0x0 /lib64/librt-2.23.so
0x78621c6c8000 0x78621c8c7000 0x1ff000 0x7000 /lib64/librt-2.23.so
0x78621c8c7000 0x78621c8c8000 0x1000 0x6000 /lib64/librt-2.23.so
0x78621c8c8000 0x78621c8c9000 0x1000 0x7000 /lib64/librt-2.23.so
0x78621c8c9000 0x78621c8e0000 0x17000 0x0 /lib64/libpthread-2.23.so
0x78621c8e0000 0x78621cae0000 0x200000 0x17000 /lib64/libpthread-2.23.so
0x78621cae0000 0x78621cae1000 0x1000 0x17000 /lib64/libpthread-2.23.so
0x78621cae1000 0x78621cae2000 0x1000 0x18000 /lib64/libpthread-2.23.so
0x78621cae2000 0x78621cae6000 0x4000 0x0
0x78621cae6000 0x78621cbeb000 0x105000 0x0 /lib64/libm-2.23.so
0x78621cbeb000 0x78621cdeb000 0x200000 0x105000 /lib64/libm-2.23.so
0x78621cdeb000 0x78621cdec000 0x1000 0x105000 /lib64/libm-2.23.so
0x78621cdec000 0x78621cded000 0x1000 0x106000 /lib64/libm-2.23.so
0x78621cded000 0x78621cee3000 0xf6000 0x0 /usr/lib64/libstdc++.so.6.0.20
0x78621cee3000 0x78621d0e2000 0x1ff000 0xf6000 /usr/lib64/libstdc++.so.6.0.20
0x78621d0e2000 0x78621d0ec000 0xa000 0xf5000 /usr/lib64/libstdc++.so.6.0.20
0x78621d0ec000 0x78621d0ed000 0x1000 0xff000 /usr/lib64/libstdc++.so.6.0.20
0x78621d0ed000 0x78621d102000 0x15000 0x0
0x78621d102000 0x78621d126000 0x24000 0x0 /lib64/ld-2.23.so
0x78621d157000 0x78621d168000 0x11000 0x0
0x78621d168000 0x78621d17d000 0x15000 0x0 /lib64/libz.so.1.2.8
0x78621d17d000 0x78621d17e000 0x1000 0x14000 /lib64/libz.so.1.2.8
0x78621d17e000 0x78621d17f000 0x1000 0x15000 /lib64/libz.so.1.2.8
0x78621d17f000 0x78621d180000 0x1000 0x0
0x78621d180000 0x78621d19c000 0x1c000 0x0 /usr/lib64/libevent_core-2.0.so.5.1.9
0x78621d19c000 0x78621d19d000 0x1000 0x1c000 /usr/lib64/libevent_core-2.0.so.5.1.9
0x78621d19d000 0x78621d19e000 0x1000 0x1c000 /usr/lib64/libevent_core-2.0.so.5.1.9
0x78621d19e000 0x78621d19f000 0x1000 0x1d000 /usr/lib64/libevent_core-2.0.so.5.1.9
0x78621d19f000 0x78621d1a2000 0x3000 0x0
0x78621d1a2000 0x78621d2fb000 0x159000 0x0 /usr/lib64/libbase-core-395517.so
0x78621d2fb000 0x78621d304000 0x9000 0x158000 /usr/lib64/libbase-core-395517.so
0x78621d304000 0x78621d305000 0x1000 0x161000 /usr/lib64/libbase-core-395517.so
0x78621d305000 0x78621d307000 0x2000 0x0
0x78621d307000 0x78621d311000 0xa000 0x0 /usr/lib64/libbrillo-cryptohome-395517.so
0x78621d311000 0x78621d312000 0x1000 0x9000 /usr/lib64/libbrillo-cryptohome-395517.so
0x78621d312000 0x78621d319000 0x7000 0xa000 /usr/lib64/libbrillo-cryptohome-395517.so
0x78621d319000 0x78621d325000 0xc000 0x0
0x78621d325000 0x78621d326000 0x1000 0x23000 /lib64/ld-2.23.so
0x78621d326000 0x78621d327000 0x1000 0x24000 /lib64/ld-2.23.so
0x78621d327000 0x78621d328000 0x1000 0x0
0x7ffca6d1a000 0x7ffca6d3b000 0x21000 0x0 [stack]
0x7ffca6dbd000 0x7ffca6dbf000 0x2000 0x0 [vvar]
0x7ffca6dbf000 0x7ffca6dc1000 0x2000 0x0 [vdso]
0xffffffffff600000 0xffffffffff601000 0x1000 0x0 [vsyscall]
,
Jan 10 2017
So one thing I have noticed: the heap is at 0x600000000000 - 0x640000000000 This coincides with the crashing |load_bias| values range of 0x555555555555ull and 0x655555555554ull. I have run several hundreds of times so far, and the crash only happens with |load_bias| falls inside the 0x600000000000 - 0x640000000000 range.
,
Jan 10 2017
Ew, yup. That would be it. It seems the brk (heap) randomization isn't handling this correctly.
,
Jan 10 2017
Any suggestions what I should try next? I'm not very familiar with how the brk randomization works. Thanks
,
Jan 10 2017
I'm looking now too. The logic starts in fs/binfmt_elf.c with the call to arch_randomize_brk(). The brk offset should already have been bumped by the load_bias, though, so I'm scratching my head at the moment. I'll keep looking...
,
Jan 10 2017
re:#66 please note this is the chromeos-4.4 branch, so it might not have picked up any latest kernel patches from upstream yet.
,
Jan 10 2017
Dump from readelf -Wl as requested:
Elf file type is DYN (Shared object file)
Entry point 0x214f0
There are 10 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000040 0x0000000000000040 0x0000000000000040 0x000230 0x000230 R 0x8
INTERP 0x000270 0x0000000000000270 0x0000000000000270 0x00001c 0x00001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x12df58 0x12df58 R E 0x1000
LOAD 0x12ee00 0x000000000012fe00 0x000000000012fe00 0x005ef0 0xcf85c8 RW 0x1000
DYNAMIC 0x12f3b8 0x00000000001303b8 0x00000000001303b8 0x0002a0 0x0002a0 RW 0x8
NOTE 0x00028c 0x000000000000028c 0x000000000000028c 0x000044 0x000044 R 0x4
GNU_EH_FRAME 0x129fac 0x0000000000129fac 0x0000000000129fac 0x003fac 0x003fac R 0x4
GNU_STACK 0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW 0
TLS 0x12ee00 0x000000000012fe00 0x000000000012fe00 0x000000 0x000054 R 0x8
GNU_RELRO 0x12ee00 0x000000000012fe00 0x000000000012fe00 0x003200 0x003200 RW 0x40
Section to Segment mapping:
Segment Sections...
00
01 .interp
02 .interp .note.ABI-tag .note.gnu.build-id .dynsym .dynstr .gnu.hash .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame .eh_frame_hdr
03 .data.rel.ro.local .jcr .fini_array .init_array .preinit_array .data.rel.ro .dynamic .got .got.plt .data .bss
04 .dynamic
05 .note.ABI-tag .note.gnu.build-id
06 .eh_frame_hdr
07
08 .tbss
09 .data.rel.ro.local .jcr .fini_array .init_array .preinit_array .data.rel.ro .dynamic .got .got.plt
,
Jan 10 2017
A more complete dump of debug info is uploaded for review at https://paste.googleplex.com/5242057396322304
,
Jan 10 2017
Per offline chat: I propose that we temporarily reduce the randomness from 32 to 31 bits for x86_64 only. This is still sufficient for security and for passing the CTS tests. https://chromium-review.googlesource.com/#/c/426066/ Meanwhile keescook@ will try to reproduce this locally so that this can be investigated more efficiently.
,
Jan 11 2017
Can you try backporting the following kernel changes from upstream? ecc2bc8ac03884266cf73f8a2a42b911465b2fbc 5d22fc25d4fc8096d2d7df27ea1893d4e055e764 0036d1f7eb95bcc52977f15507f00dd07018e7e2 I don't think it'll change anything, but it does touch a lot of the same code that I'm suspicious of.
,
Jan 11 2017
re:#71 I did try 0036d1f7eb95bcc52977f15507f00dd07018e7e2 earlier yesterday but it didn't seem to help. Haven't tried the other two though. Can you try set up a repro locally? It would be more efficient as you have more context on this topic than I do.
,
Jan 11 2017
FYI: I tried all 3 patches in comment #71, they don't seem to help. However, the [heap] range of 0x600000000000-0x640000000000 appears to be always fixed and has nothing to do with the load_bias. For example if I force load_bias to 0x480000000000 (modify ELF_ET_DYN_BASE in arch/x86/include/asm/elf.h), then the ET_DYN load_bias would never collide with [heap] which is still at 0x600000000000.
,
Jan 11 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/a29136e1e89b3bb15d2b4917db856058f576c06f commit a29136e1e89b3bb15d2b4917db856058f576c06f Author: Haixia Shi <hshi@chromium.org> Date: Mon Jan 09 22:32:54 2017 CHROMIUM: config: reduce mmap_rnd_bits from 32 to 31 for x86_64. There seems to be a bug (either in the kernel or ASAN cryptohome) that causes the range of ET_DYN to collide with the heap when 32 random bits are used. Changing this to 31 would still provide plenty of randomness and allow us to pass the relevant CTS tests. Meanwhile we will continue to investigate the underlying problem. BUG= chromium:674998 BUG=b:33398361 TEST=see instructions at http://crbug.com/674998#c27 Change-Id: I6137c5f3798e9de0ec6e57f9e4534d016ad72727 Reviewed-on: https://chromium-review.googlesource.com/426066 Commit-Ready: Haixia Shi <hshi@chromium.org> Tested-by: Haixia Shi <hshi@chromium.org> Reviewed-by: Haixia Shi <hshi@chromium.org> [modify] https://crrev.com/a29136e1e89b3bb15d2b4917db856058f576c06f/chromeos/config/x86_64/common.config
,
Jan 11 2017
Builds are now turning green. See https://build.chromium.org/p/chromiumos.chromium/builders/amd64-generic-tot-asan-informational/ I'd suggest to lower this to Pri-2.
,
Jan 11 2017
,
Jan 11 2017
#75: good suggestion and thank you for all the good work! Do you want to keep working on this or shall we give it to Kees? (Hi Kees, please feel free to chime in :)
,
Jan 11 2017
By the way, I said "thank you" because it's good progress for the team. Not because you did a favor to me or anything like that. Maybe I should have said "I am impressed by the good work". You get the idea.
,
Sep 12 2017
,
Jan 22 2018
,
Jan 23 2018
|
||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||
Comment 1 by steve...@chromium.org
, Dec 16 2016