Android crashes in using crc32 with ARMv8 specific optimization |
||||||
Issue descriptionConsumers of Cronet are observing crashes when using crc32 with ARMv8 specific optimization. Internal bug: b/110262915 Example crash stack: signal 7 (SIGBUS), code 2 (BUS_ADRERR), fault addr 0x7f5798327d Stack frame #00 pc 000000000000f128 /system/bin/linker64 (__dl_strcmp+140) Stack frame #01 pc 0000000000009740 /system/bin/linker64 (__dl__ZL14find_librariesP6soinfoPKPKcmPS0_PNSt3__16vectorIS0_NS6_9allocatorIS0_EEEEmiPK17android_dlextinfo+844) Stack frame #02 pc 0000000000009fcc /system/bin/linker64 (__dl__Z9do_dlopenPKciPK17android_dlextinfo+116) Stack frame #03 pc 00000000000033ec /system/bin/linker64 (__dl_dlopen+44) Stack frame #04 pc 000000000036ceac /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine get_elf_hwcap_from_getauxval at /b/build/slave/cronet-arm_64/build/src/out/Release/../../third_party/android_ndk/sources/android/cpufeatures/cpu-features.c:524 Stack frame #05 pc 0000000000068bc0 /system/lib64/libc.so (pthread_once+152) Stack frame #06 pc 000000000036cfec /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine android_getCpuFeatures at /b/build/slave/cronet-arm_64/build/src/out/Release/../../third_party/android_ndk/sources/android/cpufeatures/cpu-features.c:1078 Stack frame #07 pc 00000000003c3928 /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine init_arm_features at /b/build/slave/cronet-arm_64/build/src/out/Release/../../third_party/zlib/arm_features.c:34 Stack frame #08 pc 0000000000068bc0 /system/lib64/libc.so (pthread_once+152) Stack frame #09 pc 0000000000402f5c /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine Cr_z_crc32 at /b/build/slave/cronet-arm_64/build/src/out/Release/../../third_party/zlib/crc32.c:292 Stack frame #10 pc 000000000025e8ac /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine disk_cache::simple_util::Crc32(char const*, int) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../net/disk_cache/simple/simple_util.cc:117 Stack frame #11 pc 00000000002591a8 /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine disk_cache::SimpleIndexFile::Deserialize(char const*, int, base::Time*, disk_cache::SimpleIndexLoadResult*) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../net/disk_cache/simple/simple_index_file.cc:537 Stack frame #12 pc 0000000000258e08 /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine disk_cache::SimpleIndexFile::SyncLoadFromDisk(base::FilePath const&, base::Time*, disk_cache::SimpleIndexLoadResult*) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../net/disk_cache/simple/simple_index_file.cc:495 (discriminator 2) Stack frame #13 pc 00000000002583e8 /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine disk_cache::SimpleIndexFile::SyncLoadIndexEntries(net::CacheType, base::Time, base::FilePath const&, base::FilePath const&, disk_cache::SimpleIndexLoadResult*) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../net/disk_cache/simple/simple_index_file.cc:403 Stack frame #14 pc 00000000001dd0f0 /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::OnceCallback<void ()>::Run() && at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/callback.h:96 (discriminator 2) Stack frame #15 pc 00000000001dd354 /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine void base::internal::FunctorTraits<void (*)(base::(anonymous namespace)::PostTaskAndReplyRelay), void>::Invoke<void (*)(base::(anonymous namespace)::PostTaskAndReplyRelay), base::(anonymous namespace)::PostTaskAndReplyRelay>(void (*&&)(base::(anonymous namespace)::PostTaskAndReplyRelay), base::(anonymous namespace)::PostTaskAndReplyRelay&&) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/bind_internal.h:402 (discriminator 6) Stack frame #16 pc 00000000001a19e4 /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::OnceCallback<void ()>::Run() && at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/callback.h:96 (discriminator 2) Stack frame #17 pc 00000000001d0e04 /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::internal::TaskTracker::RunOrSkipTask(base::internal::Task, base::internal::Sequence*, bool) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/task_scheduler/task_tracker.cc:479 Stack frame #18 pc 00000000001f5cf4 /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::internal::TaskTrackerPosix::RunOrSkipTask(base::internal::Task, base::internal::Sequence*, bool) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/task_scheduler/task_tracker_posix.cc:23 (discriminator 4) Stack frame #19 pc 00000000001d080c /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::internal::TaskTracker::RunAndPopNextTask(scoped_refptr<base::internal::Sequence>, base::internal::CanScheduleSequenceObserver*) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/task_scheduler/task_tracker.cc:372 (discriminator 8) Stack frame #20 pc 00000000001cbb10 /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::internal::SchedulerWorker::RunWorker() at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/task_scheduler/scheduler_worker.cc:313 (discriminator 8) Stack frame #21 pc 00000000001cb934 /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::internal::SchedulerWorker::RunPooledWorker() at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/task_scheduler/scheduler_worker.cc:210 Stack frame #22 pc 00000000001f6178 /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::(anonymous namespace)::ThreadFunc(void*) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/threading/platform_thread_posix.cc:76 Stack frame #23 pc 0000000000067814 /system/lib64/libc.so (_ZL15__pthread_startPv+52) Stack frame #24 pc 000000000001ee30 /system/lib64/libc.so (__start_thread+16)
,
Jun 18 2018
It looks like the Clank team has a Redmi Note 4 which I can test on. I have a theory given the weird call stack & BUS_ADRERR. __dl_strcmp __dl__ZL14find_librariesP6soinfoPKPKcmPS0_PNSt3__16vectorIS0_NS6_9allocatorIS0_EEEEmiPK17android_dlextinfo __dl__Z9do_dlopenPKciPK17android_dlextinfo __dl_dlopen get_elf_hwcap_from_getauxval (inside the NDK) pthread_once android_getCpuFeatures (our function) The NDK function android_getCpuFeatures calls get_elf_hwcap_from_getauxval, which in turn calls dlopen. OpenCV has some interesting code that I previously didn't fully comprehend. But I now thing it is related to this. They mention that calling dlopen() inside a call to dlopen() fails. And their code works around it: https://github.com/opencv/opencv/blob/master/3rdparty/cpufeatures/cpu-features.c#L519 This call stack has a base of __start_thread. My theory is some app is calling dlopen("libcronet.so") and part of Cronet's init is calling Cr_z_crc32 (which in turn is calling another dlopen because of the NDK). Later, I'll take a look at creating a repro case and then confirming this theory.
,
Jun 18 2018
Correction: android_getCpuFeatures (the API exposed by the NDK which our function calls)
,
Jun 18 2018
@cblume: interesting theory (and it makes sense). /me just back from vacations and still catching up with the never ending pile of email. I don't have access to a Redmi Note 4 device, but please let me know if there is anything we could do to help.
,
Jun 20 2018
This is off PostTaskAndReply, though, in a worker thread. Cronet initialization is an explicit function call, too, isn't it?
,
Jun 20 2018
I don't see a specific Cronet initialization here. But you are right that this thread doesn't have a prior dlopen() call. I don't think it is one-per-thread though. I think it is one-per-process. I am guessing at this but I believe if another thread is in the middle of a dlopen() call and this thread happens to also call dlopen() we'll encounter this issue. Is there a way we can check of the Cronet users could possibly have two threads both calling dlopen()? There may be some not-so-lovely workarounds available. Given the call stack, it is clear that android_getCpuFeatures() is just calling getauxval(). We could use the getauxval() code path on Android like we do on Linux. Or we could do what Chrome does and have an explicit initialization step which will call into android_getCpuFeatures() so we know the initialization steps happen in serial and there are no parallel calls to dlopen().
,
Jun 20 2018
It's very possible that we have a parallel call to dlopen(). +pauljensen@: looks like you are on sheriff duty this week, could you help with adding android_getCpuFeatures() to Cronet's initialization step?
,
Jun 20 2018
I may have said something in an unclear way. I am not sure if an initialization call to android_getCpuFeatures() will cause future calls to no longer call dlopen(). I would assume so but I don't know that. What I was suggesting is an initialization call into zlib so it can make its call into android_getCpuFeatures() and never need to call it again.
,
Jun 21 2018
All reports come from API level 23, Marshmallow. 99.69% of reports come from Redmi Note 4 and 4X.
,
Jun 21 2018
+dimitry@google.com who has worked a lot on bionic's dynamic linker. Dimitry: are you aware of any Marshmallow-only dlopen() crashes with similar call-stacks?
,
Jun 21 2018
tl;dr; if this is a bug I am thinking about this happens on unsuccessful dlopen with RTLD_NODELETE flag or if library has DF_1_NODELETE flag set and this was fixed for NYC. https://b.corp.google.com/issues/27911891 is the bug in question. I do not remember is this was regression or bug exists in mnc as well, sounds like it might have. If this is the case removing NODELETE flag should result in normal dlerror().
,
Jun 21 2018
It's Android NDK code: http://androidxref.com/6.0.0_r1/xref/ndk/sources/android/cpufeatures/cpu-features.c#506 that looks like this: dlopen("libc.so", RTLD_NOW) so I don't imagine it would be failing to load libc and it doesn't appear to use RTLD_NODELETE.
,
Jun 21 2018
Ah.. yes. It fails on any subsequent dlopens. The problem is that previous unsuccessful dlopen failed to clear internal linker structures and left them in the state where next access to get_soinfo() results in attempt to access unmmaped memory. Are there any NODELETE libraries in the project?
,
Jun 21 2018
It looks like the only NODELETE dlopen calls in Chromium are in media/ and content/ which aren't included in Cronet: https://cs.chromium.org/search/?q=RTLD_NODELETE+file:%5C.cc&sq=package:chromium&type=cs However, the app using Cronet (see b/110262915) could be using NODELETE, though a quick search didn't turn up any results.
,
Jun 21 2018
Is there any way to prevent the crash? like issuing a failed dlopen() without NODELETE?
,
Jun 21 2018
yes, removing NODELETE flag from dlopen will fix this. Come to think of it the failing dlopen with NODELETE should not crash. It should result in normal dlerror(). But all subsequent dlopens are doomed...
,
Jun 21 2018
That media call makes me wonder if something in the HAL or something for the device may be using that flag... is there any way to check?
,
Jun 21 2018
I looked quickly through Android Marshmallow source but I can't see into the device-specific HAL stuff. Unfortunately I don't know where the NODELETE dlopen call is... It's interesting that dlopen for libcronet succeeds, but then later the dlopen call from Cronet fails. I have a fix that perform the zlib initialization earlier, but I've no way of knowing if we'd hit this crash later from some other dlopen call in Cronet. Here's my proposed fix: https://chromium-review.googlesource.com/c/chromium/src/+/1110377 I'm trying to see if I can get a hold of a Redmi Note 4 to test.
,
Jun 21 2018
I'm sure you have all considered this but I haven't seen it said yet so I felt the need to say it. Since this is only happening on one set of devices for one version of Android, there is a very high chance that the device includes modifications to the AOSP which introduced a bug. Is that perhaps what you are saying with the HAL stuff? Maybe the bug is in the HAL layer?
,
Jun 22 2018
@paul: my understanding is that cronet does CPU features detection to enable optimizations? Assuming that the chain of calls allowed (i.e Android app calls dlopen() to load cronet and nothing else calls into zlib), why not pass on to zlib the CPU features information? We could have a state variable in zlib (i.e. arm_detect_cpu_features) that if set to 'false' we simply skip the call into the NDK function. Then it would be up to cronet to set the flag that enables the optimized code (https://cs.chromium.org/chromium/src/third_party/zlib/arm_features.c?l=22). One concern in zlib is that it runs in both a privileged level in Chromium (e.g. network operations) as also in a sandboxed level (i.e. in RenderProcess when we perform PNG decoding). One advantage of implementing this flag in zlib is that it would allow to perform experiments (i.e. disable/enable optimizations). Another advantage is that it would allow to avoid doing the same work twice in zlib (i.e. detect CPU features) in the context of cronet. Assuming this design is acceptable, it would require one change in cronet and one change in zlib. I would be more than happy to help on this one.
,
Jun 25 2018
> Is that perhaps what you are saying with the HAL stuff? Maybe the bug is in the HAL layer? It's possible there is a bug in the HAL, or perhaps the HAL is loading a library with RTLD_NODELETE. However I suspect updating Cronet may be the most expeditious way to avoid this crash. > @paul: my understanding is that cronet does CPU features detection to enable optimizations? Various pieces of Cronet do CPU feature detection (e.g. BoringSSL, zlib etc). We could have a way to pass CPU feature detection flags into zlib. I think I'd prefer to try my simpler less-risky fix first to see if it fixes the issue before investing more time going down this route. A simple safer fix also is more amenable to merging into release branches.
,
Jun 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/fccbefeaeb30340b3f4240cec51088fb824581eb commit fccbefeaeb30340b3f4240cec51088fb824581eb Author: Paul Jensen <pauljensen@chromium.org> Date: Mon Jun 25 16:23:44 2018 [Cronet] Initialize zlib ARM CPU feature detection early Later attempts to perform this initialization seem to fail potentially because of a dlopen() bug, so perform the initialization early, right after dlopen() on libcronet succeeded. Bug: 853725 Cq-Include-Trybots: master.tryserver.chromium.android:android_cronet_tester;master.tryserver.chromium.mac:ios-simulator-cronet Change-Id: I9132dff672f91d7962909531f01c6bada4fee964 Reviewed-on: https://chromium-review.googlesource.com/1110377 Reviewed-by: Andrei Kapishnikov <kapishnikov@chromium.org> Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org> Commit-Queue: Paul Jensen <pauljensen@chromium.org> Cr-Commit-Position: refs/heads/master@{#570057} [modify] https://crrev.com/fccbefeaeb30340b3f4240cec51088fb824581eb/components/cronet/DEPS [modify] https://crrev.com/fccbefeaeb30340b3f4240cec51088fb824581eb/components/cronet/android/BUILD.gn [modify] https://crrev.com/fccbefeaeb30340b3f4240cec51088fb824581eb/components/cronet/android/cronet_library_loader.cc
,
Jun 26 2018
#6 "We could use the getauxval() code path on Android like we do on Linux." Interesting idea: wonder if that'd work? Indeed, any reasons I'm missing why it would not work on Android?
,
Aug 3
This bug has an owner, thus, it's been triaged. Changing status to "assigned".
,
Dec 21
Marking fixed per comment 22.
,
Dec 21
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by xunji...@chromium.org
, Jun 18 2018