New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 853725 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Dec 21
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 2
Type: Bug



Sign in to add a comment

Android crashes in using crc32 with ARMv8 specific optimization

Project Member Reported by xunji...@chromium.org, Jun 18 2018

Issue description

Consumers of Cronet are observing crashes when using crc32 with ARMv8 specific optimization.

Internal bug: b/110262915
Example crash stack:

signal 7 (SIGBUS), code 2 (BUS_ADRERR), fault addr 0x7f5798327d
Stack frame #00 pc 000000000000f128  /system/bin/linker64 (__dl_strcmp+140)
Stack frame #01 pc 0000000000009740  /system/bin/linker64 (__dl__ZL14find_librariesP6soinfoPKPKcmPS0_PNSt3__16vectorIS0_NS6_9allocatorIS0_EEEEmiPK17android_dlextinfo+844)
Stack frame #02 pc 0000000000009fcc  /system/bin/linker64 (__dl__Z9do_dlopenPKciPK17android_dlextinfo+116)
Stack frame #03 pc 00000000000033ec  /system/bin/linker64 (__dl_dlopen+44)
Stack frame #04 pc 000000000036ceac  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine get_elf_hwcap_from_getauxval at /b/build/slave/cronet-arm_64/build/src/out/Release/../../third_party/android_ndk/sources/android/cpufeatures/cpu-features.c:524
Stack frame #05 pc 0000000000068bc0  /system/lib64/libc.so (pthread_once+152)
Stack frame #06 pc 000000000036cfec  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine android_getCpuFeatures at /b/build/slave/cronet-arm_64/build/src/out/Release/../../third_party/android_ndk/sources/android/cpufeatures/cpu-features.c:1078
Stack frame #07 pc 00000000003c3928  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine init_arm_features at /b/build/slave/cronet-arm_64/build/src/out/Release/../../third_party/zlib/arm_features.c:34
Stack frame #08 pc 0000000000068bc0  /system/lib64/libc.so (pthread_once+152)
Stack frame #09 pc 0000000000402f5c  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine Cr_z_crc32 at /b/build/slave/cronet-arm_64/build/src/out/Release/../../third_party/zlib/crc32.c:292
Stack frame #10 pc 000000000025e8ac  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine disk_cache::simple_util::Crc32(char const*, int) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../net/disk_cache/simple/simple_util.cc:117
Stack frame #11 pc 00000000002591a8  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine disk_cache::SimpleIndexFile::Deserialize(char const*, int, base::Time*, disk_cache::SimpleIndexLoadResult*) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../net/disk_cache/simple/simple_index_file.cc:537
Stack frame #12 pc 0000000000258e08  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine disk_cache::SimpleIndexFile::SyncLoadFromDisk(base::FilePath const&, base::Time*, disk_cache::SimpleIndexLoadResult*) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../net/disk_cache/simple/simple_index_file.cc:495 (discriminator 2)
Stack frame #13 pc 00000000002583e8  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine disk_cache::SimpleIndexFile::SyncLoadIndexEntries(net::CacheType, base::Time, base::FilePath const&, base::FilePath const&, disk_cache::SimpleIndexLoadResult*) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../net/disk_cache/simple/simple_index_file.cc:403
Stack frame #14 pc 00000000001dd0f0  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::OnceCallback<void ()>::Run() && at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/callback.h:96 (discriminator 2)
Stack frame #15 pc 00000000001dd354  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine void base::internal::FunctorTraits<void (*)(base::(anonymous namespace)::PostTaskAndReplyRelay), void>::Invoke<void (*)(base::(anonymous namespace)::PostTaskAndReplyRelay), base::(anonymous namespace)::PostTaskAndReplyRelay>(void (*&&)(base::(anonymous namespace)::PostTaskAndReplyRelay), base::(anonymous namespace)::PostTaskAndReplyRelay&&) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/bind_internal.h:402 (discriminator 6)
Stack frame #16 pc 00000000001a19e4  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::OnceCallback<void ()>::Run() && at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/callback.h:96 (discriminator 2)
Stack frame #17 pc 00000000001d0e04  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::internal::TaskTracker::RunOrSkipTask(base::internal::Task, base::internal::Sequence*, bool) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/task_scheduler/task_tracker.cc:479
Stack frame #18 pc 00000000001f5cf4  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::internal::TaskTrackerPosix::RunOrSkipTask(base::internal::Task, base::internal::Sequence*, bool) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/task_scheduler/task_tracker_posix.cc:23 (discriminator 4)
Stack frame #19 pc 00000000001d080c  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::internal::TaskTracker::RunAndPopNextTask(scoped_refptr<base::internal::Sequence>, base::internal::CanScheduleSequenceObserver*) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/task_scheduler/task_tracker.cc:372 (discriminator 8)
Stack frame #20 pc 00000000001cbb10  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::internal::SchedulerWorker::RunWorker() at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/task_scheduler/scheduler_worker.cc:313 (discriminator 8)
Stack frame #21 pc 00000000001cb934  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::internal::SchedulerWorker::RunPooledWorker() at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/task_scheduler/scheduler_worker.cc:210
Stack frame #22 pc 00000000001f6178  /data/app/myapp/lib/arm64/libcronet.68.0.3426.0.so (offset 0x18d000): Routine base::(anonymous namespace)::ThreadFunc(void*) at /b/build/slave/cronet-arm_64/build/src/out/Release/../../base/threading/platform_thread_posix.cc:76
Stack frame #23 pc 0000000000067814  /system/lib64/libc.so (_ZL15__pthread_startPv+52)
Stack frame #24 pc 000000000001ee30  /system/lib64/libc.so (__start_thread+16)
 
Cc: cavalcantii@chromium.org cblume@chromium.org noel@chromium.org
+cavalcantii@, noel@, cblume@:

This seems to be related to the recent patch on crc32 with Armv8 optimization: https://chromium-review.googlesource.com/c/chromium/src/+/801108

Could you take a look?

This might be device-specific. Most reports are coming from Redmi Note 4 and Redmi Note 4X.

Comment 2 by cblume@chromium.org, Jun 18 2018

Cc: ericrk@chromium.org vmi...@chromium.org
It looks like the Clank team has a Redmi Note 4 which I can test on.

I have a theory given the weird call stack & BUS_ADRERR.

__dl_strcmp
__dl__ZL14find_librariesP6soinfoPKPKcmPS0_PNSt3__16vectorIS0_NS6_9allocatorIS0_EEEEmiPK17android_dlextinfo
__dl__Z9do_dlopenPKciPK17android_dlextinfo
__dl_dlopen
get_elf_hwcap_from_getauxval (inside the NDK)
pthread_once
android_getCpuFeatures (our function)

The NDK function android_getCpuFeatures calls get_elf_hwcap_from_getauxval, which in turn calls dlopen.

OpenCV has some interesting code that I previously didn't fully comprehend. But I now thing it is related to this. They mention that calling dlopen() inside a call to dlopen() fails. And their code works around it: https://github.com/opencv/opencv/blob/master/3rdparty/cpufeatures/cpu-features.c#L519


This call stack has a base of __start_thread.

My theory is some app is calling dlopen("libcronet.so") and part of Cronet's init is calling Cr_z_crc32 (which in turn is calling another dlopen because of the NDK).



Later, I'll take a look at creating a repro case and then confirming this theory.

Comment 3 by cblume@chromium.org, Jun 18 2018

Correction:

android_getCpuFeatures (the API exposed by the NDK which our function calls)
@cblume: interesting theory (and it makes sense).

/me just back from vacations and still catching up with the never ending pile of email.

I don't have access to a Redmi Note 4 device, but please let me know if there is anything we could do to help.
This is off PostTaskAndReply, though, in a worker thread. Cronet initialization is an explicit function call, too, isn't it?


Comment 6 by cblume@chromium.org, Jun 20 2018

I don't see a specific Cronet initialization here. But you are right that this thread doesn't have a prior dlopen() call.

I don't think it is one-per-thread though. I think it is one-per-process. I am guessing at this but I believe if another thread is in the middle of a dlopen() call and this thread happens to also call dlopen() we'll encounter this issue.


Is there a way we can check of the Cronet users could possibly have two threads both calling dlopen()?

There may be some not-so-lovely workarounds available. Given the call stack, it is clear that android_getCpuFeatures() is just calling getauxval(). We could use the getauxval() code path on Android  like we do on Linux. Or we could do what Chrome does and have an explicit initialization step which will call into android_getCpuFeatures() so we know the initialization steps happen in serial and there are no parallel calls to dlopen().
Owner: pauljensen@chromium.org
It's very possible that we have a parallel call to dlopen().
+pauljensen@: looks like you are on sheriff duty this week, could you help with adding android_getCpuFeatures() to Cronet's initialization step?

Comment 8 by cblume@chromium.org, Jun 20 2018

I may have said something in an unclear way.

I am not sure if an initialization call to android_getCpuFeatures() will cause future calls to no longer call dlopen(). I would assume so but I don't know that.

What I was suggesting is an initialization call into zlib so it can make its call into android_getCpuFeatures() and never need to call it again.
All reports come from API level 23, Marshmallow.  99.69% of reports come from Redmi Note 4 and 4X.
Cc: dimitry@google.com
+dimitry@google.com who has worked a lot on bionic's dynamic linker.

Dimitry: are you aware of any Marshmallow-only dlopen() crashes with similar call-stacks?

Comment 11 by dimitry@google.com, Jun 21 2018

tl;dr; if this is a bug I am thinking about this happens on unsuccessful dlopen with RTLD_NODELETE flag or if library has DF_1_NODELETE flag set and this was fixed for NYC.

https://b.corp.google.com/issues/27911891 is the bug in question. I do not remember is this was regression or bug exists in mnc as well, sounds like it might have. If this is the case removing NODELETE flag should result in normal dlerror().


It's Android NDK code:
  http://androidxref.com/6.0.0_r1/xref/ndk/sources/android/cpufeatures/cpu-features.c#506
that looks like this:
  dlopen("libc.so", RTLD_NOW)
so I don't imagine it would be failing to load libc and it doesn't appear to use RTLD_NODELETE.

Comment 13 by dimitry@google.com, Jun 21 2018

Ah.. yes. It fails on any subsequent dlopens.

The problem is that previous unsuccessful dlopen failed to clear internal linker structures and left them in the state where next access to get_soinfo() results in attempt to access unmmaped memory.

Are there any NODELETE libraries in the project?
It looks like the only NODELETE dlopen calls in Chromium are in media/ and content/ which aren't included in Cronet:
https://cs.chromium.org/search/?q=RTLD_NODELETE+file:%5C.cc&sq=package:chromium&type=cs

However, the app using Cronet (see b/110262915) could be using NODELETE, though a quick search didn't turn up any results.
Is there any way to prevent the crash?  like issuing a failed dlopen() without NODELETE?

Comment 16 by dimitry@google.com, Jun 21 2018

yes, removing NODELETE flag from dlopen will fix this. Come to think of it the failing dlopen with NODELETE should not crash. It should result in normal dlerror(). But all subsequent dlopens are doomed...
That media call makes me wonder if something in the HAL or something for the device may be using that flag... is there any way to check?


I looked quickly through Android Marshmallow source but I can't see into the device-specific HAL stuff.  Unfortunately I don't know where the NODELETE dlopen call is...
It's interesting that dlopen for libcronet succeeds, but then later the dlopen call from Cronet fails.  I have a fix that perform the zlib initialization earlier, but I've no way of knowing if we'd hit this crash later from some other dlopen call in Cronet.

Here's my proposed fix:
https://chromium-review.googlesource.com/c/chromium/src/+/1110377
I'm trying to see if I can get a hold of a Redmi Note 4 to test.
I'm sure you have all considered this but I haven't seen it said yet so I felt the need to say it.

Since this is only happening on one set of devices for one version of Android, there is a very high chance that the device includes modifications to the AOSP which introduced a bug.

Is that perhaps what you are saying with the HAL stuff? Maybe the bug is in the HAL layer?
@paul: my understanding is that cronet does CPU features detection to enable optimizations?

Assuming that the chain of calls allowed (i.e Android app calls dlopen() to load cronet and nothing else calls into zlib), why not pass on to zlib the CPU features information?

We could have a state variable in zlib (i.e. arm_detect_cpu_features) that if set to 'false' we simply skip the call into the NDK function. Then it would be up to cronet to set the flag that enables the optimized code (https://cs.chromium.org/chromium/src/third_party/zlib/arm_features.c?l=22).

One concern in zlib is that it runs in both a privileged level in Chromium (e.g. network operations) as also in a sandboxed level (i.e. in RenderProcess when we perform PNG decoding).

One advantage of implementing this flag in zlib is that it would allow to perform experiments (i.e. disable/enable optimizations). Another advantage is that it would allow to avoid doing the same work twice in zlib (i.e. detect CPU features) in the context of cronet.

Assuming this design is acceptable, it would require one change in cronet and one change in zlib. I would be more than happy to help on this one.

> Is that perhaps what you are saying with the HAL stuff? Maybe the bug is in the HAL layer?

It's possible there is a bug in the HAL, or perhaps the HAL is loading a library with RTLD_NODELETE.  However I suspect updating Cronet may be the most expeditious way to avoid this crash.

> @paul: my understanding is that cronet does CPU features detection to enable optimizations?

Various pieces of Cronet do CPU feature detection (e.g. BoringSSL, zlib etc).

We could have a way to pass CPU feature detection flags into zlib.  I think I'd prefer to try my simpler less-risky fix first to see if it fixes the issue before investing more time going down this route.  A simple safer fix also is more amenable to merging into release branches.
Project Member

Comment 22 by bugdroid1@chromium.org, Jun 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/fccbefeaeb30340b3f4240cec51088fb824581eb

commit fccbefeaeb30340b3f4240cec51088fb824581eb
Author: Paul Jensen <pauljensen@chromium.org>
Date: Mon Jun 25 16:23:44 2018

[Cronet] Initialize zlib ARM CPU feature detection early

Later attempts to perform this initialization seem to fail
potentially because of a dlopen() bug, so perform the
initialization early, right after dlopen() on libcronet
succeeded.

Bug:  853725 
Cq-Include-Trybots: master.tryserver.chromium.android:android_cronet_tester;master.tryserver.chromium.mac:ios-simulator-cronet
Change-Id: I9132dff672f91d7962909531f01c6bada4fee964
Reviewed-on: https://chromium-review.googlesource.com/1110377
Reviewed-by: Andrei Kapishnikov <kapishnikov@chromium.org>
Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org>
Commit-Queue: Paul Jensen <pauljensen@chromium.org>
Cr-Commit-Position: refs/heads/master@{#570057}
[modify] https://crrev.com/fccbefeaeb30340b3f4240cec51088fb824581eb/components/cronet/DEPS
[modify] https://crrev.com/fccbefeaeb30340b3f4240cec51088fb824581eb/components/cronet/android/BUILD.gn
[modify] https://crrev.com/fccbefeaeb30340b3f4240cec51088fb824581eb/components/cronet/android/cronet_library_loader.cc

Comment 23 by noel@chromium.org, Jun 26 2018

#6 "We could use the getauxval() code path on Android like we do on Linux."  Interesting idea: wonder if that'd work?  Indeed, any reasons I'm missing why it would not work on Android?  

Status: Assigned (was: Untriaged)
This bug has an owner, thus, it's been triaged. Changing status to "assigned".
Marking fixed per comment 22.
Status: Fixed (was: Assigned)

Sign in to add a comment