Crash while locking port->lock in mojo::edk::ports::Node::OnObserveClosure |
|||||||||
Issue descriptionLink to crashes: https://crash.corp.google.com/browse?q=product.name%3D%27AndroidWebView%27%20AND%20product.Version%20LIKE%20%2754.%25%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3Ainternal%3A%3ALockImpl%3A%3ALock%27%20OMIT%20RECORD%20IF%20SUM(CrashedStackTrace.StackFrame.FunctionName%3D%27mojo%3A%3Aedk%3A%3Aports%3A%3ANode%3A%3AOnObserveClosure%27)%20%3D%200&ignore_case=false&enable_rewrite=true&omit_field_name=CrashedStackTrace.StackFrame.FunctionName&omit_field_value=mojo%3A%3Aedk%3A%3Aports%3A%3ANode%3A%3AOnObserveClosure&omit_field_opt=%3D#samplereports Occurs only on Nexus 9, so we think this may be a Tegra K1 erratum. GetPort looks up a port by name in the ports_ map. It presumably finds the port because there is a null check: int Node::OnObserveClosure(const PortName& port_name, uint64_t last_sequence_num) { scoped_refptr<Port> port = GetPort(port_name); if (!port) return OK; { base::AutoLock lock(port->lock); ... } } GetPort_Locked does this: 00000000011b846c <_ZN4mojo3edk5ports4Node14GetPort_LockedERKNS1_8PortNameE>: 11b846c: a9be7bfd stp x29, x30, [sp,#-32]! 11b8470: 91010000 add x0, x0, #0x40 11b8474: 910003fd mov x29, sp 11b8478: f9000bf3 str x19, [sp,#16] 11b847c: aa0803f3 mov x19, x8 11b8480: 97ffffca bl 11b83a8 <std::unordered_map::find> 11b8484: b5000060 cbnz x0, 11b8490 <_ZN4mojo3edk5ports4Node14GetPort_LockedERKNS1_8PortNameE+0x24> 11b8488: f9000260 str x0, [x19] 11b848c: 14000005 b 11b84a0 <_ZN4mojo3edk5ports4Node14GetPort_LockedERKNS1_8PortNameE+0x34> 11b8490: f9401000 ldr x0, [x0,#32] # x0 now points at the Port 11b8494: f9000260 str x0, [x19] # stored in the scoped_refptr<Port> in the stack frame of OnObserveClosure 11b8498: b4000040 cbz x0, 11b84a0 <_ZN4mojo3edk5ports4Node14GetPort_LockedERKNS1_8PortNameE+0x34> # we assume x0 is not 0 at this point, because of the null check below 11b849c: 97d61dda bl 73fc04 <base::subtle::RefCountedThreadSafeBase::AddRef() const> # so it is used to AddRef the Port instance, and can't be -1 at this point. 11b84a0: aa1303e0 mov x0, x19 11b84a4: f9400bf3 ldr x19, [sp,#16] 11b84a8: a8c27bfd ldp x29, x30, [sp],#32 11b84ac: d65f03c0 ret So the fact that the resulting scoped_refptr target is AddRef'ed without crashing means that at this point everything is ok. In OnObserveClosure, however, the story is different. 11b8ac0: 97fffe7c bl 11b84b0 <mojo::edk::ports::Node::GetPort(mojo::edk::ports::PortName const&)> 11b8ac4: f94037a3 ldr x3, [x29,#104] # the location of the scoped_refptr<Port> 11b8ac8: b40009e3 cbz x3, 11b8c04 <_ZN4mojo3edk5ports4Node16OnObserveClosureERKNS1_8PortNameEm+0x174> # x3 is not 0 at this point. 11b8acc: 91001060 add x0, x3, #0x4 # adding 4 produces 3, which is passed to pthread_mutex_lock, so x0, as loaded from the stack, is -1 11b8ad0: f9003fbf str xzr, [x29,#120] 11b8ad4: f9004fa0 str x0, [x29,#152] 11b8ad8: f90043bf str xzr, [x29,#128] 11b8adc: f90047bf str xzr, [x29,#136] 11b8ae0: f9004bbf str xzr, [x29,#144] 11b8ae4: 97d6c462 bl 769c6c <_ZN4base8internal8LockImpl4LockEv> pthread_mutex_lock faults at 0x3, which is produced in x0 by adding 4 to the scoped_refptr value, the stack value of which is now -1 (which passes the null check).
,
Nov 16 2016
crash in mojo::edk that's device specific, how odd.. ken should someone on mojo team look into this one? 0x0000007690480818 (libc.so + 0x00068818 ) pthread_mutex_lock 0x0000007679b60c8c (libmonochrome.so -lock_impl_posix.cc:64 ) base::internal::LockImpl::Lock 0x000000767a5afae4 (libmonochrome.so -lock.h:26 ) mojo::edk::ports::Node::OnObserveClosure 0x000000767a5b1f08 (libmonochrome.so -node.cc:349 ) mojo::edk::ports::Node::AcceptMessage 0x000000767a5a19dc (libmonochrome.so -node_controller.cc:696 ) mojo::edk::NodeController::AcceptIncomingMessages 0x000000767a5a1fb8 (libmonochrome.so -node_controller.cc:716 ) mojo::edk::NodeController::ProcessIncomingMessages 0x000000767a59fa74 (libmonochrome.so -bind_internal.h:214 ) base::internal::Invoker<base::internal::BindState<void (mojo::edk::NodeController::*)(), base::internal::UnretainedWrapper<mojo::edk::NodeController> >, void()>::Run 0x0000007679b202b8 (libmonochrome.so -callback.h:388 ) base::debug::TaskAnnotator::RunTask 0x0000007679b3987c (libmonochrome.so -message_loop.cc:488 ) base::MessageLoop::RunTask 0x0000007679b3a1cc (libmonochrome.so -message_loop.cc:497 ) base::MessageLoop::DeferOrRunPendingTask 0x0000007679b3a590 (libmonochrome.so -message_loop.cc:621 ) base::MessageLoop::DoWork 0x0000007679b3ba8c (libmonochrome.so -message_pump_libevent.cc:217 ) base::MessagePumpLibevent::Run 0x0000007679b39e1c (libmonochrome.so -message_loop.cc:451 ) base::MessageLoop::RunHandler 0x0000007679b51604 (libmonochrome.so -run_loop.cc:35 ) base::RunLoop::Run 0x000000767a736740 (libmonochrome.so -browser_thread_impl.cc:243 ) content::BrowserThreadImpl::IOThreadRun 0x000000767a736834 (libmonochrome.so -browser_thread_impl.cc:278 ) content::BrowserThreadImpl::Run 0x0000007679b6efd0 (libmonochrome.so -thread.cc:301 ) base::Thread::ThreadMain 0x0000007679b6ab34 (libmonochrome.so -platform_thread_posix.cc:71 ) ThreadFunc Happens on chrome as well where 98% is on nexus 9. Definitely got worse in M54 https://crash.corp.google.com/browse?q=custom_data.ChromeCrashProto.magic_signature_1.name%3D%27mojo%3A%3Aedk%3A%3Aports%3A%3ANode%3A%3AOnObserveClosure%27%20AND%20product.name%3D%27Chrome_Android%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#samplereports:5,devicemodel
,
Nov 16 2016
Also 90% of this is on N. I dunno if really 90% of nexus 9s are upgraded to N already though.
,
Nov 16 2016
Given that this happens in both Chrome and WebView, across multiple versions of M54 and later (but not earlier versions afaict), it really seems like the exact code sequence in question must be at fault; the addresses have moved around in each build, but the behaviour is always exactly the same issue. It does seem like a cpu erratum is a definite possibility here :/ Was this code either substantially different, or not executed at all on android, in M53 and earlier?
,
Nov 16 2016
I don't think it's that unlikely that 90% of actively-used-by-anyone Nexus 9's are on N by now. The OTA has been out for a while. Devices that haven't updated are probably mostly sitting turned off in a drawer and aren't submitting crash reports.
,
Nov 16 2016
,
Nov 16 2016
actually +amineer instead of RBS..
,
Nov 16 2016
There are also a small number of crashes in Node::AcceptMessage - https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Android%27%20AND%20crash.Address%3D3%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27mojo%3A%3Aedk%3A%3Aports%3A%3ANode%3A%3AAcceptMessage%27%20OMIT%20RECORD%20IF%20SUM(ProductData.Key%3D%27model%27)%20%3D%200%20OR%20SUM(ProductData.Value%3D%27Nexus%209%27)%20%3D%200&ignore_case=false&enable_rewrite=false&omit_field_name=&omit_field_value=&omit_field_opt=
,
Nov 16 2016
Yeah, this makes me suspect that the problematic code sequence is the one in GetPorts, and so any function in this file that calls GetPorts might crash in this way.
,
Nov 16 2016
Issue 645653 has been merged into this issue.
,
Nov 16 2016
We've seen issue 645653 for a while and I could only assume it had to be some kind of hardware issue. I've pored over the code here many times and don't think it's possible for it to be a logical error. What do we typically do in cases like this?
,
Nov 16 2016
RB-Stable for M55 for sure, this is the #10 overall WebView crash and one of our top Android N crashes as well. Is there some way to blacklist the functionality that's causing this on the impacted chipset?
,
Nov 16 2016
No, this is fundamental IPC routing logic in Chrome.
,
Nov 16 2016
I think we need test team to get a nexus 9 on N and see if they can reproduce this..
,
Nov 16 2016
Most of the user comments in the reports I saw said 'happens at random', so this is going to be hard to repro, and adding any code to log anything around this is probably going to stop it from happening, as well.
,
Nov 16 2016
I think knowing the crash rate would help. If it's once very few minutes, then I think there's a pretty good chance we can verify any change we do. If it's higher, then we are really screwed.. Either way, I dunno how to workaround cpu bugs, but I'm guessing it's similar to gpu bugs, bang your head against it until the crash goes away...
,
Nov 16 2016
OK, so my extremely vague ideas here: 1) The problem is that between 11b8494: f9000260 str x0, [x19] and 11b8ac4: f94037a3 ldr x3, [x29,#104] (which both refer to the same address), the valid value we stored from x0 has become -1 loaded into x3. This means something went wrong in "the memory system", to handwave somewhat. (everything that happens outside the registers). 2) Between those two things, we do an atomic increment on the reference count. This involves both issuing a memory barrier, and executing a load/store exclusive pair to do the actual increment. This is a nontrivial memory system operation that is a common place for CPU bugs to hide. I suspect that if this was a non-threadsafe refcounted object (and thus didn't use an atomic memory access) this bug wouldn't occur. 3) It's possible that the location of the pointer on the stack, and the actual refcount in the object, purely by coincidence of their physical addresses, are sharing a cache location and so one line evicts the other. This is also another place bugs can hide, we can't do anything about it, and it would explain why it happens at random. We might be able to make this problem go away for now (though it won't fix "the real bug") by simply perturbing this code enough that the generated code sequence happens not to trigger the erratum any more, but without knowing specifically what the bug is, this is tricky. Adding additional memory accesses (e.g. a number of CHECK()s to assert things are all valid throughout) might do it, but it depends how much the compiler inlines/reorders/etc. We should probably reach out to nvidia (via internal contacts) to ask if they know of any relevant errata for the Tegra K1. The errata documents I can find internally don't cover that chip.
,
Nov 17 2016
I'm going to attempt to build a simple stress test to repro this locally on a Nexus 9.
,
Nov 17 2016
I wrote a trivial test here which reliablu repros the crash on a Nexus 9 device: https://codereview.chromium.org/2514553002 - would anyone be able to provide some assistance/guidance on how to dig in further to test any prevailing theories? Alternatively I can take the approach of blind code perturbation and use this test as an indicator of likely success.
,
Nov 17 2016
I've amended the above CL (https://codereview.chromium.org/2514553002) with what appears to be a reliable workaround.
,
Nov 18 2016
correction to my comment 17: the atomic increment of the refcount *doesn't* include a memory barrier, we only issue a barrier when decrementing a refcount (which is fine, I just forgot). Adding an explicit barrier somewhere here might be enough to avoid the issue as it will limit what reordering/etc the cpu can do.
,
Nov 18 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/bb50f7d41c6aded291aec15a16853a1cf044c275 commit bb50f7d41c6aded291aec15a16853a1cf044c275 Author: rockot <rockot@chromium.org> Date: Fri Nov 18 20:51:40 2016 Mojo EDK: Work around Nexus 9 hardware bug Adds a memory barrier to port table lookup, which is apparently sufficient to avoid a nasty Nexus 9 bug. BUG= 665869 Review-Url: https://codereview.chromium.org/2514553002 Cr-Commit-Position: refs/heads/master@{#433291} [modify] https://crrev.com/bb50f7d41c6aded291aec15a16853a1cf044c275/mojo/edk/system/message_pipe_unittest.cc [modify] https://crrev.com/bb50f7d41c6aded291aec15a16853a1cf044c275/mojo/edk/system/ports/node.cc
,
Nov 21 2016
[Bulk edit] URGENT - PTAL ASAP. This issue is marked as a stable release blocker for Android M55. We will be cutting our stable candidate from branch 2883 on Tuesday, 11/29, so *all* blockers must be fixed on trunk by Monday, 11/28 to allow time to merge back to the branch. Please review this issue ASAP. Know that this issue shouldn't block the release? Remove the ReleaseBlock-Stable label. Unsure if the issue should block the release, or think that the issue should block the release but you know you won't be able to fix it in time? Please CC me to the bug so that we can discuss options. Thanks!
,
Nov 21 2016
I believe this issue is fixed. We won't know for sure until there are more users in the wild, but the test I added for reliable repro indicates that the bug is sufficiently worked-around.
,
Nov 21 2016
Your change meets the bar and is auto-approved for M55 (branch: 2883)
,
Nov 22 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3ffb52589a6f429c8aba0f70807cb9d322cb8ffe commit 3ffb52589a6f429c8aba0f70807cb9d322cb8ffe Author: Ken Rockot <rockot@chromium.org> Date: Mon Nov 21 23:56:50 2016 Mojo EDK: Work around Nexus 9 hardware bug Adds a memory barrier to port table lookup, which is apparently sufficient to avoid a nasty Nexus 9 bug. BUG= 665869 Review-Url: https://codereview.chromium.org/2514553002 Cr-Commit-Position: refs/heads/master@{#433291} (cherry picked from commit bb50f7d41c6aded291aec15a16853a1cf044c275) Review URL: https://codereview.chromium.org/2522743002 . Cr-Commit-Position: refs/branch-heads/2883@{#638} Cr-Branched-From: 614d31daee2f61b0180df403a8ad43f20b9f6dd7-refs/heads/master@{#423768} [modify] https://crrev.com/3ffb52589a6f429c8aba0f70807cb9d322cb8ffe/mojo/edk/system/message_pipe_unittest.cc [modify] https://crrev.com/3ffb52589a6f429c8aba0f70807cb9d322cb8ffe/mojo/edk/system/ports/node.cc |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by sheriffbot@chromium.org
, Nov 16 2016