Crash/race in mojo cancellation system.
Reported by
dyaros...@yandex-team.ru,
Feb 21 2017
|
|||||
Issue descriptionMojo experts help wanted. I've encountered a repeating crash in mojo in a number reports from users and in perftests. Even though I haven't found this crash in chromium I believe there is a very reasonable chance of it happening in the future. Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x0000000000000020 base::subtle::RefCountedThreadSafeBase::AddRef() const [inlined] std::__1::__atomic_base<long, true>::fetch_add(long, std::__1::memory_order) volatile at atomic:977:17 mojo::Watcher::CallOnHandleReady(unsigned long, unsigned int, MojoHandleSignalsState, unsigned int) [inlined] base::WeakPtr<mojo::Watcher>::WeakPtr(base::WeakPtr<mojo::Watcher> const&) + 13 at weak_ptr.h:210:7 mojo::edk::Watcher::MaybeInvokeCallback(unsigned int, mojo::edk::HandleSignalsState const&, unsigned int) [inlined] base::Lock::Release() at lock.h:31:26 mojo::edk::RequestContext::~RequestContext() [inlined] mojo::edk::RequestContext::~RequestContext() at request_context.cc:31:35 mojo::edk::Core::WriteMessageNew(unsigned int, unsigned long, unsigned int) + 267 at core.cc:702:1 mojo::Connector::Accept(mojo::Message*) [inlined] mojo::WriteMessageNew(mojo::MessagePipeHandle, mojo::ScopedHandleBase<mojo::MessageHandle>, unsigned int) + 22 at message_pipe.h:97:10 service_manager::mojom::ConnectorProxy::Clone(mojo::InterfaceRequest<service_manager::mojom::Connector>) + 233 at connector.mojom.cc:291:1 service_manager::ConnectorImpl::BindRequest(mojo::InterfaceRequest<service_manager::mojom::Connector>) [inlined] scoped_refptr<mojo::ProcessResumer>::~scoped_refptr() at ref_counted.h:310:9 (anonymous namespace)::BindToBrowserConnector(mojo::InterfaceRequest<service_manager::mojom::Connector>) [inlined] scoped_refptr<mojo::ProcessResumer>::~scoped_refptr() at ref_counted.h:310:9 base::internal::Invoker<base::internal::BindState<void (*)(mojo::InterfaceRequest<service_manager::mojom::Connector>), base::internal::PassedWrapper<mojo::InterfaceRequest<service_manager::mojom::Connector> > >, void ()>::Run(base::internal::BindStateBase*) [inlined] scoped_refptr<mojo::ProcessResumer>::~scoped_refptr() at ref_counted.h:310:9 base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*) [inlined] base::Callback<void (), (base::internal::CopyMode)0, (base::internal::RepeatMode)0>::~Callback() at callback_forward.h:29:7 base::MessageLoop::RunTask(base::PendingTask*) [inlined] std::__1::vector<base::MessageLoop::TaskObserver*, std::__1::allocator<base::MessageLoop::TaskObserver*> >::empty() const at vector:638:23 base::MessageLoop::DeferOrRunPendingTask(base::PendingTask) + 44 at message_loop.cc:432:5 base::MessageLoop::DoWork() + 371 at message_loop.cc:525:13 base::MessagePumpCFRunLoopBase::RunWork() + 45 at message_pump_mac.mm:302:30 These tests: https://codereview.chromium.org/2709703002/ reproduce it in chromium (in debug mode they fail with invalid mutex in thread checker, in release - very similar stacktrace). Self canceling test is legal judging by this comment: https://cs.chromium.org/chromium/src/mojo/public/cpp/system/watcher.cc?q=watcher.cc&l=84 Canceling other watcher seems to be legal because I couldn't find anything that stated otherwise. Reasons that lead to the crash: Judging by this comment: https://cs.chromium.org/chromium/src/mojo/edk/system/request_context.cc?q=request_con&l=55 mojo cancelation system tries to support canceling while running callbacks. This attempt doesn't work because everybody uses first created request context: https://cs.chromium.org/chromium/src/mojo/edk/system/request_context.cc?q=request_con&l=27 so creating a new instance doesn't seem to do anything and the active request context have already have processed all cancellations: https://cs.chromium.org/chromium/src/mojo/edk/system/request_context.cc?q=request_con&l=49 Another code that seems problematic calls arbitrary callback while holding a lock: https://cs.chromium.org/chromium/src/mojo/edk/system/watcher.cc?q=watcher.cc&l=20 - this can backfire during fixing this issue or it might be a source of not yet found bug. If a Mojo expert could look at this problem it would be great. PS: As far as I understand the main reason for creating a new cancelation system is to pass it through C code. Isn't it easier/less error prone to try to pass weak_ptr through C?
,
Feb 21 2017
,
Feb 21 2017
,
Feb 21 2017
Thanks for the report and the repro. I'm looking into it. In reference to https://cs.chromium.org/chromium/src/mojo/edk/system/request_context.cc?q=request_con&l=55 please note that the topmost RequestContext has been reset already -- we know we're in its destructor because IsCurrent() was true -- so the fresh one established here is in fact used to capture any new notifications triggered by watcher callbacks.
,
Feb 21 2017
@roc... Thanks for the report and the repro. I'm looking into it. == Great! A note on the repro - if you gonna run the test with DCHECK on it'll get an error in thread_checker in watcher: https://cs.chromium.org/chromium/src/mojo/public/cpp/system/watcher.h?l=92 pthread library is going to return 22 - as far as I understood - invalid mutex. To get the attached stack you can either remove thread checker from watcher or build in release (thread checker does nothing in release. In reference to https://cs.chromium.org/chromium/src/mojo/edk/system/request_context.cc?q=request_con&l=55 please note that the topmost RequestContext has been reset already -- we know we're in its destructor because IsCurrent() was true -- so the fresh one established here is in fact used to capture any new notifications triggered by watcher callbacks. == Can you please clarify how that would work? As far as I can see everyone uses current context https://cs.chromium.org/chromium/src/mojo/edk/system/request_context.cc?l=69 - the nested one won't be current, so it won't get any notifications, am I wrong?
,
Feb 21 2017
It works because it's *not* nested. The one you're calling the current context is no longer the current context, because this destructor has reset the TLS ptr to null.
,
Feb 21 2017
@roc Oh, I see. Thanks.
,
Feb 21 2017
The actual cause of the bug appears to be more specifically: 1. Add watcher W to pipe handle A 2. Within a single top-level RequestContext: 2a. Close pipe handle A. 2b. Cancel watcher W The issue is that closing pipe A in 2a enqueues a notification to watcher W on RequestContext unwind. Then because A has already been closed, cancellation of W in 2b is necessarily a no-op - there is no longer an internal reference to W which can be used to enqueue a cancellation finalizer (which itself would normally block any pending notifications to the cancelled watcher). The net result is that the cancellation notification triggered by 2a is incorrectly dispatched. This may be a tricky problem to solve, and while I do think it's worth solving, AFAICT we will not ever hit it in chromium unless someone explicitly uses the low-level Mojo API this way. Our bindings objects always ensure that their Watchers are explicitly cancelled before their corresponding message pipe is closed. A short-term fix may be to specify in the documentation that a mojo::Watcher must outlive the handle it's watching and likewise that MojoCancelWatch must be called to cancel any watches added by MojoWatch before the corresponding handle is closed.
,
Feb 21 2017
Also note that in the repro description from comment #8 the order of 2a and 2b matters. If done in the reverse order, there's no problem.
,
Mar 1 2017
@roc Do I understand correctly that this: https://cs.chromium.org/chromium/src/mojo/public/cpp/system/watcher.cc?q=watcher.cc&l=63 should be a DCHECk?
,
Mar 1 2017
No that should not be a DCHECK. A watch is implicitly cancelled if its handle is closed. This is conveyed by notifying the watcher with MOJO_RESULT_CANCELLED. The Watcher class defined here invalidates |handle_| if it receives that notification, and it's still legal for the user to call Cancel() after this happens. We could make it a DCHECK and then require callers to check IsWatching() before Cancel(), but that seems unnecessary.
,
Mar 21 2017
Based on the recent changes to watcher API I'm closing this as obsolete. There should no longer be any surprises in the behavior as we no longer try to make difficult guarantees. This requires a slightly more complex API but has the benefit of being more predictable.
,
Apr 5 2017
@roc Seems like the original crash does not reproduce anymore. And you've added my tests, which is great! |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by lgrey@chromium.org
, Feb 21 2017