New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 694526 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
please use my google.com address
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows , Mac
Pri: 3
Type: Bug



Sign in to add a comment

Crash/race in mojo cancellation system.

Reported by dyaros...@yandex-team.ru, Feb 21 2017

Issue description

Mojo experts help wanted.

I've encountered a repeating crash in mojo in a number reports from users and in perftests. Even though I haven't found this crash in chromium I believe there is a very reasonable chance of it happening in the future.

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000020

base::subtle::RefCountedThreadSafeBase::AddRef() const [inlined] std::__1::__atomic_base<long, true>::fetch_add(long, std::__1::memory_order) volatile at atomic:977:17
mojo::Watcher::CallOnHandleReady(unsigned long, unsigned int, MojoHandleSignalsState, unsigned int) [inlined] base::WeakPtr<mojo::Watcher>::WeakPtr(base::WeakPtr<mojo::Watcher> const&) + 13 at weak_ptr.h:210:7
mojo::edk::Watcher::MaybeInvokeCallback(unsigned int, mojo::edk::HandleSignalsState const&, unsigned int) [inlined] base::Lock::Release() at lock.h:31:26
mojo::edk::RequestContext::~RequestContext() [inlined] mojo::edk::RequestContext::~RequestContext() at request_context.cc:31:35
mojo::edk::Core::WriteMessageNew(unsigned int, unsigned long, unsigned int) + 267 at core.cc:702:1
mojo::Connector::Accept(mojo::Message*) [inlined] mojo::WriteMessageNew(mojo::MessagePipeHandle, mojo::ScopedHandleBase<mojo::MessageHandle>, unsigned int) + 22 at message_pipe.h:97:10
service_manager::mojom::ConnectorProxy::Clone(mojo::InterfaceRequest<service_manager::mojom::Connector>) + 233 at connector.mojom.cc:291:1
service_manager::ConnectorImpl::BindRequest(mojo::InterfaceRequest<service_manager::mojom::Connector>) [inlined] scoped_refptr<mojo::ProcessResumer>::~scoped_refptr() at ref_counted.h:310:9
(anonymous namespace)::BindToBrowserConnector(mojo::InterfaceRequest<service_manager::mojom::Connector>) [inlined] scoped_refptr<mojo::ProcessResumer>::~scoped_refptr() at ref_counted.h:310:9
base::internal::Invoker<base::internal::BindState<void (*)(mojo::InterfaceRequest<service_manager::mojom::Connector>), base::internal::PassedWrapper<mojo::InterfaceRequest<service_manager::mojom::Connector> > >, void ()>::Run(base::internal::BindStateBase*) [inlined] scoped_refptr<mojo::ProcessResumer>::~scoped_refptr() at ref_counted.h:310:9
base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*) [inlined] base::Callback<void (), (base::internal::CopyMode)0, (base::internal::RepeatMode)0>::~Callback() at callback_forward.h:29:7
base::MessageLoop::RunTask(base::PendingTask*) [inlined] std::__1::vector<base::MessageLoop::TaskObserver*, std::__1::allocator<base::MessageLoop::TaskObserver*> >::empty() const at vector:638:23
base::MessageLoop::DeferOrRunPendingTask(base::PendingTask) + 44 at message_loop.cc:432:5
base::MessageLoop::DoWork() + 371 at message_loop.cc:525:13
base::MessagePumpCFRunLoopBase::RunWork() + 45 at message_pump_mac.mm:302:30

These tests: https://codereview.chromium.org/2709703002/ reproduce it in chromium (in debug mode they fail with invalid mutex in thread checker, in release - very similar stacktrace).

Self canceling test is legal judging by this comment:
https://cs.chromium.org/chromium/src/mojo/public/cpp/system/watcher.cc?q=watcher.cc&l=84
Canceling other watcher seems to be legal because I couldn't find anything that stated otherwise.

Reasons that lead to the crash:
Judging by this comment: https://cs.chromium.org/chromium/src/mojo/edk/system/request_context.cc?q=request_con&l=55 mojo cancelation system tries to support canceling while running callbacks. This attempt doesn't work because everybody uses first created request context: https://cs.chromium.org/chromium/src/mojo/edk/system/request_context.cc?q=request_con&l=27 so creating a new instance doesn't seem to do anything and the active request context have already have processed all cancellations: https://cs.chromium.org/chromium/src/mojo/edk/system/request_context.cc?q=request_con&l=49

Another code that seems problematic calls arbitrary callback while holding a lock: https://cs.chromium.org/chromium/src/mojo/edk/system/watcher.cc?q=watcher.cc&l=20 - this can backfire during fixing this issue or it might be a source of not yet found bug.

If a Mojo expert could look at this problem it would be great.

PS: As far as I understand the main reason for creating a new cancelation system is to pass it through C code. Isn't it easier/less error prone to try to pass weak_ptr through C?

 

Comment 1 by lgrey@chromium.org, Feb 21 2017

Components: Internals>Mojo

Comment 2 by yzshen@chromium.org, Feb 21 2017

Cc: roc...@chromium.org yzshen@chromium.org

Comment 3 by roc...@chromium.org, Feb 21 2017

Cc: -roc...@chromium.org
Owner: roc...@chromium.org
Status: Assigned (was: Unconfirmed)

Comment 4 by roc...@chromium.org, Feb 21 2017

Thanks for the report and the repro. I'm looking into it.

In reference to https://cs.chromium.org/chromium/src/mojo/edk/system/request_context.cc?q=request_con&l=55 please note that the topmost RequestContext has been reset already -- we know we're in its destructor because IsCurrent() was true -- so the fresh one established here is in fact used to capture any new notifications triggered by watcher callbacks.
@roc...

Thanks for the report and the repro. I'm looking into it.
==

Great!

A note on the repro - if you gonna run the test with DCHECK on it'll get an error in thread_checker in watcher: https://cs.chromium.org/chromium/src/mojo/public/cpp/system/watcher.h?l=92
pthread library is going to return 22 - as far as I understood - invalid mutex. To get the attached stack you can either remove thread checker from watcher or build in release (thread checker does nothing in release.


In reference to https://cs.chromium.org/chromium/src/mojo/edk/system/request_context.cc?q=request_con&l=55 please note that the topmost RequestContext has been reset already -- we know we're in its destructor because IsCurrent() was true -- so the fresh one established here is in fact used to capture any new notifications triggered by watcher callbacks.

==
Can you please clarify how that would work?

As far as I can see everyone uses current context https://cs.chromium.org/chromium/src/mojo/edk/system/request_context.cc?l=69 - the nested one won't be current, so it won't get any notifications, am I wrong?

Comment 6 by roc...@chromium.org, Feb 21 2017

It works because it's *not* nested. The one you're calling the current
context is no longer the current context, because this destructor has reset
the TLS ptr to null.
@roc

Oh, I see. Thanks.

Comment 8 by roc...@chromium.org, Feb 21 2017

Labels: OS-Windows
The actual cause of the bug appears to be more specifically:

 1. Add watcher W to pipe handle A
 2. Within a single top-level RequestContext:
   2a. Close pipe handle A.
   2b. Cancel watcher W

The issue is that closing pipe A in 2a enqueues a notification to watcher W on RequestContext unwind. Then because A has already been closed, cancellation of W in 2b is necessarily a no-op - there is no longer an internal reference to W which can be used to enqueue a cancellation finalizer (which itself would normally block any pending notifications to the cancelled watcher). The net result is that the cancellation notification triggered by 2a is  incorrectly dispatched.

This may be a tricky problem to solve, and while I do think it's worth solving, AFAICT we will not ever hit it in chromium unless someone explicitly uses the low-level Mojo API this way. Our bindings objects always ensure that their Watchers are explicitly cancelled before their corresponding message pipe is closed.

A short-term fix may be to specify in the documentation that a mojo::Watcher must outlive the handle it's watching and likewise that MojoCancelWatch must be called to cancel any watches added by MojoWatch before the corresponding handle is closed.

Comment 9 by roc...@chromium.org, Feb 21 2017

Also note that in the repro description from comment #8 the order of 2a and 2b matters. If done in the reverse order, there's no problem.
@roc

Do I understand correctly that this: https://cs.chromium.org/chromium/src/mojo/public/cpp/system/watcher.cc?q=watcher.cc&l=63 should be a DCHECk?
No that should not be a DCHECK.

A watch is implicitly cancelled if its handle is closed. This is conveyed
by notifying the watcher with MOJO_RESULT_CANCELLED. The Watcher class
defined here invalidates |handle_| if it receives that notification, and
it's still legal for the user to call Cancel() after this happens.

We could make it a DCHECK and then require callers to check IsWatching()
before Cancel(), but that seems unnecessary.
Status: WontFix (was: Assigned)
Based on the recent changes to watcher API I'm closing this as obsolete. There should no longer be any surprises in the behavior as we no longer try to make difficult guarantees. This requires a slightly more complex API but has the benefit of being more predictable.
@roc

Seems like the original crash does not reproduce anymore.
And you've added my tests, which is great!

Sign in to add a comment