New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 735051 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocking:
issue 678687



Sign in to add a comment

InterfaceEndpointClient::AcceptWithResponder Succeeds when there is no response

Project Member Reported by jonr...@chromium.org, Jun 20 2017

Issue description

When calling InterfaceEndpointClient::AcceptWithResponder with a synchronous call it is able to detect if a response is received.

This indicates either a pipe closure or an unusable response.

However the method still returns true, indicating a success.

We should look at updating this. As well as consider how to signify this error state. |encountered_error_| is currently used when errors are received. However it is problematic itself right now. Due to the async nature of receiving errors we cannot rely on it.

This mostly affects debugging, as there is no way to get the type of failure out of mojo calls.

An example of this is DirectoryProxy::OpenFileHandles (https://cs.chromium.org/chromium/src/out/Debug/gen/components/filesystem/public/interfaces/directory.mojom.cc?rcl=243e8d6094dcd543b5c4ced845fc734df0f5c4c1&l=1590)

This method will return true upon success.
However false encompasses both pipe failures within InterfaceEndpointClient, as well as Deserialization errors cause by bugs.

One potential thought for debugging is exposing the state of the underlying message pipe one BindingState/InterfacePtrState
then you could check   ptr.internal_state()->handle().QuerySignalsState().peer_closed
 
This race condition can be quite nasty.

Service::OnStart can be called. At this time ServiceContext::binding_.handler()->QuerySignalesState().peer_closed() can be either true or false.

So that is not enough of a signal to close the service.

However when an InterfacePtr is created the peer can have closed by this time. So InterfacePtr->internal_state_.handle_->QuerySignalState().peer_closed() is consistently false at the time that Sync mojo calls fail.

So we can have the peer close before we complete Service::OnStart.

For asynchronous calls this isn't much of an issue, as API-wise they are awaiting later responses. So connection closures arrive.

For synchronous mojo calls we end up with the vague "false" response for failures. But the peer state can be polled correctly to differentiate.
Blocking: 678687
Cc: yzshen@chromium.org
+yzshen@ for thoughts on this issue.

My concern is that we can have synchronous mojo calls fail to receive responses, but this method says the call succeeded.

Comment 4 by yzshen@chromium.org, Aug 30 2017

The result of InterfaceEndpointClient::AcceptWithResponder() doesn't indicate whether the sync mojo call successfully receive the response. Instead it indicates whether the request is successfuly sent.

Actually this result is ignored entirely, please see:
https://cs.chromium.org/chromium/src/mojo/public/tools/bindings/generators/cpp_templates/interface_definition.tmpl?rcl=b28a108060452565677f4c716db37ebc6b6bb46d&l=149


Whether a response is received is signaled using a different way: if responder->Accept() is called, it set the value of a boolean pointer to true:
https://cs.chromium.org/chromium/src/mojo/public/tools/bindings/generators/cpp_templates/interface_definition.tmpl?rcl=b28a108060452565677f4c716db37ebc6b6bb46d&l=75

This boolean pointer pointing to a boolean that we return to the user code:
https://cs.chromium.org/chromium/src/mojo/public/tools/bindings/generators/cpp_templates/interface_definition.tmpl?rcl=b28a108060452565677f4c716db37ebc6b6bb46d&l=150

Ah I hadn't realized that InterfaceEndpointClient::AcceptWithResponder was only tracking the success of having sent the message. Though if the sending fails that isn't recorded anywhere.

So the lack of response being received is signaled because responder->Accept is not called, giving result = false.

This overlaps with responder->Accept being unable to deserialize, which also returns false, but denotes a different kind of error.

While both signal that the caller should shutdown, it does make it a bit difficult to debug. I was wondering if we should expose the peer_closed concept ate the InterfacePtrState level, so that call sites can better determine if this is an expect error to receive during closure, vs unexpected serialization errors which normally denote bugs.

Thoughts?

Comment 6 by yzshen@chromium.org, Aug 30 2017

(I feel that this bug report has conflated a few different things. It might make sense to use different bugs for more focused discussion.)

RE #5:
If the purpose is to debug locally, when serialization error happens we have got error logs printed to the command line window. Normal pipe disconnection doesn't output logs. Is that sufficient?
Sorry if I've ended up conflating a few different things.

I grouped this together as right now it's difficult to determine the type of error encountered with synchronous mojo calls. This affects
   - debugging locally the error being ran into*
   - reading error logs from unittests on trybots
   - actually handling the different error states
        -  cleanly shut down a process when the peer has closed
        -  throw a fatal error for serialization bugs

* Regarding debugging locally, while the error on serialization is great, having synchronous mojo fail silently on a pipe disconnection is a problem.
   - you have to know that the lack of an error message means it is this type of error
   - the internal state to verify this state is not exposed to code calling into mojo

I could see this being split into three for better discussion/tracking
   - InterfaceEndpointClient::AcceptWithResponder failing to send isn't exposed to calling code
   - There is no logging for the failure to receive a response to synchronous mojo code
   - The peer closed state is not exposed to code using mojo

Comment 8 by yzshen@chromium.org, Aug 30 2017

> "having synchronous mojo fail silently on a pipe disconnection is a problem"
I don't see why this is sync call specific?
- When pipe disconnection happens during a sync call, the sync call returns "false", and you will get your connection error handler called asynchronously (if you have registered one).
- When pipe disconnection happens when you have a pending async call, you will never receive a callback, and you will get your connection error handler called.
- Pipe disconnection could be caused by any serialization failure or the other side closes its handle to the pipe.

> "InterfaceEndpointClient::AcceptWithResponder failing to send isn't exposed to calling code"
We don't expose this for either sync or async calls.
Even if we return "true", it doesn't mean the request will be successfully delivered to the peer. Because the peer could be destroyed at any time. I am not sure it is useful for the users, instead it only makes the call site more confusing.


> "There is no logging for the failure to receive a response to synchronous mojo code"
IMO it doesn't seem useful to log in this case in general.
We also don't log when the pipe is disconnected while there are still async calls pending (callbacks haven't run).

> The peer closed state is not exposed to code using mojo
Exposing this as a public API may be a little confusing. Because its asynchronous nature, there is no guarantee that peer_closed == false means the peer is open.
Could you please provide an example where certain things couldn't be done without this state exposed?
> Why sync call specific

I was focusing on synchronous mojo as there have been a series of bugs caused by writing to assume that the connection state doesn't change mid-method call.

   method A
      call sync mojo
      assume mojo is still fine, so do followup
 
Whereas more async mojo tends to relegate the followup work to messages received back

   method A
      call async mojo
      do nothing

   method B
      received as a call from mojo
      do follow up work

With the asynchronous case the connection error handler gets called.
With the synchronous case it's possible to have the connection fail mid-call, and then for the subsequent code to cause crashes. Since we are still in the same method, the connection error hasn't triggered yet.

> InterfaceEndpointClient::AcceptWirhResponder

Ah, I didn't realize that the success on send could still fail on the other side. Thanks for that clarification.

> No logging of pipe closure during sync mojo calls

This would mostly make determining cause of crashes in unittests easier to diagnose. We've had failures such as  issue 738441  where the CHECK that was failing could either have been triggered by a synchronous mojo call failure, or by a bug with the availability of resource files.

In this particular case the CHECK wasn't even at the site of the mojo call, but several levels removed.

> Peer closed state not exposed

We haven't been unable to fix crashes without it, but we are now masking future bugs. For example in https://codereview.chromium.org/2949103002/ AuraInit was updated to signal an initialization failure, and to tell clients to shutdown when files fail to load. 

This handles cases where the service manager is shutting down the world before some other process has completed startup. But it also could mask cases where the file fails to load, but the service manager is alive. 

I'd feel much safer being able to detect the difference of those two failure states, and to cleanly shutdown in the closing case, but to crash with a CHECK in the other.

Sorry for late reply!

> "With the synchronous case it's possible to have the connection fail mid-call, and then for the subsequent code to cause crashes. "
Even if the connection failed in the middle of a sync call, the mojo objects are in absolute valid state that you can operate on without crashing your code.

Connection errors never invalidate your mojo objects, even after you receive a call to your connection error handler.

I wonder how it causes your crashes? The only thing I could think up is the sync call returns false indicating that the output parameters of the sync call are *untouched*. And your code assumes that the output parameters are updated.

Sync calls could lead to weird call stacks because while we are waiting for sync call reply, you could get re-entered by any incoming sync calls on the same threasd, right on the current call stack. Failure to handle re-entrancy is a common cause of bugs. But this has nothing to do with Mojo, people made bugs when using the old sync IPCs as well.

> "I'd feel much safer being able to detect the difference of those two failure states, and to cleanly shutdown in the closing case, but to crash with a CHECK in the other."
You are not supposed to CHECK when you receive a malformed message. Because that means a malicious sender can control exactly when to crash your process.

Sorry I haven't got time and domain expertise to look through the bug and CL links. If you have a concrete example, would you please describe it in-line? Or alternatively, would you please point me to the lines where it is relevant in the pages you linked to?





Example:

Font loading code had a check to fail if no font was found. Similar to other hard failures in Chrome, we crash right away so that bugs are filed and we fix the root cause. In pre-mojo world this would happen at call time.

Later on a mojo font loading service was built. Leading to this previously written area of code to become re-entrant. So where a failure used to mean a bug, it could now mean a bug, or that we are closing and need to handle this gracefully.

The problem in this case is that the handling of the mojo call is quite removed from where the older check was. In this particular case the Mojo handling was 5 levels deeper in the code than the check. Meanwhile there were dozens of pre-existing entry points to the code performing the check.

So without any logging of the connection failure, diagnosing these types of re-entrancy bugs becomes difficult. I'm just concerned that this could be problematic as we continue to introduce more of these interfaces.


Project Member

Comment 12 by bugdroid1@chromium.org, Sep 9 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/81485a85ecff15a1e40ccabf360f0e00f9fcde44

commit 81485a85ecff15a1e40ccabf360f0e00f9fcde44
Author: Yuzhu Shen <yzshen@chromium.org>
Date: Sat Sep 09 02:22:35 2017

Mojo C++ bindings: add a DVLOG for sync call returning without getting response.

BUG= 735051 

Change-Id: I8594db149a09ae9335fc7394ec26a462028d4aed
Reviewed-on: https://chromium-review.googlesource.com/658150
Reviewed-by: Jonathan Ross <jonross@chromium.org>
Commit-Queue: Yuzhu Shen <yzshen@chromium.org>
Cr-Commit-Position: refs/heads/master@{#500783}
[modify] https://crrev.com/81485a85ecff15a1e40ccabf360f0e00f9fcde44/mojo/public/cpp/bindings/lib/interface_endpoint_client.cc

Status: Fixed (was: Untriaged)
Thanks Yuzhu for landing #12!

With this additional logging information we can now differentiate between an api returning false, and a synchronous api call not receiving a response.

After discussions offline we agree that for synchronous apis which wish to handle no response separately from rejections, should look to update their API. Such as considering an enum for an out param to signify success/failure/no_response.

We do not want to expose the peer state, as for some failure states a fake peer can replace the actual one. So we cannot always rely on the peer state to detect these failed synchronous calls.

We also do not want to set encountered_error to true at this point, as there may already be other enqueued async call replies which arrived without the error. Some processes may wish to finish processing those before shutting down.
Thanks for the summary, Jonathan!

Agreed with everything except a minor correction:

"""After discussions offline we agree that for synchronous apis which wish to handle no response separately from rejections, should look to update their API. Such as considering an enum for an out param to signify success/failure/no_response."""
Actually the could have a boolean output param to signify success/failure, because no_respnose is indicated by the return value already:

bool succeed = false;
bool has_response = foo->SomeSyncCall(&succeed);



Components: -MUS Internals>Services>WindowService

Sign in to add a comment