InterfaceEndpointClient::AcceptWithResponder Succeeds when there is no response |
|||||
Issue descriptionWhen calling InterfaceEndpointClient::AcceptWithResponder with a synchronous call it is able to detect if a response is received. This indicates either a pipe closure or an unusable response. However the method still returns true, indicating a success. We should look at updating this. As well as consider how to signify this error state. |encountered_error_| is currently used when errors are received. However it is problematic itself right now. Due to the async nature of receiving errors we cannot rely on it. This mostly affects debugging, as there is no way to get the type of failure out of mojo calls. An example of this is DirectoryProxy::OpenFileHandles (https://cs.chromium.org/chromium/src/out/Debug/gen/components/filesystem/public/interfaces/directory.mojom.cc?rcl=243e8d6094dcd543b5c4ced845fc734df0f5c4c1&l=1590) This method will return true upon success. However false encompasses both pipe failures within InterfaceEndpointClient, as well as Deserialization errors cause by bugs. One potential thought for debugging is exposing the state of the underlying message pipe one BindingState/InterfacePtrState then you could check ptr.internal_state()->handle().QuerySignalsState().peer_closed
,
Aug 29 2017
,
Aug 30 2017
+yzshen@ for thoughts on this issue. My concern is that we can have synchronous mojo calls fail to receive responses, but this method says the call succeeded.
,
Aug 30 2017
The result of InterfaceEndpointClient::AcceptWithResponder() doesn't indicate whether the sync mojo call successfully receive the response. Instead it indicates whether the request is successfuly sent. Actually this result is ignored entirely, please see: https://cs.chromium.org/chromium/src/mojo/public/tools/bindings/generators/cpp_templates/interface_definition.tmpl?rcl=b28a108060452565677f4c716db37ebc6b6bb46d&l=149 Whether a response is received is signaled using a different way: if responder->Accept() is called, it set the value of a boolean pointer to true: https://cs.chromium.org/chromium/src/mojo/public/tools/bindings/generators/cpp_templates/interface_definition.tmpl?rcl=b28a108060452565677f4c716db37ebc6b6bb46d&l=75 This boolean pointer pointing to a boolean that we return to the user code: https://cs.chromium.org/chromium/src/mojo/public/tools/bindings/generators/cpp_templates/interface_definition.tmpl?rcl=b28a108060452565677f4c716db37ebc6b6bb46d&l=150
,
Aug 30 2017
Ah I hadn't realized that InterfaceEndpointClient::AcceptWithResponder was only tracking the success of having sent the message. Though if the sending fails that isn't recorded anywhere. So the lack of response being received is signaled because responder->Accept is not called, giving result = false. This overlaps with responder->Accept being unable to deserialize, which also returns false, but denotes a different kind of error. While both signal that the caller should shutdown, it does make it a bit difficult to debug. I was wondering if we should expose the peer_closed concept ate the InterfacePtrState level, so that call sites can better determine if this is an expect error to receive during closure, vs unexpected serialization errors which normally denote bugs. Thoughts?
,
Aug 30 2017
(I feel that this bug report has conflated a few different things. It might make sense to use different bugs for more focused discussion.) RE #5: If the purpose is to debug locally, when serialization error happens we have got error logs printed to the command line window. Normal pipe disconnection doesn't output logs. Is that sufficient?
,
Aug 30 2017
Sorry if I've ended up conflating a few different things.
I grouped this together as right now it's difficult to determine the type of error encountered with synchronous mojo calls. This affects
- debugging locally the error being ran into*
- reading error logs from unittests on trybots
- actually handling the different error states
- cleanly shut down a process when the peer has closed
- throw a fatal error for serialization bugs
* Regarding debugging locally, while the error on serialization is great, having synchronous mojo fail silently on a pipe disconnection is a problem.
- you have to know that the lack of an error message means it is this type of error
- the internal state to verify this state is not exposed to code calling into mojo
I could see this being split into three for better discussion/tracking
- InterfaceEndpointClient::AcceptWithResponder failing to send isn't exposed to calling code
- There is no logging for the failure to receive a response to synchronous mojo code
- The peer closed state is not exposed to code using mojo
,
Aug 30 2017
> "having synchronous mojo fail silently on a pipe disconnection is a problem" I don't see why this is sync call specific? - When pipe disconnection happens during a sync call, the sync call returns "false", and you will get your connection error handler called asynchronously (if you have registered one). - When pipe disconnection happens when you have a pending async call, you will never receive a callback, and you will get your connection error handler called. - Pipe disconnection could be caused by any serialization failure or the other side closes its handle to the pipe. > "InterfaceEndpointClient::AcceptWithResponder failing to send isn't exposed to calling code" We don't expose this for either sync or async calls. Even if we return "true", it doesn't mean the request will be successfully delivered to the peer. Because the peer could be destroyed at any time. I am not sure it is useful for the users, instead it only makes the call site more confusing. > "There is no logging for the failure to receive a response to synchronous mojo code" IMO it doesn't seem useful to log in this case in general. We also don't log when the pipe is disconnected while there are still async calls pending (callbacks haven't run). > The peer closed state is not exposed to code using mojo Exposing this as a public API may be a little confusing. Because its asynchronous nature, there is no guarantee that peer_closed == false means the peer is open. Could you please provide an example where certain things couldn't be done without this state exposed?
,
Aug 30 2017
> Why sync call specific
I was focusing on synchronous mojo as there have been a series of bugs caused by writing to assume that the connection state doesn't change mid-method call.
method A
call sync mojo
assume mojo is still fine, so do followup
Whereas more async mojo tends to relegate the followup work to messages received back
method A
call async mojo
do nothing
method B
received as a call from mojo
do follow up work
With the asynchronous case the connection error handler gets called.
With the synchronous case it's possible to have the connection fail mid-call, and then for the subsequent code to cause crashes. Since we are still in the same method, the connection error hasn't triggered yet.
> InterfaceEndpointClient::AcceptWirhResponder
Ah, I didn't realize that the success on send could still fail on the other side. Thanks for that clarification.
> No logging of pipe closure during sync mojo calls
This would mostly make determining cause of crashes in unittests easier to diagnose. We've had failures such as issue 738441 where the CHECK that was failing could either have been triggered by a synchronous mojo call failure, or by a bug with the availability of resource files.
In this particular case the CHECK wasn't even at the site of the mojo call, but several levels removed.
> Peer closed state not exposed
We haven't been unable to fix crashes without it, but we are now masking future bugs. For example in https://codereview.chromium.org/2949103002/ AuraInit was updated to signal an initialization failure, and to tell clients to shutdown when files fail to load.
This handles cases where the service manager is shutting down the world before some other process has completed startup. But it also could mask cases where the file fails to load, but the service manager is alive.
I'd feel much safer being able to detect the difference of those two failure states, and to cleanly shutdown in the closing case, but to crash with a CHECK in the other.
,
Sep 1 2017
Sorry for late reply! > "With the synchronous case it's possible to have the connection fail mid-call, and then for the subsequent code to cause crashes. " Even if the connection failed in the middle of a sync call, the mojo objects are in absolute valid state that you can operate on without crashing your code. Connection errors never invalidate your mojo objects, even after you receive a call to your connection error handler. I wonder how it causes your crashes? The only thing I could think up is the sync call returns false indicating that the output parameters of the sync call are *untouched*. And your code assumes that the output parameters are updated. Sync calls could lead to weird call stacks because while we are waiting for sync call reply, you could get re-entered by any incoming sync calls on the same threasd, right on the current call stack. Failure to handle re-entrancy is a common cause of bugs. But this has nothing to do with Mojo, people made bugs when using the old sync IPCs as well. > "I'd feel much safer being able to detect the difference of those two failure states, and to cleanly shutdown in the closing case, but to crash with a CHECK in the other." You are not supposed to CHECK when you receive a malformed message. Because that means a malicious sender can control exactly when to crash your process. Sorry I haven't got time and domain expertise to look through the bug and CL links. If you have a concrete example, would you please describe it in-line? Or alternatively, would you please point me to the lines where it is relevant in the pages you linked to?
,
Sep 5 2017
Example: Font loading code had a check to fail if no font was found. Similar to other hard failures in Chrome, we crash right away so that bugs are filed and we fix the root cause. In pre-mojo world this would happen at call time. Later on a mojo font loading service was built. Leading to this previously written area of code to become re-entrant. So where a failure used to mean a bug, it could now mean a bug, or that we are closing and need to handle this gracefully. The problem in this case is that the handling of the mojo call is quite removed from where the older check was. In this particular case the Mojo handling was 5 levels deeper in the code than the check. Meanwhile there were dozens of pre-existing entry points to the code performing the check. So without any logging of the connection failure, diagnosing these types of re-entrancy bugs becomes difficult. I'm just concerned that this could be problematic as we continue to introduce more of these interfaces.
,
Sep 9 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/81485a85ecff15a1e40ccabf360f0e00f9fcde44 commit 81485a85ecff15a1e40ccabf360f0e00f9fcde44 Author: Yuzhu Shen <yzshen@chromium.org> Date: Sat Sep 09 02:22:35 2017 Mojo C++ bindings: add a DVLOG for sync call returning without getting response. BUG= 735051 Change-Id: I8594db149a09ae9335fc7394ec26a462028d4aed Reviewed-on: https://chromium-review.googlesource.com/658150 Reviewed-by: Jonathan Ross <jonross@chromium.org> Commit-Queue: Yuzhu Shen <yzshen@chromium.org> Cr-Commit-Position: refs/heads/master@{#500783} [modify] https://crrev.com/81485a85ecff15a1e40ccabf360f0e00f9fcde44/mojo/public/cpp/bindings/lib/interface_endpoint_client.cc
,
Sep 13 2017
Thanks Yuzhu for landing #12! With this additional logging information we can now differentiate between an api returning false, and a synchronous api call not receiving a response. After discussions offline we agree that for synchronous apis which wish to handle no response separately from rejections, should look to update their API. Such as considering an enum for an out param to signify success/failure/no_response. We do not want to expose the peer state, as for some failure states a fake peer can replace the actual one. So we cannot always rely on the peer state to detect these failed synchronous calls. We also do not want to set encountered_error to true at this point, as there may already be other enqueued async call replies which arrived without the error. Some processes may wish to finish processing those before shutting down.
,
Sep 13 2017
Thanks for the summary, Jonathan! Agreed with everything except a minor correction: """After discussions offline we agree that for synchronous apis which wish to handle no response separately from rejections, should look to update their API. Such as considering an enum for an out param to signify success/failure/no_response.""" Actually the could have a boolean output param to signify success/failure, because no_respnose is indicated by the return value already: bool succeed = false; bool has_response = foo->SomeSyncCall(&succeed);
,
Feb 26 2018
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by jonr...@chromium.org
, Jun 22 2017