New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 740642 link

Starred by 2 users

Issue metadata

Status: Started
Owner:
Cc:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 1
Type: Bug



Sign in to add a comment

Mojo: Port proxies can get stuck in an uncollapsable state

Project Member Reported by roc...@chromium.org, Jul 10 2017

Issue description

I recently built a multiprocess stress test to expose potentially buggy edge cases in the EDK. This is the only significant bug I found, and it appears to be a long-standing, fundamental issue with the ports protocol. It doesn't seem possible to have a significant impact on production today, but it is effectively blocking this stress test from landing so I'd like to resolve it.

The issue is raciness in the circulation of ObserveProxy events vs ObserveProxyAck events. A potentially bad scenario:


(A) -> (B) -> (C) -> (D)
  \                   /
    -- <-- (E) -- <--  

(A) and (D) are the receiving ports here. (B) and (C) are proxies on one side, (E) is a lone proxy on the other side.

The issue here is that the ObserveProxy[E->A] event being forwarded from (B) may race with (C) being destroyed and thus get lost forever. The sequence of events is:

1. ObserveProxy[E->A] is received by (A)
2. ObserveProxy[E->A] is forwarded to (B)
2. ObserveProxy[B->C] is received by (A)
3. ObserveProxyAck[B] is sent to (B) and (A) starts routing to (C) directly
4. ObserveProxy[C->D] is received by (A)
5. ObserveProxyAck[C] is sent to (C) and (A) starts routing to (D) directly
6. ObserveProxyAck[C] is received by (C) and (C) is destroyed
7. ObserveProxy[E->A] is received by (B) (forwarded from (A) in #2)
8. (B) attempts to forward ObserveProxy[E->A] to (C)

#8 fails because (C) has already been destroyed. There's no way for (C)'s node to know whether it should tell (E) to re-transmit its ObserveProxy[E->A], as there are legitimate scenarios (e.g. an explicitly deleted port) where an ObserveProxy recipient is gone and re-transmission must not be requested. The net result is that the proxy at (E) will never collapse.

This issue only exists when there are proxies on both sides of the port cycle.

Making this P1 since it will inevitably affect production code, but not blocking any releases since it is unlikely to have much if any impact today.
 

Comment 1 by roc...@chromium.org, Jul 10 2017

One idea I have is to introduce a new "control sequence number" for ObserveProxy events.

All ObserveProxy events would start with an invalid control sequence number.

When passed through a receiving port, an ObserveProxy with an invalid sequence number gets the receiving port's next available control sequence number (separate count from user message sequence number) attached to it.

When an ObserveProxy event has a *valid* sequence number attached and is passing through a receiving port, the sequence number is instead stripped from the event. This ensures that an ObserveProxy event only has a valid sequence number while it travels along the side of the port cycle opposite the proxy.

ObserveProxy events would continue to be forwarded normally when they have an invalid control sequence number. Proxies forwarding ObserveProxy events with a valid sequence number would have to keep track of the the highest contiguous control sequence number seen so far.

Finally, an ObserveProxyAck would convey the last control sequence number sent, in addition to last user message sequence number, and proxy cleanup would be blocked on both sequences being fully propagated. This is somewhat simplified by the fact that an ObserveProxyAck with a valid sequence number is only ever sent by receiving ports.

In the scenario given in comment #1 this behavior would ensure that port (C) could not be destroyed until it received the ObserveProxy[E->A] which (A) has already forwarded to (B).

Owner: rockot@google.com

Sign in to add a comment