I recently built a multiprocess stress test to expose potentially buggy edge cases in the EDK. This is the only significant bug I found, and it appears to be a long-standing, fundamental issue with the ports protocol. It doesn't seem possible to have a significant impact on production today, but it is effectively blocking this stress test from landing so I'd like to resolve it.
The issue is raciness in the circulation of ObserveProxy events vs ObserveProxyAck events. A potentially bad scenario:
(A) -> (B) -> (C) -> (D)
\ /
-- <-- (E) -- <--
(A) and (D) are the receiving ports here. (B) and (C) are proxies on one side, (E) is a lone proxy on the other side.
The issue here is that the ObserveProxy[E->A] event being forwarded from (B) may race with (C) being destroyed and thus get lost forever. The sequence of events is:
1. ObserveProxy[E->A] is received by (A)
2. ObserveProxy[E->A] is forwarded to (B)
2. ObserveProxy[B->C] is received by (A)
3. ObserveProxyAck[B] is sent to (B) and (A) starts routing to (C) directly
4. ObserveProxy[C->D] is received by (A)
5. ObserveProxyAck[C] is sent to (C) and (A) starts routing to (D) directly
6. ObserveProxyAck[C] is received by (C) and (C) is destroyed
7. ObserveProxy[E->A] is received by (B) (forwarded from (A) in #2)
8. (B) attempts to forward ObserveProxy[E->A] to (C)
#8 fails because (C) has already been destroyed. There's no way for (C)'s node to know whether it should tell (E) to re-transmit its ObserveProxy[E->A], as there are legitimate scenarios (e.g. an explicitly deleted port) where an ObserveProxy recipient is gone and re-transmission must not be requested. The net result is that the proxy at (E) will never collapse.
This issue only exists when there are proxies on both sides of the port cycle.
Making this P1 since it will inevitably affect production code, but not blocking any releases since it is unlikely to have much if any impact today.
Comment 1 by roc...@chromium.org
, Jul 10 2017