New issue
Advanced search Search tips

Issue 876540 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jan 17
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 3
Type: Bug



Sign in to add a comment

[remoting host] First P2P relay connection fails

Project Member Reported by lambroslambrou@chromium.org, Aug 21

Issue description

What steps will reproduce the problem?
(1) Start host process
(2) Connect (Me2Me) using web-site with ?mods=force_relay_connection

What is the expected result?
Should connect successfully first time.
What happens instead?
The first connection attempt fails, very soon after PIN authentication (usually less than 1s, well within the 15s ICE timeout). Subsequent connection attempts are successful.

Please use labels and text to provide additional information.
Seen on Linux, connecting to a ToT host build, from same machine. It's consistently reproducible.

 
I built a host which hard-codes the ICE config lifetime to 5 seconds. This means that the host fetches a new ICE config for every connection. With this modification, the problem occurs every time - it always fails to make a force-relay connection.

From a user perspective, the problem is recoverable - they just need to hit the Reconnect button. But this could account for a significant portion of the P2P failure stats.

I suspect the same problem would happen after the ICE config expiry period, which is typically 24 hours.

The failure seems to happen at the client end, which makes sense if there's an issue with remote relay candidates.

On the client, ICE gathering seems fine, it looks like the failure is in the ICE checking stage.

Maybe there is a timing issue with the ICE credentials not being immediately usable?

Looks like the problem is caused by this error from the TURN server, seen at the client (Chrome) end:
ERROR:turnport.cc(1754)] Received TURN CreatePermission error response, code=403; pruned connection.

The problem is not reproducible when doing local development of the web site (maybe TestGAIA is the reason?).

In the host, I tried adding a 10-second delay before signaling relay candidates (so all the other gathered candidates are signaled immediately, but the relay candidates are sent after 10-second timer).

This had no effect. The failure case still failed immediately - the PeerConnection failed within the 10s delay. The successful case connected immediately. A relayed connection was established within the 10s window, before the relay candidates were sent from host to client. In the client, the selected candidate-pair was peerreflexive/relay, but after 10s this changed to relay/relay once the remote candidate was received (which became the selected remote candidate on the client).


Owner: lambroslambrou@chromium.org
Status: Assigned (was: Untriaged)
It looks like there's a timing issue in the ICE stack in Chrome?
In the host process, I tried delaying signaling all non-"host" candidates. That means: srflx and relay candidates were signaled to the client after 10s delay.
This experiment caused the failure to happen consistently.
Recall that the host and client are on the same machine, and the client (website) was applying config['iceTransportPolicy'] = 'relay'.
So:
The client receives typ=host candidates from the host.
The client gathers only relay candidates (no host/srflx).
The client goes into ICE "checking" state.
At this point, the ICE stack applies a short timeout (< 100ms) before failing.
Client is waiting for the host to signal srflx/relay candidates, which are delayed by 10s.
The short timeout on waiting for more remote candidates seems to cause the failure.
It's not clear what the actual timeout is here. The time between ICE checking and ICE failure seems to vary between 10ms and 80ms.

So it looks like the trigger for the failure is: too much time between the host signaling typ=host and typ=srflx candidates.

If I tweak the host to batch the typ=host and typ=srflx candidates into a single message, the connection always succeeds. This can be done by increasing this value from 20ms to 100ms, but it's a crude hack:
https://cs.chromium.org/search/?q=kTransportInfoSendDelayMs+file:webrtc_transport.cc

Also, I haven't found a way to trigger the problem without the force_relay_connection mod. It's not clear if this problem could account for increased P2P failures in the wild?


Comment 5 by lambroslambrou@chromium.org, Jan 17 (6 days ago)

Status: Fixed (was: Assigned)
This issue only affected force-relay connections, because of a weird timing issue when host generated the full set of ICE candidates, but client-side list was pruned to relay-only candidates.
Also, the issue disappeared since we pushed a server-side fix for the ICE config.

Sign in to add a comment