New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 922588 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 1
Type: Bug-Regression



Sign in to add a comment

ICE fails immediately since v.73 in certain scenarios

Reported by davies...@gmail.com, Jan 16 (6 days ago)

Issue description

UserAgent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3664.3 Safari/537.36

Steps to reproduce the problem:
1. Using JsSIP with Asterisk on Chrome
2. Initiate outbound call
3. SDP setup proceeds correctly / normally
4. signalling state => stable
5. iceconnectionstate => failed
session drops

What is the expected behavior?
In step 5. iceconnectionstate => connected
session proceeds

What went wrong?
Cases which WORK:
- All calls originated by the remote party
- All calls communicating with other browsers (ie. not Asterisk)
- All calls using Chrome stable or beta
- Calls where at least one STUN/TURN server *never* responds.

Case which fails:
- Chrome initiates call
- Target is Asterisk via JsSIP
- Either no ICEservers, or all servers respond quickly.

Did this work before? Yes 72

Does this work in other browsers? N/A

Chrome version: 73.0.3664.3  Channel: dev
OS Version: Mint
Flash Version: 

Tested in Windows 10 and Linux, still fails on nightly build as of 16/Jan/19.

Example of SDP:

(Chrome) setLocalDescription
type: offer, sdp: v=0
o=- 7067090028397092722 2 IN IP4 127.0.0.1
s=-
t=0 0
a=group:BUNDLE 0
a=msid-semantic: WMS CVSfbdtMl6jxizlmf2keCZity2kSonBVosM6
m=audio 9 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 106 105 13 110 112 113 126
c=IN IP4 0.0.0.0
a=rtcp:9 IN IP4 0.0.0.0
a=ice-ufrag:HDjS
a=ice-pwd:gdiVw1Mvc7jQ0tcLFivDQ5d5
a=ice-options:trickle
a=fingerprint:sha-256 A9:CB:57:40:7F:19:77:54:B1:5D:E0:29:4F:39:FA:FD:5A:CE:7F:F2:6F:10:59:0E:02:E5:36:FF:FD:EC:FD:22
a=setup:actpass
a=mid:0
a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level
a=extmap:9 urn:ietf:params:rtp-hdrext:sdes:mid
a=extmap:13 urn:ietf:params:rtp-hdrext:sdes:rtp-stream-id
a=extmap:14 urn:ietf:params:rtp-hdrext:sdes:repaired-rtp-stream-id
a=sendrecv
a=msid:CVSfbdtMl6jxizlmf2keCZity2kSonBVosM6 e20d694d-16ef-483b-9cf3-cb0268b9c53a
a=rtcp-mux
a=rtpmap:111 opus/48000/2
a=rtcp-fb:111 transport-cc
a=fmtp:111 minptime=10;useinbandfec=1
a=rtpmap:103 ISAC/16000
a=rtpmap:104 ISAC/32000
a=rtpmap:9 G722/8000
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:106 CN/32000
a=rtpmap:105 CN/16000
a=rtpmap:13 CN/8000
a=rtpmap:110 telephone-event/48000
a=rtpmap:112 telephone-event/32000
a=rtpmap:113 telephone-event/16000
a=rtpmap:126 telephone-event/8000
a=ssrc:3033023710 cname:Ofs1+a650DG8V3cx
a=ssrc:3033023710 msid:CVSfbdtMl6jxizlmf2keCZity2kSonBVosM6 e20d694d-16ef-483b-9cf3-cb0268b9c53a
a=ssrc:3033023710 mslabel:CVSfbdtMl6jxizlmf2keCZity2kSonBVosM6
a=ssrc:3033023710 label:e20d694d-16ef-483b-9cf3-cb0268b9c53a

(From Asterisk) setRemoteDescription
type: answer, sdp: v=0
o=root 465229573 465229573 IN IP4 10.0.0.33
s=Asterisk PBX
t=0 0
a=group:BUNDLE 0
a=msid-semantic:WMS 1a33c281-dc00-491c-8a54-934417e24b02
m=audio 14076 RTP/SAVPF 111 8 0 126
c=IN IP4 10.0.0.33
a=rtcp:14077 IN IP4 10.0.0.33
a=rtpmap:111 opus/48000/2
a=maxptime:60
a=fmtp:111 maxplaybackrate=16000; stereo=0; sprop-stereo=0; useinbandfec=0
a=rtpmap:8 PCMA/8000
a=rtpmap:0 PCMU/8000
a=rtpmap:126 telephone-event/8000
a=fmtp:126 0-16
a=ptime:20
a=ice-ufrag:251b2d3411577c7663341c1271e4f359
a=ice-pwd:663af51e05a02e0c637c853761f17da5
a=candidate:Ha000021 1 UDP 2130706431 10.0.0.33 14076 typ host
a=candidate:S58cae96e 1 UDP 1694498815 xxx.xxx.233.110 51117 typ srflx raddr 10.0.0.33 rport 14076
a=connection:new
a=setup:active
a=fingerprint:SHA-256 72:C2:D7:33:94:C1:71:44:3A:68:D6:A3:18:CC:3C:E1:7C:0F:A6:65:AA:EC:EF:B1:48:50:5A:DB:49:34:D6:70
a=mid:0
a=ssrc:310494907 cname:5eca32ee-dd3e-4378-b542-c5ff198d2f8b
a=ssrc:310494907 msid:1a33c281-dc00-491c-8a54-934417e24b02 4eed43c2-2d43-4317-abfc-a67080d411f3
a=ssrc:310494907 mslabel:1a33c281-dc00-491c-8a54-934417e24b02
a=ssrc:310494907 label:4eed43c2-2d43-4317-abfc-a67080d411f3
a=rtcp-mux
a=sendrecv
a=msid:1a33c281-dc00-491c-8a54-934417e24b02 4eed43c2-2d43-4317-abfc-a67080d411f3
a=end-of-candidates
 

Comment 1 by davies...@gmail.com, Jan 16 (6 days ago)

I hope it is appropriate to attach the ok-call and failed-call chrome logs, captured using '--enable-logging --vmodule=*/webrtc/*=1' flags. Both have been gzip'ed for better upload size. The only difference between the 2 calls is the addition of a non-responsive STUN server to the working call setup.

chrome_debug.log.fails.gz
33.6 KB Download
chrome_debug.log.works.gz
39.1 KB Download

Comment 2 by viswa.karala@chromium.org, Jan 17 (6 days ago)

Labels: Needs-Triage-M73 Needs-Bisect

Comment 3 by viswa.karala@chromium.org, Jan 17 (5 days ago)

Cc: viswa.karala@chromium.org
Labels: Triaged-ET Needs-Feedback
Thanks for filing the issue!

@Reporter: Could you please provide sample Test File/URL that reproduces the issue which help in further triaging the issue in better way. If possible provide screencast of the issue which help in better understanding.

Thanks!

Comment 4 by philipp....@googlemail.com, Jan 17 (5 days ago)

Components: -Blink>WebRTC Blink>WebRTC>Network
viswa.karala: it is very unlikely that a screencast is going to help with a problem that occurs at the network level.

davies147: do you use trickle ice? (even though the error sounds like the opposite of 'non-trickle does not work with a broken stun server)

Comment 5 by davies...@gmail.com, Jan 17 (5 days ago)

No, we do not *actually* use trickle ice. Because the Asterisk endpoint does not support it, we collect ICE candidates for a finite (2 second max) period, add it to the generated offer, and pass it to the far-end. Asterisk will then respond with all ice candidates present in the SDP (as above).

I have just re-tried the test with all a=extmap: lines and with a=ice-options:trickle removed from the initial offer, but that makes no change.

Experimentally with a bit of guesswork involved, it appears that it is necessary to have called setRemoteDescription() before the icegatheringstatechange transitions from gathering => complete. If the icegatheringstatechange occurs first, then the setup fails.

Based on this theory I will attempt to "force the error" in an environment where Asterisk is not involved.

Project Member

Comment 6 by sheriffbot@chromium.org, Jan 17 (5 days ago)

Labels: -Needs-Feedback
Thank you for providing more feedback. Adding the requester to the cc list.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Comment 7 by davies...@gmail.com, Jan 17 (5 days ago)

PS. I am happy to do a version bisect to pin down this issue to a more specific build but will need some pointers on how to do so if necessary/useful/required.

Comment 8 by davies...@gmail.com, Jan 17 (5 days ago)

1) Sadly, I have failed to replicate this in a Chrome <-> Chrome environment, even replicating the ICE gather/collate mechanism, and ensuring that the order and timing of calls to the various methods is similar to our JsSIP <-> Asterisk scenario.

2) Interestingly, I have found a curious workaround! During setup we have events as follows:

signalingstate => have-local-offer
icegatherisngstate => gathering
... ice ...
iceconnectionstate => checking
icegatheringstate => complete
... setRemote ...
signalingstate => stable
iceconnectionstate => failed      <- ***
iceconnectionstate => completed

If I act upon failed (at ***), then I will close the connection. If I IGNORE the failed report, it immediately transitions to completed with only a 3ms delay in my tests!

I am attaching a debug log of the case when I ignore the 'failed' indication and allow the call to proceed.

chrome_debug.log.workaround.gz
40.3 KB Download

Comment 9 by philipp....@googlemail.com, Jan 17 (5 days ago)

Cc: jonasolsson@chromium.org
failed->completed sounds like a very buggy transition. You might want to look at wireshark dumps too (which are more accessible than chrome logs :-))

jonasolson: maybe this is related to your changes too?

Comment 10 by davies...@gmail.com, Jan 17 (5 days ago)

Wireshark dump taken - timings should be pretty accurate as they use the same local system clock.

It would appear that the 'failed' event occurs in the small gap between all external STUN/TURN bindings completing and the receipt of the first host candidate's STUN response, and that it flips back to 'completed' as soon as a STUN response is received from the host candidate.

1st pcap attached, the significant events/timings are:

14:41:15.556 "checking" 
frame 490/491

14:41:15.641 "failed" (Ignored by user-agent)
frame 598

14:41:15.647 "completed"
frame 604

14:41:18.018 "closed" (By user)


I also did a really simple test with no external STUN/TURN specified. In this case 'failed' happens when Chrome sends the first host-binding request, and 'completed' happens immediately after that, even beforethe host-binding-success packet is received.

2nd pcap attached, the significant events/timings are:

14:56:40.682 "checking"
frame 6

14:56:40.763 "failed"  (Ignored by user-agent)
frame 17/18

14:56:40.764 "completed"
frame 17/18

14:56:42.568 "closed" (By user)


​Hope this helps.
Regards,
Steve

webrtc_wireshark.pcapng.gz
86.9 KB Download
webrtc_wireshark_no_ice.pcapng.gz
47.2 KB Download

Comment 11 by qingsi@chromium.org, Jan 17 (5 days ago)

This issue is likely related to this WebRTC bug: https://bugs.chromium.org/p/webrtc/issues/detail?id=9218.

libwebrtc implementation reports the 'failed' state if all currently available candidate pairs fail the connectivity check. As a result, if STUN/TURN transaction is slow and delays the gathering and trickling process, while the gathered host candidate is not reachable from the remote peer, the remote peer would see a 'false-alarm' failure and then 'connected' if the later signaled other candidates are reachable.

As a result, the 'failed' state used not to be a terminal state. My suggestion is to have some waiting logic upon 'failed', combined with ICE restart to be robust.

jonasolsson@ has been recently working on making the interpretation of states consistent with the spec (which I believe has been introduced to M73 canary). For what is worth, there may be a caveat that if the new state signal prevents the transition from 'failed' to 'connected', forcing 'failed' to be a terminal state, Web apps could get stuck.

Comment 12 by qingsi@chromium.org, Jan 17 (5 days ago)

Cc: qingsi@chromium.org jeroendb@chromium.org

Comment 13 by philipp....@googlemail.com, Jan 17 (5 days ago)

#11: does this only surface due to the recent changes in chrome?

Given the guidance we gave in the spec about restarting in https://github.com/w3c/webrtc-pc/pull/1910 shipping this would cause quite a number of undesirable restarts.

Comment 14 by hbos@chromium.org, Jan 17 (5 days ago)

Cc: hbos@chromium.org
Labels: -Needs-Bisect -Via-Wizard-API -Triaged-ET -Needs-Triage-M73 M-73
Owner: jonasolsson@chromium.org
Status: Assigned (was: Unconfirmed)
Jonas, can you take a look?

Comment 15 by juberti@google.com, Jan 17 (5 days ago)

Cc: benwright@chromium.org

Comment 16 by hbos@chromium.org, Jan 17 (5 days ago)

Labels: -Pri-2 Pri-1

Comment 17 by juberti@google.com, Jan 17 (5 days ago)

I think this behavior is incorrect per https://tools.ietf.org/html/draft-ietf-ice-trickle-21#section-8 - when trickling, we should only go to failed when we have received end-of-candidates. (and hence, failed should be a terminal state)

I think we should probably roll back the offending CL until we can figure out the exact path forward here.

Comment 18 by benwright@google.com, Jan 17 (5 days ago)

Revert CL should be landing soon: https://chromium-review.googlesource.com/c/chromium/src/+/1418314

Comment 19 by juberti@google.com, Jan 17 (5 days ago)

Great. That only rolls back the thing that wires up the new logic for the ICE states, so the actual root cause CL is elsewhere - we can let Jonas track that down.

Comment 20 by philipp....@googlemail.com, Jan 17 (5 days ago)

davies147: can you try removing the 10.x host candidate from the asterisk description? Its a long shot but if that makes the failure go away might provide a bit more insight.

Comment 21 by davies...@gmail.com, Jan 18 (4 days ago)

I removed the host candidate, which correctly forced the use of the relay entry (the net topology/NAT means that srflx cannot be used), but still observed the transition via the 'failed' state.


Comment 22 by jonasolsson@chromium.org, Yesterday (42 hours ago)

Now I've had some time to dig into this, and I'm really confused.

In order for us to emit failed on the new iceconnectionstate path (which 73 should have been on until the revert in #18) this check has to evaluate to true: https://cs.chromium.org/chromium/src/third_party/webrtc/p2p/base/p2p_transport_channel.cc?rcl=74ba99062c48b278675cfe52643719202296fddc&l=409

This means that writable_ has to have been true at some point while it's currently false.

writable_ is only set by SetWritable, which also logs a line containing "Changed writable_ to ". This line occurs exactly once in each of the submitted logs, when writable_ becomes true.

Comment 23 by davies...@gmail.com, Yesterday (39 hours ago)

This definitely feels like a timing issue, one of our systems will work about 1 time in 10.

Looking at the same piece of code as referred to in comment#22 just below line 409, if

(has_been_writable_ == false && gathering_state_ == kIceGatheringComplete && has_connection == false)

which I believe is possible / true then this method can still return 'failed'. It does depend on how/when this method is triggered to re-evaluate the state.

Forgive me if the above is a naive interpretation, but it does also reflect that if I provide a broken/slow TURN server, then gathering_state_ will not yet be kIceGatheringComplete?

Comment 24 by jonasolsson@chromium.org, Yesterday (38 hours ago)

Never mind, some rubber duck debugging made things clearer. I think we're actually hitting this code path: https://cs.chromium.org/chromium/src/third_party/webrtc/p2p/base/p2p_transport_channel.cc?rcl=74ba99062c48b278675cfe52643719202296fddc&l=421

I.e. the gathering_state_ variable is kIceGatheringComplete, but we don't have any active connections. This seems to match the chrome_debug.log.fails log, where 
"[1:17:0116/174627.503729:INFO:p2ptransportchannel.cc(843)] P2PTransportChannel: 0, component 1 gathering complete"
occurs before AddConnection() gets called for the first time.

Is this a bug with gathering_state_, or did I misunderstand the meaning of "gathering complete"? I've implicitly assumed that once the gathering is completed no more connections will be added, which is why I made "gathering complete with no connections" map to failure.

Comment 25 by philipp....@googlemail.com, Yesterday (38 hours ago)

gathering complete only means the local candidates have all been gathered but you can still create new pairs with newly trickled remote candidates.

Comment 26 by jonasolsson@chromium.org, Yesterday (37 hours ago)

That explains where I went wrong then, thanks! Since end-of-candidates signalling isn't implemented yet (see #11) we should probably avoid transitioning to completed at all for now. I'll tweak the webrtc guts and then I'll re-enable the new ice connection state. If we keep running into issues we can always revert to the old implementation again.

Sign in to add a comment