RTCPeerConnection objects are released too slowly and reallocating causes exception: Cannot create so many PeerConnections
Reported by
devteam_...@netop.com,
Mar 25 2018
|
|||||||||
Issue descriptionUserAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 Steps to reproduce the problem: 1. Allocate a batch of RTCPeerConnection objects 2. Close and release allocated objects 3. Wait for a reasonable amount of time to allow for garbage collection 4. Go to 1 What is the expected behavior? Allocation/Release cycle should work indefinitely without issue. What went wrong? At some point allocation fails with DOMException: Failed to construct 'RTCPeerConnection': Cannot create so many PeerConnections. Did this work before? Yes Chrome 64 Does this work in other browsers? N/A Chrome version: 65.0.3325.181 Channel: stable OS Version: 10.0 Flash Version: We are using WebRTC communication with our Chrome app. We create between 30 and 50 RTCPeerConnection at a time, per session, which are used only for their Data Channel. Sessions can be restarted, and when that happens the connections are released and recreated in a short time interval. It was never a problem before, but since Chrome 65 we noticed the "cannot create" exceptions gathering in Google Analytics. This worked before, as early as Chrome 50 or even before that. Attached script can be run as a Snippet and reproduces the error. Depending on the memory context, the exception can happen more or less soon, but always does. At some point after the exception is thrown, it appears that some connections are released and more RTCPeerConnections can be created. Objects are not always shown as allocated in the memory analyzer dumps - and it appears that while the Javascript layer releases them, the underlying implementation does not. Perhaps related to: https://bugs.chromium.org/p/chromium/issues/detail?id=797876
,
Mar 26 2018
Able to reproduce this issue on reported version 65.0.3325.181 using Mac 10.13.3, Windows 10 and Ubuntu 14.04 using .js file given in comment#0. Attached .js file as snippet in devtools and run it. Seeing errors as shown in comment#0. Good Build: 65.0.3300.0 Bad Build: 65.0.3301.0 You are probably looking for a change made after 525703 (known good), but no later than 525704 (first known bad). CHANGELOG URL: https://chromium.googlesource.com/chromium/src/+log/c2c77bad1df1c6c3973853bc7bfb9ca604d08315..87fd6420466bab9e3cf9958db08ad64818d837b8 Reviewed-on: https://chromium-review.googlesource.com/838380 Suspecting same from changelog. @ hta: Please confirm whether this is an issue or intended behavior. Adding RB-Stable for M-65. Please remove if not the case. Thanks!
,
Mar 26 2018
,
Mar 26 2018
This seems likely to be a result of the hard limit on number of PeerConnections introduced in m65 - see release notes: https://groups.google.com/forum/#!msg/discuss-webrtc/QJHpBnGQPKk/oKR0pSD-CgAJ Bug: https://bugs.chromium.org/p/webrtc/issues/detail?id=8571 The lack of collection of PeerConnections is worrying, but doesn't seem like a blocker; we would expect this to have previously caused crashes. Removing blocker label and target 65 label; let's fix this for 66.
,
Mar 26 2018
Based on above comment from hta@,updating the target milestone to M66.
,
Mar 28 2018
Here is a simple demo: https://jsfiddle.net/0tp6tzhu/ I see 2 issues here: 1) Hard limit is OK, but 500 active is not high enough (I'm running DHT node in browser and really need as many connections as hardware can handle, so hard limit should be more reasonable) 2) Limit at 500 for ever created connections (including closed, across all tabs in the process) in simply unacceptable, looks like there is some kind of counter that is not decremented when connection is closed BTW, Firefox has no problem with creating and closing 10k connections using demo linked above (didn't bother to check more).
,
Apr 1 2018
1) re active connections: You have the source, so you can compile a Chromium with the limit at any point you want, and tell us what higher limit works. We know that when we scaled it last, the crashes occurred first on Windows 7, so extra good if you can test on that platform. The test "fast/peerconnection/RTCPeerConnection-many" is the relevant layout test. 2) the problem of closed connections not being garbage collected needs investigation. This should not happen, but there might be a loop reference problem between datachannels and peerconnections. I'll investigate that. It looks as if the sample you provided should be easily convertible into a layout test. Thanks for contributing that! Lowering priority to 2.
,
Apr 2 2018
Experimentation shows that the problem reproduces in content_shell, and that inserting manual garbage collection cures it. So the problem is that garbage collection is not triggered when an excess of PeerConnections is accumulated, but memory pressure is otherwise low. Calling on @haraken for advice.
,
Apr 2 2018
How large is the PeerConnection object? If it's large, you can report the size to V8 using AdjustAmountOfExternalAllocatedMemory(). Then V8 takes into account the memory usage and triggers a GC.
,
Apr 2 2018
I'd say that it's ... very large :) Threads are created per instance, codecs initialized etc. I think it would be fair to put the price per-instance in the "a few MB" category.
,
Apr 2 2018
Then you should report the size to V8 with AdjustAmountOfExternalAllocatedMemory() :)
,
Apr 2 2018
never! :D sorry for not being clear. My comment was meant to be for hta@ if he's going to be making the change. All the stuff that happens inside of WebRTC isn't obvious from the Chromium side of things, so I just wanted to give a finger-in-the-air estimate.
,
Apr 3 2018
The issue here is that we're hitting the limit on # of threads, and were depending on the idea that "garbage collection happens once in a while" to do the final freeing of the deallocated objects. In cases like the demo in the bug, there's very little memory pressure, so garbage collection never comes into play. In order to make GC happen at the right time, it seems that we would have to calculate the size as 1/500 of available memory, which would certainly work, but would not reflect reality very well (on a 64G machine, it would return a "size" for a PeerConnection of 128 Mbytes or so), and "feels hackish" - we're running out of threads, not memory. Some other strategies that I see as possible, but want opinions on: - When the # of peerconnections hits 9/10 of the possible PCs, trigger a garbage collection "by hand" - Make it possible to iterate over all peerconnections in modules/RTCPeerConnection.cpp, and make the module have the ability to finalize destruction of those peerconnections that would have been GCed Both seem hackish in their own way - the first may have side effects because GC is now called when memory is not full, the second seems like we're redoing what GC already does. What's the least hackish way of getting the behavior that we want?
,
Apr 3 2018
,
Apr 10 2018
ping @haraken for further comment on possible solutions.
,
Apr 11 2018
A couple of questions: - Why is RTCPeerConnection creating so many threads? - Do we really need to support cases where developers create so many threads? e.g., if developers create 10000 workers, it's a bug of the website, not Chrome. - What is the current limit of # of threads?
,
Apr 11 2018
@haraken Please consider that the initial report was not referring to creating that many (10000+) active connections, but rather repeatedly creating and releasing the connection objects during a reasonable time span. It is not that hard to reach the 500 objects hard-limit, by creating, releasing and re-creating a set of 60 connections (or say two per object for a set of 30 objects). Given the nature of the RTCPeerConnection JS object, where the configuration is passed through constructor - a pool-based implementation is hard or impossible to achieve, therefore the only solution to reset state being to recreate the connections. Thank you very much for looking into this.
,
Apr 11 2018
Do you have any explicit timing where you can release resources after the connection is released? Or do you really have to wait for a GC to happen to know that the resource is safe to be released?
,
Apr 11 2018
The thread resources come from the implementaiton of PeerConnection in WebRTC - I think it uses around 4 threads per PeerConnection, and this creates serious problems around 700 - 1400 PeerConnections, depending on platform. There are calls that can be done on a PeerConnection in any state, including "closed", that involve jumping between the threads - so it's not safe to release the threads until we know that all references to the PeerConnection are removed - which is the same time at which the object becomes available for garbage collection - and this is the reason we went with construction / destruction as the "countable events" that lead us to block creating more PeerConnections. But destruction only happens when garbage collection happens. So our problem now is that when memory pressure is low, but PeerConnections are being used up, GC is not called. We either have to call it or to emulate its effect. Which is less horrible? Related: Tommi@ is working on reducing the number of threads required, but that proejct doesn't have a guaranteed end date. We don't want to depend on that.
,
Apr 11 2018
> So our problem now is that when memory pressure is low, but PeerConnections are being used up, GC is not called. We either have to call it or to emulate its effect. Which is less horrible? How is it possible to emulate it? To emulate a GC, you need to know a timing where it's safe to destruct the objects. However, you're saying that it's not possible without causing a GC...? I think that a right solution would be to reuse physical threads. What is making it harder?
,
Apr 17 2018
See comment #19 for "what's making it harder".
,
May 24 2018
I guess the fix missed the M66 target... Would it be an option to increase or remove the hard limit until a solution is found (like before M65), perhaps allowing the resources to be released through memory pressure at some point? Thank you.
,
May 25 2018
,
Aug 30
I am experiencing the same issue and here is test code which can show you the bug:
var i = 1;
function peer() {
var peer = new RTCPeerConnection();
setTimeout(() => {
peer.close();
peer=null;
}, 10);
console.log(i++);
}
setInterval(peer, 20);
It works without any issue in FireFox. But chrome starts to complain "Cannot create so many PeerConnections" after about 10 seconds.
,
Sep 3
To help you triage this issue, here a couple of links from around the web where people are seeing the same bug: https://github.com/webtorrent/webtorrent/issues/1349 https://stackoverflow.com/questions/49732647/webrtc-wrong-active-unactive-rtcpeerconnection-limit |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by susan.boorgula@chromium.org
, Mar 26 2018