Issue metadata
Sign in to add a comment
|
WebSockets stop working after a while. Need to kill renderer process.
Reported by
bj...@asana.com,
Mar 15 2017
|
||||||||||||||||||||
Issue descriptionUserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 Steps to reproduce the problem: 1. Use a product that utilizes websockets - in our case, https://app.asana.com. In all cases we've seen, users have had multiple tabs open. 2. After some undetermined amount of time, websocket connections stop working. It's unclear whether they get disconnected all at once or what -- but what we do know is that re-establishing websockets does not work. 3. When trying to re-establish the connection, the console logs: `WebSocket connection to 'wss://<url>' failed: WebSocket is closed before the connection is established` and the WebSocket Frames tab reports Opcode -1. 4. The problem is resolved by either restarting Chrome or by opening the Task Manager and killing all worker processes for the site [a] a: (When killing worker processes, we've seen multiple worker processes for the site, and some of them also have tabs from other sites. This may be normal, and the current process-sharing model, but calling out in case relevant). What is the expected behavior? Expected the websocket to be successfully re-established. What went wrong? The websocket connection could not be re-established, and it appears as there is some bad state in the worker process(es). To my knowledge, this has happened ~10 times in an office of 150 people using the product in the last week. There are likely other cases I haven't heard of (where users restart their browser without reporting). As of 10 days ago, I'd seen this once, ever. I thought this might be related to the 57 release, but the last occurrence was for a user on 56. Our product is also used by our customers, who are possibly also affected. As a result, we're very motivated to help diagnose this problem further. Are there any diagnostics we can run once we hit this again, which would help pinpoint the problem? Did this work before? Yes Unknown. Does this work in other browsers? N/A Chrome version: 57.0.2987.98 Channel: stable OS Version: OS X 10.10.5 Flash Version:
,
Mar 16 2017
- Are you using DedicatedWorker or SharedWorker? Both? - How many websocket connections do you have? - Could you provide a net-internals log? - https://dev.chromium.org/for-testers/providing-network-details +ricea@ as this may be related to the websocket throttling.
,
Mar 16 2017
> Are you using DedicatedWorker or SharedWorker? Both? Are you referring to web workers? We're not using web workers, but I see now that I chose poor language above. Instead of "worker process", I should have said "renderer process". That is: "The problem is resolved by either restarting Chrome or by opening the Task Manager and killing all RENDERER processes for the site" > How many websocket connections do you have? The app uses a single websocket connection per tab. However, most users have multiple tabs open, so at any given time we expect as many websocket connections as there are open tabs to the app. > Could you provide a net-internals log? - https://dev.chromium.org/for-testers/providing-network-details This is currently difficult, as we don't have a repro. This happens seemingly randomly. We are trying to come up with a repro, and if we do, I'll submit a capture.
,
Mar 16 2017
Thank you for providing more feedback. Adding requester "yhirano@chromium.org" to the cc list and removing "Needs-Feedback" label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Mar 16 2017
The `WebSocket is closed before the connection is established` error message only seems to appear in Chromium's codebase under closeInternal, and so one might speculate that our application is calling .close() on the websocket. While this is possible (we do call .close() under certain conditions), I find it unlikely since once you're in this state reloading the tab does not help. The application is unable to open the websocket until the renderer process is killed, or chrome restarted. I've had one report of a user who left the application running for "6 to 7 minutes" and then suddenly it was able to connect again. Our application basically retries forever, so it sounds like maybe this can unwedge itself under some other conditions (time based?). It also might be more likely to happen if a laptop has been woken up after sleeping -- but this is speculative still. In any case, we've added logging around the places we call .close() on websockets just to be sure, so we'll hopefully have more data on the next occurrence.
,
Mar 16 2017
I assume you only retry after a delay? Otherwise your retries would get throttled. If your application times out waiting for onopen and calls close(), that would explain the error message you are seeing.
,
Mar 16 2017
> I assume you only retry after a delay? Otherwise your retries would get throttled. Correct, we do exponential back off - up to 30 seconds. However, ~month ago there were some changes in how the retries were kicked off -- whether we'd .close() and wait for onClose before reconnecting - or whether we'd just call .close, throw the old socket away and create a new one. Maybe that could cause half-closed websockets to pile up? Will look into that tomorrow. > If your application times out waiting for onopen and calls close(), that would explain the error message you are seeing. While we do have such logic, it looked as if we were getting the error immediately when attempting to open the socket (as opposed to after 10 seconds, when our timeout is supposed to fire). I will verify when this happens again (and the added logging will help there). Are there docs / commits you could point me to so that I could better understand how websockets are throttled?
,
Mar 16 2017
The design doc is https://docs.google.com/document/d/1aw2oN5PKfk-1gLnBrlv1OwLA8K3-ykM2ckwX2lubTg4/edit The "Delay calculation" section will probably be of most interest to you.
,
Mar 16 2017
Thanks for the doc, that was helpful. Given that the max delay is 5 seconds and our cancellation timer is 10 seconds, this is probably not throttling. I'm currently suspicious of a) the fact that Chrome allows a max of 255 sockets in total and b) the fact that our reconnect logic no-longer waits for the onClose event before creating a new websocket. I wonder whether we might be leaking half-closed connections. Is there a way to get a list of all websockets? chrome://net-internals/#sockets does not seem to list the websocket connections. Thank you!
,
Mar 17 2017
,
Mar 17 2017
You can see the list of WebSocket connections on the devtools Network tab, but devtools need to be opened before making a connection.
,
Mar 17 2017
s/devtools need to be opened/the Network tab needs to be opened/
,
Mar 17 2017
#9: Another thing you could check is whether affected users are using proxies. Behind a proxy you can only have 32 concurrent WebSocket connections. I think in general it's fine to call close() and then discard the WebSocket. We've had reports that some servers don't close the TCP/IP connection after sending the Close frame. Chrome won't issue onclose until the connection is actually gone, which you may not want to wait for. Keeping track of how many sockets are in that state would be a good defensive programming measure.
,
Mar 22 2017
,
Mar 22 2017
I work with Björn and am carrying on efforts to solve this issue. > You can see the list of WebSocket connections on the devtools Network tab, but devtools need to be opened before making a connection. We're aware of the list of WebSocket connections on the Network tab, but it can't be looked at when we get a repro (the window wasn't open before the crash). We were wondering if chrome://net-internals/#sockets would tell us something about the number of open WebSocket connections. > I think in general it's fine to call close() and then discard the WebSocket. We've had reports that some servers don't close the TCP/IP connection after sending the Close frame. Chrome won't issue onclose until the connection is actually gone, which you may not want to wait for. Keeping track of how many sockets are in that state would be a good defensive programming measure. We're now tracking the number of WebSockets that we've given up on, but haven't received a "close" event yet. I'll update this thread once we have more conclusive information about this. Thank you for all the help so far!
,
Apr 17 2017
Thanks. Adding Needs-Feedback label.
,
Jun 23 2017
Hey! We just had another internal repro of this and learned some things from it: During the incident the user had 1 Asana tab open. In any tab, all websockets to wss://foo.sync-beta.asana.com were stuck in the CONNECTING state. WebSocket connections to wss://foo.sync.asana.com worked fine (as did other websites that use WebSockets). I now realize I missed a time to grab a net-internals log, but I did grab some screenshots (attached). What we saw was that there were multiple sockets that seemed to be half-connected to sync-beta.app.asana.com. When we clicked "Flush socket pools" the problem totally disappeared. Is there more data we can gather to help diagnose this? We hit this pretty rarely, but next time I'll take a full dev-internals log. Will it still be useful even though the problem will have already started? Thanks! Alex
,
Jun 23 2017
Also my coworker Justin dug into this a bit more and things that the mutex at https://chromium.googlesource.com/chromium/src/net/+/master/socket/websocket_transport_connect_sub_job.cc#98 could be related because it guards connections for a IP address + port. He says: > Here's a comment saying basically that: https://chromium.googlesource.com/chromium/src/net/+/master/socket/websocket_endpoint_lock_manager.h#25 > >> Keep track of ongoing WebSocket connections in order to satisfy the WebSocket connection throttling requirements described in RFC6455 4.1.2: >> If the client already has a WebSocket connection to the remote host (IP address) identified by /host/ and port /port/ pair, even if the remote host is known by another name, the client MUST wait until that connection has been established or for that connection to have failed. There MUST be no more than one connection in a CONNECTING state. If multiple connections to the same IP address are attempted simultaneously, the client MUST serialize them so that there is no more than one connection at a time running through the following steps. > This is consistent with several pieces of evidence we saw today in my repro: > - The failed URL_REQUEST event has reason load_state = 3 (WAITING_FOR_AVAILABLE_SOCKET), which is exactly the state returned when you're waiting for the lock. Similarly, you get net_error = -1 (ERR_IO_PENDING), which is returned only when something else is currently trying to connect over the websocket endpoint. > - We saw that connections to sync.app.asana.com were able to be created, but not sync-beta - probably because sync-beta was the stuck one and sync gives a different IP address (and therefore a different mutex) > - Flushing the socket pool erased all websockets and therefore freed the mutex. > So the question then becomes - what is causing this mutex not to be released? Why is a websocket attempting connection for such a long time before timing out?
,
Jun 26 2017
Thanks for the extra investigation. There is supposed to be a 4 minute timeout, so after 4 minutes the first connection to sync-beta.app.asana.com will fail and release the lock. Can you confirm whether a single connection is stuck for more than 4 minutes? I am wondering if there is some particular set of circumstances in which the 4 minute timeout is not effective.
,
Jun 26 2017
> Can you confirm whether a single connection is stuck for more than 4 minutes? Sure. I can try and check that next time (although I don't see repros of this very often, so it might be a bit). So the expected behavior is that 1. A WebSocket gets stuck and holds this lock 2. 4 minutes later the lock times out 3. At this point the existing connections blocked on the lock succeed and we are back online Is that correct? If it is, does it matter that by the time this has completed our app will have created a decent number more WebSockets? As said earlier, we notice when they are stuck in CONNECTING and give up after 10sec, call close, and make another (with a small back off). So if this is happening we could have a decent number of WebSockets around after the 4min timeout hits.
,
Jun 27 2017
Closing the WebSocket should release the lock. I think we may have a problem with SSL sockets. Unencrypted WebSockets use early binding, meaning the connection job is associated with the WebSocket and will be stopped if the WebSocket is closed. But SSL WebSockets may still be using late binding like HTTP does. In the case of HTTP, cancelling the request does not abort the connect job because they are not associated yet. I will try to reproduce this. If this is indeed the problem it will not be at all easy to fix.
,
Jul 4 2017
|
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by manoranj...@chromium.org
, Mar 15 2017