Chrome is sometimes closing "idle" HTTP/2 sessions earlier than expected |
||||||||
Issue descriptionI haven't tracked down the specific details to make sure it's not something in the WebPageTest environment that is triggering it yet but Wikimedia is seeing cases in their multi-step tests where the H2 connections are being closed between individual steps when there is a 5 second idle gap. That seems a bit aggressive if it is something that we do regularly, particularly for H2 connections. The WebPageTest issue is here: https://github.com/WPO-Foundation/webpagetest/issues/823 This test shows the issue on runs 2-5 but not run 1 (all have netlogs included): http://wpt.wmftest.org/result/170310_S2_B7/ From one of the netlogs: t= 8656 [st= 342] HTTP2_SESSION_RECV_DATA --> fin = false --> size = 1542 --> stream_id = 35 t= 8656 [st= 342] HTTP2_SESSION_UPDATE_RECV_WINDOW --> delta = -1542 --> window_size = 15722072 t= 8656 [st= 342] HTTP2_SESSION_RECV_DATA --> fin = true --> size = 0 --> stream_id = 35 t= 8661 [st= 347] HTTP2_STREAM_UPDATE_RECV_WINDOW --> delta = 2954 --> window_size = 15725026 t= 8662 [st= 348] HTTP2_STREAM_UPDATE_RECV_WINDOW --> delta = 1258 --> window_size = 15726284 t= 8664 [st= 350] HTTP2_STREAM_UPDATE_RECV_WINDOW --> delta = 814 --> window_size = 15727098 t= 8666 [st= 352] HTTP2_STREAM_UPDATE_RECV_WINDOW --> delta = 1542 --> window_size = 15728640 t=14969 [st=6655] HTTP2_SESSION_CLOSE --> description = "Closing idle sessions." --> net_error = -3 (ERR_ABORTED) t=14970 [st=6656] HTTP2_SESSION_POOL_REMOVE_SESSION --> source_dependency = 850 (HTTP2_SESSION) As best as I can tell, the "Closing idle sessions." decision is a local one (working through the pool logic now to see if I can figure out what triggered it). The netlogs are a bit noisy because WebPageTest uses a localhost connection for interacting with it's extension but that is a long-lived connection so it doesn't look like we are running out of socket pool space (to force a purge).
,
Mar 10 2017
Still winding my way through the session management to see how a decision is made to call into "closeIdleSockets" but as best as I can tell, at the H/2 layer it considers a connection "idle" if there are no currently active streams so it will be "idle" immediately after the last pending stream completes.
,
Mar 10 2017
Sorry for the stream-of-consciousness. Most of the flows that close "idle" connections don't look like they would trigger (and there are none that appear timer based). THe one possiblity that is currently standing out is OnPurgeMemory() (or memory pressure): https://cs.chromium.org/chromium/src/net/http/http_network_session.cc?type=cs&l=489 Better understanding of the memory coordinator to see when it would decide to call that would help. From the comments around OnPurgeMemory(): // Called to purge memory. // This callback should free up any memory that is used as an optimization, or // any memory whose contents can be reproduced. The machines have at least 3.5GB of RAM and aren't doing all that much so if we're tripping some form of memory pressure that would be concerning. That might match with why it only happens with the agents after a while though.
,
Mar 10 2017
+Sami because it looks like it is getting into scheduler and timer throttling space... Following the chain, it looks like memory is purged when renders are backgrounded and timers are suspended: https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/scheduler/renderer/renderer_scheduler_impl.cc?type=cs&l=1583 Given that the net stack is not tied to renders, purging the memory in those conditions (and closing all open but not transmitting H2 streams) seems extreme. I still don't know if/why that would trigger in this case since it is a single tab navigating across 3 pages on the same domain but that feels like the most likely triggering path.
,
Mar 10 2017
,
Mar 10 2017
+Kentaro, sounds like this is is related to purge & suspend?
,
Mar 10 2017
+tasak: Would you take a look at this?
,
Mar 14 2017
Looking.
,
Mar 14 2017
I would like to know what chrome was used for the test, e.g. chrome version, trunk/dev/beta/stable, windows/Linux/MacOS, and so on.
,
Mar 14 2017
I see. "cl":"0e9a9a6f3676ae439b78cd9b3f62b4193c3ac7d5-refs/branch-heads/2924@{#895}" "... --disable-background-networking --no-default-browser-check --no-first-run --process-per-tab --new-window --silent-debugger-extension-api --disable-infobars --disable-translate --disable-notifications --disable-desktop-notifications --allow-running-insecure-content --disable-component-update --disable-background-downloads --disable-add-to-shelf --disable-client-side-phishing-detection --disable-datasaver-prompt --disable-default-apps --disable-domain-reliability --safebrowsing-disable-auto-update --disable-background-timer-throttling --host-rules=\"MAP cache.pack.google.com 127.0.0.1\" name":"Google Chrome","official":"official","os_type":"Windows NT: 6.1.7601 SP1 (x86_64)\", "version":"56.0.2924.87","version_mod":"canary"
,
Mar 14 2017
I'm not able to reproduce this issue yet. I tried, but I saw only "Closing all sessions." ... I would like to know whether it is possible to reproduce locally or not too. (for bisecting and so on...) I'm not sure whether the WebPageTest is running on background tabs or not. However, as far as I investigated (by looking at the netlog), this was not caused by background-timer-throttling because of --disable-background-timer-throttling. The netlog also says, os_type is Windows NT. So "purge when timer queue is suspended" is disabled. The feature is currently enabled for only Android devices. Talking about purge-and-throttle, the feature was enabled with purge-and-suspend-time=30hours for 3 months (i.e. Dec 2016, Jan 2017, and Feb 2017). So the WebPageTest reuses backgrounded inactive tabs without activating the tabs?
,
Mar 14 2017
As far as reproducing, they lost the agent that could currently reproduce it reliably before we could collect a trace and it takes a few weeks usually before they see it happen again. Sounds like it might be back to the drawing board to try to figure out the code path that was invoked to get net to close all of the idle connections. Initially I thought there was a 5-second session timer somewhere because the point where the connection got dropped was ~5 seconds after the last activity and that could explain the random behavior they see but I couldn't find anything in net that decides to close the idle connections on it's own. WebPageTest doesn't re-use anything. At most there was ~30 seconds in that test and to get there it started with a completely new process and a completely clean profile directory. WPT does have an extension with a background HTML page but there are no user-facing other tabs other than the one being tested. Whatever went wrong with the throttling in 55 bit WPT pretty hard and services like SpeedCurve had to roll back to 54 until 56 came out so something was tripping even without background user tabs. That's when I added the flag to try to help mitigate things going forward. Memory pressure is a possibility if there is a leak on the machine that accumulates over time but I checked a few machines that I've had running for months and don't see any memory issues. I was hoping to get a trace with memory-infra events to see if that was the case. I'll poke around the code some more to see if I can find another code path in but a definitive cause may have to wait until they can reproduce it again.
,
Mar 16 2017
I see. Thank you for explanation. Currently I'm planning to add more TRACE_EVENTs (related to memory). I expect that the events makes it possible for us to see whether unexpected "purge" is invoked or not.
,
Aug 25 2017
I vote that we archive this bug because we're just waiting for it to happen again. But making it available for now.
,
Jul 30
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by b...@chromium.org
, Mar 10 2017