New issue
Advanced search Search tips

Issue 700360 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 3
Type: Bug



Sign in to add a comment

Chrome is sometimes closing "idle" HTTP/2 sessions earlier than expected

Project Member Reported by pmeenan@chromium.org, Mar 10 2017

Issue description

I haven't tracked down the specific details to make sure it's not something in the WebPageTest environment that is triggering it yet but Wikimedia is seeing cases in their multi-step tests where the H2 connections are being closed between individual steps when there is a 5 second idle gap.

That seems a bit aggressive if it is something that we do regularly, particularly for H2 connections.

The WebPageTest issue is here: https://github.com/WPO-Foundation/webpagetest/issues/823

This test shows the issue on runs 2-5 but not run 1 (all have netlogs included): http://wpt.wmftest.org/result/170310_S2_B7/

From one of the netlogs:

t= 8656 [st= 342]    HTTP2_SESSION_RECV_DATA
                     --> fin = false
                     --> size = 1542
                     --> stream_id = 35
t= 8656 [st= 342]    HTTP2_SESSION_UPDATE_RECV_WINDOW
                     --> delta = -1542
                     --> window_size = 15722072
t= 8656 [st= 342]    HTTP2_SESSION_RECV_DATA
                     --> fin = true
                     --> size = 0
                     --> stream_id = 35
t= 8661 [st= 347]    HTTP2_STREAM_UPDATE_RECV_WINDOW
                     --> delta = 2954
                     --> window_size = 15725026
t= 8662 [st= 348]    HTTP2_STREAM_UPDATE_RECV_WINDOW
                     --> delta = 1258
                     --> window_size = 15726284
t= 8664 [st= 350]    HTTP2_STREAM_UPDATE_RECV_WINDOW
                     --> delta = 814
                     --> window_size = 15727098
t= 8666 [st= 352]    HTTP2_STREAM_UPDATE_RECV_WINDOW
                     --> delta = 1542
                     --> window_size = 15728640
t=14969 [st=6655]    HTTP2_SESSION_CLOSE
                     --> description = "Closing idle sessions."
                     --> net_error = -3 (ERR_ABORTED)
t=14970 [st=6656]    HTTP2_SESSION_POOL_REMOVE_SESSION
                     --> source_dependency = 850 (HTTP2_SESSION)

As best as I can tell, the "Closing idle sessions." decision is a local one (working through the pool logic now to see if I can figure out what triggered it).

The netlogs are a bit noisy because WebPageTest uses a localhost connection for interacting with it's extension but that is a long-lived connection so it doesn't look like we are running out of socket pool space (to force a purge).
 

Comment 1 by b...@chromium.org, Mar 10 2017

Cc: b...@chromium.org
Still winding my way through the session management to see how a decision is made to call into "closeIdleSockets" but as best as I can tell, at the H/2 layer it considers a connection "idle" if there are no currently active streams so it will be "idle" immediately after the last pending stream completes.
Sorry for the stream-of-consciousness.  Most of the flows that close "idle" connections don't look like they would trigger (and there are none that appear timer based).

THe one possiblity that is currently standing out is OnPurgeMemory() (or memory pressure): https://cs.chromium.org/chromium/src/net/http/http_network_session.cc?type=cs&l=489

Better understanding of the memory coordinator to see when it would decide to call that would help.  From the comments around OnPurgeMemory():

  // Called to purge memory.
  // This callback should free up any memory that is used as an optimization, or
  // any memory whose contents can be reproduced.

The machines have at least 3.5GB of RAM and aren't doing all that much so if we're tripping some form of memory pressure that would be concerning.  That might match with why it only happens with the agents after a while though.
Cc: skyos...@chromium.org
+Sami because it looks like it is getting into scheduler and timer throttling space...

Following the chain, it looks like memory is purged when renders are backgrounded and timers are suspended: https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/scheduler/renderer/renderer_scheduler_impl.cc?type=cs&l=1583

Given that the net stack is not tied to renders, purging the memory in those conditions (and closing all open but not transmitting H2 streams) seems extreme.

I still don't know if/why that would trigger in this case since it is a single tab navigating across 3 pages on the same domain but that feels like the most likely triggering path.
Summary: Chrome is sometimes closing "idle" HTTP/2 sessions earlier than expected (was: Chrome is sometimes closing "idle" HTTP/2 sessions after ~5 seconds)
Cc: haraken@chromium.org
+Kentaro, sounds like this is is related to purge & suspend?
Owner: tasak@google.com
+tasak: Would you take a look at this?

Comment 8 by tasak@google.com, Mar 14 2017

Looking.

Comment 9 by tasak@google.com, Mar 14 2017

I would like to know what chrome was used for the test, e.g. chrome version, trunk/dev/beta/stable, windows/Linux/MacOS, and so on.

Comment 10 by tasak@google.com, Mar 14 2017

I see.
"cl":"0e9a9a6f3676ae439b78cd9b3f62b4193c3ac7d5-refs/branch-heads/2924@{#895}"
"... --disable-background-networking --no-default-browser-check --no-first-run --process-per-tab --new-window --silent-debugger-extension-api --disable-infobars --disable-translate --disable-notifications --disable-desktop-notifications --allow-running-insecure-content --disable-component-update --disable-background-downloads --disable-add-to-shelf --disable-client-side-phishing-detection --disable-datasaver-prompt --disable-default-apps --disable-domain-reliability --safebrowsing-disable-auto-update --disable-background-timer-throttling --host-rules=\"MAP cache.pack.google.com 127.0.0.1\" 
name":"Google Chrome","official":"official","os_type":"Windows NT: 6.1.7601 SP1 (x86_64)\",
"version":"56.0.2924.87","version_mod":"canary"

Comment 11 by tasak@google.com, Mar 14 2017

I'm not able to reproduce this issue yet. I tried, but I saw only "Closing all sessions."
... I would like to know whether it is possible to reproduce locally or not too. (for bisecting and so on...)

I'm not sure whether the WebPageTest is running on background tabs or not.
However, as far as I investigated (by looking at the netlog), this was not caused by background-timer-throttling because of --disable-background-timer-throttling.
The netlog also says, os_type is Windows NT. So "purge when timer queue is suspended" is disabled. The feature is currently enabled for only Android devices.

Talking about purge-and-throttle, the feature was enabled with purge-and-suspend-time=30hours for 3 months (i.e. Dec 2016, Jan 2017, and Feb 2017). So the WebPageTest reuses backgrounded inactive tabs without activating the tabs?
Cc: tasak@google.com
Owner: ----
As far as reproducing, they lost the agent that could currently reproduce it reliably before we could collect a trace and it takes a few weeks usually before they see it happen again.

Sounds like it might be back to the drawing board to try to figure out the code path that was invoked to get net to close all of the idle connections.  Initially I thought there was a 5-second session timer somewhere because the point where the connection got dropped was ~5 seconds after the last activity and that could explain the random behavior they see but I couldn't find anything in net that decides to close the idle connections on it's own.

WebPageTest doesn't re-use anything.  At most there was ~30 seconds in that test and to get there it started with a completely new process and a completely clean profile directory.

WPT does have an extension with a background HTML page but there are no user-facing other tabs other than the one being tested.  Whatever went wrong with the throttling in 55 bit WPT pretty hard and services like SpeedCurve had to roll back to 54 until 56 came out so something was tripping even without background user tabs.  That's when I added the flag to try to help mitigate things going forward.

Memory pressure is a possibility if there is a leak on the machine that accumulates over time but I checked a few machines that I've had running for months and don't see any memory issues.  I was hoping to get a trace with memory-infra events to see if that was the case.

I'll poke around the code some more to see if I can find another code path in but a definitive cause may have to wait until they can reproduce it again.

Comment 13 by tasak@google.com, Mar 16 2017

I see. Thank you for explanation.

Currently I'm planning to add more TRACE_EVENTs (related to memory). I expect that the events makes it possible for us to see whether unexpected "purge" is invoked or not.

Labels: OS-All
Status: Available (was: Untriaged)
I vote that we archive this bug because we're just waiting for it to happen again. But making it available for now.
Cc: -pmeenan@chromium.org

Sign in to add a comment