HTTP/2 Requests getting stalled |
||||||||||||
Issue descriptionBloomberg reported seeing intermittent 60-second stalls in the middle of their waterfalls in their WebPageTest testing (HTTP/2 with Akamai on the other end). I reproduced it on both Windows and Android. Windows (run #2): https://www.webpagetest.org/result/170517_9Q_f1ce70ea13ce17adccc75a49a613aa28/ and Android (run #3): https://www.webpagetest.org/result/170517_V4_86428acf21e259308b08b632df6d2c0c/ The Windows test has a netlog (also attached to this bug) and tcpdump with TLS session keys. The "interesting" events that I've seen so far (content clipped) are... #549 which is the URL Request for the request right after the stall resumes: t= 7290 [st= 0] +REQUEST_ALIVE [dt=73488] --> priority = "HIGHEST" --> url = "https://assets.bwbx.io/s3/javelin/public/javelin/css/body/basic/newsletter_tout-8a5015009b.mobile.css" t= 7291 [st= 1] HTTP_STREAM_JOB_CONTROLLER_BOUND --> source_dependency = 552 (HTTP_STREAM_JOB_CONTROLLER) t=80399 [st=73109] HTTP_STREAM_REQUEST_BOUND_TO_JOB --> source_dependency = 553 (HTTP_STREAM_JOB) #552 HTTP_STREAM_JOB_CONTROLLER t= 7291 [st= 0] +HTTP_STREAM_JOB_CONTROLLER [dt=73108] --> is_preconnect = false --> url = "https://assets.bwbx.io/s3/javelin/public/javelin/css/body/basic/newsletter_tout-8a5015009b.mobile.css" t= 7291 [st= 0] HTTP_STREAM_JOB_CONTROLLER_BOUND --> source_dependency = 549 (URL_REQUEST) t= 7291 [st= 0] HTTP_STREAM_REQUEST_STARTED_JOB --> source_dependency = 553 (HTTP_STREAM_JOB) t=80399 [st=73108] -HTTP_STREAM_JOB_CONTROLLER #553 HTTP_STREAM_JOB t= 7291 [st= 0] +SOCKET_POOL [dt=73106] t= 7291 [st= 0] SOCKET_POOL_STALLED_MAX_SOCKETS_PER_GROUP t=80397 [st=73106] SOCKET_POOL_BOUND_TO_CONNECT_JOB --> source_dependency = 2113 (SSL_CONNECT_JOB) t=80397 [st=73106] SOCKET_POOL_BOUND_TO_SOCKET --> source_dependency = 2120 (SOCKET) t=80397 [st=73106] -SOCKET_POOL There are, however, several other (previous) stream jobs for the same server that are coalesced to an existing SPDY connection. Something bad is going on with the socket pools interacting with HTTP/2 and requests getting tied to stalled connection requests.
,
May 17 2017
Adding socket pool wizard davidben@ for another set of eyes. I agree that this is confusing. In particular, 460 HTTP_STREAM_JOB logs a SOCKET_POOL_CONNECTING_N_SOCKETS event at t=7202, but subsequent HTTP_STREAM_JOBs, starting with 533, log a SOCKET_POOL_STALLED_MAX_SOCKETS_PER_GROUP. This does not make sense to me: they should bind to the existing socket.
,
May 17 2017
bnc: That's expected. There are a ton of HTTP_STREAM_JOB for https://assets.bwbx.io/. The first 6 get the first 6 socket slots, and all others wait patiently in line behind them. Then when we finally get a single connection 1 second later, ~14 requests (The first 6 and 8 others) start using the new H2 session immediately. The question is why all the other HTTP_STREAM_JOBs keep waiting, rather than being assigned to the H2 session at that point as well. Doesn't seem like a socket pool issue, but rather an issue with the HttpStreamFactory* classes.
,
May 17 2017
Also, requests made after the H2 session is created all correctly use the session, so the issue is just with all but the first 14 requests made before the H2 session was created.
,
May 17 2017
One thing of note is that id:553 eventually wakes up at t=80397. This is close to t=79224 which is when the H2 session at id:538 is closed (I guess the server decided to time out an idle connection).
,
May 17 2017
id:538 is an example HTTP_STREAM_JOB which is created before the H2 session and doesn't get mapped. It works because attempting to create a socket fails with ERR_SPDY_SESSION_ALREADY_EXISTS. That comes from here: https://cs.chromium.org/chromium/src/net/http/http_stream_factory_impl_job.cc?rcl=68b3fab8a62e3e060690a7a0ea89fa6564b2f49f&l=528 So that's probably part of the story. If the H2 session is created after we pass the OnHostResolution point, we never actually notice that a second one was created. That sounds like a bug. We should probably wake up all the applicable HTTP_STREAM_JOBs whereever they are in the connect process. This does raises another question: why aren't those other 5 socket slots released before t=80397? The idle sockets are all closed earlier, I think due to this line: https://cs.chromium.org/chromium/src/net/http/http_stream_factory_impl_job.cc?rcl=68b3fab8a62e3e060690a7a0ea89fa6564b2f49f&l=1188 (Though fixing just the second issue would in us making a bunch of SpdyStreams and, depending on whether we keep them both around, seems like it'd still cause this stall case.)
,
May 17 2017
> The idle sockets are all closed earlier Er, no, that doesn't make sense. If they were idle, the socket pool should have passed them along. Poking further...
,
May 17 2017
That CloseIdleSocketsInGroup line is new, and I don't believe it's on stable yet, anyways.
,
May 17 2017
mgersh noticed another coincidence: id:610 gets "cancelled" at t=80400 (which is immediately after id:553 gets unblocked). Its HTTP_STREAM_JOB_CONTROLLER (id:609) is bound to URL_REQUEST id:557, but at t=80400, that URL_REQUEST didn't cancel things. It gets to send a request, but is missing HTTP_STREAM_REQUEST_BOUND_TO_JOB for some reason... The HTTP_STREAM_JOBS after id:610 seem to have the same behavior too.
,
May 17 2017
> So that's probably part of the story. If the H2 session is created after we pass the OnHostResolution point, we never actually notice that a second one was created. That sounds like a bug. We should probably wake up all the applicable HTTP_STREAM_JOBs whereever they are in the connect process. Okay, this is not what's going on. We already wake things up in HttpStreamFactoryImpl::OnNewSpdySessionReady. Why that's not happening here is probably an HttpStreamFactory bug. The reason id:553 does not get to the OnHostResolution point is it's busy waiting for a chance to grab a socket. OnHostResolution only happens if we get to TransportConnectJob. That it never gets that far is probably another (?) bug, either in HttpStreamFactory or the socket pools. As for the weird logging in comment #9, that's what happens when HttpStreamFactoryImpl::OnNewSpdySessionReady fires. HTTP/2 folks, you might want to improve the logging here.
,
May 18 2017
Thanks for looking into this. I'm able to consistently reproduce the issue on one of our test sites: https://ttfb.2015edgedemos.com/presta/presta4domain.html (WPT of it stalling at https://www.webpagetest.org/result/170518_56_91eb51f21d974b4adafc3a395bdd3de6/1/details/#waterfall_view_step1 ) Let me know if I can help in any way.
,
May 18 2017
Issue 723643 has been merged into this issue.
,
May 18 2017
Turns out this is also the cause of a 100% repro case I've been looking into for https://www.bosslaser.com/ but the test times out instead of just having a big gap (presumably because they aren't closing the H2 connection after 60 seconds like Akamai is) Test: https://www.webpagetest.org/result/170517_BT_f762002e35781ff9ed6c90924329e68b/ Netlog attached but he signature looks identical: 283: HTTP_STREAM_JOB https://js.bosslaser.com/ Start Time: 2017-05-17 09:22:10.920 t= 1834 [st= 0] +HTTP_STREAM_JOB [dt=204900] --> alternative_service = "unknown :0" --> original_url = "https://js.bosslaser.com/" --> priority = "MEDIUM" --> source_dependency = 282 (HTTP_STREAM_JOB_CONTROLLER) --> url = "https://js.bosslaser.com/" t= 1834 [st= 0] +PROXY_SERVICE [dt=0] t= 1834 [st= 0] PROXY_SERVICE_RESOLVED_PROXY_LIST --> pac_string = "DIRECT" t= 1834 [st= 0] -PROXY_SERVICE t= 1834 [st= 0] HTTP_STREAM_JOB_PROXY_SERVER_RESOLVED --> proxy_server = "DIRECT" t= 1834 [st= 0] HTTP_STREAM_JOB_WAITING [dt=0] --> should_wait = false t= 1834 [st= 0] +HOST_RESOLVER_IMPL_REQUEST [dt=0] --> address_family = 0 --> allow_cached_response = true --> host = "js.bosslaser.com:443" --> is_speculative = false t= 1834 [st= 0] HOST_RESOLVER_IMPL_IPV6_REACHABILITY_CHECK --> cached = true --> ipv6_available = false t= 1834 [st= 0] -HOST_RESOLVER_IMPL_REQUEST --> net_error = -804 (ERR_DNS_CACHE_MISS) t= 1834 [st= 0] +SOCKET_POOL [dt=204900] t= 1834 [st= 0] SOCKET_POOL_STALLED_MAX_SOCKETS_PER_GROUP t=206734 [st=204900] -HTTP_STREAM_JOB t=206734 [st=204900] CANCELLED t=206734 [st=204900] -SOCKET_POOL
,
May 18 2017
Hrm...I visited https://www.bosslaser.com/ once, and it reproduced, but I haven't been able to repro again - flushing socket pools, clearing cache, restarting Chrome, incognito mode - noting reproduces the issue. Even tried Canary, and it didn't repro (On Windows). Maybe the OS level DNS cache is causing us to "win" a race?
,
May 18 2017
Sorry, by 100% repro case I meant on WPT with traffic-shaping enabled (where the other site was intermittent). If I disable the traffic-shaping then it doesn't fail so it's very racy.
,
May 18 2017
I got the exact same thing with https://ttfb.2015edgedemos.com/presta/presta4domain.html (Well, I got it to hang a second time by restarting the browser *and* using incognito, but never again). I bet just inserting a 1 second delay on all DNS lookups or connection attempts would make it repro consistently.
,
May 18 2017
I've confirmed it's still an issue on Canary (Bug says M60, but NetLogs are all from M58, so wasn't sure). This seems sufficiently problematic (Though just a reload likely fixes it) that I'm labeling this a stable blocker, and think we'll want to merge a fix into M59, if it isn't too hairy. Also can now repro on my custom build (Added a 0.5 second delay on DNS lookups, though not clear if it was needed. Just need to clear cache, and it reproduces every time). Unfortunately, code has recently changed in this area, so I'm not sure a fix will merge cleanly, though not sure what the issue is yet. May not have too much time to dig into this until next week.
,
May 18 2017
,
May 18 2017
One thing of note: It looks like all requests that hung should have been pooled into H2 sessions using connections to different domains. Maybe this is a session pooling issue?
,
May 18 2017
,
May 18 2017
Other things of note: If I got to https://www.bosslaser.com and this bug reproduces, I have 6 "pending" SSL connections to js.bosslaser.com (Why don't they connect and become IDLE? I'm not sure, but there was a recent question about sockets connecting but not negotiating SSL on net-dev, which may be related).
,
May 18 2017
Given that this is easily reproducible with traffic shaping, pmeenan@ do you know if this is a regression?
,
May 18 2017
If the DNS delay for reproing could be made external to the binary rather than with a custom build, bisect-builds.py would probably do the trick.
,
May 18 2017
#22 - hard to say if it is a regression until we know what the cause is and if it was a result of recent changes or has always been there. It's only easily reproducible for specific pages (not all pages tested with shaping) and as noted in #19 may be tied to connection coalescing. The original report to me mentioned they have been seeing the intermittent delays since March (using Stable for testing) so it's been there since at least 57 if those were really the same failure.
,
May 18 2017
#23 - should be possible to introduce the delay externally easily enough with dummynet/linkconditioner/winshaper. You'd be adding delays to all traffic but that should still work (and is what triggers it in WPT). Main thing is that you'll need to flush the OS DNS and browser caches between tests (on windows that would be "ipconfig /flushdns") I can take a crack at it if nobody else wants to.
,
May 18 2017
I suspect the bug lies in the propagation of OnNewSpdySessionReady notification. We could try debug that and see if that gives us anything. A bisect will be helpful so we can know a range or if this has never worked. (Happy to look into this given that I did modify some of the code recently but I am out today)
,
May 18 2017
I'm not very fluent in net stack code but what I see is on https://ttfb.2015edgedemos.com/presta/presta4domain.html is: * Requests for both the CSS and JS domains are coming in, HttpStreamFactoryImpl::Job::DoInitConnectionImpl is running on them, but spdy_session_pool->FindAvailableSession is failing for them for some reason. I'm still digging here to try to find out why that is. * Once that happened, these requests end up opening new sockets, and after the 6th one, the rest result in SOCKET_POOL_STALLED_MAX_SOCKETS_PER_GROUP and wind up in the pending_requests_ queue. * ProcessPendingRequest is being called occasionally (I'm guessing due to sockets going idle), but stopped being called after a while. Maybe people more familiar with ClientSocketPoolBase than myself can say if this is expected or not. Hope this helps. Any pointers or insights would be appreciated.
,
May 18 2017
Ok, I've figured it out: We have 6 pending ConnectJobs for js.bosslaser.com (Which will eventually be able to use www.bosslaser.com). They've already done DNS resolution, but that doesn't really matter. Those jobs complete asynchronously (With ERR_SPDY_SESSION_EXISTS), and when that happens, we start the next request in the queue. The next request fails *synchronously* with ERR_SPDY_SESSION_EXISTS. And then we don't try and start the next pending request, so we eventually end up with n > 0 pending socket requests, and no RequestJobs. This is a socket pool bug, but sync failure is probably rare enough that it was basically never reproducible (Or happened at all) until the H2 sync failures were added. We we discover we can reuse the www.bosslser.com connection (Which we establish before any connection to js.bosslaser.com
,
May 18 2017
Oops...remove that final fragment...Anyhow, I'll go ahead and fix it myself.
,
May 18 2017
Please tag with applicable OSs. Thanks!
,
May 18 2017
,
May 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9d72fe409a214856ec4d9700105b4664bb4c0861 commit 9d72fe409a214856ec4d9700105b4664bb4c0861 Author: mmenke <mmenke@chromium.org> Date: Thu May 18 22:36:07 2017 Fix for HTTP2 request hanging bug. If, when a socket request completed asynchronously, the next socket request in the queue failed synchronously, ClientSocketPoolBase wouldn't continue to try to service other socket requests in the queue. As a result, the socket pools could end up with pending connect requests and free socket slots, but no ConnectJobs would be made to service them, if this happened 6 times in a row to a socket group. This was a relatively obscure issue, until H2 started depending on this path for sharing sessions when different domains map to the same IP. BUG= 723748 Review-Url: https://codereview.chromium.org/2888623011 Cr-Commit-Position: refs/heads/master@{#472952} [modify] https://crrev.com/9d72fe409a214856ec4d9700105b4664bb4c0861/net/socket/client_socket_pool_base.cc [modify] https://crrev.com/9d72fe409a214856ec4d9700105b4664bb4c0861/net/socket/client_socket_pool_base_unittest.cc
,
May 19 2017
I can confirm that the issue is resolved for me with this fix. Will this fix be merged into M59?
,
May 19 2017
Yes, I intend to merge it to M59, though I may give it a week on Canary first.
,
May 19 2017
,
May 22 2017
Requesting M-59 merge. When this issue triggers, a page can just keep loading forever, without anything being seen (Other failure scenarios are possible, that's just what happens to the sites mentioned in this bug). Reloading will generally fix the page, but the experience seems bad enough to be worth merging a fix. More likely to run into this issue on slower connections.
,
May 22 2017
Your change meets the bar and is auto-approved for M59. Please go ahead and merge the CL to branch 3071 manually. Please contact milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), gkihumba@(ChromeOS), Abdul Syed@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
May 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ef63c09a56bf127df9990b3e7412f543b4733b0b commit ef63c09a56bf127df9990b3e7412f543b4733b0b Author: Matt Menke <mmenke@chromium.org> Date: Mon May 22 16:26:35 2017 Fix for HTTP2 request hanging bug. If, when a socket request completed asynchronously, the next socket request in the queue failed synchronously, ClientSocketPoolBase wouldn't continue to try to service other socket requests in the queue. As a result, the socket pools could end up with pending connect requests and free socket slots, but no ConnectJobs would be made to service them, if this happened 6 times in a row to a socket group. This was a relatively obscure issue, until H2 started depending on this path for sharing sessions when different domains map to the same IP. BUG= 723748 Review-Url: https://codereview.chromium.org/2888623011 Cr-Original-Commit-Position: refs/heads/master@{#472952} Review-Url: https://codereview.chromium.org/2896883002 . Cr-Commit-Position: refs/branch-heads/3071@{#654} Cr-Branched-From: a106f0abbf69dad349d4aaf4bcc4f5d376dd2377-refs/heads/master@{#464641} [modify] https://crrev.com/ef63c09a56bf127df9990b3e7412f543b4733b0b/net/socket/client_socket_pool_base.cc [modify] https://crrev.com/ef63c09a56bf127df9990b3e7412f543b4733b0b/net/socket/client_socket_pool_base_unittest.cc
,
Jun 14 2017
I'm still experiencing some variant of this with a Jun 11 git HEAD Including a screenshot of a partially loaded youtube page stuck in some sort of loading forever, after browser startup (with 'Continue where you left off' setting enabled). The video is playable but no other elements on that page are loading (after like 10 minutes now). Oh looks like I don't have the Comment 38 commit in my git history, only have the Comment 32 one. I'll git update...
,
Jun 14 2017
Another stalling example that never completes: "Processing request..." (attached screenshot) chrome://net-internals/#sockets doesn't report any Stalling as true. chrome://net-internals/#http2 doesn't show youtube in the list, so maybe it's a different issue - so I will stop posting here.
,
Jun 14 2017
The first sounds like a different issue entirely - "Processing request" means we're blocked on something outside the network stack (Like a hung renderer, for instance). Please file a new bug. Please also file a new bug for the second issue - youtube is likely using QUIC, not H2, so can't be the same bug, since QUIC doesn't use the TCP socket pools.
,
Jun 14 2017
Sorry, misread that as two issues, but I think it's just one? Anyhow, like I said, please do file a new bug, with a net-intenrals log attached.
,
Jun 14 2017
Thank you @mmenke. Much appreciated feedback! Filed new issue 733196 |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by pmeenan@chromium.org
, May 17 2017