New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 796199 link

Starred by 4 users

Issue metadata

Status: WontFix
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

HTTP2 GET requests fail with ERR_CONNECTION_RESET

Reported by niklas.h...@gmail.com, Dec 19 2017

Issue description

I suspect that Chromium cancels HTTP connections with `ERR_CONNECTION_RESET` when a DNS TTL expires.

On a website I have where 1000 files of ~15 KB size are downloaded over 1 minute (with ~4 concurrent downloads), one of the files randomly fails with `ERR_CONNECTION_RESET`.

In the network tab, these connections are shown as `h2` in the "Protocol" column (if enabled).

The problem goes away when using `chromium-browser --disable-http2`.

I am using a DNS triple-A response (3 servers returned in random order) with short TTL (10 seconds). For the sake of example, say these IPs are 1.1.1.1, 2.2.2.2 and 3.3.3.3.

I noticed in the network tab that most of the time, all entries immediately preceding a failing download with `ERR_CONNECTION_RESET` are against one IP address (e.g. `Remote Address: 1.1.1.1`) while the entries immediately succeeding the failed entry are against another IP address from the set (e.g. `Remote Address: 2.2.2.2`). This is what makes me suspect that it is the TTL expiry that makes the HTTP2 connection fail.

However, I have also observed cases where the IP addresses before and after the failing entry are the same IP. I suspect that this simply means that the cancelling is happening even when the new DNS refresh after TTL expiry returns the same IP.

If I had to take a blind guess at the problem, I suspect that the logic discussed in https://groups.google.com/a/chromium.org/forum/#!topic/net-dev/VkwAL9jttcw is no longer in place, that somebody implemented DNS refreshing for h2, but did so using a hard cancel of the connection instead of waiting for the currently downloading files to finish.

I have verified this behaviour across different machines in different countries, and across different operating systems (Ubuntu and Windows).

Needless to say, this breaks my page for all Chrome users.

The only reasonable, unfortunate server-side workaround is to disable HTTP 2.

What steps will reproduce the problem?
(1) Set up domain with short TTL that rotates between IPs
(2) Enable `Disable cache` in the network tab
(3) Download many files from that domain via HTTP2
(4) Refresh a couple times until ERR_CONNECTION_RESET happens for one of the downloads (easily spotted when also showing the Console)

Chrome Version       : 63.0.3239.84 (Official Build) Built on Ubuntu , running on Ubuntu 16.04 (64-bit)
URL : Internal, I could try to make a reproducible test case but not without significant time investment, so I'll do that on request
Behavior in Firefox: Works fine (tested on Nightly 59)
 
Components: Internals>Network>HTTP2

Comment 2 by mef@chromium.org, Dec 20 2017

Components: Internals>Network>DNS

Comment 3 by mef@chromium.org, Dec 20 2017

Labels: Needs-Feedback
Hi Niklas, could you please collect and attach netlog by following instructions here: https://sites.google.com/a/chromium.org/dev/for-testers/providing-network-details (start it before the problem and stop after).

Comment 4 by b...@chromium.org, Jan 2 2018

Cc: b...@chromium.org
@niklas: Any chance you can attach the netlog as described in c#3 to help us get to the bottom of this?  Without more details we're not going to be able to make progress.

I will try to get this done, just short on time currently.
Project Member

Comment 7 by sheriffbot@chromium.org, Jan 19 2018

Cc: mef@chromium.org
Labels: -Needs-Feedback
Thank you for providing more feedback. Adding requester "mef@chromium.org" to the cc list and removing "Needs-Feedback" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Comment 8 by mef@chromium.org, Jan 19 2018

Labels: Needs-Feedback
niklas.hambeuchen@, are you able to get a netlog as requested by comment #3?
I have attached the netlog.

You can find the relevant line with:

    zgrep RESET chrome-net-export-log-redacted.json.gz

It is:

    {"params":{"description":"Abandoned.","net_error":"ERR_CONNECTION_RESET","stream_id":1999},"phase":0,"source":{"id":106588,"type":1},"time":"82676821","type":213},

chrome-net-export-log-redacted.json.gz
1.0 MB Download
Project Member

Comment 11 by sheriffbot@chromium.org, Feb 3 2018

Labels: -Needs-Feedback
Thank you for providing more feedback. Adding requester "mef@chromium.org" to the cc list and removing "Needs-Feedback" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
I can also provide an URL to reproduce this reliably (within a minute or so) to Chromium developers privately, if there is a way to do that.
Status: Untriaged (was: Unconfirmed)
Summary: HTTP2 GET requests fail with ERR_CONNECTION_RESET (was: HTTP2 GET requests get cancelled by DNS changes)
Thanks for the log!

bnc, this is one for you. This part from the end of event 100450 looks suspicious:
t=42563 [st=42245]  HTTP2_SESSION_RECV_GOAWAY
                    --> active_streams = 5
                    --> debug_data = "[0 bytes were stripped]"
                    --> error_code = "0 (NO_ERROR)"
                    --> last_accepted_stream_id = 1999
                    --> unclaimed_streams = 0
t=42563 [st=42245]  HTTP2_SESSION_RECV_HEADERS
                    --> fin = false
                    --> :status: 200
                        server: nginx
                        date: Sat, 03 Feb 2018 23:47:33 GMT
                        content-type: text/plain
                        content-length: 15718
                        last-modified: Wed, 06 Apr 2016 14:23:56 GMT
                        etag: "57051bfc-3d66"
                        content-encoding: gzip
                        expires: Thu, 31 Dec 2037 23:55:55 GMT
                        cache-control: max-age=315360000public
                        access-control-allow-origin: *
                        vary: Accept-Encoding, Origin
                    --> stream_id = 1999
t=42581 [st=42263]  HTTP2_SESSION_CLOSE
                    --> description = "Error 101 reading from socket."
                    --> net_error = -101 (ERR_CONNECTION_RESET)

The failing request is event 106588.
Components: -Internals>Network>DNS

Comment 15 by b...@chromium.org, Feb 5 2018

Cc: -b...@chromium.org
Labels: Needs-Feedback
Owner: b...@chromium.org
Status: Unconfirmed (was: Untriaged)
I looked through the attached log but did not see anything related to DNS TTL.  What cought my eyes, however, is exactly that part pasted in comment #13.  Here the server first sends a GOAWAY frame with NO_ERROR.  According to RFC 7540 Section 6.8, this tells the client not to open further connections, but streams up to last_accepted_stream_id = 1999 can be expected to be served.  And indeed the next frame is response HEADERS for this stream.  However, the connection is then reset before response bodies could be received.

While this part of the log does not explicitly say if the connection is reset by the server or by the DNS layer of the client, and event 100449 SOCKET does not say anything explicitly either, it is worth noting that this happens 18 ms after the GOAWAY is received from the server.  This coincidence suggests that the connection is torn down by the server.

I suggest two ways going forward to find the cause of this issue.  The first one is if you could please try with another browser, like Firefox.  The second one, in case Firefox is able to load the page correctly, would be to use Wireshark to analyze network data simultaneously with recording a NetLog in Chrome.  This would be able to tell for sure whether the server resets the connection.  I would be happy to do this in case you provide me with a URL for reproducing this issue, as offered in comment #12.  Please send the details to bnc@chromium.org.  Thanks.
Thanks for your reply!

I cannot reproduce the issue with Firefox; I have run multiple races of Firefox (60 Nightly) and Chromium loading the page simultaneously, and in Firefox all loads succeeded even when in Chromium the error appeared.

I can do a Wireshark recording, will have a look at that in a moment.

I will email you an URL for reproducing right now. Please let me know if you can reproduce it.
Project Member

Comment 17 by sheriffbot@chromium.org, Feb 13 2018

Cc: b...@chromium.org
Labels: -Needs-Feedback
Thank you for providing more feedback. Adding requester "bnc@chromium.org" to the cc list and removing "Needs-Feedback" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
OK, I've done a Wireshark recording on the desktop and a tcpdump on the server

I can indeed see some RST coming from the server, in both the client Wireshark and the server tcpdump.

I can also see when the problem appears very easily on the client using basically `tcpdump -l -i eth0 (host 2.3.4.5) | grep '\[R\]'`. Every time it happens in chrome, 2 or 3 lines appear there.

Using this method, I just figured the issue might be happening to Firefox as well, but it somehow seems to handle it gracefully:

In the Firefox Network tab, I noticed that around the time it the lines appear in my tcpdump, there are 3 entries which have "0 GB" as their "Transferred" size, and "0 ms" as all their timings. They have an empty field in the "Status" column, instead of "200" as for all other entries. The details panel on the right, in the "Headers" tab, has lines missing where the other requests have "Remote address", "Status code" and "Version"; they show only "Request URL" and "Request method".
But when clicking to view the "Response" details tab for such an entry, it shows correct contents.
So Firefox seems to somehow get at the contents despite the RSTs; it's not clear to me how and when.

I did not notice this before because Firefox doesn't print something into its console when a request fails (as Chromium does with "ERR_CONNECTION_RESET"), and because when clicking the "Status" column to sort it, it doesn't sort the rows with empty "Status" before or after the rows with Status = "200", but instead keeps them in the middle of the other requests.
chromium-bug-796199-client-wireshark.txt
2.5 KB View Download
chromium-bug-796199-server-tcpdump.txt
31.9 KB View Download
Ah, I just noticed, `grep '\[R\]` is not enough, because sometimes the flag printed is only `[R.]`. Just `grep R` also does well though.

So with this extra information, it seems my nginx might be to blame, and this is not actually a Chromium bug. Please let me know whether you think so too. If that's the case, I apologise for taking your time. You may still be interested in this in though if there is in fact a bug on the nginx side. I've tested nginx 2.12.1 and 2.13.8, with identical results. 
OK, more info. I enabled nginx debug logging now, and found that it is *always* the stream id 1999 at which it breaks.

I'm not familiar with HTTP/2 on this level is there something special about stream id 1999?
Yes, that seems to be the issue.

https://trac.nginx.org/nginx/ticket/1102 and
https://trac.nginx.org/nginx/ticket/1250

both claim that it's the browsers (both Chromium and Firefox) that don't implement GOAWAY correctly, but  bug 681477  is marked as fixed -- who's right?

It also confuses me where the 1999 comes from. The default value for nginx's "http2_max_requests" setting is 1000 (http://hg.nginx.org/nginx/rev/9027991e2f37#l2.29), and there doesn't seem to be any logic in `ngx_http_v2.c` that doubles it or adds 999 in a different way.

It seems to set the "Last-Stream-ID" to `h2c->last_sid` in
http://hg.nginx.org/nginx/file/9027991e2f37/src/http/v2/ngx_http_v2.c#l2633

That value is set with `h2c->last_sid = h2c->state.sid;` in
http://hg.nginx.org/nginx/file/9027991e2f37/src/http/v2/ngx_http_v2.c#l1111

And `state.sid` is `h2c->state.sid = ngx_http_v2_parse_sid(&pos[5]);` from
http://hg.nginx.org/nginx/file/9027991e2f37/src/http/v2/ngx_http_v2.c#l786

So a quick guess from reading the code for a few minutes, it seems like nginx sets as the "Last-Stream-ID" the, well, last stream ID it received, so I'm not sure how that can exceed 1000.

Grepping out the IDs that nginx logs, I have a suspicion: The logged IDs start at 1 and always increase by 2. I'm not very familiar with HTTP/2 but I guess that sender and receiver side might each have even and odd IDs? That would explain where the factor 2 comes from, and why this problem triggers around 1999 instead of 1000.
I have now modified this line

  http://hg.nginx.org/nginx/file/9027991e2f37/src/http/v2/ngx_http_v2.c#l1059

to to print the stream id:

        ngx_log_debug1(NGX_LOG_DEBUG_HTTP, h2c->connection->log, 0,
                       "skipping http2 HEADERS frame %ui", h2c->state.sid);

and can see in the debug logs:

    [debug] 10499#10499: *29 http2 send GOAWAY frame: last sid 1999, error 0
    ...
    [debug] 10499#10499: *29 skipping http2 HEADERS frame 2001
    [debug] 10499#10499: *29 skipping http2 HEADERS frame 2003
    [debug] 10499#10499: *29 skipping http2 HEADERS frame 2005

This suggests that nginx sets the `last_accepted_stream_id = 1999` but then gets the frames with higher IDs from somewhere. Since the number of frames that appear is variable, is it possible that these are in-flight frames? And that  bug 681477  was about retrying those rejected in-flight frames?

Further, do I understand you correctly that with "... next frame is response HEADERS for this stream [...] However, the connection is then reset before response bodies could be received" you're saying that the DATA frames after the HEADERS never arrive?

Looking at the attached debug log suggests that nginx did send the data frames:

    *29 http2:1999 DATA frame 0000564EFB1A9AD0 was sent
    *29 http2 frame sent: 0000564EFB1A9AD0 sid:1999 bl:0 len:8192
    *29 http2:1999 DATA frame 0000564EFB1A9D60 was sent
    *29 http2 frame sent: 0000564EFB1A9D60 sid:1999 bl:0 len:5857
    ...
    *29 SSL_shutdown: 1

I guess now we have to establish whether the data frames actually reach Chromium and just don't appear in the netlog, or if they don't make it across the network.
nginx-debug-log.txt
12.5 KB View Download
The comment on https://bugs.chromium.org/p/chromium/issues/detail?id=681477#c8 and specifically the linked comment in https://bugzilla.mozilla.org/show_bug.cgi?id=1050329#c14 suggest that:

> the TCP server sent a RST which aborts all TCP processing. No retransmissions happen, no ordering problems are fixed, unread kernel buffers are simply deleted (even if acked!)

In other words, if RST is sent, there is no guarantee that any data sent just before the RST actually reaches the other side's userspace (Chromium).

Might this be what's happening here?

If yes and if the above is true, I see no way how this could possibly work without a HTTP/2 protocol change (namely the other side acknowledging the GOAWAY before RST is sent). Am I wrong?
(Also, could somebody please CC the user from https://bugs.chromium.org/p/chromium/issues/detail?id=681477#c8, I can't figure out how to add CCs in this bug tracker.)

Comment 26 by b...@chromium.org, Feb 14 2018

Cc: patrick....@gmail.com
Adding CC per request in comment #25.
> I see no way how this could possibly work without a HTTP/2 protocol change

To be a bit more precise, the server doing a `shutdown()` instead of a `close()` and waiting for the other side to close the connection (as suggested on https://bugzilla.mozilla.org/show_bug.cgi?id=1050329#c15) probably doesn't count as a HTTP/2 protocol change, only an implementation detail.

So as per my current understanding, this is likely the old problem from https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable, and the `shutdown()` approach the just-as-old solution from its section "So, if we read that data first...".

Comment 28 by b...@chromium.org, Feb 14 2018

Status: WontFix (was: Unconfirmed)
Hi, thank you for the tcpdump and wireshark logs.  Also thank you to other 
commenters for solving the issue.  As described above and in the linked bugs 
and blog posts, the problem seems to be that Nginx writes all relevant data, but
then closes the TCP socket without making sure data are sent.  My opinion is
that this is a bug outside of Chrome, so I am closing this issue as WontFix.

Note that since  issue 681477  is fixed, requests that are already sent on a
connection but with a stream ID larger than the last_accepted_stream_id
parameter of the received GOAWAY frame with NO_ERROR error code are actually
retried.  On the other hand, requests that are sent with stream id at most
last_accepted_stream_id fail.  I believe this is the right thing to do, because
the server has indicated that it has already started processing of the frames,
and the request might not be idempotent.  Also because retrying those frames by
the client would just paper over deeper issues, like this one.

Re comments #20 and #22: Your suspicion is right.  Nginx http2_max_requests
default value is indeed 1000, see
http://nginx.org/en/docs/http/ngx_http_v2_module.html#http2_max_requests.
Client initiated streams (requests) all have odd numbers, so the first request
is stream 1, the second is stream 3, the third is stream 5, etc.  Therefore the 
1000th request is stream 1999.  

By the way, increasing http2_max_requests might mitigate the problem.  After
all, if Chrome retried the failed requests, Nginx would need to deal with that
load anyway.  Then again, this would not be a real solution.

Re comment #23: higher layers in Nginx are processing the requests and passing 
down the HEADERS frames.  However, they are skipped because the connection is
already going away.  And yes, you understand correctly, DATA frames never arrive
for stream 1999 in comment #13, although Nginx might have written them to the
socket.
Great, we're in agreement, and thanks for answering my questions.

I have posted on the nginx bugtracker that this is an nginx problem: https://trac.nginx.org/nginx/ticket/1250#comment:4.

Do you think the proposed solution of the server using `shutdown()` instead of `close()` after GOAWAY, and waiting until the browser side closes the connection with FIN, is a good one? Would this work smoothly with Chromium?

Also, do you happen to have an insight on why server implementors would want to have a feature like `http2_max_requests`?

Comment 30 by b...@chromium.org, Feb 14 2018

Cc: mmenke@chromium.org davidben@chromium.org
 Issue 809683  has been merged into this issue.

Comment 31 by b...@chromium.org, Feb 14 2018

Re comment #29:  I do not know enough about Unix TCP sockets to answer your question.  I am optimistic that there is a canonical way of closing the socket is such a way that delivery of data is guaranteed, but I do now know what that would be.

My guess is that http2_max_requests is there to limit the load on the server.  However, if I was configuring a server, I would set that to unlimited (the protocol limitation is 2^31, which sounds generous enough), and use SETTINGS_MAX_CONCURRENT_STREAMS instead for this purpose.  What surprises me is that Nginx documentation says "maximum number of requests" and not "maximum number of concurrent requests", and for that purpose, 1000 sounds awfully small.
Per that article, and reading man 7 socket, I believe close() is supposed to continue trying to send the data, up to a timeout. However, if one enables SO_LINGER with a timeout of zero, then it will just reset the connection. NGINX does appear to have code to do this, though it's not clear if that's being triggered here.
@davidben I'm not sure what you mean.

As https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable describes:

> So, we set this option [SO_LINGER, even with non-zero timeout], rerun our program. And it still does not work, not all our million bytes arrive.

Because as described, even when all the data *was* sent, there's no guarantee that the data has been received by the other side and has reached userspace (Chromium).

To ensure the latter, you must *never* send RST, not even after all data has been sent or after ACKs for all data have been received. The only way to know that you will never send RST is to read() the empty string "" from the socket, which means the other side has sent FIN.
> I believe close() is supposed to continue trying to send the data,

close() is reliable in the same sense that send()/write() is reliable. Which is to say it does reliability until a RST comes along. A RST just nukes everything.

incoming data on a close()d socket causes the TCP stack to generate a RST.
Ah! Okay, that would do it. If the server sends a GOAWAY, the client is, of course, not going to stop sending data until it actually receives the GOAWAY.

Sign in to add a comment