Implement "remote" (alternative service and origin have different hostnames) HTTP/2 Alternative Services |
|||||||||||||
Issue descriptionWhen a new TLS connection is opened to an Alternative Service for a given origin, RFC7838 [1] Section 2.3. requires that the SNI contains the hostname of the origin, not that of the Alternative Service. [1] https://tools.ietf.org/html/rfc7838
,
Jun 6 2016
Also, an important detail that, I think got mentioned in the other discussion, but is probably worth recasting in terms of HTTP/2 and HTTP/1.1 because QUIC cares less about this layering detail:
Say I connect to (1) {origin=origin.com, alternate=alternate.com} and later to (2) {origin=alternate.com, alternate=alternate.com}. The two sockets will be authenticated under their respective origins and will both end up in the socket pool, but they must do so under *different* groups. The socket pool must lookup by origin and not the alternate. At the socket pool level, if a request comes in for alternate.com, socket (1) must never be considered an option.
Only if (1) negotiated HTTP/2, then the HTTP/2 session pooling logic may check (1)'s certificate against alternate.com and then use it for alternate.com's request, as an extension of how cross-host pooling currently works. (It's basically an extension of the IP address check in today's pooling logic.) But this must happen in the HTTP/2 layer, not the socket pools which service both HTTP/1.1 and HTTP/2.
The reason for this is that cross-host pooling of sockets is only legal in HTTP/2, not HTTP/1.1. In HTTP/1.1, sites will break if we start mismatching the SNI and Host header.
,
Jun 20 2016
,
Jun 22 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/aa60ff405b46620f283640f1f6571670b65d8fe8 commit aa60ff405b46620f283640f1f6571670b65d8fe8 Author: bnc <bnc@chromium.org> Date: Wed Jun 22 19:12:42 2016 Change GroupName test to https. Alternative Services for insecure origins have been disabled in https://crbug.com/615497 , making HttpNetworkTransactionTest.GroupNameFor* tests unrealistic. This CL changes these tests to use a secure origin instead. The alternative service port is changed to ensure that group name has port of origin, not of alternative service. BUG=615413 Review-Url: https://codereview.chromium.org/2082403002 Cr-Commit-Position: refs/heads/master@{#401380} [modify] https://crrev.com/aa60ff405b46620f283640f1f6571670b65d8fe8/net/http/http_network_transaction_unittest.cc
,
Jun 22 2016
By the way, +mmenke and I chatted about this yesterday. Matt, correct me if I'm summarizing wrong:
So, the current scheme where we make two HttpStreamJobs for Alt-Svc works fine for QUIC, but it's going to require bifurcating socket pool groups at various layers which would wreak havoc on the socket pool's per-group limits. Depending on exactly how this is done, we were able to come up with situations where per-group limits become meaningless and situations where we deadlock due to interactions of limits at each stage. Both of these are, of course, pretty bad.
Instead, we propose doing something like the following:
1. In the HttpStream logic, continue to race two jobs at once, but *only* for "QUIC" and "not-QUIC". Decide whether you do the QUIC on based on whether any alternates have QUIC labelled.
2. On the not-QUIC side, all alternates go in *one* HttpStream job. This is so we *don't* have to bifurcate the socket pools. Instead, you pass the *entire* set of alternates down into the socket pool parameters.
3. Each socket pool handles the parameter accordingly. Specifically, you need that SSL over TCP straight will pass the list along. SSL over HTTP CONNECT proxy or SOCKS proxy or whatever else drops the list. (If we need to, I could imagine implementing Alt-Svc for HTTP CONNECT too, but let's not bother.)
4. At the leaf, TransportConnectJob evaluates the non-QUIC Alt-Svcs list. Right now it does a happy eyeballs and races IPv4 and IPv6 connections where the Socket loops over IP addresses. Add to the mix the Alt-Svc IP addresses in some way or another. Note, however, that TransportConnectJob shouldn't race more than a small constant (probably 2) number of connections at once because otherwise we, again, secretly bump the per-group socket limits and we don't want to do that.
5. Whichever one works, we route up the stack and that socket is used. But since only TransportConnectJob acts on the alternate name, SNI and certificate checks still use the true name. Alt-Svc is *purely* a funny DNS.
We explicitly do *not* enforce that alternate connections use HTTP/2. I believe this was simply a mistake in how the header works. QUIC vs non-QUIC have incompatible handshakes, but everything over TLS/TCP upgrades.
This should solve the "funny kind of CNAME" part of Alt-Svc. To solve the "get rid of DNS lookup in CanPool" half, we have two proposals:
Proposal 1: Just get rid of the DNS lookup in SPDY pooling. (Can we do this?) That would be simplest.
Proposal 2: Okay, if we need to keep that and want an Alt-Svc aware one, replace it with the following:
1. StreamSocket or so has a method GetAlternateUsed or something. It lets you query the name we actually resolved when connecting.
2. The TCP sockets will just pass the Alt-Svc socket along. Funny sockets like proxies will never return anything (the alternate used to connect to the proxy is *NOT* checked for pooling). SSL sockets will pass along the transport's one.
3. SpdySession forwards that value out from its underlying socket.
4. The CanPool logic checks the session key and the alternate used. If either of those intersects with {request hostname, alternates acceptable for request hostname}, we can skip the DNS check as this suffices for the IP address check.
(This is kind of a mess. Alt-Svc is trying to solve too many problems at once and is not a very well-designed header.)
,
Jun 22 2016
We didn't discuss it, but I assume we'd have to do two DNS lookups in TransportConnectJob, too? I guess we'd want to do them sequentially rather than in parallel, since we have a fairly tight limit on simultaneous DNS lookups. Maybe checking the cache for the one we'd normally do second, first. Once we have one DNS result, we can start trying to connect.
,
Jun 22 2016
Actually, we do the DNS lookups in parallel in the racing QUIC case, so maybe doing them in parallel makes sense...
,
Jun 23 2016
davidben: re comment 5, can you back of up and give a bit of context? You mention, for example, bifurcating socket pool groups but I don't understand what leads to that situation? Are you talking about the case where there is more than one alternative service?
,
Jun 23 2016
So if we one SSL group map to multiple transport client socket groups, with transport client socket groups not using alt service as part of their group names that bifurcates the pools, and you get weird situations (6 alt service requests going to the same lower level group can blockhole any requests actually targetted at that group). If you *do* key on alt service, suddenly the group limits break down, and can have 256 connections to the same server.
,
Jun 23 2016
mmenke: Actually, can the deadlock situation we were worried about happen? Isn't that what the HigherLayeredPool business is intended to solve? (Or maybe it doesn't? CloseOneIdleConnection() without any group information seems a fairly blunt instrument. You'd really want it targeted at a specific group...) Group limits definite break down with alt-svc bifurcation, but it seems the deadlock is something we'd have to solve regardless if any kind of complex high/low-level pool interactions is to be supported (and it already does happen). rch: For context, the deadlock situation is that we have 6 idle connections to (a.com, alternate.com). The transport socket pool will do group limits based on alternate.com. While the SSL socket pool will do it based on a.com. But that means that when the SSL socket pool is asked to connect to alternate.com directly, the transport socket pool can't service the request until one of a.com's idle sockets are closed. But the SSL socket pool doesn't know its alternate.com request is tied to its idle a.com sockets.
,
Jun 23 2016
The higher layered pool logic is only done if there are available socket slots in the group. The problem is if a.com, b.com, etc, use alt.com, we'll see alt.com as having 6 active connections. So any new connection attempt to use alt.com will just wait for one of those to free up. The idle logic only triggers if the group has free socket slots, but the pool itself does not.
,
Jun 23 2016
Re deadlock: Why is there a limit on the number of sockets at two places: transport socket pool and SSL socket pool? Are they not redundant? I understand that the deadlock issue is that these two limits use different groupnames (keys). Can we not just generate the groupname (based on origin, not alternative) in the SSL socket pool, and pass it down to the transport socket pool, so that identical keys would be used for the two limits, solving the deadlock issue? Re pooling logic: I'm not particularily worried about avoiding DNS lookup. IP-based pooling happens in SpdySessionPool::FindAvailableSession(), and SpdySessionPool will need to be aware of both origin and destination for each SpdySession anyway. How that information gets there is indeed more complex.
,
Jun 23 2016
That dual limit isn't relevant to the deadlock situation, and actually shouldn't affect much of anything. It does mean we can have high priority requests waiting on the SSL limit, so the lower layer pool hands things out to lower priority requests first, I suppose, but that should be pretty uncommon. The dual limit is a result of multiple SSL sockets feeding in to a single transport socket group, and the transport socket group having a limit. As long as those two things are the case, we'll have a problem.
,
Jun 27 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a8681534b79e6352c68fe106e2433fa92d0f5671 commit a8681534b79e6352c68fe106e2433fa92d0f5671 Author: bnc <bnc@chromium.org> Date: Mon Jun 27 12:27:48 2016 Split enable_alternative_service_with_different_host flag. Split enable_alternative_service_with_different_host flag into separate HTTP/2 and QUIC flags. Also remove * unnecessary assigments from tests, * test in DataReductionProxyIODataTest.TestConstruction: it is moot because both HTTP/2 and QUIC are disabled anyway, * assignment from UrlRequestContextBuilder, see justification below. HTTP/2 AltSvc for alternative service host different from that of the origin is currently broken, see linked bug. Therefore enable_http2_alternative_service_with_different_host is disabled in production and for all tests for the time being. It will be used in writing tests when fixing the linked bug. Also, this flag will make it easier to remove flag enable_alternative_services_for_insecure_origins from tests. Also, no clients of the network stack should be able to flip this flag directly because of the linked bug, that is why it is not added to UrlRequestContextBuilder. On the other hand, QUIC AltSvc works correctly, thus enable_quic_alternative_service_with_different_host is enabled by default both in production and tests. It is only there temporarily as a kill switch in case something goes wrong, and will eventually be removed. Therefore no clients of the network stack should have the need to flip this flag directly. Furthermore, this flag is moot if QUIC is disabled. That is why it is not added to UrlRequestContextBuilder. BUG=615413 Review-Url: https://codereview.chromium.org/2089243002 Cr-Commit-Position: refs/heads/master@{#402154} [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/components/data_reduction_proxy/core/browser/data_reduction_proxy_io_data_unittest.cc [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/components/network_session_configurator/network_session_configurator.cc [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/components/network_session_configurator/network_session_configurator_unittest.cc [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/jingle/glue/proxy_resolving_client_socket.cc [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/net/http/http_network_session.cc [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/net/http/http_network_session.h [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/net/http/http_network_transaction_unittest.cc [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/net/http/http_stream_factory_impl_job_controller.cc [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/net/http/http_stream_factory_impl_unittest.cc [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/net/quic/quic_network_transaction_unittest.cc [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/net/spdy/spdy_network_transaction_unittest.cc [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/net/spdy/spdy_test_util_common.cc [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/net/spdy/spdy_test_util_common.h [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/net/url_request/url_request_context_builder.cc [modify] https://crrev.com/a8681534b79e6352c68fe106e2433fa92d0f5671/net/url_request/url_request_context_builder.h
,
Jul 8 2016
Moving this nonessential bug to the next milestone. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Jul 13 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9989bd51828a155d65cd9131cee6700d2bb3164e commit 9989bd51828a155d65cd9131cee6700d2bb3164e Author: bnc <bnc@chromium.org> Date: Wed Jul 13 23:09:40 2016 Remove ALTERNATIVE_CERT_NOT_VALID_FOR_ORIGIN error code. Remove ERR_ALTERNATIVE_CERT_NOT_VALID_FOR_ORIGIN error code, because (1) right now, remote TCP alternative services (where the origin and the alternative service have different hostnames) are disabled, and (2) in the future, if an existing connection to the destination of the alternative service with an origin different from the origin of the request does not have a valid certificate for the origin of the request, then a new connection to the destination of the alternative service will be opened with the origin of the request, and if that connection does not have a valid certificate for the origin of the request, then that's simply ERR_CERT_COMMON_NAME_INVALID. So there is no need for ERR_ALTERNATIVE_CERT_NOT_VALID_FOR_ORIGIN. Also remove ValidSpdySessionPool class, because (1) right now there is no need for the extra cert verification, and (2) in the future, SpdySessionPool::FindAvailableSession() and SpdySessionPool::CreateAvailableSessionFromSocket() will take both an origin and a destination argument, and will make sure that the SpdySession they return has a certificate that is valid for the origin. BUG=615413 Review-Url: https://codereview.chromium.org/2152453002 Cr-Commit-Position: refs/heads/master@{#405342} [modify] https://crrev.com/9989bd51828a155d65cd9131cee6700d2bb3164e/net/base/net_error_list.h [modify] https://crrev.com/9989bd51828a155d65cd9131cee6700d2bb3164e/net/http/http_network_transaction_unittest.cc [modify] https://crrev.com/9989bd51828a155d65cd9131cee6700d2bb3164e/net/http/http_stream_factory_impl_job.cc [modify] https://crrev.com/9989bd51828a155d65cd9131cee6700d2bb3164e/net/http/http_stream_factory_impl_job.h
,
Jul 22 2016
> We explicitly do *not* enforce that alternate connections use HTTP/2. I believe this was simply a mistake in how the header works. QUIC vs non-QUIC have incompatible handshakes, but everything over TLS/TCP upgrades. I bumped into mnot today at IETF. He said this was the intention. We are not supposed to think of these as ALPN labels and "h2" refers to the whole HTTP/2, HTTP/1.1, TLS, TCP stack of specs. That is, it is exactly this "entry-point" notion. So I believe Matt's and my proposed rationalization is the correct one.
,
Aug 8 2016
https://tools.ietf.org/rfc/rfc7838.txt Section 2.4 last paragraph: "If the connection to the alternative service does not negotiate the expected protocol (for example, ALPN fails to negotiate h2, or an Upgrade request to h2c is not accepted), the connection to the alternative service MUST be considered to have failed."
,
Aug 8 2016
Interesting. That does not match what mnot told me. Anyway, we should not honor that text since it's nuts.
,
Aug 9 2016
> Anyway, we should not honor that text since it's nuts. O_o That doesn't sound like the best approach :> If we think this is nuts, then let's start a conversation on httpbis about fixing the language.
,
Aug 9 2016
Yeah, I'm not a thread with mnot to try to figure out what the intent was here.
,
Aug 9 2016
s/I'm not a thread/I'm on a thread/
,
Aug 16 2016
Changing title per comment #1, to avoid giving the impression that the SNI should be manipulated by itself.
,
Sep 20 2016
,
Sep 20 2016
Random question: What if we learn about a "remote" alternate service through a connection where the server sends us a self-signed cert? Seems like trusting that information could be very bad.
,
Sep 20 2016
Re #25: unfortunately we currently only check for the scheme when deciding whether to honor an AltSvc response header, at https://cs.chromium.org/chromium/src/net/http/http_network_transaction.cc?l=1219. (This is for QUIC; remote HTTP/2 alternative service is currently disabled.) I agree that remote alt-svc should not be allowed for connections with only self-signed certs. How do I check for that?
,
Sep 20 2016
It's somewhere in SSLInfo, I believe.
,
Sep 20 2016
Do we cache content if the connection uses a self-signed cert?
,
Sep 20 2016
No, we don't.
,
Sep 20 2016
Hrm...More generally, how long are we going to cache alt-service info? An attacker managing to get stuff into the alt-service table seems like it's a lot worse than getting stuff into the cache, since even if the attacker is caught and the cert revoked, the alt-service entry will still be around, potentially talking to another attacker-controller website.
,
Sep 20 2016
The Alt-Svc header itself specifies its expiration in seconds, which we currently honor, without any upper limit. (In practice, the upper limit is something like 2^31 seconds = 68 years.) It sounds like capping expiration at a reasonable value would help alliviate this security concern. Maybe a month, or a week?
,
Sep 20 2016
Keep in mind that if www.example.com (via hacking) advertises alt.example.com as an alternative, then when chrome uses alt.example.com, it will still need to present a valid certificate for www.example.com. So in the scenario you present, if the cert is revoked, then the connection to alt.example.com will fail.
,
Sep 20 2016
Ahh, perhaps I misunderstood "Only if (1) negotiated HTTP/2, then the HTTP/2 session pooling logic may check (1)'s certificate against alternate.com"...That's describing for alternate requests, not requests to the original origin.
,
Sep 20 2016
Correct. Essentially, alt-svc acts like a CNAME. At the TLS layer, we're still "connecting to the origin" (just at a different IP address). The TLS handshakes sends www.example.com in the SNI and verifies that the cert is valid for www.example.com. If later, we have a request for alt.example.com we'll notice that we have a connection to the IP address for alt.example.com and will engage the "pooling logic". That means we'll look at the certificate on the connection and if it's *also* valid for alt.example.com then we can use the existing connection for this origin.
,
Nov 7 2016
,
Nov 29 2016
,
Apr 12 2017
,
Apr 12 2017
,
Apr 13 2017
,
May 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1fc76912bc055188dc76c09f6032096c079d4d86 commit 1fc76912bc055188dc76c09f6032096c079d4d86 Author: bnc <bnc@chromium.org> Date: Thu May 18 20:43:24 2017 Fix SpdySessionKey for HTTP/2 alternative Jobs. Change HttpStreamFactoryImpl::Job::GetSpdySessionKey() to construct a SpdySessionKey from the request URL, not the destination. This is the correct behavior, because the only layers that should worry about the destination are the socket pool and below. This does not change behavior in production, because HostPortPair::FromURL(origin_url_) can only differ from |destination_| for alternative jobs, and HTTP/2 alternative jobs are disabled in https://crrev.com/2821463002. BUG=615413, 706974 Review-Url: https://codereview.chromium.org/2887773006 Cr-Commit-Position: refs/heads/master@{#472919} [modify] https://crrev.com/1fc76912bc055188dc76c09f6032096c079d4d86/net/http/http_stream_factory_impl_job.cc [modify] https://crrev.com/1fc76912bc055188dc76c09f6032096c079d4d86/net/http/http_stream_factory_impl_job_controller_unittest.cc
,
Aug 10 2017
,
Jul 18
Brad and I were talking about this a bit this week. A note on the plan in comment #5: Doing the retry purely at TransportConnectJob has a slight downside because presumably we want a certificate error at the Alt-Svc name to retry connecting to the origin directly, rather than be fatal. That implies we want to loop over Alt-Svc names at the SSLConnectJob, not the TransportConnectJob. Although, as I write this, that means bifurcating the transport half of pools... or maybe just keying them on the Alt-Svc name? I suspect the latter will result in deadlock around the limits. Maybe we do actually need to de-layer the SSL and TCP pools before we can implement Alt-Svc... Anyway, supposing that issue isn't fatal, we would then run into issue #823387 , because we can't have SSLConnectJob-level retries right now. They cause infinite loops with HTTP proxy auth. (See first half of that bug for discussion of what goes wrong.) The solution for ERR_SSL_VERSION_INTERFERENCE handling (bifurcate upper-level pools and retry at HttpNetworkTransaction) is unlikely to work because we actually want pooling behavior. Thus we probably need to make SSLConnectJob retries actually possible, by fixing the proxy auth implementation. Half-baked thought: rather than bypassing the SSLConnectJob, somehow hand back some wrapper over the old SSLConnectJob instead of dropping it on the floor. When the auth information comes in, resume both SSLConnectJob- and TransportConnectJob-level operations. I have no idea if this is actually feasible. We'd need to stare at that. But, again, this is supposing the first issue isn't fatal. |
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by davidben@chromium.org
, Jun 6 2016