Resolve complaints around caching of "bad" proxies |
||||||||||||||||||
Issue descriptionMechanisms like PAC allow specifying a list of proxy servers. Chrome prefers to use proxies in the order given, however will attempt to connect to proxies in a different order if it has deemed some of the proxies to be less reliable (based on recent failures). Unfortunately these heuristics have received lots of user complaints: * Can't reliably judge what a "failed" request is; current the threshold is a single failed request will mark the proxy as bad for the next 5 minutes. * Marking a proxy as bad may be exacerbated by background or speculative requests * Proxies may fail only certain requests (per host, or per protocol, per port, etc), however once marked as bad will be put in penalty box for all request * Chrome doesn't special case DIRECT when it is included as a proxy fallback option (however users may desire it to be used as a last resort fallback) * Similar to above, but with a non-DIRECT option. Even though two proxies are specified, users may expect only the first one to be used. So when it gets cached as bad and traffic goes to the second one things break. Either more effort needs to be invested in improving these heuristics, or we should consider removing this feature. We don't have metrics justifying if it is doing more good than harm. We also need more investigation comparing our behavior to other browsers to understand what the main disconnect is.
,
Jan 13 2017
Firefox also caches bad proxy proxies, and from what I recall in a similar fashion to Chrome. We need to understand if users are unhappy with their approach or only Chrome's, and where they currently differ. (User complaints tend to say the issues are Chrome-only; sometimes the comparisons are difficult since Chrome may be issuing other background requests, or speculative requests, which cause can mark a proxy as bad sooner). As far as ranking DIRECT last, I seem to recall that used to be the case, but don't remember details.
,
Jan 13 2017
,
Jan 13 2017
Some of Chrome's internal requests use funky ports. Even if we just not blacklisting ports for requests that don't go through RDH may be an improvement, though that would certainly be a pretty hacky change.
,
Feb 9 2017
(Not a chromium developer here, but just an external bystander...Just my 2 cents) In my opinion the detection of bad proxies really needs work (it is the one thing that really works differently from other browsers). Therefor the best option to solve most complaints I think. The bad proxy detection should probably be able to inspect the HTTP status code (if present) that was received on the CONNECT tunnel request. For example in case of a 4xx error (like Squid saying 404, or some filter proxy saying 403) it should not mark anything as bad. On the other hand if some 50x was returned, you may want to flag it as bad? Not sure how firefox handles this. Currently all other responses than status 200 seem to be seen as a full tunnel connection failure (by net/http/http_network_transation). That is arguably wrong in the context of a proxy (it responds using valid HTTP, so the tunnel is alive/not really dead?).
,
Feb 13 2017
,
Feb 13 2017
Issue 668110 has been merged into this issue.
,
Feb 13 2017
Issue 680083 has been merged into this issue.
,
Feb 13 2017
,
Feb 13 2017
,
Feb 13 2017
,
Feb 13 2017
Issue 644248 has been merged into this issue.
,
Feb 13 2017
Issue 595678 has been merged into this issue.
,
Feb 13 2017
,
Feb 13 2017
Issue 581288 has been merged into this issue.
,
Feb 13 2017
Issue 537915 has been merged into this issue.
,
Feb 13 2017
Issue 531347 has been merged into this issue.
,
Feb 13 2017
Issue 411635 has been merged into this issue.
,
Feb 13 2017
Issue 312245 has been merged into this issue.
,
Feb 13 2017
,
Feb 13 2017
,
Feb 14 2017
,
Mar 14 2017
The root cause of my issue with it is that I am trying to manage an office in China so Chrome immediately marks the primary proxy as bad as it is also in China. Simply having the option to switch off bad proxy detection would make Chrome usable in our environment again.
,
Jul 11 2017
Is the solution placed on Canary? I have a case in which after trying in Canary the Pac file worked as expected.
,
Jul 11 2017
No code has been landed to improve the situation here.
,
Jul 18 2017
eroman@chromium.org when can we expect a solution? For 6 years this problem! Make a choice in chrome://net-internals/#proxy and add the ability to administer this feature using the adm file for the GPO. I think those who write pac files can administrate proxy, and will understand if he will stop responding. This setting is needed in corporate networks and if I dropped a proxy server, it does not mean that users can now go to pron sites, right?
,
Jul 20 2017
RE comment #26: I haven't gotten around to working on this, and hence have no specific timeline.
,
Sep 8 2017
Please advise if there are any plans to resolve the issue? In our enterprise we face it in some locations, usually when blocked address is accessed via Chrome. At least it would be good to control bad proxy detection feature via GPO. Also, in same locations where bad proxy detection occurs, Chrome appears to incorrectly pick up PAC file (distributed via DHCP 252 option), re-apply settings helps in this case, but the problem may appear again. Can you advise if it can be related to bad proxy detection or it's another issue?
,
Sep 12 2017
,
Sep 18 2017
,
Nov 7 2017
Issue 781290 has been merged into this issue.
,
Nov 15 2017
Hello team, do you have any update about this?
,
Nov 15 2017
No one's working on this, or, the extent of my knowledge, has any concrete plans to do so.
,
Nov 15 2017
The simplest heuristic adjustment I would suggest is to de-prioritize requests only after N failed requests (whereas right now Chrome does so after 1 failed request to the proxy). This should reduce how often we re-prioritize proxies when the proxy fails requests, without needing to build a bunch of one-off heuristics that try to model the (known) problematic behaviors: * proxy blocks ports other than :80 and :443, causing the occasional requests to mtalk.google.com:5228 to mark the proxy as bad * proxy is flaky and occasionally fails requests (with low probability) * proxy fails/blocks requests to a particular protocol * proxy fails/blocks requests to a particular host * target server failures are (mis)attributed as a proxy server failure The higher we make N, the more of a performance hit before Chrome will fail-over to other (functional) proxies in the list. I suspect something on the order of 5-10 would be sufficient, especially if we count N as the number of unique origins rather than just unique requests. It is unfortunate to further delay the de-prioritization as conceptually any proxy in the list should be viable for the user-agent. But given the frustrations of users in environments that make assumptions on the proxy list ordering, I think it is a worthwhile tradeoff.
,
Nov 15 2017
Issue 784463 has been merged into this issue.
,
Nov 29 2017
,
Nov 29 2017
Issue 698263 has been merged into this issue.
,
Jan 25 2018
,
Feb 7 2018
In a very large enterprise environment that is dependent on geo ip based data, this is becoming a showstopper. After all these years of contending with issues flowing in regarding failover to new proxies from a proxy mistakenly marked as 'bad' because the proxy is doing it's job blocking CONNECTs to non-standard ports I am ready to subvert this algorithm with DNS, or rollback chrome completely. When will something be done about this?
,
Feb 7 2018
Can you not resolve this in the PAC script? If the proxy is blocking CONNECTs purely based on its port, it is simple to route those requests differently in the PAC script and avoid sending them to that proxy in the first place. If your proxy is being marked as bad by Chrome, it means you have some option in the proxy list which DOES permit these requests to be made. So the end effect is that the proxy configuration is allowing those requests, but the proxy you specified can't handle them. Hence from Chrome's perspective your proxy is not working. @dakotadave2001 and others on this thread: Are you using a final fallback to DIRECT (which subsequently succeeds requests that other proxies failed), or are you using another proxy that succeeds on some subset of requests. In other words does your PAC script resemble: // A return "PROXY foo:8080; DIRECT"; vs // B return "PROXY foo:8080; PROXY bar:8888";
,
Feb 8 2018
@elawrence: Are you aware of any Microsoft documentation for their policy around failed proxies and evaluation of proxy list? (The only thing I am turning up is http://support.microsoft.com/kb/2551554 which describes BadProxyExpiresTime registry setting, but not what signals they consider for marking proxies as bad or fallback orderings).
,
Feb 8 2018
Just want to add an update to this since I had put in a ticket some time back. It appears this was an issue with older versions of squid, im guessing it didnt support whatever Chrome is currently trying to do... I had since re-built the server on new hardware and there is no longer an issue. Currently its on squid 3.5.12.
,
Feb 8 2018
Re #41: Sorry, no, I don't recall this being documented, other than the timeout configuration. As far as I understand it, fallback simply chooses the the next result in the FindProxyForUrl string. Historically, a proxy was marked bad if it was unreachable, or, in the case of SOCKS proxies, if the proxy failed to respond to a request before the request timeout. (The latter behavior may have been fixed by now, after it led to a fiasco a half decade ago where a given component in IE made requests with a 5 second timeout and the component's server was attacked via DDOS. All corporate customers using SOCKS proxies were effectively broken as the SOCKS proxy lists were all marked "Bad" a few seconds after startup. :) I don't recall the code having the behavior of marking a proxy as bad if it refuses to satisfy a given CONNECT, but I haven't looked at this in a VERY long time.
,
Feb 8 2018
All the browsers do a form of marking proxies as bad and skipping them, albeit with differences in how failures are counted (connectivity error, timeout, protocol error, etc), how the list is re-ordered, and how long until the proxy is tried again, what settings control the timeout, etc. Other than policy differences in marking proxies bad (which lack standardization and documentation), the other big difference is what traffic the browser generates and therefore triggers failures. In the particular scenario of "non-standard ports", Chrome gets disproportionately affected by proxies that barf on CONNECT to non-standard ports, since internally it likes to send traffic to things like "mtalk.google.com:5228". So when this request fails because the proxy closes the connection, AND we subsequently succeed using a different proxy option provided in the proxy list by the PAC script, that intolerant proxy server is considered to be working poorly and hence gets de-prioritized with respect to the other proxy options in the list that did work connecting to mtalk.google.com:5228 (i.e. it is marked as bad). There isn't anything inherently wrong with this policy, but there is ambiguity on how proxy "failures" are counted, and what user preference is for how to interpret the proxy list returned by PAC. Moreover, Chrome has been at the forefront of protocol level changes, which can increase the chances of proxy servers being intolerant of its requests. The general feeling from enterprise users on this thread is that they would like a strict ordering of proxies, such that proxies are tried precisely in order each time, and past failures are not considered indicative of future failures. This has been discussed as a configuration option (Issue 227288), however that has a variety of drawbacks (off by default, would perform badly in the case of timeouts, one more option to maintain), and I don't think addresses the core issue (disconnect between user expectation on how proxy lists are interpreted, and what UAs actually do, due to lack of standards). Per comment #40, the catch-all technique you can do today, which will work across browsers, is to special case these doomed requests from being sent to the proxy in the first place by blacklisting them the PAC script. The other option is to provide a proxy fallback list which is tolerant to any option being used. For instance remove the fallback to DIRECT or other proxy that is able to handle these requests. Why include options on the proxy list which shouldn't actually be used? If the proxy is being used to filter content (and hence can fail any request seemingly at random) then there probably shouldn't be fallbacks in the list anyway - and certainly not fallbacks that would then bypass the filtering. The option being explored on this bug thread, is to modify Chrome's heuristics to work around the ambiguity on how a proxy is considered to be "bad", by adding some slop before a failing proxy is de-prioritized. This can address the specific case of proxies blocking connections on non-standard ports, as well as smooth over cases like proxies that block specific hosts, or proxies that are unreliable and fail requests sporadically. Simply adding slop is not particularly elegant, but if we can't distinguish intentional proxy failures vs misbehaving proxy failures, it is the best we can do.
,
Feb 8 2018
@eroman - 'If the proxy is blocking CONNECTs purely based on its port, it is simple to route those requests differently in the PAC script and avoid sending them to that proxy in the first place.' - No not really, no other browsers exhibit this behaviour, so I do not want to set precedent for one-offs with Chrome. 'If your proxy is being marked as bad by Chrome, it means you have some option in the proxy list which DOES permit these requests to be made.' This is incorrect, the pac file is as such: var internetproxy = "PROXY proxy1.foobar.com:80; PROXY proxy2.foobar.com:80; Proxy 1 and 2 are all configured with the exact same policy, to block CONNECTs on unwanted ports. Chrome will fail the first proxy, then allow traffic on the second proxy. I suppose this is because it does not have any other choice, there is no direct access. Please add the disablement of this behaviour/functionality as an option in Chrome, thanks.
,
Feb 8 2018
@eroman, I appreciate the attention you are giving to this matter, my previous comment was a reply to comment 40 before I saw comment 44 - it still applies however- Thanks.
,
Feb 8 2018
Thanks for your feedback dakotadave2001. Regarding comment #45: > This is incorrect, the pac file is as such: > var internetproxy = "PROXY proxy1.foobar.com:80; PROXY proxy2.foobar.com:80; That is not the case - the marking of proxy1.foobar.com:80 can only happen if both proxy1.foobar.com:80 failed the request, AND proxy2.foobar.com:80 succeeded at it. Specifically here is the flow: First Chrome will do a CONNECT to proxy1.foobar.com. In our scenario we say this fails with ERR_TUNNEL_CONNECTION_FAILED. Next Chrome will try a CONNECT to proxy2.foobar.com, the second proxy in the list. If this also fails with ERR_TUNNEL_CONNECTION_FAILED, then the entire request will fail with ERR_TUNNEL_CONNECTION_FAILED, and none of the proxies will be marked as bad (since we didn't have a clear signal that one proxy was working better than another one). In this case, since nothing was marked as bad (you can see the bad list on chrome://net-internals#proxy) the next attempt at proxy resolution will once again start with proxy1.foobar.com:80 as one expects (and probably end up failing both attempts yet again), It is only if the the second attempt we made at connecting to a proxy (using proxy2.foobar.com:80) succeeds that proxy1.foobar.com:80 will be marked as "bad". The effect of marking proxy1.foobar.com:80 as "bad" now, is that the next time we interpret the proxy list "PROXY proxy1.foobar.com:80; PROXY proxy2.foobar.com:80", we will effectively re-order it to "PROXY proxy2.foobar.com:80; PROXY proxy1.foobar.com:80" when attempting connections. Since we have reason to believe that proxy2.foobar.com:80 works better than proxy1.foobar.com:80 (as it succeeded while proxy2.foobar.com:80 failed). I did some testing and can confirm this is how Chrome behaves in this scenario so this isn't just theoretical. Hence I think the description in comment #45 is not a correct diagnosis. If you think that seeing something different that what I described is happening please send me a NetLogDump (https://dev.chromium.org/for-testers/providing-network-details) and I'll take a look.
,
Feb 9 2018
@eroman, my testing indicates otherwise, so let's agree to disagree. Can we please have a switch to disable the proxy failure behaviour if we are so inclined?
,
Feb 9 2018
If there's a bug, causing the code not to do what it should, we should identify and fix the issue. In general, we prefer to minimize options, because the number of possible configurations blows up exponentially, and adding (collectively) a lot of bloat for options used by 0.01% of users makes code harder to reason about, and adds bugs to rarely used and tested code, and increases the maintenance burden.
,
Feb 9 2018
@dakotadave2001: Rather than agree to disagree, let's resolve the source of this mismatch :) Per comment #47 please send me a netlog dump for your problematic fallback. After starting the capture go to chrome://net-internals/#proxy and clear any "bad" proxies before reproducing. My procedure in testing was to use a PAC script that returned unconditionally: return "PROXY p1; PROXY p2"; Where p1 and p2 were proxy servers that either accepted or failed on CONNECT. I confirmed that when both p1 and p2 fail CONNECT requests, neither gets marked as bad. And that when p1 fails CONNECTs but p2 accepts them, then p1 gets marked as bad and p2 then gets tried first for the duration of the "bad" timeout.
,
Feb 9 2018
It occurs to me that you may both be right. ProxyResolverClientSocket used to try direct connections even when there was a configured proxy, and I think if the direct fallback worked, it may have marked all proxies as bad. I'm not sure under what conditions we use that class. The weird behavior has since been changed, so it only used direct if that's allowed as a fallback, but that change hasn't made it to stable yet (It'll be in M65). It's also possible that there's some other proxy consumer doing something similar.
,
Feb 12 2018
@mmenke - thanks! So is it possible this may be fixed in M65? Mar 6th, 2018 (Mar 13th for Chrome OS)
,
Feb 12 2018
It is possible, if my theory is correct, though it's certainly also possible something else is going on here - if this is just due to the first proxy being unreliable, M65 won't do anything about that.
,
Feb 13 2018
Proxy uptime and availability is 99.9% :) This is and always has been a matter of policy, or more precisely, interpretation (or lack thereof) of policy enforcement response.
,
Feb 14 2018
You can try the Canary channel and see if the issue goes away, if you want. Even if the proxy and network are fine, there could be other code in Chrome doing something similar to the removed logic. With the possible exception of ChromeOS, Canary will use its own configuration directory, so won't mess up settings for a stable Chrome install. Dev and beta share directories with stable, I believe, so if they modify the format of data saved on disk, you can end up losing data.
,
Feb 27 2018
Issue 815239 has been merged into this issue.
,
Feb 27 2018
Thanks for all the feedback on this thread! Based on the recent data, a reasonable summary is: "Chrome falls back to other proxies after a failure during CONNECT; whereas other browsers just fail the request. This causes Chrome to mark the proxy as 'bad', and then use other fallback options in the proxy list. Whereas other browsers continue using proxy that failed on CONNECT." While there isn't a standard specification for how proxy fallback lists are to be applied, Chrome is the only browser that considers failures during CONNECT. The majority of UAs only apply fallback for connection level failures, and not for protocol level failures. The desktop browsers that I tested were: * Firefox: 58.0.2 * Microsoft Edge 38.14393.2068.0 * Safari 11.0.3 (13604.5.6) The results of this testing can be reproduced by setting the PAC script to http://127.0.0.1:8081, and then launching these two servers: python pac_server.py 8081 python http_proxy 8084 Chrome's policy to failover on errors during CONNECT goes back to the earliest version of Chrome, over 9 years ago (https://chromium.googlesource.com/chromium/src/+/86ec30d6710923cf1c193eb88b1e6251f831e0ef). While this policy has its merits -- it is able to fallover to other proxy options when the proxy is unable to for instance handle SSL traffic -- it is clearly the cause of compatibility woes for the users on this list, who rely on failover not happening in this case. To resolve this we should align Chrome to match other browsers by no longer doing failover on errors during CONNECT. (This obsoletes the proposal in comment #34 to add heuristics to paper over this). There is some risk that changing Chrome's policy now will cause a different set of compatibility problems for those deployments dependent on Chrome's fallback policy. But aligning with other browsers is more valuable.
,
Mar 8 2018
,
Mar 29 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ad1fa4c6d75105c92733270b373425893eb8588a commit ad1fa4c6d75105c92733270b373425893eb8588a Author: Eric Roman <eroman@chromium.org> Date: Thu Mar 29 14:28:48 2018 Remove proxy fallback on errors during CONNECT. Previously when an error occurred while establishing a tunnel to the proxy, the next proxy in the list would be tried (and might succeed). Whereas now the request will immediately fail with ERR_TUNNEL_CONNECTION_FAILED without considering any proxy fallbacks. The change in policy is in line with what other major web browsers do. Bug: 680837 Change-Id: Ifc9833b87e413b403c05cbbf8d04f22fa5d98c21 Reviewed-on: https://chromium-review.googlesource.com/981260 Commit-Queue: Eric Roman <eroman@chromium.org> Reviewed-by: Helen Li <xunjieli@chromium.org> Cr-Commit-Position: refs/heads/master@{#546825} [modify] https://crrev.com/ad1fa4c6d75105c92733270b373425893eb8588a/net/http/http_stream_factory_impl_unittest.cc [modify] https://crrev.com/ad1fa4c6d75105c92733270b373425893eb8588a/net/http/proxy_fallback.cc
,
Mar 29 2018
Tentatively marking as fixed. I will update this thread with testing instructions as soon as the change is available on the Canary channel (likely tomorrow). Assuming testing is successful, this change would be included in Chrome 67. There is some risk of compatibility regression for Chrome users that actually _were_ relying on the fallback during CONNECT to discover a functional proxy. We can't quantify that given our inability to distinguish unexpected successes vs unexpected failures (with this change we are intentionally increasing errors), but I am optimistic it will be uncommon. We will have a better idea of impact once this reaches the Beta channel.
,
Mar 30 2018
The fix is now available on Chrome 67.0.3384.0. This can be obtained using the Canary Channel: https://www.google.com/chrome/browser/canary.html Please test this build so we can confirm the problem is resolved. (If things aren't working, double check that chrome://version/ reads 67.0.3384.0 or higher). Thanks!
,
May 22 2018
,
May 24 2018
Hello I've asked about 4 users to install beta version. So far no complaints about issue. So seems like it's fixed in new version. Will keep you updated.
,
May 24 2018
Just to keep you updated. From my perspective (netops in world wide company) it has something to do with blocking some http requests. By default on our proxies we block all sites which are categorized as "advertisments". Once I add affected people to list "dont block web advertisments" issue is not happening for them.
,
May 30 2018
,
May 30 2018
@dreathan@gmail.com: Thanks for the confirmation. Note that version 67 was released to the stable channel yesterday (so your users can return to the stable channel rather than continue using beta channel).
,
Jun 13 2018
|
||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||
Comment 1 by asanka@chromium.org
, Jan 13 2017