Quic protocol errors for multiple users since 2-27-18
Reported by
cruss...@kanren.net,
Mar 1 2018
|
|||||||
Issue descriptionUserAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36 Example URL: maps.google.com Steps to reproduce the problem: 1. Try to access gmail, maps, google calendar, or most google services in chrome 2. browser will sit and spin for an inordinately long period of time (30 seconds to 1+ minutes) 3. browser will then proceed or fail with "ERR_QUIC_PROTOCOL_ERR" message What is the expected behavior? speedy access to google services. What went wrong? For the last 36 hours at least we've had multiple machines in our offices with intermittent problems accessing google services. I'd describe the "problem" as such. For some indeterminate time things seemed to work fine, then all of a sudden for anywhere between 1-10 minutes, MOST google services (mail, maps, calendar, search, etc) would refuse to load. After some investigation here's what we learned. The affected machines are on public IPs, NOT natted, they have no firewall, or state aware devices between them and the Internet. The affected users were using Chrome. During the interruption, if you switch to Firefox, or a different browser, the same Google services load fine. The affected devices are IPv4 and v6 enabled, and have publicly routable IPs for both protocols. They have v6 capable DNS servers configured and can resolve the v6 addresses for the affected google services. I think it's reasonable to assume therefore, that they're using v6 to access the google services. traceroutes and pings to the affected google services look fine during the problematic time periods. Finally: Occasionally a service takes long enough to load, that the browser times out and a error page displays. The error page says "ERR_QUIC_PROTOCOL_ERR" disabling the quic protocol in the chrome browser has enabled one user to run for 8 hours now error free while others (QUIC still enabled) still exhibit problems. I've done a packet capture and we appear to be getting QUIC protocol version 39 packets served to us when attempting to access these services (Q039). Does it occur on multiple sites: Yes Is it a problem with a plugin? No Did this work before? Yes I believe it was same version of chrome, just about 36-48 hours ago. Does this work in other browsers? Yes Chrome version: 64.0.3282.186 Channel: stable OS Version: 10.0 Flash Version:
,
Mar 2 2018
,
Mar 5 2018
My apologies for the slow response. I've been out of the office for a couple of days. Here is the net-export. I started the net-export, then tried to load photos.google.com in another tab and waited until it timed out with the ERR_QUIC_PROTOCOL_ERR message, then stopped it. If this didn't work as expected, or you need something else let me know. BTW, as you might have deduced, the problem still exists for us. I had to move to Firefox to submit this ticket update, because although I could bring up this bug detail page by clicking the link in my email, I could NOT sign in to post an update. When I clicked the sign in link, it timed out in Chrome. When I switched to firefox, I could both load the page, and also sign in just fine.
,
Mar 5 2018
Thank you for providing more feedback. Adding the requester to the cc list. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Mar 5 2018
+rch@: Can you take a look at the NetLog attached in #3. I see the ERR_QUIC_PROTOCOL_ERR, but can't tell what has caused that error.
,
Mar 5 2018
An additional bit of troubleshooting information for you. On Friday morning, one of my network technicians left QUIC enabled in his browser, but disabled the IPv6 stack on his windows 10 machine, so he was using IPv4 only. He's been running that way since Friday morning with no issues. Our other network admin, directly across from him, with an identical laptop is running dual-stack (IPv6 and IPv4) and is still having the issue. This problem only seems to present its self when the user has Chrome, has the QUIC protocol enabled, and is using IPv6 connectivity (or is running dual-stack).
,
Mar 5 2018
It looks like your network is having a huge number of UDP packets dropping. A large set of packets were missing and we got a lot of retransmission timeouts.
t=114407 [st=17315] QUIC_SESSION_ACK_FRAME_SENT
--> delta_time_largest_observed_us = "72"
--> largest_observed = "71"
--> missing_packets = ["50","51","52","53","54","55","56","57","59","60","61","62","63","64","66","69","70"]
--> received_packet_times = [{"packet_number":71,"received":"426647081364"}]
,
Mar 5 2018
Thank you for the additional information. Since UDP isn't simple to emulate, and I can't replicate this packet loss with simple pings to the affected services can you tell me from this diagnostic info, if the loss is in the direction FROM your servers TO my users... Or is it upstream (FROM my users toward your servers/services)?
,
Mar 6 2018
Adding 'TE-NeedsTriageHelp' as this issue is already being investigated by the Devs. crussell@ Can you please confirm if 'Needs-Bisect' label can be removed? Thanks..
,
Mar 6 2018
yes, for as much as I understand the label, I believe the "Needs-Bisect" label can be removed.
,
Mar 6 2018
Additional troubleshooting information for you. yesterday, during the day I ran hours worth of ping testing to multiple affected services (see IPv6 listing below). And had no packet loss whatsoever over hours of testing. This loss does not appear to affect the ICMP protocol in IPv6, only udp. tested to: 2607:f8b0:4000:815::2004 2607:f8b0:4000:815::2005 2607:f8b0:4000:813::200e 2607:f8b0:4001:c14::67 2607:f8b0:4009:80d::200e I chose these IPs by doing packet captures as I tried to access the affected services, so that I was pinging to the actual anycast or CDN server that were handing me the content.
,
Mar 6 2018
Seeing that ICMP was not affected, and since I didn't get a response from you on which direction the traffic was being dropped in, I did further testing last night around 5:00pm. We are a large research and education network with multiple BGP peering points. Last night I changed the local preference value on routes we received from Internet2 to force Google (ASN15169) traffic onto that connection and off of one of our other transit links. This only affects the path our on-net users choose to take OUTWARD toward google services, it does not affect the path your servers/services take back toward us. Forcing the IPv6 upstream traffic to a different path made no difference and did NOT clear up the issue which leaves me to conclude that the packet loss is occuring in the downstream direction (FROM your servers TOWARD our users).
,
Mar 6 2018
Sorry for the delay in response. Looking at the netlogs, the connection sent packets but packets are not received. It's possible that the loss happens from servers to users while it's also possible that the packets were lost during transmission from client to server as server won't ping the client. crussell@: when you say affected services, is that gmail, maps, google calendar, or most google services right? Have you try Youtube? The NetLog provided by you was truncated, and not all events were listed for the "suspicious" connection. Could you turn on the netlog at the very beginning, then connect to different services and capture another NetLog so that we could get more details about the connection setup? I wasn't able to reproduce the issue on my network and didn't receive other similar bug reports. So it doesn't sound like a server issues but more like local network issue.
,
Mar 6 2018
@crussell Thanks for the report. We're currently running an experiment to help with the user experience in situations like this. I'd like to confirm that this experiment helps your situation. Can you run Chrome with some command line arguments to enable this experiment? --enable-quic --force-fieldtrials="QUIC/FlagEnabled" --force-fieldtrial-params="QUIC.FlagEnabled:retry_without_alt_svc_on_quic_errors/true" This should cause Chrome to retry failed QUIC requests over TCP instead. *fingers crossed*
,
Mar 6 2018
Have you been able to reproduce this on any other OS/environment besides Windows/IPv6 on your network?
,
Mar 6 2018
The most affected services have been google search, gmail, calendar, photos, and maps. If other services have been impacted, we haven't seen them, or heard user reports. I'll be happy to run your additional netlog, and the command line arguments to gather you more diagnostic information, but I'm working out of office today and I'll be out of the office tomorrow on the road doing an install. So it maybe the Thursday or even Friday before I can get that information back to you. If possible, I'll have one of my other staffers in the office get the info collected and relay it to me so I can post it to you. We've tested extensively with Windows/IPv4(only) and the problem doesn't exist, we do have a macintosh user or two in the office, but I haven't asked them specifically if they've seen the problem when using chrome. He'll be back in the office tomorrow, if you'd like, I'll ask him to test IPv6 connectivity for a while and see if he has the problems with Chrome/IPv6 as well.
,
Mar 6 2018
Looking more closely at the net logs confirms your conclusion that the loss is in the server->client direction. The client sends ACKs with lots of missing packets, but all of the ACKs received by the client (send from the server) contain no missing packets. So the server is getting everything the client is sending. Since this is v6 specific, I'm wonder if there's some sort of fragmentation issue going on here. Do you have the ability to get a packet trace with IP and UDP packet headers? If so, that could be valuable to see.
,
Mar 7 2018
Additionally, could you try running with --quic-max-packet-length=1280 to see if that changes anything? (This is a bit of a shot in the dark, just trying to narrow a few things down)
,
Mar 8 2018
Gentlemen, I came in Tuesday night to try some of your suggestions, unfortunately, the problem seems to have resolved its self. I had our office staff monitor throughout the day yesterday and again today, and none of us have experienced the issue at all since sometime during the day on Tuesday. We made no changes to the statewide backbone to precipitate the change, the problem just disappeared. Since I'm now no longer able to replicate the problem, I'm unable to run your diagnostic commands. More accurately, I CAN run the commands, they just won't tell us much since Google services are now working all the time anyway. The sudden resolution suggests to me that the problem was somewhere in the path between your network (ASN15169) and ours (ASN2459). I suspect one of the providers in between most likely had an issue and quietly resolved it. We'll continue to monitor, but as of now, the issue is gone (for all of our users) and I no longer see the packet loss in the net-exports.
,
Mar 9 2018
Thanks for the update. It's great it fixed itself, but it's unfortunate we don't understand the root cause. I'm going to close this, please re-open if it re-occurs. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by xunji...@chromium.org
, Mar 1 2018Labels: Needs-Feedback