Seeing repeated instance of "network connection unavailable" |
||||||||||||||
Issue descriptionChrome Version : 70.0.3535.2 (Android), 70.0.3534.4 (Mac) I've seen this both on my Android phone and my mac pro. The client is obviously online (can navigate to pages just fine) but the sync engine is getting stuck in a loop repeated of NETWORK_CONNECTION_UNAVAILABLE errors, retrying every 10 minutes. See attached screenshot from Android. I was able to get my mac instance to fix itself by restarting, but will keep an eye on it if it comes back. It's unclear to me if this is a Sync issue or a network layer issue, but it's definitely affecting Sync on multiple devices for me. UMA: https://uma.googleplex.com/p//timeline_v2/?sid=599c530392ffcbd245735ad8ebc51522 Specifically, since Aug 25 there's been a big spike in FAILED errors in Canary (and Aug 30 in Dev). It's at about 20% of requests, where previously it was basically 0. I know there's been a fair amount of refactoring of the network logic that Sync users recently. In particular, https://chromium-review.googlesource.com/c/chromium/src/+/1174655 seems relevant.
,
Sep 5
Looks to me like the new code is logging network errors when the old code logged errors. Unclear if different handling of HTTP errors is the main issue here. Are there any tests for HTTP error handling?
,
Sep 5
Yeah, that's a sync specific error. According to UMA, the spike Sync sees is for FAILED network errors.
I managed to grab two event logs from two more times this repro'd (see attached). By repro, I mean that sync-internals shows that a request is made, but that the response was NETWORK_CONNECTION_UNAVAILABLE.
It looks like in each case there's only one request, and it has a 401 response code. Looking at the logging code:
base::UmaHistogramSparse(
"Sync.URLFetchResponse",
fetch_state_.request_succeeded
? fetch_state_.http_response_code
: net::URLRequestStatus::FromError(fetch_state_.error_code)
.ToNetError());
I would have expected the 401 code to be logged, not FAILED. From what I can tell that shouldn't have changed with the CL, assuming they both count the response as successful. Looking at the same timeline linked above, there also appears to be a drop in logged 401 responses (going from 1.8% to 0.2% of responses) coinciding with the spike in FAILED. The drop is both proportional and in bucket count.
So far this appears to be a fairly reliable repro on my Android device, so happy to help gather more info while I can.
,
Sep 5
,
Sep 5
SimpleURLLoader by default fails HTTP error responses when headers are received (This is because a number of consumers failed to check for both HTTP errors and network errors, since they were exposed via two entirely tangential mechanisms). There's a method to call if you really want HTTP error bodies. Unfortunately, if you don't want them, it's a bit tricky to figure out if a response was failed due to an HTTP error or a network error. Most consumers that do care about the difference just care about it for logging purposes.
,
Sep 5
+treib: This looks similar to crbug/878718 -- wdyt? and thanks Nicolas for the thorough bug report and investigation!
,
Sep 5
Yes, I think this is indeed the same as bug 878718 . From the investigation here (thanks!) it also seems clear that https://chromium-review.googlesource.com/c/chromium/src/+/1174655 is the culprit. I don't know yet if the fix should be in SimpleURLLoader or in Sync's net code.
,
Sep 5
as the author of crrev.com/1174655, I would like to help with fixing this. I am reading the various descriptions of the problem (here and on the other bug), but I might be missing still a set of concrete steps to reproduce it. Would someone mind to describe steps to me?
,
Sep 5
Thanks! Assigning to you then :) I *think* this happens if you restart Chrome, and Sync's access token has expired in the meantime. AFAIK we don't have any repro steps to explicitly/immediately trigger it.
,
Sep 5
Issue 878718 has been merged into this issue.
,
Sep 5
Is this [1] the problem? I.e. Sync treating the network request as failed if there's no body (combined with SimpleURLLoader's behavior of not downloading bodies on HTTP errors). Instead, maybe we should only treat it as failed if there was an actual network error? [1] https://cs.chromium.org/chromium/src/components/sync/engine/net/http_bridge.cc?rcl=337ba90ae4627a62c35f131555c9051f10a4952a&l=427
,
Sep 5
SimpleURLLoader::SetAllowHttpErrorResults could be of help here? Matt, I believe this is what you referred to in comment #5, right?
,
Sep 5
+pnoland FYI, as this likely relates to the work you're doing around an authenticated fetcher. Matt, that does make me wonder, when you say that most consumers don't care about the http errors, and the ones that do are just logging it, does that imply a bug on the consumer side? Basically any chrome service with an authenticated channel needs to care about at least 401 errors, and therefore needs to call SetAllowHttpErrorResults.
,
Sep 5
Thanks. This is on my list of gotchas that will get taken care of by default in the authenticated fetcher.
,
Sep 5
Zea: I have no clue. Theere are ~180 consumers of SimpleURLLoader, and I have no idea how to tell if one cares about signed-in state or not. If a lot do, or should, they should be sharing code, instead of having 180 different ways of dealing with 401s (Particularly if they don't test this scenario, but still care about it). There are at least a few consumers that do have 401 magic, though I don't believe any of them care about response bodies in the 401 case.
,
Sep 6
Sorry, for the delay here, guys (I interrupted a couples of days off to tackle this). I will post updates. Basically, to summarize what we want is to be able to throw an authentication error (not a network error), regardless of whether there is a |body_response| or not. Is this correct? In the worst case, I could put an emergency CL up for review (so it could reach canary asap and you guys could test it out) and follow up with unittests next.
,
Sep 6
I'd vote for having the unit tests as part of the initial patch. It's like not taking shortcuts when you're in a hurry or already late ;-)
,
Sep 7
Correct, 401s should not be considered network errors. I'm not completely sure if it matters here, but I think it'd be best if other HTTP errors would also not be treated as network errors. It's not emergency-level urgent, but it'd be great to get this fixed (with unit tests) in time for Beta. It's definitely a stable blocker.
,
Sep 10
Ok, I am back. Fix and tests coming up for review shortly.
,
Sep 11
CL is up for review here: CL https://crrev.com/c/1217122. Sync owners, feel free to join (the party) and review.
,
Sep 11
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/05b9b976f63eb1f9fcca0e01c15a6eeeb719dd69 commit 05b9b976f63eb1f9fcca0e01c15a6eeeb719dd69 Author: Antonio Gomes <tonikitoo@igalia.com> Date: Tue Sep 11 17:10:45 2018 Fix sync logic failing with repeated instance of "network connection unavailable" This is a follow on of [1], where the sync::HttpBridge logic switched away from URLFetcher to SimpleURLLoader, as part of the network servicification effort. [1] https://crrev.com/c/1174655/ It turns out that the way sync::HttpBridge interpreted the response of an upload request changed after the migration: By default, SimpleURLLoader does not pass a valid response body in case of either HTTP or network errors. What is more, the new logic in HttpBridge::OnURLLoadComplete relied precisely on null-checking the responsebody to verify whether the upload was successful or not. This CL changes this particular check logic so that it matches the previous behavior: only actual network errors account for sync failures (which map to "network connection unavailable" errors) whereas HTTP errors (>= 400 && < 600) still account as a successful uploads. 401 specifically is handled to request expired user credentials in [2]. [2] https://cs.chromium.org/chromium/src/components/sync/engine_impl/net/sync_server_connection_manager.cc?l=84 BUG= 844968 , 878718 , 880633 Change-Id: I9d0b93b527f103149a7c78ba405dde27cae779f2 Reviewed-on: https://chromium-review.googlesource.com/1217122 Commit-Queue: Antonio Gomes <tonikitoo@igalia.com> Reviewed-by: Matt Menke <mmenke@chromium.org> Reviewed-by: Marc Treib <treib@chromium.org> Cr-Commit-Position: refs/heads/master@{#590363} [modify] https://crrev.com/05b9b976f63eb1f9fcca0e01c15a6eeeb719dd69/components/sync/engine/net/http_bridge.cc [modify] https://crrev.com/05b9b976f63eb1f9fcca0e01c15a6eeeb719dd69/components/sync/engine/net/http_bridge_unittest.cc
,
Sep 11
Please reopen if the issue persists.
,
Sep 11
[Auto-generated comment by a script] We noticed that this issue is targeted for M-70; it appears the fix may have landed after branch point, meaning a merge might be required. Please confirm if a merge is required here - if so add Merge-Request-70 label, otherwise remove Merge-TBD label. Thanks.
,
Sep 12
Yup, this will definitely need to be merged. zea@, can you verify that this fixes the error for you? We should also watch the UMA timeline to verify things go back to normal, but that'll take a few more days to roll in.
,
Sep 13
Your change meets the bar and is auto-approved for M70. Please go ahead and merge the CL to branch 3538 manually. Please contact milestone owner if you have questions. Owners: benmason@(Android), kariahda@(iOS), geohsu@(ChromeOS), abdulsyed@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Sep 13
shoud I away a confirmation for zea@ about the issue being fixed on canary before I merge into m70?
,
Sep 13
I think we're pretty confident that this fixes the issue (and doesn't break anything else), and M70 is going to Beta any minute now :) So I vote for merging immediately, maybe we can make the first Beta release.
,
Sep 13
Ok, I suppose from Gerrit I can do menu > cherry-pick (on the dialog) Branch: branch-heads/3538 ? (sorry, I am not sure I have done this before)
,
Sep 14
Yup, that's correct! Thanks!
,
Sep 14
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/bb324165da2232905e108c07ef42e288db468fac commit bb324165da2232905e108c07ef42e288db468fac Author: Antonio Gomes <tonikitoo@igalia.com> Date: Fri Sep 14 06:31:30 2018 Fix sync logic failing with repeated instance of "network connection unavailable" This is a follow on of [1], where the sync::HttpBridge logic switched away from URLFetcher to SimpleURLLoader, as part of the network servicification effort. [1] https://crrev.com/c/1174655/ It turns out that the way sync::HttpBridge interpreted the response of an upload request changed after the migration: By default, SimpleURLLoader does not pass a valid response body in case of either HTTP or network errors. What is more, the new logic in HttpBridge::OnURLLoadComplete relied precisely on null-checking the responsebody to verify whether the upload was successful or not. This CL changes this particular check logic so that it matches the previous behavior: only actual network errors account for sync failures (which map to "network connection unavailable" errors) whereas HTTP errors (>= 400 && < 600) still account as a successful uploads. 401 specifically is handled to request expired user credentials in [2]. [2] https://cs.chromium.org/chromium/src/components/sync/engine_impl/net/sync_server_connection_manager.cc?l=84 BUG= 844968 , 878718 , 880633 Change-Id: I9d0b93b527f103149a7c78ba405dde27cae779f2 Reviewed-on: https://chromium-review.googlesource.com/1217122 Commit-Queue: Antonio Gomes <tonikitoo@igalia.com> Reviewed-by: Matt Menke <mmenke@chromium.org> Reviewed-by: Marc Treib <treib@chromium.org> Cr-Original-Commit-Position: refs/heads/master@{#590363}(cherry picked from commit 05b9b976f63eb1f9fcca0e01c15a6eeeb719dd69) Reviewed-on: https://chromium-review.googlesource.com/1225490 Reviewed-by: Antonio Gomes <tonikitoo@igalia.com> Cr-Commit-Position: refs/branch-heads/3538@{#400} Cr-Branched-From: 79f7c91a2b2a2932cd447fa6f865cb6662fa8fa6-refs/heads/master@{#587811} [modify] https://crrev.com/bb324165da2232905e108c07ef42e288db468fac/components/sync/engine/net/http_bridge.cc [modify] https://crrev.com/bb324165da2232905e108c07ef42e288db468fac/components/sync/engine/net/http_bridge_unittest.cc |
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by mmenke@chromium.org
, Sep 5