New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 880633 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Sep 11
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , Chrome , Mac
Pri: 1
Type: Bug-Regression



Sign in to add a comment

Seeing repeated instance of "network connection unavailable"

Project Member Reported by zea@chromium.org, Sep 4

Issue description

Chrome Version       : 70.0.3535.2 (Android), 70.0.3534.4 (Mac)

I've seen this both on my Android phone and my mac pro. The client is obviously online (can navigate to pages just fine) but the sync engine is getting stuck in a loop repeated  of NETWORK_CONNECTION_UNAVAILABLE errors, retrying every 10 minutes.

See attached screenshot from Android. I was able to get my mac instance to fix itself by restarting, but will keep an eye on it if it comes back.

It's unclear to me if this is a Sync issue or a network layer issue, but it's definitely affecting Sync on multiple devices for me.

UMA: https://uma.googleplex.com/p//timeline_v2/?sid=599c530392ffcbd245735ad8ebc51522

Specifically, since Aug 25 there's been a big spike in FAILED errors in Canary (and Aug 30 in Dev). It's at about 20% of requests, where previously it was basically 0.

I know there's been a fair amount of refactoring of the network logic that Sync users recently. In particular, https://chromium-review.googlesource.com/c/chromium/src/+/1174655 seems relevant.
 
Screenshot (Sep 4, 2018 4:47:13 PM).png
640 KB View Download
Labels: Needs-Feedback
Not sure what NETWORK_CONNECTION_UNAVAILABLE means - it's not a net error.  I think we need a repro or a net-export log to make any progress (Though even a network error code might be nice).
Looks to me like the new code is logging network errors when the old code logged errors.  Unclear if different handling of HTTP errors is the main issue here.  Are there any tests for HTTP error handling?
Yeah, that's a sync specific error. According to UMA, the spike Sync sees is for FAILED network errors.

I managed to grab two event logs from two more times this repro'd (see attached). By repro, I mean that sync-internals shows that a request is made, but that the response was NETWORK_CONNECTION_UNAVAILABLE.

It looks like in each case there's only one request, and it has a 401 response code. Looking at the logging code:

  base::UmaHistogramSparse(
      "Sync.URLFetchResponse",
      fetch_state_.request_succeeded
          ? fetch_state_.http_response_code
          : net::URLRequestStatus::FromError(fetch_state_.error_code)
                .ToNetError());


I would have expected the 401 code to be logged, not FAILED. From what I can tell that shouldn't have changed with the CL, assuming they both count the response as successful. Looking at the same timeline linked above, there also appears to be a drop in logged 401 responses (going from 1.8% to 0.2% of responses) coinciding with the spike in FAILED. The drop is both proportional and in bucket count.

So far this appears to be a fairly reliable repro on my Android device, so happy to help gather more info while I can.
net_internals_log (1)
438 KB View Download
net_internals_log
467 KB View Download
Labels: -Needs-Feedback
SimpleURLLoader by default fails HTTP error responses when headers are received (This is because a number of consumers failed to check for both HTTP errors and network errors, since they were exposed via two entirely tangential mechanisms).  There's a method to call if you really want HTTP error bodies.  Unfortunately, if you don't want them, it's a bit tricky to figure out if a response was failed due to an HTTP error or a network error.  Most consumers that do care about the difference just care about it for logging purposes.
Cc: treib@chromium.org
+treib: This looks similar to crbug/878718 -- wdyt?

and thanks Nicolas for the thorough bug report and investigation!
Cc: blundell@chromium.org
Yes, I think this is indeed the same as  bug 878718 . From the investigation here (thanks!) it also seems clear that https://chromium-review.googlesource.com/c/chromium/src/+/1174655 is the culprit. I don't know yet if the fix should be in SimpleURLLoader or in Sync's net code.
as the author of crrev.com/1174655, I would like to help with fixing this.

I am reading the various descriptions of the problem (here and on the other bug), but I might be missing still a set of concrete steps to reproduce it.

Would someone mind to describe steps to me?
Labels: -Type-Bug OS-Android OS-Chrome OS-Linux OS-Windows Type-Bug-Regression
Owner: toniki...@chromium.org
Status: Assigned (was: Unconfirmed)
Thanks! Assigning to you then :)

I *think* this happens if you restart Chrome, and Sync's access token has expired in the meantime. AFAIK we don't have any repro steps to explicitly/immediately trigger it.
Cc: msarda@chromium.org droger@chromium.org zea@chromium.org
 Issue 878718  has been merged into this issue.
Is this [1] the problem? I.e. Sync treating the network request as failed if there's no body (combined with SimpleURLLoader's behavior of not downloading bodies on HTTP errors). Instead, maybe we should only treat it as failed if there was an actual network error?

[1] https://cs.chromium.org/chromium/src/components/sync/engine/net/http_bridge.cc?rcl=337ba90ae4627a62c35f131555c9051f10a4952a&l=427
SimpleURLLoader::SetAllowHttpErrorResults could be of help here?

Matt, I believe this is what you referred to in comment #5, right?
Cc: pnoland@chromium.org
+pnoland FYI, as this likely relates to the work you're doing around an authenticated fetcher.

Matt, that does make me wonder, when you say that most consumers don't care about the http errors, and the ones that do are just logging it, does that imply a bug on the consumer side? Basically any chrome service with an authenticated channel needs to care about at least 401 errors, and therefore needs to call SetAllowHttpErrorResults.
Thanks. This is on my list of gotchas that will get taken care of by default in the authenticated fetcher.
Zea:  I have no clue.  Theere are ~180 consumers of SimpleURLLoader, and I have no idea how to tell if one cares about signed-in state or not.  If a lot do, or should, they should be sharing code, instead of having 180 different ways of dealing with 401s (Particularly if they don't test this scenario, but still care about it).

There are at least a few consumers that do have 401 magic, though I don't believe any of them care about response bodies in the 401 case.
Status: Started (was: Assigned)
Sorry, for the delay here, guys (I interrupted a couples of days off to tackle this). I will post updates.

Basically, to summarize what we want is to be able to throw an authentication error (not a network error), regardless of whether there is a |body_response| or not. Is this correct?

In the worst case, I could put an emergency CL up for review (so it could reach canary asap and you guys could test it out) and follow up with unittests next.
I'd vote for having the unit tests as part of the initial patch.
It's like not taking shortcuts when you're in a hurry or already late ;-)
Labels: ReleaseBlock-Stable
Correct, 401s should not be considered network errors. I'm not completely sure if it matters here, but I think it'd be best if other HTTP errors would also not be treated as network errors.

It's not emergency-level urgent, but it'd be great to get this fixed (with unit tests) in time for Beta. It's definitely a stable blocker.
Ok, I am back. Fix and tests coming up for review shortly.
CL is up for review here: CL https://crrev.com/c/1217122. Sync owners, feel free to join (the party) and review.
Project Member

Comment 21 by bugdroid1@chromium.org, Sep 11

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/05b9b976f63eb1f9fcca0e01c15a6eeeb719dd69

commit 05b9b976f63eb1f9fcca0e01c15a6eeeb719dd69
Author: Antonio Gomes <tonikitoo@igalia.com>
Date: Tue Sep 11 17:10:45 2018

Fix sync logic failing with repeated instance of "network connection unavailable"

This is a follow on of [1], where the sync::HttpBridge logic switched
away from URLFetcher to SimpleURLLoader, as part of the network
servicification effort.

[1] https://crrev.com/c/1174655/

It turns out that the way sync::HttpBridge interpreted the response
of an upload request changed after the migration:

By default, SimpleURLLoader does not pass a valid response body in case
of either HTTP or network errors. What is more, the new logic in
HttpBridge::OnURLLoadComplete relied precisely on null-checking the responsebody
to verify whether the upload was successful or not.

This CL changes this particular check logic so that it matches the
previous behavior: only actual network errors account for sync failures
(which map to "network connection unavailable" errors) whereas HTTP errors
(>= 400 && < 600) still account as a successful uploads.
401 specifically is handled to request expired user credentials in [2].

[2] https://cs.chromium.org/chromium/src/components/sync/engine_impl/net/sync_server_connection_manager.cc?l=84

BUG= 844968 , 878718 , 880633 

Change-Id: I9d0b93b527f103149a7c78ba405dde27cae779f2
Reviewed-on: https://chromium-review.googlesource.com/1217122
Commit-Queue: Antonio Gomes <tonikitoo@igalia.com>
Reviewed-by: Matt Menke <mmenke@chromium.org>
Reviewed-by: Marc Treib <treib@chromium.org>
Cr-Commit-Position: refs/heads/master@{#590363}
[modify] https://crrev.com/05b9b976f63eb1f9fcca0e01c15a6eeeb719dd69/components/sync/engine/net/http_bridge.cc
[modify] https://crrev.com/05b9b976f63eb1f9fcca0e01c15a6eeeb719dd69/components/sync/engine/net/http_bridge_unittest.cc

Status: Fixed (was: Started)
Please reopen if the issue persists. 
Labels: Merge-TBD
[Auto-generated comment by a script] We noticed that this issue is targeted for M-70; it appears the fix may have landed after branch point, meaning a merge might be required. Please confirm if a merge is required here - if so add Merge-Request-70 label, otherwise remove Merge-TBD label. Thanks.
Labels: -Merge-TBD Merge-Request-70
Yup, this will definitely need to be merged.

zea@, can you verify that this fixes the error for you?
We should also watch the UMA timeline to verify things go back to normal, but that'll take a few more days to roll in.
Project Member

Comment 25 by sheriffbot@chromium.org, Sep 13

Labels: -Merge-Request-70 Hotlist-Merge-Approved Merge-Approved-70
Your change meets the bar and is auto-approved for M70. Please go ahead and merge the CL to branch 3538 manually. Please contact milestone owner if you have questions.
Owners: benmason@(Android), kariahda@(iOS), geohsu@(ChromeOS), abdulsyed@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
shoud I away a confirmation for zea@ about the issue being fixed on canary before I  merge into m70?
I think we're pretty confident that this fixes the issue (and doesn't break anything else), and M70 is going to Beta any minute now :)  So I vote for merging immediately, maybe we can make the first Beta release.
Ok, I suppose from Gerrit I can do

menu > cherry-pick
(on the dialog) Branch: branch-heads/3538

?

(sorry, I am not sure I have done this before)

Yup, that's correct! Thanks!
Project Member

Comment 30 by bugdroid1@chromium.org, Sep 14

Labels: -merge-approved-70 merge-merged-3538
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/bb324165da2232905e108c07ef42e288db468fac

commit bb324165da2232905e108c07ef42e288db468fac
Author: Antonio Gomes <tonikitoo@igalia.com>
Date: Fri Sep 14 06:31:30 2018

Fix sync logic failing with repeated instance of "network connection unavailable"

This is a follow on of [1], where the sync::HttpBridge logic switched
away from URLFetcher to SimpleURLLoader, as part of the network
servicification effort.

[1] https://crrev.com/c/1174655/

It turns out that the way sync::HttpBridge interpreted the response
of an upload request changed after the migration:

By default, SimpleURLLoader does not pass a valid response body in case
of either HTTP or network errors. What is more, the new logic in
HttpBridge::OnURLLoadComplete relied precisely on null-checking the responsebody
to verify whether the upload was successful or not.

This CL changes this particular check logic so that it matches the
previous behavior: only actual network errors account for sync failures
(which map to "network connection unavailable" errors) whereas HTTP errors
(>= 400 && < 600) still account as a successful uploads.
401 specifically is handled to request expired user credentials in [2].

[2] https://cs.chromium.org/chromium/src/components/sync/engine_impl/net/sync_server_connection_manager.cc?l=84

BUG= 844968 , 878718 , 880633 

Change-Id: I9d0b93b527f103149a7c78ba405dde27cae779f2
Reviewed-on: https://chromium-review.googlesource.com/1217122
Commit-Queue: Antonio Gomes <tonikitoo@igalia.com>
Reviewed-by: Matt Menke <mmenke@chromium.org>
Reviewed-by: Marc Treib <treib@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#590363}(cherry picked from commit 05b9b976f63eb1f9fcca0e01c15a6eeeb719dd69)
Reviewed-on: https://chromium-review.googlesource.com/1225490
Reviewed-by: Antonio Gomes <tonikitoo@igalia.com>
Cr-Commit-Position: refs/branch-heads/3538@{#400}
Cr-Branched-From: 79f7c91a2b2a2932cd447fa6f865cb6662fa8fa6-refs/heads/master@{#587811}
[modify] https://crrev.com/bb324165da2232905e108c07ef42e288db468fac/components/sync/engine/net/http_bridge.cc
[modify] https://crrev.com/bb324165da2232905e108c07ef42e288db468fac/components/sync/engine/net/http_bridge_unittest.cc

Sign in to add a comment