New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 657195 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 2016
Cc:
Components:
EstimatedDays: ----
NextAction: 2016-11-28
OS: Linux , Windows , Chrome , Mac
Pri: 1
Type: Bug-Regression



Sign in to add a comment

OOPIF: 20 second ERR_CACHE_LOCK_TIMEOUT on extension popup double-click

Project Member Reported by creis@chromium.org, Oct 18 2016

Issue description

Version: 56.0.2894.0, 54.0.2840.59
OS: Windows

What steps will reproduce the problem?
(1) Start Chrome with --isolate-extensions.  (On by default in M55+)
(2) Install https://chrome.google.com/webstore/detail/thumbs-up/fheglbjjlikcmbifdhdmhfjogjofhakp
(3) Click the Thumbs Up browser action, then dismiss it.
(4) Double click the Thumbs Up browser action.  It starts to open and closes right away, as expected.
(5) Click the Thumbs Up browser action one more time.

What is the expected output?
The popup should appear.

What do you see instead?
The popup stays hidden for 20 seconds, then appears.

This is a followup to  issue 655114 , since nick@ noticed it was possible to get into this state with this extension (see comment 6 on that bug).
 

Comment 1 by creis@chromium.org, Oct 18 2016

Reposting Nick's comment 6 from  issue 655114 :

On my system, I was able to repro a consistent 20 second delay before the
Tweet button shows up. This didn't happen immediately, but after clicking
the button very frequently, and now it happens consistently in this window.

In net-internals events, I see an ERR_CACHE_LOCK_TIMEOUT that is happening
with what looks like a big jump in delta-t:

12987: URL_REQUEST
https://platform.twitter.com/widgets/tweet_button.39f7ee9fffbd122b7a37a520dbdaebc6.en.html#dnt=false&hashtags=music4chrome&id=twitter-widget-0&lang=en&original_referer=chrome-extension%3A%2F%2Ffheglbjjlikcmbifdhdmhfjogjofhakp%2Fhtml%2Fpopup.html&size=l&text=Check%20out%20these%20music%20extensions%20for%20Chrome!&time=1476743285204&type=share&url=https%3A%2F%2Fplus.google.com%2F108126838540149369329
Start Time: 2016-10-17 15:28:05.179

t= 4223 [st=    0] +REQUEST_ALIVE  [dt=20150]
t= 4223 [st=    0]    DELEGATE_INFO  [dt=0]
                      --> delegate_info = "NavigationResourceThrottle"
t= 4223 [st=    0]    URL_REQUEST_DELEGATE  [dt=0]
t= 4223 [st=    0]    URL_REQUEST_START_JOB  [dt=1]
                      --> load_flags = 256 (VERIFY_EV_CERT)
                      --> method = "GET"
                      --> priority = "HIGHEST"
                      --> url =
"https://platform.twitter.com/widgets/tweet_button.39f7ee9fffbd122b7a37a520dbdaebc6.en.html#dnt=false&hashtags=music4chrome&id=twitter-widget-0&lang=en&original_referer=chrome-extension%3A%2F%2Ffheglbjjlikcmbifdhdmhfjogjofhakp%2Fhtml%2Fpopup.html&size=l&text=Check%20out%20these%20music%20extensions%20for%20Chrome!&time=1476743285204&type=share&url=https%3A%2F%2Fplus.google.com%2F108126838540149369329"
t= 4224 [st=    1]   +URL_REQUEST_START_JOB  [dt=20061]
                      --> load_flags = 256 (VERIFY_EV_CERT)
                      --> method = "GET"
                      --> priority = "HIGHEST"
                      --> url =
"https://platform.twitter.com/widgets/tweet_button.39f7ee9fffbd122b7a37a520dbdaebc6.en.html#dnt=false&hashtags=music4chrome&id=twitter-widget-0&lang=en&original_referer=chrome-extension%3A%2F%2Ffheglbjjlikcmbifdhdmhfjogjofhakp%2Fhtml%2Fpopup.html&size=l&text=Check%20out%20these%20music%20extensions%20for%20Chrome!&time=1476743285204&type=share&url=https%3A%2F%2Fplus.google.com%2F108126838540149369329"
t= 4224 [st=    1]      URL_REQUEST_DELEGATE  [dt=0]
t= 4224 [st=    1]      HTTP_CACHE_GET_BACKEND  [dt=0]
t= 4224 [st=    1]      HTTP_CACHE_OPEN_ENTRY  [dt=0]
t= 4224 [st=    1]      HTTP_CACHE_ADD_TO_ENTRY  [dt=20000]
                        --> net_error = -409 (ERR_CACHE_LOCK_TIMEOUT)
t=24224 [st=20001]     +HTTP_STREAM_REQUEST  [dt=42]
t=24224 [st=20001]        HTTP_STREAM_REQUEST_STARTED_JOB
                          --> source_dependency = 12995 (HTTP_STREAM_JOB)
t=24266 [st=20043]        HTTP_STREAM_REQUEST_BOUND_TO_JOB
                          --> source_dependency = 12995 (HTTP_STREAM_JOB)
t=24266 [st=20043]     -HTTP_STREAM_REQUEST

Comment 2 by creis@chromium.org, Oct 18 2016

Cc: alex...@chromium.org
Note that with my repro steps, I'm getting the "net_error = -409 (ERR_CACHE_LOCK_TIMEOUT)" on the https://accounts.google.com/o/oauth2/postmessageRelay URL, not the Twitter URL as above.

I can repro reliably on Windows Canary/Stable, but not Linux tip-of-tree in a debug build...  alexmos@ said he could repro on Mac Canary.

Comment 3 by creis@chromium.org, Oct 18 2016

Cc: csharrison@chromium.org clamy@chromium.org
There seems to be a race involved.  I'm guessing that dismissing the OOPIF somehow leaves the transferring network request around in the network stack, blocking future navigations to it.  They give up after 20 seconds and try again.

(This is similar to going to https://expired.badssl.com/, getting an interstitial, then going to it again in a second tab.  The second tab will hang for 20 seconds before the network stack gives up on the cache lock and tries again.)

We'll probably just need to make sure that network requests from frames get cleaned up when the frame goes away.  Note that I don't think clamy@'s recent move of CrossSiteResourceHandler to NavigationResourceThrottle is to blame, since this bug repros on M54 as well as M56.

Comment 4 by creis@chromium.org, Oct 19 2016

Side note: see issue 6697 about the 20 second cache lock.
Cc: shivanisha@chromium.org
+shivanisha FYI who is working on the cache lock.

We already should have logic that should clean up requests when frames are killed, ResourceDispatcherHostImpl::OnRenderFrameDeleted.

This is from a PostTask from the UI thread WCO notification, RenderFrameDeleted. Does anyone see anything problematic here that could cause this race?
I see that ResourceDispatcherHostImpl::OnRenderFrameDeleted doesn't cancel blocked requests that are associated with a process transfer:

void ResourceDispatcherHostImpl::CancelRequestsForRoute(
    const GlobalFrameRoutingId& global_routing_id) {
...
  // Don't clear the blocked loaders or offline policy maps if any of the
  // requests in route_id are being transferred to a new process, since those
  // maps will be updated with the new route_id after the transfer.  Otherwise
  // we will lose track of this info when the old route goes away, before the
  // new one is created.
  if (any_requests_transferring)
    return;
...
    CancelBlockedRequestsForRoute(...);

Maybe we also need to call ResourceDispatcherHostImpl::CancelBlockedRequestsForFrameFromUI later on, when we realize the frame is not coming back after a transfer.
Yes that looks right... I'm not super familiar with the transfer case though.

It looks like this is probably a race between CancelRequestsForRoute and the new frame issuing the navigation request (which should be coalesced in RDHI afaict).
Well, actually I was wondering what happens when the new frame never materializes (i.e. if the whole page / frame tree is destroyed after the transfer has started, but before the new frame issued the navigation request to resume the transfer).

One thing I am not sure about is how blocked requests can be introduced.  I initially suspected that requests might become blocked because of ResourceDispatcherHostImpl::BlockRequestsForFrameFromUI being called, but I am not seeing this when debugging.  I am not sure how else a request/loader can get blocked.
Yep I agree, I think I spoke poorly. The race is really between the frame being destroyed and the new request going out. If the frame is destroyed first, the requests will have the old route and will not be killed.

The other place requests can be blocked is in creating a new window in RenderWidgetHelper::CreateNewWindow. Note that this code is vestigial and afaik if we wanted to we could remove the blocking logic here, but it has strange interactions with the webrequest api :/

Your proposal in #6 sounds good as long as there is a point in time where we know the old frame is gone and the new frame won't take the request.
I think that "blocked requests" in RDHI (i.e. what is tracked in blocked_loaders_map_) might be a red herring:

- I can repro a 20 seconds delay in a local build (*)

- I've added printf-style debugging to ResourceDispatcherHostImpl::BlockRequestsForFrameFromUI and ResourceDispatcherHostImpl::BlockRequestsForRoute

- Neither ResourceDispatcherHostImpl::BlockRequestsFor* method is hit (and I don't see any other way to *populate* RDHI::blocked_loaders_map_;  I am not counting RDHI::UpdateRequestForTransfer, because it only "moves" existing entries around, rather than "populating" new entries).

(*) The repro is intermittent :-(  FWIW, here is a config in which I did see a repro using the repro steps from the top of this bug (although I had to double-click multiple times, not just once...)
Linux, ToT (r426829), args.gn:
is_component_build = true
is_debug = false
use_goma = true

Components: Internals>Network>Cache
I am not sure what the next steps for this bug should be.  I was hoping that somebody more familiar with the lower levels of network stack (where the 20 seconds timeout is implemented) might be able to take a look here.

One big question here is - is there a relationship between the bug and the process swaps and/or --isolate-extensions mode?  Or is this just purely accidental that the bug happened to repro in an extension popup while running in --isolate-extensions mode.
Components: Internals>Network
Status: Untriaged (was: Available)
Hm, I am surprised this isn't a blocked loaders issue. I still think this could be a content/browser/loader issue but let's change some labels so it gets on the net triagers queue.
One thing that I thought might affect blocked loaders is:
https://cs.chromium.org/chromium/src/extensions/browser/app_window/app_window_contents.cc?rcl=1478881111&l=47

Which seems to have different behavior for oop extensions.

BTW I am getting a 404 for the linked extension that repros. 
I'll look at it from HttpCache::Transaction's perspective (the layer that implements the cache lock). I will probably be able to get to it on Monday, though. I am hoping to reproduce and debug to see why the 1st resource loading attempt did not complete/get cancelled at the HttpCache::Transaction layer. 

creis@ : In comment #1, I see the netlog for the second loading attempt. Do we have the net log for the 1st one as well?
I am getting a 404 for the link:
https://chrome.google.com/webstore/detail/thumbs-up/fheglbjjlikcmbifdhdmhfjogjofhakp
Anyone aware of another reproducer?
shivanisha@:  RE: #c15

I captured chrome://net-internals log that should cover the whole repro.  I see that request with ID=443 hits ERR_CACHE_LOCK_TIMEOUT.   This request is for the following URI: https://accounts.google.com/o/oauth2/postmessageRelay?parent=chrome-extension%3A%2F%2Ffheglbjjlikcmbifdhdmhfjogjofhakp&jsh=m%3B%2F_%2Fscs%2Fapps-static%2F_%2Fjs%2Fk%3Doz.gapi.en_US.YsRf0Y2ZnQE.O%2Fm%3D__features__%2Fam%3DkQ%2Frt%3Dj%2Fd%3D1%2Frs%3DAGLTcCN4mOm3l5hJcNkvV1vMAcrcngmL1g#rpctoken=737885601&forcesecure=1

I don't see other requests for the same URI as above.  OTOH, I note that the URI includes a '#' character in a weird place and that there are 2 other requests for similar URIs.  Request w/ ID=287 seems to have succeeded normally, but request w/ ID=398 seems to have spent more than 30 seconds in request delegate / NavigationResourceThrottle.

Please see the attached net-internals-log.json


net-internals-log.json
789 KB View Download
shivanisha@:  RE: #c16

Yes - it seems that the extension used for the repro got pulled from Chrome Web Store.  I've forwarded to you the zipped up extension that we had stashed from previous repro attempts.  (maybe I should just attach the extension to the bug?...)

If you do end up using the zipped up extension for a repro, then you need the following extra steps to prepare for the repro:

1. Download and unzip the attached extension.
   (note that I had to delete _metadata subdirectory to get step 2.b. to work).
2. Go to chrome://extensions and manually install the extension:
   2.a. Enable "Developer mode"
   2.b. "Load unpacked extension..." pointing at
        "...\fheglbjjlikcmbifdhdmhfjogjofhakp\3.2_0" directory



I am unable to repro without --isolate-extensions (and so far I was always able to consistently repro on Windows, 56.0.2919.0, with isolated extensions, using repro steps from the top of this bug).
Owner: shivanisha@chromium.org
Assigning to shivanisha for now since Shivani has been investigating this and is working on code in this space.
Status: Assigned (was: Untriaged)
Looking at the net logs posted in #17, requests with ID 398 and 443 are the ones that seem to be using the same HttpCache::ActiveEntry object (their URLs are same except for the rpctoken argument which doesn't get checked while looking up in the cache). The HttpCache::Transaction associated with request ID 398, being first, becomes ActiveEntry's writer and thus it needs to complete reading the response before any other request gets to the cached resource or times out after 20 secs.This HttpCache::transaction, however never completes or gets deleted. From the net log it shows it is stuck at the NavigationResourceThrottle step and thus the 2nd request waits for the timer (20 secs) to expire and then goes to the network. This is expected behavior of the HttpCache::Transaction layer. 
Does someone more familiar with content/browser/loader want to take a look at why the correct deletion logic is not happening for the 1st request?
I wonder if the http cache lock should be held *before* a ResourceThrottle okays (i.e. okays OR defers and resumes) a WillStartRequest notification.  Maybe postponing when the lock is taken would also mitigate this bug.

OTOH, the discussion about hypothetical causes and fixes would be much easier if we had a reliable, automated repro test.  Unfortunately this is not the case.  Some time ago I've tried the following browser test but it *didn't* repro (maybe I am doing something wrong here?):

content/browser/frame_host/navigation_handle_impl_browsertest.cc:
// Checks that everything is cleaned up even when the frame tree is destroyed
// shortly after a call to NavigationThrottle::WillStartRequest.
IN_PROC_BROWSER_TEST_F(NavigationHandleImplBrowserTest,
                       FrameTreeDestroyedShortlyAfterWillStartRequest) {
  GURL url(embedded_test_server()->GetURL("/title1.html"));


  {
    std::unique_ptr<Shell> other_shell(Shell::CreateNewWindow(
        shell()->web_contents()->GetBrowserContext(),
        GURL(), nullptr, gfx::Size()));


    TestNavigationManager navigation_manager(other_shell->web_contents(), url);
    other_shell->LoadURL(url);


    // Wait until |navigation_manager| defers (pauses) the URLRequest associated
    // with the navigation and immediately destroy |other_shell|.
    //ASSERT_TRUE(navigation_manager.WaitForRequestStart());
    ASSERT_TRUE(navigation_manager.WaitForResponse());
    other_shell.reset();


    navigation_manager.WaitForNavigationFinished();
  }


  // Start a URLRequest to the same URL.  This should succeed.
  EXPECT_TRUE(NavigateToURL(shell(), url));
  EXPECT_EQ(url, shell()->web_contents()->GetLastCommittedURL());
}

Re #23: The lock is held when URLRequest::Start() is invoked which, I assume, should be after ResourceThrottle okays a WillStartRequest notification.
Correct me if I am wrong.
clamy@ / csharrison@ - would you be able to take a look at this?  (based on #c22 it seems to be a problem somewhere near NavigationResourceThrottle and/or content/browser/loader).
Yeah sure thing let me take a look tomorrow. Can you post the extension to this issue (or email it to me)?
Owner: csharrison@chromium.org
Attached the extension.

Also assigning to csharrison@ as Charles is looking at it.
fheglbjjlikcmbifdhdmhfjogjofhakp.zip
57.8 KB Download
After reproing this I can confirm that the problem is that the requests prepare for transfer (thus making them unkillable), and the transfer never successfully completes.

I'm not super familiar with the transfer code, but it seems like we might want the UI thread code to post a task to the RDHI saying "please destroy all transfer navigations associated with frame X" 
Additionally, the navigation_handle is succesfully transferred to the dest RFH, but the dest RFH is deleted before it can successfully fire off a navigation request. Let me try to write a browser test.
I seem to have a fix that works manually.

In the RFH destructor, we may have a navigation_handle_ that "is_transferring_". If we do, just post a task to RDHI::CancelRequest with the nav handles request id.

Still thinking of the best way to write a browser test for this though...
Thank you for digging into this and pointing out that the repro depends on the navigation handle being destroyed while is_transferring_ is true.  Based on this and with a little bit of help from nasko@ I was able to repro the problem in a browser test that I've uploaded to https://crrev.com/2518513003 (first patchset with extra logs that helped me verify that things are really happening in the right order;  the 2nd patchset cleaned-up and hopefully ready to patch into your https://crrev.com/2508973006).  Logs from the first patchset clearly show 1) a 20+ seconds delay during the second navigation attempted by the test and 2) a DCHECK that points out that we have leaked a URLRequest:

...
[72496:72496:1118/121046:1376026322337:ERROR:site_per_process_browsertest.cc(8609)] Next test navigation...
...
[72496:72496:1118/121107:1376046899462:ERROR:site_per_process_browsertest.cc(8611)] Next test navigation... done.
...
[74984:75012:1118/122039:1376619287761:FATAL:url_request_context.cc(106)] Check failed: false. Leaked 1 URLRequest(s). First URL: http://baz.com:52830/title1.html.
#0 0x0000005b7031 __interceptor_backtrace
#1 0x7faf32b8afd3 base::debug::StackTrace::StackTrace()
#2 0x7faf32bfdfda logging::LogMessage::~LogMessage()
#3 0x7faf2fef95fb net::URLRequestContext::AssertNoURLRequests()
#4 0x7faf2fef91c4 net::URLRequestContext::~URLRequestContext()
#5 0x7faf2fef975e net::URLRequestContext::~URLRequestContext()
#6 0x000001725ee5 content::ShellURLRequestContextGetter::~ShellURLRequestContextGetter()
#7 0x0000017260de content::ShellURLRequestContextGetter::~ShellURLRequestContextGetter()
#8 0x7faf2ff02b41 net::URLRequestContextGetter::OnDestruct()

Amazing! Thanks so much. I've just fixed the try failures in my patch so I'll try to throw up a finished patch with your test added before EOD.
Project Member

Comment 33 by bugdroid1@chromium.org, Nov 20 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f2466b80b0cd892d31fdf69a397905eaacbfb3b5

commit f2466b80b0cd892d31fdf69a397905eaacbfb3b5
Author: csharrison <csharrison@chromium.org>
Date: Sat Nov 19 23:57:12 2016

Kill vestigial transfer requests

When requests are marked as transferring, we mark them as protected in
the ResourceDispatcherHostImpl, so when their RFH goes away we won't
lose the request.

However, there are cases where the transfer can fail, and the navigation
abandoned, while the request on the IO thread is still marked as
protected.

This patch catches that case by checking for transferring navigation
handles in ~RFH, and cancelling them there.

Browser test written by lukasza@chromium.org

BUG= 657195 
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation

Review-Url: https://codereview.chromium.org/2508973006
Cr-Commit-Position: refs/heads/master@{#433430}

[modify] https://crrev.com/f2466b80b0cd892d31fdf69a397905eaacbfb3b5/content/browser/frame_host/navigation_handle_impl.cc
[modify] https://crrev.com/f2466b80b0cd892d31fdf69a397905eaacbfb3b5/content/browser/frame_host/navigator_impl.cc
[modify] https://crrev.com/f2466b80b0cd892d31fdf69a397905eaacbfb3b5/content/browser/frame_host/render_frame_host_manager.cc
[modify] https://crrev.com/f2466b80b0cd892d31fdf69a397905eaacbfb3b5/content/browser/site_per_process_browsertest.cc

NextAction: 2016-11-28
Setting NextAction field to +1 week, so we are reminded (after some bake time on Canary+Dev) to evaluate whether to merge the fix above to the M56 branch.
Status: Fixed (was: Assigned)
Let's also mark as fixed.

Comment 36 by creis@chromium.org, Nov 28 2016

Labels: M-56
Thanks for landing the fix!  I've confirmed it seems to work on Windows Canary (57.0.2931.0).  Can you request a merge?
Labels: Merge-Request-56
Sure!

Comment 38 by dimu@chromium.org, Nov 28 2016

Labels: -Merge-Request-56 Merge-Approved-56 Hotlist-Merge-Approved
Your change meets the bar and is auto-approved for M56 (branch: 2924)
Project Member

Comment 39 by bugdroid1@chromium.org, Nov 29 2016

Labels: -merge-approved-56 merge-merged-2924
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/50c9ad5f38b13fd6cce7081cc2e00d8f0acf6459

commit 50c9ad5f38b13fd6cce7081cc2e00d8f0acf6459
Author: csharrison <csharrison@chromium.org>
Date: Tue Nov 29 15:23:55 2016

Kill vestigial transfer requests

When requests are marked as transferring, we mark them as protected in
the ResourceDispatcherHostImpl, so when their RFH goes away we won't
lose the request.

However, there are cases where the transfer can fail, and the navigation
abandoned, while the request on the IO thread is still marked as
protected.

This patch catches that case by checking for transferring navigation
handles in ~RFH, and cancelling them there.

Browser test written by lukasza@chromium.org

BUG= 657195 
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation

Review-Url: https://codereview.chromium.org/2508973006
Cr-Commit-Position: refs/heads/master@{#433430}
(cherry picked from commit f2466b80b0cd892d31fdf69a397905eaacbfb3b5)
TBR=creis@chromium.org,lukasza@chromium.org
NOTRY=true
NOPRESUBMIT=true

Review-Url: https://codereview.chromium.org/2541543002
Cr-Commit-Position: refs/branch-heads/2924@{#155}
Cr-Branched-From: 3a87aecc31cd1ffe751dd72c04e5a96a1fc8108a-refs/heads/master@{#433059}

[modify] https://crrev.com/50c9ad5f38b13fd6cce7081cc2e00d8f0acf6459/content/browser/frame_host/navigation_handle_impl.cc
[modify] https://crrev.com/50c9ad5f38b13fd6cce7081cc2e00d8f0acf6459/content/browser/frame_host/navigator_impl.cc
[modify] https://crrev.com/50c9ad5f38b13fd6cce7081cc2e00d8f0acf6459/content/browser/frame_host/render_frame_host_manager.cc
[modify] https://crrev.com/50c9ad5f38b13fd6cce7081cc2e00d8f0acf6459/content/browser/site_per_process_browsertest.cc

Comment 40 by creis@chromium.org, Nov 29 2016

Thanks!

Sign in to add a comment