Issue metadata
Sign in to add a comment
|
OOPIF: 20 second ERR_CACHE_LOCK_TIMEOUT on extension popup double-click |
||||||||||||||||||||||
Issue descriptionVersion: 56.0.2894.0, 54.0.2840.59 OS: Windows What steps will reproduce the problem? (1) Start Chrome with --isolate-extensions. (On by default in M55+) (2) Install https://chrome.google.com/webstore/detail/thumbs-up/fheglbjjlikcmbifdhdmhfjogjofhakp (3) Click the Thumbs Up browser action, then dismiss it. (4) Double click the Thumbs Up browser action. It starts to open and closes right away, as expected. (5) Click the Thumbs Up browser action one more time. What is the expected output? The popup should appear. What do you see instead? The popup stays hidden for 20 seconds, then appears. This is a followup to issue 655114 , since nick@ noticed it was possible to get into this state with this extension (see comment 6 on that bug).
,
Oct 18 2016
Note that with my repro steps, I'm getting the "net_error = -409 (ERR_CACHE_LOCK_TIMEOUT)" on the https://accounts.google.com/o/oauth2/postmessageRelay URL, not the Twitter URL as above. I can repro reliably on Windows Canary/Stable, but not Linux tip-of-tree in a debug build... alexmos@ said he could repro on Mac Canary.
,
Oct 18 2016
There seems to be a race involved. I'm guessing that dismissing the OOPIF somehow leaves the transferring network request around in the network stack, blocking future navigations to it. They give up after 20 seconds and try again. (This is similar to going to https://expired.badssl.com/, getting an interstitial, then going to it again in a second tab. The second tab will hang for 20 seconds before the network stack gives up on the cache lock and tries again.) We'll probably just need to make sure that network requests from frames get cleaned up when the frame goes away. Note that I don't think clamy@'s recent move of CrossSiteResourceHandler to NavigationResourceThrottle is to blame, since this bug repros on M54 as well as M56.
,
Oct 19 2016
Side note: see issue 6697 about the 20 second cache lock.
,
Oct 19 2016
+shivanisha FYI who is working on the cache lock. We already should have logic that should clean up requests when frames are killed, ResourceDispatcherHostImpl::OnRenderFrameDeleted. This is from a PostTask from the UI thread WCO notification, RenderFrameDeleted. Does anyone see anything problematic here that could cause this race?
,
Oct 20 2016
I see that ResourceDispatcherHostImpl::OnRenderFrameDeleted doesn't cancel blocked requests that are associated with a process transfer:
void ResourceDispatcherHostImpl::CancelRequestsForRoute(
const GlobalFrameRoutingId& global_routing_id) {
...
// Don't clear the blocked loaders or offline policy maps if any of the
// requests in route_id are being transferred to a new process, since those
// maps will be updated with the new route_id after the transfer. Otherwise
// we will lose track of this info when the old route goes away, before the
// new one is created.
if (any_requests_transferring)
return;
...
CancelBlockedRequestsForRoute(...);
Maybe we also need to call ResourceDispatcherHostImpl::CancelBlockedRequestsForFrameFromUI later on, when we realize the frame is not coming back after a transfer.
,
Oct 20 2016
Yes that looks right... I'm not super familiar with the transfer case though. It looks like this is probably a race between CancelRequestsForRoute and the new frame issuing the navigation request (which should be coalesced in RDHI afaict).
,
Oct 22 2016
Well, actually I was wondering what happens when the new frame never materializes (i.e. if the whole page / frame tree is destroyed after the transfer has started, but before the new frame issued the navigation request to resume the transfer). One thing I am not sure about is how blocked requests can be introduced. I initially suspected that requests might become blocked because of ResourceDispatcherHostImpl::BlockRequestsForFrameFromUI being called, but I am not seeing this when debugging. I am not sure how else a request/loader can get blocked.
,
Oct 22 2016
Yep I agree, I think I spoke poorly. The race is really between the frame being destroyed and the new request going out. If the frame is destroyed first, the requests will have the old route and will not be killed. The other place requests can be blocked is in creating a new window in RenderWidgetHelper::CreateNewWindow. Note that this code is vestigial and afaik if we wanted to we could remove the blocking logic here, but it has strange interactions with the webrequest api :/ Your proposal in #6 sounds good as long as there is a point in time where we know the old frame is gone and the new frame won't take the request.
,
Oct 24 2016
I think that "blocked requests" in RDHI (i.e. what is tracked in blocked_loaders_map_) might be a red herring: - I can repro a 20 seconds delay in a local build (*) - I've added printf-style debugging to ResourceDispatcherHostImpl::BlockRequestsForFrameFromUI and ResourceDispatcherHostImpl::BlockRequestsForRoute - Neither ResourceDispatcherHostImpl::BlockRequestsFor* method is hit (and I don't see any other way to *populate* RDHI::blocked_loaders_map_; I am not counting RDHI::UpdateRequestForTransfer, because it only "moves" existing entries around, rather than "populating" new entries). (*) The repro is intermittent :-( FWIW, here is a config in which I did see a repro using the repro steps from the top of this bug (although I had to double-click multiple times, not just once...) Linux, ToT (r426829), args.gn: is_component_build = true is_debug = false use_goma = true
,
Oct 24 2016
,
Nov 11 2016
I am not sure what the next steps for this bug should be. I was hoping that somebody more familiar with the lower levels of network stack (where the 20 seconds timeout is implemented) might be able to take a look here. One big question here is - is there a relationship between the bug and the process swaps and/or --isolate-extensions mode? Or is this just purely accidental that the bug happened to repro in an extension popup while running in --isolate-extensions mode.
,
Nov 12 2016
Hm, I am surprised this isn't a blocked loaders issue. I still think this could be a content/browser/loader issue but let's change some labels so it gets on the net triagers queue.
,
Nov 12 2016
One thing that I thought might affect blocked loaders is: https://cs.chromium.org/chromium/src/extensions/browser/app_window/app_window_contents.cc?rcl=1478881111&l=47 Which seems to have different behavior for oop extensions. BTW I am getting a 404 for the linked extension that repros.
,
Nov 12 2016
I'll look at it from HttpCache::Transaction's perspective (the layer that implements the cache lock). I will probably be able to get to it on Monday, though. I am hoping to reproduce and debug to see why the 1st resource loading attempt did not complete/get cancelled at the HttpCache::Transaction layer. creis@ : In comment #1, I see the netlog for the second loading attempt. Do we have the net log for the 1st one as well?
,
Nov 12 2016
I am getting a 404 for the link: https://chrome.google.com/webstore/detail/thumbs-up/fheglbjjlikcmbifdhdmhfjogjofhakp Anyone aware of another reproducer?
,
Nov 14 2016
shivanisha@: RE: #c15 I captured chrome://net-internals log that should cover the whole repro. I see that request with ID=443 hits ERR_CACHE_LOCK_TIMEOUT. This request is for the following URI: https://accounts.google.com/o/oauth2/postmessageRelay?parent=chrome-extension%3A%2F%2Ffheglbjjlikcmbifdhdmhfjogjofhakp&jsh=m%3B%2F_%2Fscs%2Fapps-static%2F_%2Fjs%2Fk%3Doz.gapi.en_US.YsRf0Y2ZnQE.O%2Fm%3D__features__%2Fam%3DkQ%2Frt%3Dj%2Fd%3D1%2Frs%3DAGLTcCN4mOm3l5hJcNkvV1vMAcrcngmL1g#rpctoken=737885601&forcesecure=1 I don't see other requests for the same URI as above. OTOH, I note that the URI includes a '#' character in a weird place and that there are 2 other requests for similar URIs. Request w/ ID=287 seems to have succeeded normally, but request w/ ID=398 seems to have spent more than 30 seconds in request delegate / NavigationResourceThrottle. Please see the attached net-internals-log.json
,
Nov 14 2016
shivanisha@: RE: #c16
Yes - it seems that the extension used for the repro got pulled from Chrome Web Store. I've forwarded to you the zipped up extension that we had stashed from previous repro attempts. (maybe I should just attach the extension to the bug?...)
If you do end up using the zipped up extension for a repro, then you need the following extra steps to prepare for the repro:
1. Download and unzip the attached extension.
(note that I had to delete _metadata subdirectory to get step 2.b. to work).
2. Go to chrome://extensions and manually install the extension:
2.a. Enable "Developer mode"
2.b. "Load unpacked extension..." pointing at
"...\fheglbjjlikcmbifdhdmhfjogjofhakp\3.2_0" directory
,
Nov 14 2016
I am unable to repro without --isolate-extensions (and so far I was always able to consistently repro on Windows, 56.0.2919.0, with isolated extensions, using repro steps from the top of this bug).
,
Nov 15 2016
Assigning to shivanisha for now since Shivani has been investigating this and is working on code in this space.
,
Nov 15 2016
,
Nov 16 2016
Looking at the net logs posted in #17, requests with ID 398 and 443 are the ones that seem to be using the same HttpCache::ActiveEntry object (their URLs are same except for the rpctoken argument which doesn't get checked while looking up in the cache). The HttpCache::Transaction associated with request ID 398, being first, becomes ActiveEntry's writer and thus it needs to complete reading the response before any other request gets to the cached resource or times out after 20 secs.This HttpCache::transaction, however never completes or gets deleted. From the net log it shows it is stuck at the NavigationResourceThrottle step and thus the 2nd request waits for the timer (20 secs) to expire and then goes to the network. This is expected behavior of the HttpCache::Transaction layer. Does someone more familiar with content/browser/loader want to take a look at why the correct deletion logic is not happening for the 1st request?
,
Nov 16 2016
I wonder if the http cache lock should be held *before* a ResourceThrottle okays (i.e. okays OR defers and resumes) a WillStartRequest notification. Maybe postponing when the lock is taken would also mitigate this bug.
OTOH, the discussion about hypothetical causes and fixes would be much easier if we had a reliable, automated repro test. Unfortunately this is not the case. Some time ago I've tried the following browser test but it *didn't* repro (maybe I am doing something wrong here?):
content/browser/frame_host/navigation_handle_impl_browsertest.cc:
// Checks that everything is cleaned up even when the frame tree is destroyed
// shortly after a call to NavigationThrottle::WillStartRequest.
IN_PROC_BROWSER_TEST_F(NavigationHandleImplBrowserTest,
FrameTreeDestroyedShortlyAfterWillStartRequest) {
GURL url(embedded_test_server()->GetURL("/title1.html"));
{
std::unique_ptr<Shell> other_shell(Shell::CreateNewWindow(
shell()->web_contents()->GetBrowserContext(),
GURL(), nullptr, gfx::Size()));
TestNavigationManager navigation_manager(other_shell->web_contents(), url);
other_shell->LoadURL(url);
// Wait until |navigation_manager| defers (pauses) the URLRequest associated
// with the navigation and immediately destroy |other_shell|.
//ASSERT_TRUE(navigation_manager.WaitForRequestStart());
ASSERT_TRUE(navigation_manager.WaitForResponse());
other_shell.reset();
navigation_manager.WaitForNavigationFinished();
}
// Start a URLRequest to the same URL. This should succeed.
EXPECT_TRUE(NavigateToURL(shell(), url));
EXPECT_EQ(url, shell()->web_contents()->GetLastCommittedURL());
}
,
Nov 16 2016
Re #23: The lock is held when URLRequest::Start() is invoked which, I assume, should be after ResourceThrottle okays a WillStartRequest notification. Correct me if I am wrong.
,
Nov 17 2016
clamy@ / csharrison@ - would you be able to take a look at this? (based on #c22 it seems to be a problem somewhere near NavigationResourceThrottle and/or content/browser/loader).
,
Nov 17 2016
Yeah sure thing let me take a look tomorrow. Can you post the extension to this issue (or email it to me)?
,
Nov 17 2016
Attached the extension. Also assigning to csharrison@ as Charles is looking at it.
,
Nov 17 2016
After reproing this I can confirm that the problem is that the requests prepare for transfer (thus making them unkillable), and the transfer never successfully completes. I'm not super familiar with the transfer code, but it seems like we might want the UI thread code to post a task to the RDHI saying "please destroy all transfer navigations associated with frame X"
,
Nov 17 2016
Additionally, the navigation_handle is succesfully transferred to the dest RFH, but the dest RFH is deleted before it can successfully fire off a navigation request. Let me try to write a browser test.
,
Nov 17 2016
I seem to have a fix that works manually. In the RFH destructor, we may have a navigation_handle_ that "is_transferring_". If we do, just post a task to RDHI::CancelRequest with the nav handles request id. Still thinking of the best way to write a browser test for this though...
,
Nov 18 2016
Thank you for digging into this and pointing out that the repro depends on the navigation handle being destroyed while is_transferring_ is true. Based on this and with a little bit of help from nasko@ I was able to repro the problem in a browser test that I've uploaded to https://crrev.com/2518513003 (first patchset with extra logs that helped me verify that things are really happening in the right order; the 2nd patchset cleaned-up and hopefully ready to patch into your https://crrev.com/2508973006). Logs from the first patchset clearly show 1) a 20+ seconds delay during the second navigation attempted by the test and 2) a DCHECK that points out that we have leaked a URLRequest: ... [72496:72496:1118/121046:1376026322337:ERROR:site_per_process_browsertest.cc(8609)] Next test navigation... ... [72496:72496:1118/121107:1376046899462:ERROR:site_per_process_browsertest.cc(8611)] Next test navigation... done. ... [74984:75012:1118/122039:1376619287761:FATAL:url_request_context.cc(106)] Check failed: false. Leaked 1 URLRequest(s). First URL: http://baz.com:52830/title1.html. #0 0x0000005b7031 __interceptor_backtrace #1 0x7faf32b8afd3 base::debug::StackTrace::StackTrace() #2 0x7faf32bfdfda logging::LogMessage::~LogMessage() #3 0x7faf2fef95fb net::URLRequestContext::AssertNoURLRequests() #4 0x7faf2fef91c4 net::URLRequestContext::~URLRequestContext() #5 0x7faf2fef975e net::URLRequestContext::~URLRequestContext() #6 0x000001725ee5 content::ShellURLRequestContextGetter::~ShellURLRequestContextGetter() #7 0x0000017260de content::ShellURLRequestContextGetter::~ShellURLRequestContextGetter() #8 0x7faf2ff02b41 net::URLRequestContextGetter::OnDestruct()
,
Nov 18 2016
Amazing! Thanks so much. I've just fixed the try failures in my patch so I'll try to throw up a finished patch with your test added before EOD.
,
Nov 20 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f2466b80b0cd892d31fdf69a397905eaacbfb3b5 commit f2466b80b0cd892d31fdf69a397905eaacbfb3b5 Author: csharrison <csharrison@chromium.org> Date: Sat Nov 19 23:57:12 2016 Kill vestigial transfer requests When requests are marked as transferring, we mark them as protected in the ResourceDispatcherHostImpl, so when their RFH goes away we won't lose the request. However, there are cases where the transfer can fail, and the navigation abandoned, while the request on the IO thread is still marked as protected. This patch catches that case by checking for transferring navigation handles in ~RFH, and cancelling them there. Browser test written by lukasza@chromium.org BUG= 657195 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation Review-Url: https://codereview.chromium.org/2508973006 Cr-Commit-Position: refs/heads/master@{#433430} [modify] https://crrev.com/f2466b80b0cd892d31fdf69a397905eaacbfb3b5/content/browser/frame_host/navigation_handle_impl.cc [modify] https://crrev.com/f2466b80b0cd892d31fdf69a397905eaacbfb3b5/content/browser/frame_host/navigator_impl.cc [modify] https://crrev.com/f2466b80b0cd892d31fdf69a397905eaacbfb3b5/content/browser/frame_host/render_frame_host_manager.cc [modify] https://crrev.com/f2466b80b0cd892d31fdf69a397905eaacbfb3b5/content/browser/site_per_process_browsertest.cc
,
Nov 21 2016
Setting NextAction field to +1 week, so we are reminded (after some bake time on Canary+Dev) to evaluate whether to merge the fix above to the M56 branch.
,
Nov 28 2016
Let's also mark as fixed.
,
Nov 28 2016
Thanks for landing the fix! I've confirmed it seems to work on Windows Canary (57.0.2931.0). Can you request a merge?
,
Nov 28 2016
Sure!
,
Nov 28 2016
Your change meets the bar and is auto-approved for M56 (branch: 2924)
,
Nov 29 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/50c9ad5f38b13fd6cce7081cc2e00d8f0acf6459 commit 50c9ad5f38b13fd6cce7081cc2e00d8f0acf6459 Author: csharrison <csharrison@chromium.org> Date: Tue Nov 29 15:23:55 2016 Kill vestigial transfer requests When requests are marked as transferring, we mark them as protected in the ResourceDispatcherHostImpl, so when their RFH goes away we won't lose the request. However, there are cases where the transfer can fail, and the navigation abandoned, while the request on the IO thread is still marked as protected. This patch catches that case by checking for transferring navigation handles in ~RFH, and cancelling them there. Browser test written by lukasza@chromium.org BUG= 657195 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation Review-Url: https://codereview.chromium.org/2508973006 Cr-Commit-Position: refs/heads/master@{#433430} (cherry picked from commit f2466b80b0cd892d31fdf69a397905eaacbfb3b5) TBR=creis@chromium.org,lukasza@chromium.org NOTRY=true NOPRESUBMIT=true Review-Url: https://codereview.chromium.org/2541543002 Cr-Commit-Position: refs/branch-heads/2924@{#155} Cr-Branched-From: 3a87aecc31cd1ffe751dd72c04e5a96a1fc8108a-refs/heads/master@{#433059} [modify] https://crrev.com/50c9ad5f38b13fd6cce7081cc2e00d8f0acf6459/content/browser/frame_host/navigation_handle_impl.cc [modify] https://crrev.com/50c9ad5f38b13fd6cce7081cc2e00d8f0acf6459/content/browser/frame_host/navigator_impl.cc [modify] https://crrev.com/50c9ad5f38b13fd6cce7081cc2e00d8f0acf6459/content/browser/frame_host/render_frame_host_manager.cc [modify] https://crrev.com/50c9ad5f38b13fd6cce7081cc2e00d8f0acf6459/content/browser/site_per_process_browsertest.cc
,
Nov 29 2016
Thanks! |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by creis@chromium.org
, Oct 18 2016