Memory leak of stale entries in PendingSiteProcessCountTracker |
|||||
Issue descriptionThe test started failing itermittently in https://ci.chromium.org/buildbot/chromium.fyi/Site%20Isolation%20Linux/23516 On my local machine the crash is pretty consistent (i.e. I get a crash 100% of times - unlike the intermittent/flaky crash on the bot). Cmdline to repro the crash locally: $ DISPLAY=:20 third_party/WebKit/Tools/Scripts/run-webkit-tests -t $GnDir -v --additional-drt-flag=--site-per-process --no-retry-failures --iterations=10 http/tests/security/upgrade-insecure-requests/iframe-upgrade.https.html I've initially bisected the crash down to r522537
,
Dec 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a52d7d2d1782297ae4db125ee5569624c501be0c commit a52d7d2d1782297ae4db125ee5569624c501be0c Author: Lukasz Anforowicz <lukasza@chromium.org> Date: Fri Dec 08 00:04:01 2017 Crash expectations for iframe-upgrade.https.html test. Bug: 793127 Change-Id: Ie86ce44db645d1aabaea51fe3839882c46590871 No-Try: true Reviewed-on: https://chromium-review.googlesource.com/815879 Reviewed-by: Alex Moshchuk <alexmos@chromium.org> Commit-Queue: Ćukasz Anforowicz <lukasza@chromium.org> Cr-Commit-Position: refs/heads/master@{#522627} [modify] https://crrev.com/a52d7d2d1782297ae4db125ee5569624c501be0c/third_party/WebKit/LayoutTests/FlagExpectations/site-per-process
,
Dec 8 2017
I can continue investigating tomorrow, but I'll reassign to clamy@ in case she can take a look in the meantime. This might be a bug with the pending SiteProcessCountTracker not properly adding/removing expected site URLs with upgrade-insecure-request. Here's what I found so far. The problem seems to happen for the second iframe in the test. It's a cross-site iframe to http://example.test:8443/security/resources/post-origin-to-parent.html which is being upgraded to https (per "upgrade-insecure-requests" in CSP). We first get a pending navigation to http://example.test:8443/security/resources/post-origin-to-parent.html, causing us to spin up a new process (ID=5 in my local repro). The destination SiteInstance in GetFrameHostForNavigation is http://example.test/. However, we then do an AddExpectedNavigationToSite() for site URL "https://example.test/" (note HTTPS and not HTTP) for that process. That seems to be a mismatch. Next, presumably after the upgrade, we get another GetFrameHostForNavigation for same FTN, to https://example.test:8443/security/resources/post-origin-to-parent.html. This also needs a new RFH and process in GetFrameHostForNavigation, and as part of that we unregister the old process with ID 5 from the global RPH map via this path: ERR: #1 0x7f73d96920f5 content::RenderProcessHostImpl::UnregisterHost()\n ERR: #2 0x7f73d969e791 content::RenderProcessHostImpl::Cleanup()\n ERR: #3 0x7f73d9699210 content::RenderProcessHostImpl::RemoveRoute()\n ERR: #4 0x7f73d96b0de5 content::RenderWidgetHostImpl::Destroy()\n ERR: #5 0x7f73d96b1c82 content::RenderWidgetHostImpl::ShutdownAndDestroyWidget()\n ERR: #6 0x7f73d96adb6b content::RenderViewHostImpl::ShutdownAndDestroy()\n ERR: #7 0x7f73d944e41d content::FrameTree::ReleaseRenderViewHostRef()\n ERR: #8 0x7f73d947b73e content::RenderFrameHostImpl::~RenderFrameHostImpl()\n ERR: #9 0x7f73d947c07e content::RenderFrameHostImpl::~RenderFrameHostImpl()\n ERR: #10 0x7f73d949e4ad content::RenderFrameHostManager::DiscardUnusedFrame()\n ERR: #11 0x7f73d949cc38 content::RenderFrameHostManager::CleanUpNavigation()\n ERR: #12 0x7f73d949ea3f content::RenderFrameHostManager::GetFrameHostForNavigation()\n ERR: #13 0x7f73d9473bee content::NavigationRequest::OnResponseStarted()\n ERR: #14 0x7f73d9572528 content::NavigationURLLoaderImpl::NotifyResponseStarted()\n But, notice that we never called the matching RemoveExpectedNavigationToSite for that process, to remove https://example.test/ from it. With my change in r522537, this subframe uses the REUSE_PENDING_OR_COMMITTED_SITE reuse policy, so when creating the new process for it, we look for a reusable process for site URL https://example.test/ using FindRenderProcessesForSite, and we're still able to find RPH with ID=5 in SiteProcessCountTracker's |map_|. But it no longer exists in the global RPH map, so FromID doesn't find it and we hit the NOTREACHED.
,
Dec 11 2017
Some more details after further investigation. Adding the post-upgrade site URL (https://example.test) for the subframe in #3 actually seems ok, as that's what the NavigationHandle site_url_ ends up being, since we create it after calling CSPContext::ShouldModifyRequestUrlForCsp. We do end up creating a speculative RFH in the wrong http SiteInstance, but we notice that and fix it when we look up the final RFH when we call GetFrameHostForNavigation from NavigationRequest::OnResponseStarted(). That's where the bug actually occurs: when GetFrameHostForNavigation needs to destroy the old speculative RFH (http SiteInstance) and creates a new speculative RFH (https SiteInstance) in its place, the destruction of the old speculative RFH might clean up its process (*). But the call to remove the corresponding expected site doesn't occur until later, when NavigationHandleImpl::ReadyToCommitNavigation calls SetExpectedProcess. This usually does the following: // If a RenderProcessHost was expecting this navigation to commit, have it // stop tracking this site. RenderProcessHost* old_process = RenderProcessHost::FromID(expected_render_process_host_id_); if (old_process) { RenderProcessHostImpl::RemoveExpectedNavigationToSite( frame_tree_node_->navigator()->GetController()->GetBrowserContext(), old_process, site_url_); } However, in this case, the process corresponding to expected_render_process_host_id_ has already been cleaned up as part of destroying the old (http) speculative RFH, so we never decrement/remove its ID value for the corresponding site in the PendingSiteProcessCountTracker's map. I don't think this results in any wrong behavior, since we'll never use a process that doesn't exist from FindRenderProcessesForSite(). However, I think this is a memory leak, as these stale entries will be around in the PendingSiteProcessCountTracker map indefinitely. I can't tell how bad the leak gets - these trackers are used unconditionally, but I don't know how frequently the speculative RFH changes in OnResponseStarted. In addition to the upgrade-insecure-requests case, I think that can also happen in some site isolation process reuse races (issues 780661, 738634 ). If the map grows really large, it'll eventually slow down the lookup as well. To fix, we might need to find a way to call RemoveExpectedNavigationToSite() sooner, or maybe we could still call it on the old process ID without trying to look it up. clamy@: would you have time to fix this, since you're most familiar with these paths from your r474395? If not, I can probably take a crack at it, though for the moment I need to focus on a few higher-priority bugs. (*) This is also what made this test flaky - one of the other subframes in the test sometimes loads quickly enough to grab onto the same process and prevent it from going away. If that happens, the process will be around for the subsequent RemoveExpectedNavigationToSite, and the bug won't manifest itself.
,
Mar 9 2018
,
Mar 9 2018
There is another layout test that hits this problem - I'll shuffle the test expectations in FlagExpectations/site-per-process to reflect that. crbug.com/793127 http/tests/security/upgrade-insecure-requests/iframe-upgrade.https.html [ Crash ] crbug.com/793127 external/wpt/content-security-policy/securitypolicyviolation/upgrade-insecure-requests-reporting.https.html [ Crash ]
,
Mar 9 2018
Updating the title based on #c4 |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by alex...@chromium.org
, Dec 7 2017Looks like it's hitting the following: [5403:5403:1207/140504.418093:FATAL:render_process_host_impl.cc(832)] Check failed: false. #0 0x00000304595c base::debug::StackTrace::StackTrace() #1 0x00000306404c logging::LogMessage::~LogMessage() #2 0x000002c201ac content::(anonymous namespace)::SiteProcessCountTracker::FindRenderProcessesForSite() #3 0x000002c1eef8 content::RenderProcessHostImpl::FindReusableProcessHostForSite() #4 0x000002c1e389 content::RenderProcessHostImpl::GetProcessHostForSiteInstance() #5 0x000002d21c09 content::SiteInstanceImpl::GetProcess() #6 0x000002a337cb content::RenderFrameHostManager::CreateSpeculativeRenderFrameHost() #7 0x000002a32e38 content::RenderFrameHostManager::GetFrameHostForNavigation() #8 0x000002a0974e content::NavigationRequest::OnResponseStarted() #9 0x000002b00aa8 content::NavigationURLLoaderImpl::NotifyResponseStarted() Seems to be a NOTREACHED in FindRenderProcessesForSite: if (!host) { // TODO(clamy): This shouldn't happen but we are getting reports from // the field that this is happening. We need to figure out why some // RenderProcessHosts are not taken out of the map when they're // destroyed. NOTREACHED(); continue; } I'll investigate further - maybe this is a lead on the issue clamy@ mentions in that comment. It seems like this function will still work if we ever hit this case on release builds, so should be ok to disable the test while we investigate.