New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 820996 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , Chrome , Mac
Pri: 1
Type: Bug

Blocking:
issue 598073
issue 879232



Sign in to add a comment

Ensure that network process is shut down gracefully

Project Member Reported by jam@chromium.org, Mar 12 2018

Issue description

We need to make sure that the network process shuts down gracefully, i.e. all the network contexts are destructed and the disk I/O is all flushed.

 

Comment 1 by mmenke@chromium.org, Mar 12 2018

Is there any work to do here?  The NetworkService itself tears down all NetworkContext, and disk IO is done on threads that should block shutdown.  Assume the network process is neither crashing not being killed, I think everything works?

Comment 2 by jam@chromium.org, Mar 19 2018

oops sorry just saw your comment.

I ran into race conditions because of the different pipes, where the NnetworkContexts are not torn down but the network process is.

Comment 3 by mmenke@chromium.org, Mar 20 2018

Thanks for the additional context!  I had been thinking we kept the process alive until all network contexts were destroyed.
Project Member

Comment 4 by bugdroid1@chromium.org, Mar 20 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/cc0b2a64386480ae1b7545b8cf49efb66ee89c93

commit cc0b2a64386480ae1b7545b8cf49efb66ee89c93
Author: John Abd-El-Malek <jam@chromium.org>
Date: Tue Mar 20 19:32:50 2018

Fix webview tag's using an on-disk directory with the network service enabled.

To fix the flakiness in the tests that span restarts, explicit calls are added to ensure the cookie
store is flushed.

Bug:  769401 ,  820996 
Change-Id: I8b02a0aa3882420a6055292f75730c607ce143b0
Reviewed-on: https://chromium-review.googlesource.com/967058
Commit-Queue: John Abd-El-Malek <jam@chromium.org>
Reviewed-by: Matt Menke <mmenke@chromium.org>
Reviewed-by: Nasko Oskov <nasko@chromium.org>
Cr-Commit-Position: refs/heads/master@{#544488}
[modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/chrome/browser/apps/guest_view/web_view_browsertest.cc
[modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/chrome/browser/chrome_content_browser_client.cc
[modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/chrome/browser/net/profile_network_context_service.cc
[modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/chrome/browser/net/profile_network_context_service.h
[modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/chrome/browser/sessions/better_session_restore_browsertest.cc
[modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/content/public/test/browser_test_utils.cc
[modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/content/public/test/browser_test_utils.h
[modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/testing/buildbot/filters/mojo.fyi.network_browser_tests.filter

Comment 5 by jam@chromium.org, Mar 20 2018

Cc: roc...@chromium.org
@Matt: it took me a bit of time to track down all the ways the process or NetworkContext is shut down. Here it is for posterity. The summary is that NetworkContexts aren't always destructed.

1) ProfileNetworkContextService is torn down in the browser, and NetworkContext::OnConnectionError notices the pipe error in the utility process.
In this case, the NetworkContext closes its URLRequestContext which flushes data to disk (*racy, see 3).


2) The service instance in the utility process gets a pipe error due to this call in the browser:
>	content.dll!content::ServiceManagerContext::InProcessServiceManagerContext::ShutDown() Line 351	C++
 	content.dll!content::ServiceManagerContext::~ServiceManagerContext() Line 662	C++
 	content.dll!std::default_delete<content::ServiceManagerContext>::operator()(content::ServiceManagerContext * _Ptr) Line 1999	C++
 	content.dll!std::unique_ptr<content::ServiceManagerContext,std::default_delete<content::ServiceManagerContext> >::reset(content::ServiceManagerContext * _Ptr) Line 2238	C++
 	content.dll!content::BrowserMainLoop::ShutdownThreadsAndCleanUp() Line 1149	C++
 	content.dll!content::BrowserMainRunnerImpl::Shutdown() Line 223	C++
 	content.dll!content::BrowserMain(const content::MainFunctionParams & parameters) Line 50	C++

On the utility process side, the pipe error is seen and eventually UtilityServiceFactory::OnServiceQuit() is reached which will quit the process.

Note in this case network::NetworkService isn't destructed which means NetworkContext's destructors aren't called.


3) The browser process kills all child process at the end of shutdown, in BrowserProcessSubThread::IOThreadCleanUp(). See the BrowserChildProcessHostImpl::TerminateAll() call.



Things we can change:
i) we could make a new mojo call from the browser to the network process as soon as shutdown starts, to improve the chance that the NetworkContexts are torn down. e.g. https://chromium-review.googlesource.com/c/chromium/src/+/967058/3
  -however this isn't foolproof, as there are no guarantees that this will run and finish execution before BrowserChildProcessHostImpl::TerminateAll kills the process

ii) similar to i) but even more racy, since it would happen later, we could have UtilityServiceFactory::OnServiceQuit() post a task to the IO thread and destruct NetworkService (through a weakptr that it initialized in UtilityServiceFactory::CreateNetworkService()). To increase the chance that this has time to run we'd probably want to change the browser to shutdown the ServiceManager earlier so that ServiceManagerContext::~ServiceManagerContext() runs earlier.

iii) we could have the browser wait on utility process shutdown (just the one hosting network service), for up to 1s. In practice this would probably work most of the time, but on slow computers or at system shutdown it would be ineffective. It also slows down shutdown, and isn't guaranteed (since there could be state-changing mojo calls in the OS pipes).


Ken points out that in all these workarounds, we still don't have guarantees that the data will be processed and flushed. Since the majority of users are on Android, where we never have clean shutdown anyways, perhaps we don't need to solve this on desktop.

Comment 6 by mmenke@chromium.org, Mar 20 2018

With unclean shutdown, we will potentially corrupt the cache and/or cookie stores.  While both can recover from unclean states generally, I'd rather we have some guarantee that on clean shutdowns, we cleanly flush them.  Do we have to kill the network process on shutdown, rather than gracefully destroy the network contexts, and shut down the worker pool?  We may still want timeouts of some sort, but just killing processes that store on-disk data in the normal case seems less than optimal.

Comment 7 by mmenke@chromium.org, Mar 20 2018

Also, if we really want to ignore TaskShutdownBehavior::BLOCK_SHUTDOWN in child processes, perhaps we should DCHECK on it everywhere but in the main process?  If we ignore it in the main process, too, we should remove it entirely.

Comment 8 by jam@chromium.org, Mar 20 2018

The browser process always kills child processes at shutdown. That'll have to remain because we never want any child processes dangling. If we want to ensure that data stores in the network process are closed gracefully, that's possible by sending it an IPC and waiting for n seconds. It'll just have to be balanced with slowing down shutdown. I don't feel strongly either way. We can also add a timeout and gather uma, so once we're on canary we get real world data on how much it slows down shutdown.

The thing I'm most curious about is what about Android: all of our shutdowns there are unclean, and I assume the browser there shuts down much more frequently than on desktop. If it's ok on Android, curious why it wouldn't be on desktop?

Comment 9 by mmenke@chromium.org, Mar 20 2018

It's OK on Android for the same reason not using Blink is ok on iOS - we don't have a choice in the matter, I assume.

Comment 10 by jam@chromium.org, Mar 20 2018

That's reasonable. I guess this separates shutdown into 2 sub-problems:
1) not all data might be flushed: although this is true today as well, since there could be in-flight messages from renderers.
2) trying to always close the cache & cookie stores, within some reasonable timeout

You mentioned worker pool: how does the URLRequestContext's stores pick which threads to do disk I/O on?

Most things use TaskScheduler now, and use block_shutdown for disk writes.  I am not sure about the two disk cache implementations - I think one uses its own (leaked?) worker pool for individual files, and (possibly) a block_shutdown thread for the cache index itself.  I'm much less concerned about performance improving data (the cache, server metadata) and partially received data (Like files being downloaded, though those are written to in the browser process, anyways) than more user-visible state (cookie store, channel ID store, and the like).

Comment 12 by dxie@chromium.org, May 15 2018

Labels: -Pri-2 Proj-Servicification-Canary OS-All Pri-1

Comment 13 by dxie@chromium.org, May 18 2018

Labels: -OS-All OS-Windows OS-Linux OS-Mac OS-Chrome OS-Android
Project Member

Comment 14 by bugdroid1@chromium.org, May 19 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c1273ef38167e73d5bdb38c73aca7d67fd4632cd

commit c1273ef38167e73d5bdb38c73aca7d67fd4632cd
Author: Matt Menke <mmenke@chromium.org>
Date: Sat May 19 00:45:21 2018

Selectively re-enable NetworkContextConfigurationBrowserTest.Hsts.

What I believe is happening is that, when using the network service,
we're killing the network service before it can write to disk,
resulting in settings not being saved before restarting the browser
and trying to load the saved state.  Unfortunately, the browser
process kills subprocesses on shutdown, rather than letting them run
their BLOCK_SHUTDOWN tasks.

This CL disables the test whenever there's an on-disk cache and
the network service is enabled.

Bug:  837776 ,  820996 ,  842088 
Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo
Change-Id: I80549906d3408b1a0c651475075959b83fb0dc03
Reviewed-on: https://chromium-review.googlesource.com/1054525
Commit-Queue: Matt Menke <mmenke@chromium.org>
Reviewed-by: Emily Stark <estark@chromium.org>
Cr-Commit-Position: refs/heads/master@{#560123}
[modify] https://crrev.com/c1273ef38167e73d5bdb38c73aca7d67fd4632cd/chrome/browser/net/network_context_configuration_browsertest.cc
[modify] https://crrev.com/c1273ef38167e73d5bdb38c73aca7d67fd4632cd/testing/buildbot/filters/mojo.fyi.network_browser_tests.filter

Project Member

Comment 15 by bugdroid1@chromium.org, May 21 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c840aa8d5e460c9f8cd9b787fd1a4c339aaeb4a5

commit c840aa8d5e460c9f8cd9b787fd1a4c339aaeb4a5
Author: John Abd-El-Malek <jam@chromium.org>
Date: Mon May 21 22:00:38 2018

Wait on the network process at shutdown.

This gives the child process time to flush I/O. This can help avoid the OS restarting in the middle
of flushing data. It's also needed because swarming jobs fail when the browser quits with child
processes still alive.

An example flake is
https://chromium-swarm.appspot.com/task?id=3d8cd8e056a4b310&refresh=10&show_raw=1. Roughly half of
the layout test runs on Win Mojo are flaking as a result.

Bug:  820996 
Change-Id: I9c3b13f83c716f020f555cbe2aa14e20c8825ed3
Reviewed-on: https://chromium-review.googlesource.com/1066622
Reviewed-by: Steven Holte <holte@chromium.org>
Reviewed-by: Ken Rockot <rockot@chromium.org>
Commit-Queue: John Abd-El-Malek <jam@chromium.org>
Cr-Commit-Position: refs/heads/master@{#560368}
[modify] https://crrev.com/c840aa8d5e460c9f8cd9b787fd1a4c339aaeb4a5/content/browser/browser_process_sub_thread.cc
[modify] https://crrev.com/c840aa8d5e460c9f8cd9b787fd1a4c339aaeb4a5/content/browser/utility_process_host.h
[modify] https://crrev.com/c840aa8d5e460c9f8cd9b787fd1a4c339aaeb4a5/content/common/child_process_host_impl.h
[modify] https://crrev.com/c840aa8d5e460c9f8cd9b787fd1a4c339aaeb4a5/tools/metrics/histograms/histograms.xml

Comment 16 by jam@chromium.org, May 22 2018

Owner: jam@chromium.org
Status: Started (was: Available)
Project Member

Comment 17 by bugdroid1@chromium.org, May 23 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/9edd464609b0353046c0f11a3594329e7ba958a7

commit 9edd464609b0353046c0f11a3594329e7ba958a7
Author: John Abd-El-Malek <jam@chromium.org>
Date: Wed May 23 04:24:28 2018

Fix waiting for network process at browser shutdown.

The previous fix in r560368 didn't work, as the BrowserChildProcessHostImpl::TerminateAll removed all the UtilityProcessHost objects first.

Bug:  820996 ,  820969 
Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo
Change-Id: I88ba647e911c8f7e8d9458b60021aa6d915ccfa0
Reviewed-on: https://chromium-review.googlesource.com/1069668
Commit-Queue: John Abd-El-Malek <jam@chromium.org>
Reviewed-by: Ken Rockot <rockot@chromium.org>
Cr-Commit-Position: refs/heads/master@{#560938}
[modify] https://crrev.com/9edd464609b0353046c0f11a3594329e7ba958a7/base/threading/thread_restrictions.h
[modify] https://crrev.com/9edd464609b0353046c0f11a3594329e7ba958a7/content/browser/browser_process_sub_thread.cc
[modify] https://crrev.com/9edd464609b0353046c0f11a3594329e7ba958a7/testing/buildbot/filters/mojo.fyi.network_browser_tests.filter

Project Member

Comment 18 by bugdroid1@chromium.org, May 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1efb28fb17a305f4358afd55f964f63429a521a8

commit 1efb28fb17a305f4358afd55f964f63429a521a8
Author: John Abd-El-Malek <jam@chromium.org>
Date: Fri May 25 23:27:39 2018

Launch unsandboxed network service in a job object to avoid the process outliving the browser.

Currently many layout test runs are flaking because swarming is not expecting child processes to outlive the browser. This is happening because TerminateProcess fails sometimes because there's pending I/O.

Bug:  820996 
Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo
Change-Id: I7595c131faccae03c268cde1139394c0ad48a7b4
Reviewed-on: https://chromium-review.googlesource.com/1071100
Commit-Queue: John Abd-El-Malek <jam@chromium.org>
Reviewed-by: Penny MacNeil <pennymac@chromium.org>
Reviewed-by: Tom Sepez <tsepez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#562058}
[modify] https://crrev.com/1efb28fb17a305f4358afd55f964f63429a521a8/content/browser/browser_process_sub_thread.cc
[modify] https://crrev.com/1efb28fb17a305f4358afd55f964f63429a521a8/services/service_manager/sandbox/win/sandbox_win.cc

Project Member

Comment 19 by bugdroid1@chromium.org, Jun 3 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/cac74ff72c0eaa1a47338d1ba44990cc74e6b589

commit cac74ff72c0eaa1a47338d1ba44990cc74e6b589
Author: John Abd-El-Malek <jam@chromium.org>
Date: Sun Jun 03 17:13:53 2018

Fix flake in layout tests when running with network service.

The problem was that there were alive processes when tests finished, and swarming considered that a failure (e.g. [1]).

The cause was that the processes were using a job object (to fix other races), and the code in base launched processes suspended if they were in a job object and later resumed them. The race was that layout test script kills the browser when there are any test failures. It did this using "taskkill.exe /t" (recursive). Windows would delete the network process, and as it was killing the browser it would sometimes do that in between the small window where the relaunched network process was created suspended and it was resumed. Child processes have logic to kill themselves if the parent is gone, but since the child process was suspended that didn't run.

The suspension when a process is in a job object was added in r100027 [2]. However the code that was moved there (test server) didn't have this originally. Also, the documentation that describes that this is needed [3] is for functionality that ended in Windows XP [4].

Note that a few weeks ago relaunching started using this as well in [5]. I'm removing created-as-suspended for that code as well, since that seems rare and keeping code in base/ that has this race condition that leaves processes around seems much worse than the alternative race of windows sometimes not being in foreground.

[1] https://chromium-swarm.appspot.com/task?id=3dd10a30220aa810&refresh=10&show_raw=1
[2] https://codereview.chromium.org/7789018
[3] https://msdn.microsoft.com/en-us/library/windows/desktop/ms681949(v=vs.85).aspx
[4] https://msdn.microsoft.com/en-us/library/windows/desktop/ms684159(v=vs.85).aspx
[5] https://chromium-review.googlesource.com/c/chromium/src/+/1064510

Bug:  820996 
Change-Id: Ie79118efffac347bf3df1d65e1124fba095acdc0
Reviewed-on: https://chromium-review.googlesource.com/1081559
Reviewed-by: Gabriel Charette <gab@chromium.org>
Reviewed-by: Greg Thompson <grt@chromium.org>
Commit-Queue: John Abd-El-Malek <jam@chromium.org>
Cr-Commit-Position: refs/heads/master@{#563967}
[modify] https://crrev.com/cac74ff72c0eaa1a47338d1ba44990cc74e6b589/base/process/launch_win.cc

Project Member

Comment 20 by bugdroid1@chromium.org, Jun 5 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/3c39807459594a766f9f66fa780cacc157e2e82c

commit 3c39807459594a766f9f66fa780cacc157e2e82c
Author: John Abd-El-Malek <jam@chromium.org>
Date: Tue Jun 05 15:27:34 2018

Add a DCHECK to ensure CREATE_SUSPENDED isn't used as a followup to r563967.

Bug:  820996 
Change-Id: I590744ed2b08bf0e5aafcaea737512dc369cf212
Reviewed-on: https://chromium-review.googlesource.com/1086592
Commit-Queue: John Abd-El-Malek <jam@chromium.org>
Reviewed-by: Gabriel Charette <gab@chromium.org>
Cr-Commit-Position: refs/heads/master@{#564510}
[modify] https://crrev.com/3c39807459594a766f9f66fa780cacc157e2e82c/base/process/launch_win.cc

Project Member

Comment 21 by bugdroid1@chromium.org, Jun 7 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4db54417f0ee15c0f4abc0f8a63bfa47b45f67a3

commit 4db54417f0ee15c0f4abc0f8a63bfa47b45f67a3
Author: John Abd-El-Malek <jam@chromium.org>
Date: Thu Jun 07 02:15:42 2018

Final fix to make layout tests not flaky on Windows with network service.

Even if a process is launched without CREATE_SUSPENDED flag, Windows internally will create it suspended and then unsuspend it right away [1]. However there's a small window where if the parent process is killed the child process is left suspended. This only shows up with processes that are unsandboxed, so that's why it's occurring with the network service only. It's also happening only with layout tests because that's the only integration test that kills the browser ofen (about a thousand times per full run).

The fix is to suspend the parent process before calling taskill on it. That ensures that it doesn't spawn another child process to replace one that's killed by taskkill.

[1] https://www.microsoftpressstore.com/articles/article.aspx?p=2233328

Bug:  820996 
Change-Id: I64e1a493744b3b9e41fdc28d5de8cb7e0c4578d1
Reviewed-on: https://chromium-review.googlesource.com/1086714
Commit-Queue: John Abd-El-Malek <jam@chromium.org>
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Cr-Commit-Position: refs/heads/master@{#565145}
[modify] https://crrev.com/4db54417f0ee15c0f4abc0f8a63bfa47b45f67a3/third_party/blink/tools/blinkpy/common/system/executive.py

Comment 22 by jam@chromium.org, Jun 7 2018

Status: Fixed (was: Started)
Blocking: 879232

Sign in to add a comment