Ensure that network process is shut down gracefully |
|||||||
Issue descriptionWe need to make sure that the network process shuts down gracefully, i.e. all the network contexts are destructed and the disk I/O is all flushed.
,
Mar 19 2018
oops sorry just saw your comment. I ran into race conditions because of the different pipes, where the NnetworkContexts are not torn down but the network process is.
,
Mar 20 2018
Thanks for the additional context! I had been thinking we kept the process alive until all network contexts were destroyed.
,
Mar 20 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/cc0b2a64386480ae1b7545b8cf49efb66ee89c93 commit cc0b2a64386480ae1b7545b8cf49efb66ee89c93 Author: John Abd-El-Malek <jam@chromium.org> Date: Tue Mar 20 19:32:50 2018 Fix webview tag's using an on-disk directory with the network service enabled. To fix the flakiness in the tests that span restarts, explicit calls are added to ensure the cookie store is flushed. Bug: 769401 , 820996 Change-Id: I8b02a0aa3882420a6055292f75730c607ce143b0 Reviewed-on: https://chromium-review.googlesource.com/967058 Commit-Queue: John Abd-El-Malek <jam@chromium.org> Reviewed-by: Matt Menke <mmenke@chromium.org> Reviewed-by: Nasko Oskov <nasko@chromium.org> Cr-Commit-Position: refs/heads/master@{#544488} [modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/chrome/browser/apps/guest_view/web_view_browsertest.cc [modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/chrome/browser/chrome_content_browser_client.cc [modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/chrome/browser/net/profile_network_context_service.cc [modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/chrome/browser/net/profile_network_context_service.h [modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/chrome/browser/sessions/better_session_restore_browsertest.cc [modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/content/public/test/browser_test_utils.cc [modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/content/public/test/browser_test_utils.h [modify] https://crrev.com/cc0b2a64386480ae1b7545b8cf49efb66ee89c93/testing/buildbot/filters/mojo.fyi.network_browser_tests.filter
,
Mar 20 2018
@Matt: it took me a bit of time to track down all the ways the process or NetworkContext is shut down. Here it is for posterity. The summary is that NetworkContexts aren't always destructed. 1) ProfileNetworkContextService is torn down in the browser, and NetworkContext::OnConnectionError notices the pipe error in the utility process. In this case, the NetworkContext closes its URLRequestContext which flushes data to disk (*racy, see 3). 2) The service instance in the utility process gets a pipe error due to this call in the browser: > content.dll!content::ServiceManagerContext::InProcessServiceManagerContext::ShutDown() Line 351 C++ content.dll!content::ServiceManagerContext::~ServiceManagerContext() Line 662 C++ content.dll!std::default_delete<content::ServiceManagerContext>::operator()(content::ServiceManagerContext * _Ptr) Line 1999 C++ content.dll!std::unique_ptr<content::ServiceManagerContext,std::default_delete<content::ServiceManagerContext> >::reset(content::ServiceManagerContext * _Ptr) Line 2238 C++ content.dll!content::BrowserMainLoop::ShutdownThreadsAndCleanUp() Line 1149 C++ content.dll!content::BrowserMainRunnerImpl::Shutdown() Line 223 C++ content.dll!content::BrowserMain(const content::MainFunctionParams & parameters) Line 50 C++ On the utility process side, the pipe error is seen and eventually UtilityServiceFactory::OnServiceQuit() is reached which will quit the process. Note in this case network::NetworkService isn't destructed which means NetworkContext's destructors aren't called. 3) The browser process kills all child process at the end of shutdown, in BrowserProcessSubThread::IOThreadCleanUp(). See the BrowserChildProcessHostImpl::TerminateAll() call. Things we can change: i) we could make a new mojo call from the browser to the network process as soon as shutdown starts, to improve the chance that the NetworkContexts are torn down. e.g. https://chromium-review.googlesource.com/c/chromium/src/+/967058/3 -however this isn't foolproof, as there are no guarantees that this will run and finish execution before BrowserChildProcessHostImpl::TerminateAll kills the process ii) similar to i) but even more racy, since it would happen later, we could have UtilityServiceFactory::OnServiceQuit() post a task to the IO thread and destruct NetworkService (through a weakptr that it initialized in UtilityServiceFactory::CreateNetworkService()). To increase the chance that this has time to run we'd probably want to change the browser to shutdown the ServiceManager earlier so that ServiceManagerContext::~ServiceManagerContext() runs earlier. iii) we could have the browser wait on utility process shutdown (just the one hosting network service), for up to 1s. In practice this would probably work most of the time, but on slow computers or at system shutdown it would be ineffective. It also slows down shutdown, and isn't guaranteed (since there could be state-changing mojo calls in the OS pipes). Ken points out that in all these workarounds, we still don't have guarantees that the data will be processed and flushed. Since the majority of users are on Android, where we never have clean shutdown anyways, perhaps we don't need to solve this on desktop.
,
Mar 20 2018
With unclean shutdown, we will potentially corrupt the cache and/or cookie stores. While both can recover from unclean states generally, I'd rather we have some guarantee that on clean shutdowns, we cleanly flush them. Do we have to kill the network process on shutdown, rather than gracefully destroy the network contexts, and shut down the worker pool? We may still want timeouts of some sort, but just killing processes that store on-disk data in the normal case seems less than optimal.
,
Mar 20 2018
Also, if we really want to ignore TaskShutdownBehavior::BLOCK_SHUTDOWN in child processes, perhaps we should DCHECK on it everywhere but in the main process? If we ignore it in the main process, too, we should remove it entirely.
,
Mar 20 2018
The browser process always kills child processes at shutdown. That'll have to remain because we never want any child processes dangling. If we want to ensure that data stores in the network process are closed gracefully, that's possible by sending it an IPC and waiting for n seconds. It'll just have to be balanced with slowing down shutdown. I don't feel strongly either way. We can also add a timeout and gather uma, so once we're on canary we get real world data on how much it slows down shutdown. The thing I'm most curious about is what about Android: all of our shutdowns there are unclean, and I assume the browser there shuts down much more frequently than on desktop. If it's ok on Android, curious why it wouldn't be on desktop?
,
Mar 20 2018
It's OK on Android for the same reason not using Blink is ok on iOS - we don't have a choice in the matter, I assume.
,
Mar 20 2018
That's reasonable. I guess this separates shutdown into 2 sub-problems: 1) not all data might be flushed: although this is true today as well, since there could be in-flight messages from renderers. 2) trying to always close the cache & cookie stores, within some reasonable timeout You mentioned worker pool: how does the URLRequestContext's stores pick which threads to do disk I/O on?
,
Mar 20 2018
Most things use TaskScheduler now, and use block_shutdown for disk writes. I am not sure about the two disk cache implementations - I think one uses its own (leaked?) worker pool for individual files, and (possibly) a block_shutdown thread for the cache index itself. I'm much less concerned about performance improving data (the cache, server metadata) and partially received data (Like files being downloaded, though those are written to in the browser process, anyways) than more user-visible state (cookie store, channel ID store, and the like).
,
May 15 2018
,
May 18 2018
,
May 19 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c1273ef38167e73d5bdb38c73aca7d67fd4632cd commit c1273ef38167e73d5bdb38c73aca7d67fd4632cd Author: Matt Menke <mmenke@chromium.org> Date: Sat May 19 00:45:21 2018 Selectively re-enable NetworkContextConfigurationBrowserTest.Hsts. What I believe is happening is that, when using the network service, we're killing the network service before it can write to disk, resulting in settings not being saved before restarting the browser and trying to load the saved state. Unfortunately, the browser process kills subprocesses on shutdown, rather than letting them run their BLOCK_SHUTDOWN tasks. This CL disables the test whenever there's an on-disk cache and the network service is enabled. Bug: 837776 , 820996 , 842088 Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo Change-Id: I80549906d3408b1a0c651475075959b83fb0dc03 Reviewed-on: https://chromium-review.googlesource.com/1054525 Commit-Queue: Matt Menke <mmenke@chromium.org> Reviewed-by: Emily Stark <estark@chromium.org> Cr-Commit-Position: refs/heads/master@{#560123} [modify] https://crrev.com/c1273ef38167e73d5bdb38c73aca7d67fd4632cd/chrome/browser/net/network_context_configuration_browsertest.cc [modify] https://crrev.com/c1273ef38167e73d5bdb38c73aca7d67fd4632cd/testing/buildbot/filters/mojo.fyi.network_browser_tests.filter
,
May 21 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c840aa8d5e460c9f8cd9b787fd1a4c339aaeb4a5 commit c840aa8d5e460c9f8cd9b787fd1a4c339aaeb4a5 Author: John Abd-El-Malek <jam@chromium.org> Date: Mon May 21 22:00:38 2018 Wait on the network process at shutdown. This gives the child process time to flush I/O. This can help avoid the OS restarting in the middle of flushing data. It's also needed because swarming jobs fail when the browser quits with child processes still alive. An example flake is https://chromium-swarm.appspot.com/task?id=3d8cd8e056a4b310&refresh=10&show_raw=1. Roughly half of the layout test runs on Win Mojo are flaking as a result. Bug: 820996 Change-Id: I9c3b13f83c716f020f555cbe2aa14e20c8825ed3 Reviewed-on: https://chromium-review.googlesource.com/1066622 Reviewed-by: Steven Holte <holte@chromium.org> Reviewed-by: Ken Rockot <rockot@chromium.org> Commit-Queue: John Abd-El-Malek <jam@chromium.org> Cr-Commit-Position: refs/heads/master@{#560368} [modify] https://crrev.com/c840aa8d5e460c9f8cd9b787fd1a4c339aaeb4a5/content/browser/browser_process_sub_thread.cc [modify] https://crrev.com/c840aa8d5e460c9f8cd9b787fd1a4c339aaeb4a5/content/browser/utility_process_host.h [modify] https://crrev.com/c840aa8d5e460c9f8cd9b787fd1a4c339aaeb4a5/content/common/child_process_host_impl.h [modify] https://crrev.com/c840aa8d5e460c9f8cd9b787fd1a4c339aaeb4a5/tools/metrics/histograms/histograms.xml
,
May 22 2018
,
May 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9edd464609b0353046c0f11a3594329e7ba958a7 commit 9edd464609b0353046c0f11a3594329e7ba958a7 Author: John Abd-El-Malek <jam@chromium.org> Date: Wed May 23 04:24:28 2018 Fix waiting for network process at browser shutdown. The previous fix in r560368 didn't work, as the BrowserChildProcessHostImpl::TerminateAll removed all the UtilityProcessHost objects first. Bug: 820996 , 820969 Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo Change-Id: I88ba647e911c8f7e8d9458b60021aa6d915ccfa0 Reviewed-on: https://chromium-review.googlesource.com/1069668 Commit-Queue: John Abd-El-Malek <jam@chromium.org> Reviewed-by: Ken Rockot <rockot@chromium.org> Cr-Commit-Position: refs/heads/master@{#560938} [modify] https://crrev.com/9edd464609b0353046c0f11a3594329e7ba958a7/base/threading/thread_restrictions.h [modify] https://crrev.com/9edd464609b0353046c0f11a3594329e7ba958a7/content/browser/browser_process_sub_thread.cc [modify] https://crrev.com/9edd464609b0353046c0f11a3594329e7ba958a7/testing/buildbot/filters/mojo.fyi.network_browser_tests.filter
,
May 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1efb28fb17a305f4358afd55f964f63429a521a8 commit 1efb28fb17a305f4358afd55f964f63429a521a8 Author: John Abd-El-Malek <jam@chromium.org> Date: Fri May 25 23:27:39 2018 Launch unsandboxed network service in a job object to avoid the process outliving the browser. Currently many layout test runs are flaking because swarming is not expecting child processes to outlive the browser. This is happening because TerminateProcess fails sometimes because there's pending I/O. Bug: 820996 Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo Change-Id: I7595c131faccae03c268cde1139394c0ad48a7b4 Reviewed-on: https://chromium-review.googlesource.com/1071100 Commit-Queue: John Abd-El-Malek <jam@chromium.org> Reviewed-by: Penny MacNeil <pennymac@chromium.org> Reviewed-by: Tom Sepez <tsepez@chromium.org> Cr-Commit-Position: refs/heads/master@{#562058} [modify] https://crrev.com/1efb28fb17a305f4358afd55f964f63429a521a8/content/browser/browser_process_sub_thread.cc [modify] https://crrev.com/1efb28fb17a305f4358afd55f964f63429a521a8/services/service_manager/sandbox/win/sandbox_win.cc
,
Jun 3 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/cac74ff72c0eaa1a47338d1ba44990cc74e6b589 commit cac74ff72c0eaa1a47338d1ba44990cc74e6b589 Author: John Abd-El-Malek <jam@chromium.org> Date: Sun Jun 03 17:13:53 2018 Fix flake in layout tests when running with network service. The problem was that there were alive processes when tests finished, and swarming considered that a failure (e.g. [1]). The cause was that the processes were using a job object (to fix other races), and the code in base launched processes suspended if they were in a job object and later resumed them. The race was that layout test script kills the browser when there are any test failures. It did this using "taskkill.exe /t" (recursive). Windows would delete the network process, and as it was killing the browser it would sometimes do that in between the small window where the relaunched network process was created suspended and it was resumed. Child processes have logic to kill themselves if the parent is gone, but since the child process was suspended that didn't run. The suspension when a process is in a job object was added in r100027 [2]. However the code that was moved there (test server) didn't have this originally. Also, the documentation that describes that this is needed [3] is for functionality that ended in Windows XP [4]. Note that a few weeks ago relaunching started using this as well in [5]. I'm removing created-as-suspended for that code as well, since that seems rare and keeping code in base/ that has this race condition that leaves processes around seems much worse than the alternative race of windows sometimes not being in foreground. [1] https://chromium-swarm.appspot.com/task?id=3dd10a30220aa810&refresh=10&show_raw=1 [2] https://codereview.chromium.org/7789018 [3] https://msdn.microsoft.com/en-us/library/windows/desktop/ms681949(v=vs.85).aspx [4] https://msdn.microsoft.com/en-us/library/windows/desktop/ms684159(v=vs.85).aspx [5] https://chromium-review.googlesource.com/c/chromium/src/+/1064510 Bug: 820996 Change-Id: Ie79118efffac347bf3df1d65e1124fba095acdc0 Reviewed-on: https://chromium-review.googlesource.com/1081559 Reviewed-by: Gabriel Charette <gab@chromium.org> Reviewed-by: Greg Thompson <grt@chromium.org> Commit-Queue: John Abd-El-Malek <jam@chromium.org> Cr-Commit-Position: refs/heads/master@{#563967} [modify] https://crrev.com/cac74ff72c0eaa1a47338d1ba44990cc74e6b589/base/process/launch_win.cc
,
Jun 5 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3c39807459594a766f9f66fa780cacc157e2e82c commit 3c39807459594a766f9f66fa780cacc157e2e82c Author: John Abd-El-Malek <jam@chromium.org> Date: Tue Jun 05 15:27:34 2018 Add a DCHECK to ensure CREATE_SUSPENDED isn't used as a followup to r563967. Bug: 820996 Change-Id: I590744ed2b08bf0e5aafcaea737512dc369cf212 Reviewed-on: https://chromium-review.googlesource.com/1086592 Commit-Queue: John Abd-El-Malek <jam@chromium.org> Reviewed-by: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#564510} [modify] https://crrev.com/3c39807459594a766f9f66fa780cacc157e2e82c/base/process/launch_win.cc
,
Jun 7 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4db54417f0ee15c0f4abc0f8a63bfa47b45f67a3 commit 4db54417f0ee15c0f4abc0f8a63bfa47b45f67a3 Author: John Abd-El-Malek <jam@chromium.org> Date: Thu Jun 07 02:15:42 2018 Final fix to make layout tests not flaky on Windows with network service. Even if a process is launched without CREATE_SUSPENDED flag, Windows internally will create it suspended and then unsuspend it right away [1]. However there's a small window where if the parent process is killed the child process is left suspended. This only shows up with processes that are unsandboxed, so that's why it's occurring with the network service only. It's also happening only with layout tests because that's the only integration test that kills the browser ofen (about a thousand times per full run). The fix is to suspend the parent process before calling taskill on it. That ensures that it doesn't spawn another child process to replace one that's killed by taskkill. [1] https://www.microsoftpressstore.com/articles/article.aspx?p=2233328 Bug: 820996 Change-Id: I64e1a493744b3b9e41fdc28d5de8cb7e0c4578d1 Reviewed-on: https://chromium-review.googlesource.com/1086714 Commit-Queue: John Abd-El-Malek <jam@chromium.org> Reviewed-by: Dirk Pranke <dpranke@chromium.org> Cr-Commit-Position: refs/heads/master@{#565145} [modify] https://crrev.com/4db54417f0ee15c0f4abc0f8a63bfa47b45f67a3/third_party/blink/tools/blinkpy/common/system/executive.py
,
Jun 7 2018
,
Aug 31
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by mmenke@chromium.org
, Mar 12 2018