TaskScheduler experiment slightly increases slow ("hung") sync IPCs |
||||||
Issue descriptionHung reports of [Renderer hang] content::`anonymous namespace\'::MessageThrottlingFilter::SendThrottled In BrowserScheduler/RedirectSequencedWorkerPools3 (60 reports): https://crash.corp.google.com/browse?q=product.name%3D%22Chrome%22%20AND%20product.version%3D%2256.0.2914.3%22%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%22%5BRenderer%20hang%5D%20content%3A%3A%60anonymous%20namespace%5C%27%3A%3AMessageThrottlingFilter%3A%3ASendThrottled%22%20AND%20custom_data.ChromeCrashProto.malware_verdict%3Dfalse%20OMIT%20RECORD%20IF%20SUM(custom_data.ChromeCrashProto.experiments.ids%3D%22241fff6c-bc1cfa01%22)%20%3D%200&ignore_case=false&enable_rewrite=false&omit_field_name=&omit_field_value=&omit_field_opt= In BrowserScheduler/Control_RedirectSequencedWorkerPools3 (3 reports): https://crash.corp.google.com/browse?q=product.name%3D%22Chrome%22%20AND%20product.version%3D%2256.0.2914.3%22%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%22%5BRenderer%20hang%5D%20content%3A%3A%60anonymous%20namespace%5C%27%3A%3AMessageThrottlingFilter%3A%3ASendThrottled%22%20AND%20custom_data.ChromeCrashProto.malware_verdict%3Dfalse%20OMIT%20RECORD%20IF%20SUM(custom_data.ChromeCrashProto.experiments.ids%3D%22241fff6c-bc1cfa01%22)%20!%3D%200&ignore_case=false&enable_rewrite=false&omit_field_name=&omit_field_value=&omit_field_opt= Our initial diagnosis: 1) The reported "hang" is a synchronous IPC from the renderer for a DOMStorage call. 2) We hypothesize that this ends up calling into DOMStorageTaskRunner : https://cs.chromium.org/chromium/src/content/browser/dom_storage/dom_storage_task_runner.h?q=DomStorageTaskRunner&sq=package:chromium&dr=CSs 3) DOMStorageTaskRunner uses the browser's BlockingPool: https://cs.chromium.org/chromium/src/content/browser/dom_storage/dom_storage_context_wrapper.cc?dr=C&q=file:dom_storage_context_wrapper.cc+BrowserThread::GetBlockingPool()&sq=package:chromium&l=277 4) Under the experiment the BlockingPool is redirected to TaskScheduler along with other pools which could interfere (e.g. CachePool). Things we want to try: A) Add multiple groups with different SchedulerWorkerPoolParams in the BrowserScheduler experiment to see if number of threads/etc. make a difference here. B) Have DOMStorageTaskRunner be one of the first direct users of TaskScheduler and give it its own SequencedTaskRunner with USER_BLOCKING TaskPriority :) (B) is blocked on Issue 662053 (having TaskScheduler always on in browser process -- ETA in a few days :)) The real solution is (B), (A) is really just us taking advantage of having found something affected by our changes to experiment with tweaking our parameters :)
,
Nov 18 2016
,
Nov 21 2016
See... LocalStorage.TimeToPrimeLocalStorage LocalStorage.RendererTimeToPrime<<>>
,
Nov 22 2016
(Unfortunately I don't have any quota to look at this; I haven't been working on anything related for years...)
,
Dec 19 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/28fb80beda6b171647d037057543f77e5b78e170 commit 28fb80beda6b171647d037057543f77e5b78e170 Author: gab <gab@chromium.org> Date: Mon Dec 19 20:09:28 2016 Force HostContentSettingsMap to be destroyed on its owning thread. It uses WeakPtrFactory and as such must be destroyed on its owning thread (UI). The rest of the changes can then be inferred from the stack trace on the bug: 1) RefcountedKeyedServiceTraits::Destruct() must check for ThreadTaskRunnerHandle::IsSet() before invoking TTRH::Get(). ** UPDATE in PS 5 : Use STR::RunsTasksOnCurrentThread() instead of STRH comparison. 2) Nothing about RefcountedKeyedService's use of its |task_runner| is thread-affine => using SequencedTaskRunner instead is inline with our desire to make all top-level API use sequence-affinity instead of thread-affinity for thread-safety. 3) Calling SequencedTaskRunnerHandle::IsSet() from the Callback's destructor requires the SequencedWorkerPool's task info to be set in that scope => brought changes from same requirements in another CL @ https://codereview.chromium.org/2491613004/#ps160001 Note: intentionally not adding tests as this change highlights that this deletion is racy in theory (deleting without owning the sequence means potentially deleting things from the same SequenceToken in parallel... any good test would fail TSAN and I also don't want to address this issue given it's long standing and SWP is being replaced by TaskScheduler shortly). 4) Switch to ScopedMockTimeMessageLoopTaskRunner in NotificationPermissionContextTest => otherwise LSAN catches a leak in patch set 3 as the HostContentSettingsMap is sent for deletion on the wrong task runner and the task is never flushed. (ScopedMockTimeMessageLoopTaskRunner ensures the loop is flushed before replacing it, both ways). Precursor requirement to https://codereview.chromium.org/2576243003/ which for some reason makes this destruction race come to life. BUG= 674946 , 665588 , 622400 TBR=mukai (NotificationPermissionContextTest side-effects) Review-Url: https://codereview.chromium.org/2581213002 Cr-Commit-Position: refs/heads/master@{#439534} [modify] https://crrev.com/28fb80beda6b171647d037057543f77e5b78e170/base/threading/sequenced_worker_pool.cc [modify] https://crrev.com/28fb80beda6b171647d037057543f77e5b78e170/chrome/browser/notifications/notification_permission_context_unittest.cc [modify] https://crrev.com/28fb80beda6b171647d037057543f77e5b78e170/components/content_settings/core/browser/cookie_settings_unittest.cc [modify] https://crrev.com/28fb80beda6b171647d037057543f77e5b78e170/components/content_settings/core/browser/host_content_settings_map.cc [modify] https://crrev.com/28fb80beda6b171647d037057543f77e5b78e170/components/keyed_service/core/refcounted_keyed_service.cc [modify] https://crrev.com/28fb80beda6b171647d037057543f77e5b78e170/components/keyed_service/core/refcounted_keyed_service.h
,
Dec 19 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/72b1a059ae930ebe1dda804e607f251d3ae94a70 commit 72b1a059ae930ebe1dda804e607f251d3ae94a70 Author: gab <gab@chromium.org> Date: Mon Dec 19 21:40:50 2016 Add an experiment to redirect DOMStorageTaskRunner to TaskScheduler. Using a bug as an opportunity to experiment with TaskScheduler (as well as confirm the issue is what we think it is: a priority inversion where a sync IPC from the renderer is blocking on a response from the BlockingPool). Only logical difference is that TaskScheduler doesn't have a notion of named sequences and as such every DOMStorageContextWrapper will have its own primary and commit sequence instead of sharing one with other DOMStorageContextWrapper. If anything that's a good thing as it allows multiple profiles to do their DOMStorage ops in parallel? Don't think this can conflict..? BUG= 665588 , 622400 Review-Url: https://codereview.chromium.org/2576243003 Cr-Commit-Position: refs/heads/master@{#439575} [modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/chrome/browser/chrome_content_browser_client.cc [modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/chrome/browser/chrome_content_browser_client.h [modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/content/browser/dom_storage/dom_storage_context_impl.cc [modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/content/browser/dom_storage/dom_storage_context_impl.h [modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/content/browser/dom_storage/dom_storage_context_wrapper.cc [modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/content/browser/dom_storage/dom_storage_task_runner.cc [modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/content/browser/dom_storage/dom_storage_task_runner.h [modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/content/public/browser/content_browser_client.cc [modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/content/public/browser/content_browser_client.h
,
Feb 15 2017
So I dug into this further this morning: the hang is indeed on a sync IPC from DomStorageDispatcher::ProxyImpl::LoadArea() [1] -- I'm not clear on why this needs to be a synchronous IPC but it is. The browser side handler of this IPC [2] doesn't appear to be blocking on DomStorageTaskRunner (which is being redirected to TaskScheduler and hence I thought that was the issue). So another possible cause (much harder to diagnose...) is that the IO thread's throughput itself is lower under the redirection... This could be because something else on the IO thread has a dependency on the BlockingPool (and thus slows down the whole queue and consequently pending sync IPCs)... @asviktine: can the profiler dashboard help highlight which tasks are slower on the IO thread under a given experiment group? Or I guess if that were the case: the IO thread's hang watcher should be reporting dumps-without-crash and Chirp should alert of the higher "crash rate" for a given group -- haven't seen that..? [1] https://cs.chromium.org/chromium/src/content/common/dom_storage/dom_storage_messages.h?rcl=5dbb827e3b41a97ec2d7105a0378c268a5d29fbe&l=74 [2] https://cs.chromium.org/chromium/src/content/browser/dom_storage/dom_storage_message_filter.cc?rcl=5dbb827e3b41a97ec2d7105a0378c268a5d29fbe&l=105
,
Feb 15 2017
@asviktine ^^^ ?
,
Feb 15 2017
+isherman Unfortunately, our dashboard doesn't support this. It would in theory be possible to add - since we have the data server side, but the UMA task/jank profiler is currently not prioritized in the team's OKRs. I'm not sure how easy it would be to query this directly from Dremel, but that may be an avenue to pursue if you're willing to get your hangs dirty... Adding Ilya in case he has more thoughts on this as he's more familiar with some of this.
,
Feb 15 2017
Yeah, the profiler.go script has a "generateVariations" option, but it's not currently run by default, as there's no implemented dashboard support for it. Sorry!
,
Mar 23 2017
Just looked into this again, it's still the case on M59 Dev: 3 reports in Control7 versus >20 in each of the other three redirection groups. So whatever is causing this issue is still amongst the calls that have yet to be migrated directly to TaskScheduler. Here's one crazy theory... given the main thing the redirection does right now is prioritize CachePool, perhaps what's happening is that we indeed are faster at servicing cache requests and hence the IO thread has more work (more responses) and as such is more likely to be "flooded" (even though that's not a bad flood if all of those responses matter -- if they don't that's a resource scheduling issue as we can't ask the task scheduler to run things slower on purpose.. or it's an IO thread scheduling issue if some responses should be processed first... that's something we can address long term but not in v1).
,
Apr 20 2017
WontFix'ing this. The original hang was marked WontFix. We don't know why the redirection increases hangs from "very seldomly" to "4X as very seldomly"... but the best way to figure it out (and fix so many other subtle issues like this) is to ship task scheduler and instrument its workload so we're going to go ahead and do just that.
,
Apr 21 2017
FTR, issue 605848 is where the low rate of hangs from this sync IPC was originally marked WontFix
,
Apr 21 2017
Interestingly, this hang stopped being reported altogether after 59.0.3044.1 [1] CL range is fairly wide until next release though: https://chromium.googlesource.com/chromium/src/+log/59.0.3044.0..59.0.3047.0?pretty=fuller&n=10000 [1] https://crash.corp.google.com/browse?q=custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BRenderer%20hang%5D%20content%3A%3A%60anonymous%20namespace%5C%27%3A%3AMessageThrottlingFilter%3A%3ASendThrottled%27%20AND%20product.Version%20CONTAINS%20%2759.0.%27&ignore_case=false&enable_rewrite=false&omit_field_name=&omit_field_value=&omit_field_opt=#samplereports:5,-productname,productversion:1000,-component,-filepath,-author,-changelist |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by gab@chromium.org
, Nov 15 2016