New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 665588 link

Starred by 3 users

Issue metadata

Status: WontFix
Owner:
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 1
Type: Bug

Blocked on:
issue 605848
issue 662053



Sign in to add a comment

TaskScheduler experiment slightly increases slow ("hung") sync IPCs

Project Member Reported by gab@chromium.org, Nov 15 2016

Issue description

Hung reports of [Renderer hang] content::`anonymous namespace\'::MessageThrottlingFilter::SendThrottled

In BrowserScheduler/RedirectSequencedWorkerPools3 (60 reports): https://crash.corp.google.com/browse?q=product.name%3D%22Chrome%22%20AND%20product.version%3D%2256.0.2914.3%22%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%22%5BRenderer%20hang%5D%20content%3A%3A%60anonymous%20namespace%5C%27%3A%3AMessageThrottlingFilter%3A%3ASendThrottled%22%20AND%20custom_data.ChromeCrashProto.malware_verdict%3Dfalse%20OMIT%20RECORD%20IF%20SUM(custom_data.ChromeCrashProto.experiments.ids%3D%22241fff6c-bc1cfa01%22)%20%3D%200&ignore_case=false&enable_rewrite=false&omit_field_name=&omit_field_value=&omit_field_opt=

In BrowserScheduler/Control_RedirectSequencedWorkerPools3 (3 reports): https://crash.corp.google.com/browse?q=product.name%3D%22Chrome%22%20AND%20product.version%3D%2256.0.2914.3%22%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%22%5BRenderer%20hang%5D%20content%3A%3A%60anonymous%20namespace%5C%27%3A%3AMessageThrottlingFilter%3A%3ASendThrottled%22%20AND%20custom_data.ChromeCrashProto.malware_verdict%3Dfalse%20OMIT%20RECORD%20IF%20SUM(custom_data.ChromeCrashProto.experiments.ids%3D%22241fff6c-bc1cfa01%22)%20!%3D%200&ignore_case=false&enable_rewrite=false&omit_field_name=&omit_field_value=&omit_field_opt=


Our initial diagnosis:

 1) The reported "hang" is a synchronous IPC from the renderer for a DOMStorage call.

 2) We hypothesize that this ends up calling into DOMStorageTaskRunner : https://cs.chromium.org/chromium/src/content/browser/dom_storage/dom_storage_task_runner.h?q=DomStorageTaskRunner&sq=package:chromium&dr=CSs

 3) DOMStorageTaskRunner uses the browser's BlockingPool: https://cs.chromium.org/chromium/src/content/browser/dom_storage/dom_storage_context_wrapper.cc?dr=C&q=file:dom_storage_context_wrapper.cc+BrowserThread::GetBlockingPool()&sq=package:chromium&l=277

 4) Under the experiment the BlockingPool is redirected to TaskScheduler along with other pools which could interfere (e.g. CachePool).


Things we want to try:
 A) Add multiple groups with different SchedulerWorkerPoolParams in the BrowserScheduler experiment to see if number of threads/etc. make a difference here.
 B) Have DOMStorageTaskRunner be one of the first direct users of TaskScheduler and give it its own SequencedTaskRunner with USER_BLOCKING TaskPriority :)

(B) is blocked on  Issue 662053  (having TaskScheduler always on in browser process -- ETA in a few days :))

The real solution is (B), (A) is really just us taking advantage of having found something affected by our changes to experiment with tweaking our parameters :)
 

Comment 1 by gab@chromium.org, Nov 15 2016

Cc: michaeln@chromium.org marja@chromium.org
@dom_storage OWNERS: is there any UMA metric we can look at to watch the effect of this potential subtle slow down?

Comment 2 by gab@chromium.org, Nov 18 2016

Labels: -M-56 M-57
See...
LocalStorage.TimeToPrimeLocalStorage
LocalStorage.RendererTimeToPrime<<>>

Comment 4 by marja@chromium.org, Nov 22 2016

(Unfortunately I don't have any quota to look at this; I haven't been working on anything related for years...)
Project Member

Comment 5 by bugdroid1@chromium.org, Dec 19 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/28fb80beda6b171647d037057543f77e5b78e170

commit 28fb80beda6b171647d037057543f77e5b78e170
Author: gab <gab@chromium.org>
Date: Mon Dec 19 20:09:28 2016

Force HostContentSettingsMap to be destroyed on its owning thread.

It uses WeakPtrFactory and as such must be destroyed on its owning thread (UI).

The rest of the changes can then be inferred from the stack trace on the bug:

1) RefcountedKeyedServiceTraits::Destruct() must check for
   ThreadTaskRunnerHandle::IsSet() before invoking TTRH::Get().
  ** UPDATE in PS 5 : Use STR::RunsTasksOnCurrentThread() instead of STRH comparison.

2) Nothing about RefcountedKeyedService's use of its |task_runner| is thread-affine
   => using SequencedTaskRunner instead is inline with our desire to make all
   top-level API use sequence-affinity instead of thread-affinity for thread-safety.

3) Calling SequencedTaskRunnerHandle::IsSet() from the Callback's destructor requires
   the SequencedWorkerPool's task info to be set in that scope
   => brought changes from same requirements in another CL
      @ https://codereview.chromium.org/2491613004/#ps160001
   Note: intentionally not adding tests as this change highlights that this deletion is
   racy in theory (deleting without owning the sequence means potentially deleting
   things from the same SequenceToken in parallel... any good test would fail TSAN and
   I also don't want to address this issue given it's long standing and SWP is being
   replaced by TaskScheduler shortly).

4) Switch to ScopedMockTimeMessageLoopTaskRunner in NotificationPermissionContextTest
   => otherwise LSAN catches a leak in patch set 3 as the HostContentSettingsMap is
      sent for deletion on the wrong task runner and the task is never flushed.
      (ScopedMockTimeMessageLoopTaskRunner ensures the loop is flushed before replacing
       it, both ways).

Precursor requirement to https://codereview.chromium.org/2576243003/ which
for some reason makes this destruction race come to life.

BUG= 674946 ,  665588 ,  622400 
TBR=mukai (NotificationPermissionContextTest side-effects)

Review-Url: https://codereview.chromium.org/2581213002
Cr-Commit-Position: refs/heads/master@{#439534}

[modify] https://crrev.com/28fb80beda6b171647d037057543f77e5b78e170/base/threading/sequenced_worker_pool.cc
[modify] https://crrev.com/28fb80beda6b171647d037057543f77e5b78e170/chrome/browser/notifications/notification_permission_context_unittest.cc
[modify] https://crrev.com/28fb80beda6b171647d037057543f77e5b78e170/components/content_settings/core/browser/cookie_settings_unittest.cc
[modify] https://crrev.com/28fb80beda6b171647d037057543f77e5b78e170/components/content_settings/core/browser/host_content_settings_map.cc
[modify] https://crrev.com/28fb80beda6b171647d037057543f77e5b78e170/components/keyed_service/core/refcounted_keyed_service.cc
[modify] https://crrev.com/28fb80beda6b171647d037057543f77e5b78e170/components/keyed_service/core/refcounted_keyed_service.h

Project Member

Comment 6 by bugdroid1@chromium.org, Dec 19 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/72b1a059ae930ebe1dda804e607f251d3ae94a70

commit 72b1a059ae930ebe1dda804e607f251d3ae94a70
Author: gab <gab@chromium.org>
Date: Mon Dec 19 21:40:50 2016

Add an experiment to redirect DOMStorageTaskRunner to TaskScheduler.

Using a bug as an opportunity to experiment with TaskScheduler (as well
as confirm the issue is what we think it is: a priority inversion where
a sync IPC from the renderer is blocking on a response from the
BlockingPool).

Only logical difference is that TaskScheduler doesn't have a notion of
named sequences and as such every DOMStorageContextWrapper will have its
own primary and commit sequence instead of sharing one with other
DOMStorageContextWrapper. If anything that's a good thing as it allows
multiple profiles to do their DOMStorage ops in parallel? Don't think
this can conflict..?

BUG= 665588 ,  622400 

Review-Url: https://codereview.chromium.org/2576243003
Cr-Commit-Position: refs/heads/master@{#439575}

[modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/chrome/browser/chrome_content_browser_client.cc
[modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/chrome/browser/chrome_content_browser_client.h
[modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/content/browser/dom_storage/dom_storage_context_impl.cc
[modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/content/browser/dom_storage/dom_storage_context_impl.h
[modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/content/browser/dom_storage/dom_storage_context_wrapper.cc
[modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/content/browser/dom_storage/dom_storage_task_runner.cc
[modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/content/browser/dom_storage/dom_storage_task_runner.h
[modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/content/public/browser/content_browser_client.cc
[modify] https://crrev.com/72b1a059ae930ebe1dda804e607f251d3ae94a70/content/public/browser/content_browser_client.h

Comment 7 by gab@chromium.org, Feb 15 2017

So I dug into this further this morning: the hang is indeed on a sync IPC from DomStorageDispatcher::ProxyImpl::LoadArea() [1] -- I'm not clear on why this needs to be a synchronous IPC but it is.

The browser side handler of this IPC [2] doesn't appear to be blocking on DomStorageTaskRunner (which is being redirected to TaskScheduler and hence I thought that was the issue).

So another possible cause (much harder to diagnose...) is that the IO thread's throughput itself is lower under the redirection... This could be because something else on the IO thread has a dependency on the BlockingPool (and thus slows down the whole queue and consequently pending sync IPCs)...

@asviktine: can the profiler dashboard help highlight which tasks are slower on the IO thread under a given experiment group? Or I guess if that were the case: the IO thread's hang watcher should be reporting dumps-without-crash and Chirp should alert of the higher "crash rate" for a given group -- haven't seen that..?

[1] https://cs.chromium.org/chromium/src/content/common/dom_storage/dom_storage_messages.h?rcl=5dbb827e3b41a97ec2d7105a0378c268a5d29fbe&l=74

[2] https://cs.chromium.org/chromium/src/content/browser/dom_storage/dom_storage_message_filter.cc?rcl=5dbb827e3b41a97ec2d7105a0378c268a5d29fbe&l=105

Comment 8 by gab@chromium.org, Feb 15 2017

Cc: asvitk...@chromium.org
@asviktine ^^^ ?
Cc: isherman@chromium.org
+isherman

Unfortunately, our dashboard doesn't support this.

It would in theory be possible to add - since we have the data server side, but the UMA task/jank profiler is currently not prioritized in the team's OKRs. I'm not sure how easy it would be to query this directly from Dremel, but that may be an avenue to pursue if you're willing to get your hangs dirty...

Adding Ilya in case he has more thoughts on this as he's more familiar with some of this.
Yeah, the profiler.go script has a "generateVariations" option, but it's not currently run by default, as there's no implemented dashboard support for it.  Sorry!

Comment 11 by gab@chromium.org, Mar 23 2017

Just looked into this again, it's still the case on M59 Dev:

3 reports in Control7 versus >20 in each of the other three redirection groups.

So whatever is causing this issue is still amongst the calls that have yet to be migrated directly to TaskScheduler.

Here's one crazy theory... given the main thing the redirection does right now is prioritize CachePool, perhaps what's happening is that we indeed are faster at servicing cache requests and hence the IO thread has more work (more responses) and as such is more likely to be "flooded" (even though that's not a bad flood if all of those responses matter -- if they don't that's a resource scheduling issue as we can't ask the task scheduler to run things slower on purpose.. or it's an IO thread scheduling issue if some responses should be processed first... that's something we can address long term but not in v1).

Comment 12 by gab@chromium.org, Apr 20 2017

Status: WontFix (was: Started)
WontFix'ing this. The original hang was marked WontFix. We don't know why the redirection increases hangs from "very seldomly" to "4X as very seldomly"... but the best way to figure it out (and fix so many other subtle issues like this) is to ship task scheduler and instrument its workload so we're going to go ahead and do just that.

Comment 13 by gab@chromium.org, Apr 21 2017

Blockedon: 605848
FTR, issue 605848 is where the low rate of hangs from this sync IPC was originally marked WontFix

Sign in to add a comment