New issue
Advanced search Search tips

Issue 779914 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on:
issue 786173

Blocking:
issue 792524



Sign in to add a comment

Reduce overhead of subresource_filter safe browsing integration.

Project Member Reported by csharrison@chromium.org, Oct 31 2017

Issue description

It seems like on android this is causing perf issues.
 
So I took some traces and it seems like actual CPU contention could be an issue here, with descheduled time (caused by IPC) partially to blame.

I have a CL which will land some additional metrics to help us see. 
Labels: Restrict-View-Google
Great news, I think the culprit is basically 100% explained by my guess at #2. Take a look at the histogram:
https://uma.googleplex.com/p/chrome/variations/?sid=345a1dc1d714307cd28da84ff70a3525

This is measuring the time a nav starting to when it is requested by the net stack. Since we dispatch the IPC *before* the navigation is handed off to the net stack, I'm guessing we're delaying that from happening.

I'm so thankful we found this metric!

Solution are a bit harder though. Let me think about it.
Cc: clamy@chromium.org shivanisha@chromium.org
Components: UI>Browser>Navigation
+Navigation and clamy specifically for ideas here.

For context: I think SubresourceFilterSafeBrowsingClientRequest::Start is causing the IO thread to deschedule, since it sends an IPC to GMSCore SafetyNet. This blocks the navigation request from going out, since we run this task in response to NavigationThrottle constructor.

Ideally, we want a way to run this *after* the navigation goes out to network, since we don't want to delay at all. We could do something hacky on the IO thread like re-post a task to the end of the queue or post a delayed task. Another idea is to expose another API to NavigationThrottle like DidStart / DidRedirect, etc. which gets called after all throttles are done.

Posting a task to the IO thread from DidStart should not block the nav request from going out.
I guess more descriptive names would be:
1. StartDidProceed()
2. RedirectDidProcess()

These would need to be called
1. After the call to NavigationURLLoader::Create is made.
2. After the call to NavigationURLLoader::FollowRedirect is made

clamy: WDYT? Let me know if you see a better solution here.
Cc: mmenke@chromium.org
I chatted with +mmenke offline who confirmed that descheduling the IO thread *after* the resource loader starts the request will still allow network work to be done (in most circumstances). That is, we aren't just gaming the Navigation.TimeToURLJobStart metric.
Note that "in most circumstances" really means "in what I think are most circumstances, at least with the simple cache, and no extension that interposes on all URLRequests" - though in either of those cases, we will at least kick off the initial cache lookup / extension check, which is done off thread.
Yeah, understood. This problem is only on Android so I think we can assume simple cache / no extensions.
Rather than introducing new NavigationThrottle APIs, can't you just POST_TASK on the UI thread the call that will POST_TASK on the IO thread. I know that makes two post tasks, but I think it should work in most cases.
The problem with posting the task on the UI thread, and then posting to the IO thread is that we'll potentially be delaying a lot more. I'm looking at a trace where the task (creating and navigating a new tab) finishes 45ms after the request is sent out to the network.

In general, double posted tasks or posting delayed tasks could probably work, but I think the results will be inconsistent, or yield worse performance.

I'm open to exploring other APIs ideas though, this was just the simplest I could think of that does basically exactly what we want :)
Project Member

Comment 12 by bugdroid1@chromium.org, Nov 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/78ff41eab058af57f7629b6b1beec3628f7ce719

commit 78ff41eab058af57f7629b6b1beec3628f7ce719
Author: Charles Harrison <csharrison@chromium.org>
Date: Fri Nov 03 16:59:32 2017

[subresource_filter] Add metric for total SB check time

We have some ideas to make this feature a bit more async, to avoid
thread descheds janking navigation code on the IO thread. This metric
can help us evaluate that impact.

This CL also adds a histogram for the time it took to dispatch the
safe browsing check, which could be substantial on Android where we
are seeing IPC descheduling.

Bug:  779914 
Change-Id: I2cabba3c5979a7ea8008d294d02d1bfbf7004ebb
Reviewed-on: https://chromium-review.googlesource.com/747441
Commit-Queue: Charlie Harrison <csharrison@chromium.org>
Reviewed-by: Shivani Sharma <shivanisha@chromium.org>
Reviewed-by: Steven Holte <holte@chromium.org>
Cr-Commit-Position: refs/heads/master@{#513807}
[modify] https://crrev.com/78ff41eab058af57f7629b6b1beec3628f7ce719/components/subresource_filter/content/browser/subresource_filter_safe_browsing_activation_throttle.cc
[modify] https://crrev.com/78ff41eab058af57f7629b6b1beec3628f7ce719/components/subresource_filter/content/browser/subresource_filter_safe_browsing_activation_throttle.h
[modify] https://crrev.com/78ff41eab058af57f7629b6b1beec3628f7ce719/components/subresource_filter/content/browser/subresource_filter_safe_browsing_client_request.cc
[modify] https://crrev.com/78ff41eab058af57f7629b6b1beec3628f7ce719/tools/metrics/histograms/histograms.xml

Labels: -Restrict-View-Google
-rvg since uma links are OK
Could you link to the trace? I'd like to have a look at it before commenting on a possible solution. I'm really worried about the added complexity to the navigation API.
Oops for some reason I thought I posted it already. Here is one with lots of descehduling. The event that we want to postpone a bit is at 5,203.031 ms.
trace_usatoday_sbandroid.json
13.1 MB View Download
Note that I think the idea to post another task to get to the end of the queue could work, but I suspect it would be better to do it on the IO thread. However, it seems like for redirects, we fail to post after the related navigation task, so we may need to use a delayed task in that case.

As I said, my main worry with the post tasking solution is that it is inconsistent (and possibly fragile). I'd need to collect more traces but I'm nervous we'd be delaying sending the IPC to SafetyNet by another 10-20ms.

However, I would be willing to try it and measure the results if you aren't happy with hooking navigation further (I totally understand your hesitation).
Thanks! I won't have time to look at it today, will try tomorrow.
So I've given more thought to the issue, and here's where I am at:
- I'm not very happy with the new interface proposed for NavigationThrottles - it looks like observer methods which look weird on a throttle. We could potentially put them on WebContentsObserver (but I would prefer to avoid adding too many methods to WebContentsObserver).
- I have been wondering if moving your throttle higher up in the list of throttles that need to execute might make the issue less of a problem. This way we post the task to the IO thread while still having work to do on the UI thread, instead of just before posting the task to start on the IO thread.

What do you think?
Cc: vakh@chromium.org
Components: UI>Browser>SafeBrowsing
I think your suggestion might work, but I'm not confident. From traces I've seen it seems like actual throttles themselves are reasonably quick, so I'm not sure if things will change dramatically if we are first in line.

However, after looking at a recent histogram (SubresourceFilter.SafeBrowsing.CheckDispatchTime) which calculates wall time spent in our IO thread task, it makes me more and more concerned that this task just takes too long for it to be on the IO thread.  I have a tentative WIP CL [1] to offload it to a worker thread rather than running it on the IO thread on Android. I think that solution would make me happiest, but needs general approval from SB folks (+vakh).

[1]: https://chromium-review.googlesource.com/c/chromium/src/+/760550

Comment 20 by vakh@chromium.org, Nov 10 2017

Labels: SafeBrowsing-Triaged
Labels: Hotlist-EnamelAndFriendsFixIt
Labels: -Pri-3 M-64 Pri-2
Attached are two traces taken on my Pixel C with a custom Chrome build. One (iothread) shows current behavior. Note the subresource_filter message on the IO thread at 1974ms that blocks navigation by ~6ms. Also note that the call to BaseResourceThrottle::CheckUrl descheds and takes ~3.5ms. Note that the time spent in the base resource throttle (afaik) *does* block the URLRequest from going out to the network, since we call it after the URLRequest is created but before "Start()" is called.

The new behavior I am proposing is in the "workerthread" trace. In this patch, the safe browsing check is dispatched on a worker thread (using base::PostTask and base::MayBlock()).

I think for purposes of explanation the best metric to look at in these traces is the time from the subresource_filter task to the creation to the URL_REQUEST_START_JOB trace in net (mmenke correct me if I'm wrong).

In "iothread" trace, this takes ~11ms. In "workerthread" trace, it takes ~1ms. Unfortunately the time spent in the base resource throttle is not accounted for in Navigation.TimeToURLJobStart, but we may be able to see movements in other metrics.

trace_safe_browsing_iothread.json
1.2 MB View Download
trace_safe_browsing_workerthread.json
1.3 MB View Download
Project Member

Comment 23 by bugdroid1@chromium.org, Nov 16 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/aa65eacee6f10757a1fa35faefd890553834eafe

commit aa65eacee6f10757a1fa35faefd890553834eafe
Author: Charles Harrison <csharrison@chromium.org>
Date: Thu Nov 16 18:28:44 2017

Add metric to capture the time between request initiation and start

This CL adds a metric to measure the time spent from URLRequest creation
and when Start() is called on it. The purpose is to measure time spent
in various resource handlers before sending the request out to the net
stack.

Bug:  779914 
Change-Id: I8e462a7d9fb541de38589aeb926f81047a735d50
Reviewed-on: https://chromium-review.googlesource.com/771549
Commit-Queue: Charlie Harrison <csharrison@chromium.org>
Reviewed-by: Camille Lamy <clamy@chromium.org>
Reviewed-by: Matt Menke <mmenke@chromium.org>
Reviewed-by: Ilya Sherman <isherman@chromium.org>
Cr-Commit-Position: refs/heads/master@{#517131}
[modify] https://crrev.com/aa65eacee6f10757a1fa35faefd890553834eafe/content/browser/loader/resource_loader.cc
[modify] https://crrev.com/aa65eacee6f10757a1fa35faefd890553834eafe/tools/metrics/histograms/histograms.xml

Blocking: 786173
Project Member

Comment 25 by bugdroid1@chromium.org, Nov 17 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a18b84fd0a3431feb0112a82f4882404a1c241f4

commit a18b84fd0a3431feb0112a82f4882404a1c241f4
Author: Charles Harrison <csharrison@chromium.org>
Date: Fri Nov 17 13:04:55 2017

[safe_browsing] Dispatch SafetyNet API request on a worker thread.

Recent data shows that the time it takes to dispatch the IPC to
Google Play Services is quite expensive.

Android canary data (so probably trends higher end), for the
subresource filter check (in navigation critical path):
50th percentile: 6ms
75th percentile: 10ms
95th percentile: 28ms
99th percentile: 100ms
Average: 11ms

Since this is directly in the critical path of navigation, it makes
sense not to hog the thread / deschedule it before the network request
goes out. It also looks like we hit some perf issues in other uses of
the API too. See the linked bug for some chrome://tracing traces.

Bug:  779914 
Change-Id: Ibc44d9f199715c3fbf7a2d66604e82ee3ccfc5e4
Reviewed-on: https://chromium-review.googlesource.com/760550
Commit-Queue: Charlie Harrison <csharrison@chromium.org>
Reviewed-by: Varun Khaneja <vakh@chromium.org>
Reviewed-by: François Doray <fdoray@chromium.org>
Cr-Commit-Position: refs/heads/master@{#517389}
[modify] https://crrev.com/a18b84fd0a3431feb0112a82f4882404a1c241f4/components/safe_browsing/android/java/src/org/chromium/components/safe_browsing/SafeBrowsingApiBridge.java
[modify] https://crrev.com/a18b84fd0a3431feb0112a82f4882404a1c241f4/components/safe_browsing/android/java/src/org/chromium/components/safe_browsing/SafeBrowsingApiHandler.java
[modify] https://crrev.com/a18b84fd0a3431feb0112a82f4882404a1c241f4/components/safe_browsing/android/safe_browsing_api_handler_bridge.cc
[modify] https://crrev.com/a18b84fd0a3431feb0112a82f4882404a1c241f4/components/safe_browsing/android/safe_browsing_api_handler_bridge.h

Project Member

Comment 26 by bugdroid1@chromium.org, Nov 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/bbfe4e4de4a8ad10a352d1a2aa48c997bfdae42d

commit bbfe4e4de4a8ad10a352d1a2aa48c997bfdae42d
Author: Charles Harrison <csharrison@chromium.org>
Date: Mon Nov 20 15:39:54 2017

Adjust Navigation metrics to account for synchronous time in resource handlers

This CL slightly alters the timestamp in Navigation.TimeToURLJobStart to include
synchronous time spent before (potentially) starting the request. This change
makes the metric more precisely capture the time until the request reached the
network stack.

Bug:  779914 
Change-Id: I0eaf1fe7380cb6a14a2fbe328f431e35bb82efa3
Reviewed-on: https://chromium-review.googlesource.com/774938
Reviewed-by: Camille Lamy <clamy@chromium.org>
Commit-Queue: Charlie Harrison <csharrison@chromium.org>
Cr-Commit-Position: refs/heads/master@{#517829}
[modify] https://crrev.com/bbfe4e4de4a8ad10a352d1a2aa48c997bfdae42d/content/browser/loader/navigation_url_loader_impl_core.cc
[modify] https://crrev.com/bbfe4e4de4a8ad10a352d1a2aa48c997bfdae42d/content/browser/loader/navigation_url_loader_impl_core.h
[modify] https://crrev.com/bbfe4e4de4a8ad10a352d1a2aa48c997bfdae42d/content/browser/loader/navigation_url_loader_unittest.cc

Blocking: 792524
Blocking: -786173
Blockedon: 786173
Labels: -Hotlist-EnamelAndFriendsFixIt
Status: Fixed (was: Started)

Sign in to add a comment