New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 606022 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

SafeBrowsing hash computation/lookup should be optimized

Project Member Reported by joenotcharles@chromium.org, Apr 22 2016

Issue description

When looking up an url on a SafeBrowsing list, it must be converted to a hash list using UrlToFullHashes. This process is often called during the critical path for resource loading, so it needs to be fast. It's also easy to call it multiple times for the same url, since SafeBrowsing lookups are triggered from different parts of the resource loader. Passing the full set of hashes around to every site that might need them would be complicated, so it would be a good idea to optimize UrlToFullHashes (and SafeBrowsing lookup in general) as much as possible.

From https://bugs.chromium.org/p/chromium/issues/detail?id=576750:

> Absolute numbers are small: on my high-end Windows desktop each call to 
> UrlToFullHashes takes around 0.04 ms, and Josh Karlin posted a tracelog 
> showing it taking 0.02 ms. But they're called from the IO thread on the 
> critical path for resource loading, so that's overhead for every 
> resource. Including resources loaded from cache, where the network 
> overhead doesn't dominate these numbers.

AFAIK there's not currently a critical issue because the only places I know of where UrlToFullHashes was called wastefully have been eliminated. But I'd still like to optimize it further.

Some possible approaches were discussed in https://groups.google.com/a/chromium.org/forum/#!topic/net-dev/nWoAaIxihVw:

Matt Menke:
> UrlToFullHashes calculates the SHA256 of entire URLs multiple times.  
> All URLs are substrings of the original URL.  SHA256 supports doing 
> incremental string updates...  I don't suppose we could optimize 
> UrlToFullHashes to take advantage of this incremental updating support?

Matt Menke:
> We could also look into improving safebrowsing to use fewer SHA2's.
>
> Given a.b.c.d.e.com/1/2/3/4/, it looks to me like we  calculate 
> something like 7 * 5 = 35 SHA-2s, which seems pretty crazy to me.  Even 
> just not trying both host/path/ and host/path would almost halve that, 
> in the worst case.

Josh Karlin:
> For instance, we could have a separate database that has only TLDs for bad urls, and only check the 35 variants if the TLD is in the database.

Ben Maurer:
> Even fancier could be a trie. For example, if foo.com/a/b/* is bad, 
> you'd have an entries
>
> hash(foo.com) -- is bad: false  bad children: true
> hash(foo.com/a/) -- is bad: false  bad children: true
> hash(foo.com/a/b/) -- is bad: true  bad children: true
>
> That way, as soon as you do a lookup where there are no bad children,
> further variants can be avoided 

Josh Karlin:
> In SafeBrowser's WillStartRequest we now spend a total of ~.16ms per 
> request (~16ms total for the page). ~.03ms of that is now in 
> UrlToFullHashes. Roughly ~.06ms is spent in per call (we call twice) to 
> SafeBrowsingDatabaseNew::PrefixSetContainsUrlHashes. So it's better to 
> optimize PrefixSetContainsUrlHashes at this point if there is any low 
> hanging fruit.

Caching results is probably not the answer. Carlos Knippschild:
> From previous discussions about safe browsing I understood we must be 
> very careful when caching its results. The safe browsing database is 
> updated very often (every 15 minutes or so?) because sites get hacked 
> all the time and one that was initially safe might become a security 
> threat.
>
> If we should go that way we'll need safe browsing folks opinion on this 
> and I'm guessing we'd at least need need some sort of invalidation 
> strategy.
 

Comment 1 by mmenke@chromium.org, Apr 22 2016

Another thing:  Does this have to be a cryptographic hash?  In the case of collisions, I believe we contact a server to figure out if a URL is really blacklisted, so collisions do incur overhead, but I'm not sure that warrants the overhead of SHA-2.
> I don't suppose we could optimize 
> UrlToFullHashes to take advantage of this incremental updating support? 

That seems doable, since all the variations are substrings. 

> For instance, we could have a separate database that has only TLDs for bad urls, and only check
> the 35 variants if the TLD is in the database. 
 
This would require pushing another database to each client, and the size of would be similar to the original one since in many cases the host is blacklisted w/o any path.  The cost of pushing this data would outweigh the benefit in CPU, I suspect.

> Even fancier could be a trie. For example, if foo.com/a/b/* is bad, 
> you'd have an entries 
> 
> hash(foo.com) -- is bad: false bad children: true 
> hash(foo.com/a/) -- is bad: false bad children: true 
> hash(foo.com/a/b/) -- is bad: true bad children: true 
> 
> That way, as soon as you do a lookup where there are no bad children, 
> further variants can be avoided
 
This would require pushing even more data to Chrome since it'd have to have entry for each level of the trie rather than just the final leaf. With the current data model, there's no way for chrome to know if "all of http://foo.com/* is safe."

> Another thing: Does this have to be a cryptographic hash? 
This is dictated by the safe browsing network API, so we can't change it in Chrome w/o changing the API. 
nparker:

If most of the entries are just the host, then we can add entries for cases where only longer paths exist. E.g., only foo.com/a exists, so also add foo.com^ (or some better character). Then your lookup algorithm becomes: if (host or host^ is in the database) { scan 35 more variants } else { return false }.

So now we're down to ~2 lookups on average instead of 35.

Status: Available (was: Untriaged)
ok, functionally I think that'd work, but you'd need to weigh the performance improvements against the increase in complexity in the client and server, and demonstrate that there isn't a comparable client-side-only optimization.  This would require switching to a new set of blacklists, which is non trivial.

For any of these, we need some numbers to demonstrate where the problem is, and if there is a problem. 

Comment 5 by vakh@chromium.org, May 6 2016

Owner: joenotcharles@chromium.org
joenotcharles@ -- do you have any performance data that shows it's worth the code complexity?

Comment 6 by vakh@chromium.org, May 6 2016

Labels: SafeBrowsing-Triaged
On my Z840 with a release build I'm seeing about ~11ms spent in SafeBrowsingDatabaseNew::PrefixSetContainsUrlHashes to load 100 URL requests. A typical page might have ~350 resources or so. So that's ~38.5ms for a page load on a fast machine. 

Comment 8 by bmau...@fb.com, May 9 2016

Something to consider here from a site author perspective -- right now we take pains to avoid increasing the number of resources on the page -- spriting image, packaging things -- even though http2 is the majority of our traffic. We've seen that there's a noticeable CPU cost to adding 2x the number of resources on the page.

Every millisecond counts here.
I also see a comment in LocalSafeBrowsingDatabaseManager::HandleOneCheck that may be relevant:

  // TODO(shess): GetHashSeverestThreadListType() contains a loop,
  // GetUrlSeverestThreatListType() a loop around that loop.  Having another
  // loop out here concerns me.  It is likely that SAFE is an expected outcome,
  // which means all of those loops run to completion.  Refactoring this to
  // generate a set of sorted items to compare in sequence would probably
  // improve things.
  //
  // Additionally, the set of patterns generated from the urls is very similar
  // to the patterns generated in ContainsBrowseHashes() and other database checks,
  // which are called from this code.  Refactoring that across the checks could
  // interact well with batching the checks here.

I noticed that SBFullHashForString, called twice from UrlToFullContext, ends up doing "new SecureHashSHA256", using the resulting object, and deleting it immediately. So I replaced it with direct calls to openssl SHA256_*, using a stack-allocated context. My first try to benchmark this suggested it would save .002 msec per call - not nothing, but maybe not worth the increased code complexity. But when I tried it again a few hours later I couldn't repeat that increase (got too much variance).

I attached the patch at http://crrev.com/2007983004 in case it becomes useful, but I don't plan to pursue this any more right now.
Project Member

Comment 11 by bugdroid1@chromium.org, May 25 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e625cb78079a7c15a944cff3fe89b28a3779b697

commit e625cb78079a7c15a944cff3fe89b28a3779b697
Author: joenotcharles <joenotcharles@chromium.org>
Date: Wed May 25 23:26:42 2016

Only call CanonicalizeUrl once from UrlToFullHashes

BUG= 606022 

Review-Url: https://codereview.chromium.org/2010713003
Cr-Commit-Position: refs/heads/master@{#396035}

[modify] https://crrev.com/e625cb78079a7c15a944cff3fe89b28a3779b697/components/safe_browsing_db/util.cc

Comment 12 by vakh@chromium.org, Jun 8 2016

Labels: Hotlist-Fixit-Triaged

Comment 13 by vakh@chromium.org, Sep 9 2016

Status: Assigned (was: Available)
Bulk assigning bugs with identified owners.

Comment 14 by vakh@chromium.org, Mar 10 2017

Status: Fixed (was: Assigned)
Re #c10: Since this is not going to be pursued anymore and the PVer4 code -- which will launch to Stable within a few weeks -- uses UrlToFullHashes differently, I am going to mark this as WontFix.

Sign in to add a comment