New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 678334 link

Starred by 4 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 3
Type: Bug



Sign in to add a comment

Improve GetRegistryLengthImpl() performance by seeding the DAFSA in reverse

Project Member Reported by nick@chromium.org, Jan 4 2017

Issue description

GetRegistryLengthImpl() is a function in registry_controlled_domain.cc that
determines the effective TLD of a URL. By means of the following algorithmic
improvement, we could make it faster.

Currently, the core of the function is a loop that works like this:

 s := hostname;
 while (s) {
   if (LookupInFixedSet(s, "./effective_tld_names.gperf")) {
     return s;
   } else {
   	 s := s.substring(s.find('.') + 1);  // trim chars up to the next dot
    }
 }

Under the hood, the effective_tld_names.gperf dictionary is statically compiled
into an efficient DAFSA data structure that makes this lookup efficient.
However, we can eliminate the while loop (and thus multiple consultations of the
DAFSA) altogether, by turning this into a "find longest suffix from a fixed set"
by a different construction of the same data structure:

 1. Seed the DAFSA with the reversed string values of effective_tld_names.gperf.
 2. Feed the characters of the hostname string to the dafsa in reverse.
 3. Rather than having a "lookup string in fixed set" DAFSA interface, we make
    the DAFSA interface instead be "return longest complete match". By
    'complete', we mean that a matching substring of the input needs to
    stop at a dot boundary, or the end of the input.

As an illustrative example, 'www.google.com' is an interesting input to
think about, since it has collisions both in the forward and reverse directions.

In the current (forward-scanning) implementation, because 'www.ck' 'and 'www.ro'
are in the list of effective TLDs, the initial lookup for 'www.google.com' needs
to scan as far as the 'g' before the DAFSA can fail lookup for the first
iteration. Then there's a forward scan to find the dot, and a second iteration
('google.com' input) needs to scan to the dot, to eliminate the 'google' TLD
branch. Then there's a scan forward to find the dot, and a third iteration with
input 'com', which returns a match.

With the proposed backwards-scanning DAFSA, on input 'www.google.com', you'd
need to scan as far as 'moc.elgoog.' (because 'withgoogle.com' is in the ETLD
registry) to know that .com is the TLD, since http://withgoogle.com is in the
registry.

Including the final scan to extend '.com' domain to 'google.com', I count 32
character reads going forward, vs. 17 going backward, but if you count just the
DAFSA iterations (which are presumably more expensive than the find-next-dot
scans, due to memory locality), it would be something like 15 vs. 11.

This idea originated in https://codereview.chromium.org/2568133007

csharrison says this this function has been a hot spot historically, though
adding fast-paths in the calling code has reduced its call frequency.

 
Cc: -sleevi@google.com rsleevi@chromium.org
That simplification isn't correct, because there are a variety of properties still needing to be considered (to implement the logic described https://publicsuffix.org/list/ - particularly wildcards and exceptions)

Granted, I may have misparsed your suggestion, but consider a domain like
pref.hokkaido.jp

The PSL entries for this (that would be applicable) are
jp
*.jp
*.hokkaido.jp
!pref.hokkaido.jp

From what I can understand of your description, this would result in worse performance/more comparisons. But perhaps I've misunderstood.

Comment 3 by nick@chromium.org, Jan 5 2017

Thanks for pointing out the wildcard and exception cases, which do add complexity to this, but I don't think they break the idea.

steps 1 and 2 are safe because it's just applying a bijective transformation to the index and inputs. The wildcards and exceptions values don't affect the dafsa graph; make_dafsa.py doesn't turn wildcards into state machine states, but instead stored as a kind of payload/flag.

Consider what would happen if we did only steps 1 and 2, and left the while(1) loop of GetRegistryLengthImpl() intact. We would see the following call order:

 - LookupInFixedSet("moc.elgoog.www") = -1
 - LookupInFixedSet("moc.elgoog") = -1
 - LookupInFixedSet("moc") = 0

The idea of step 3, in essence, is that because a DAFSA consumes the input string in order, then the second and third calls ("moc.elgoog" and "moc") above are actually redundant computation, because we had already computed them as an intermediate result during the computation of LookupInFixedSet("moc.elgoog.www").

Reading the formal algorithm on https://publicsuffix.org/list/, I'm pretty sure that we would only need a single loop parameter to tracking the currently prevailing rule, and that this could handle the wildcards and exceptions. Something like the following...

=== 
before the loop, initialize:
  int prevailing_rule = 0;
  bool prevailing_rule_is_wildcard = true

Then, inside the loop, each time we encounter the end of a label in the input, we'd assign code = GetReturnValue() -- i.e., what the result would be if we stopped there, and go through the following cases:
 - if (code & kDafsaPrivateRule) and we're ignoring private rules, set code = -1.
 - if (code != -1 && code & kDafsaExceptionRule), it's an exception rule, and we're basically done, per step 5 of the formal algorithm. Trim the most recent label and return. (note: This is actually a different behavior from ToT if the dictionary contained say both !pref.hokkaido.jp and a longer rule *.www.pref.hokkaido.jp, but I think such collisions don't occur in the dictionary today)
 - if it's a wildcard or normal result (code != -1), it must have more labels than the previous prevailing_rule. So replace prevailing_rule with the current index, and update prevailing_rule_is_wildcard to (code & kDafsaWildcardRule)
 - if there's no match (code == -1), and |prevailing_rule_is_wildcard| is true, then set prevailing_rule to the current substring, and set prevailing_rule_is_wildcard to false.

When we've hit either the end of the string, or the final state of the DAFSA. we know we've identified the prevailing rule (i.e. the matching rule with the most labels) and that there's no matching exception rule.
====

I noticed that the publicsuffix.org doc talks about multiple wildcards, or wildcards in the not-leftmost position (*.*.mil, or state.*.us). Our current implementation doesn't seem capable of handling these at all, and they don't occur in practice in effective_tld_names.dat, so I'm assuming we can keep ignoring them.

As for whether this is worse perf / more comparisons, we'd of course need to measure. With the existing algorithm, unless the hostname has an invalid TLD, you always have to process every character of the input at least once, and some characters are looked at multiple times. whereas consuming in reverse allows you to sometimes skip a run of leftmost characters (e.g. 'www') in the string. So that, at least, feels like it ought to be an efficiency gain for some inputs.

Are there possible efficiency losses? Sure. It's possible that while we feed fewer characters into the DAFSA on average, we wind up traversing more bytes of the dafsa overall. And it's pretty much guaranteed that it'll get slower on some subset of inputs that happen to have a particularly fast path through the graph today.

A starting point might be just to see if the generated dafsa blob gets bigger or smaller on reversed inputs. I'd imagine that smaller is likely faster.
Multi-wildcard is https://bugs.chromium.org/p/chromium/issues/detail?id=459802 , and it seems like you also tripped up on the other issue (exception rules trumping longer, more specific things). The lack of multi-wildcards has prevented several recent PSL inclusion requests (notably, Amazon), so it's something that if we're going to be mucking about, would be good to fix.

As mentioned, this function is perf sensitive (we took a 25% perf-hit switching to the DAFSA, but that gave us 10x space savings) - it's not just the DOM routines, but also every network request/cookie. It's hard to evaluate what the up-front cost (of reversing the string) is versus the potential performance savings.

I don't want to discourage profiling, but I also want to encourage that if we do switch it up, it's aggressively profiled as well.

Comment 5 by nick@chromium.org, Jan 5 2017

rsleevi: We don't need to reverse the string or create any copies. The idea is just to start at the end of the hostname StringPiece and work forward. operator-- instead of operator++.

"25% perf-hit switching to the DAFSA" how was that measured? Is there a benchmark?

The DAFSA is a tad smaller on the reversed strings: 40830 bytes, down from 41441.

Multi-wildcards on the left seems like a simple extension of the current system. What exactly did Amazon need?
https://cs.chromium.org/chromium/src/net/extras/sqlite/sqlite_persistent_cookie_store.cc?rcl=1483575682&l=628 is the main hotpoint that affects startup & page load times.

The test involved using a sample profile (of ~2400 cookies) and iterating several runs on that.

The major metric to watch is Cookie.TimeParseDomains if you wanted to do A/B testing of this.
Status: Available (was: Untriaged)

Comment 8 by nick@chromium.org, Jan 18 2017

Owner: nick@chromium.org
Status: Started (was: Available)
I'll implement this.

I've done a good bit of prototyping already, and am convinced it's a significant speedup. Additionally, it is compatible with exception/wildcard rules, and could be extended to support multi-wildcard (*.*.google.com) and mid-wildcard (borg.*.google.com) rules, if anybody requires them -- the idea is that we'd encode '*' in the dafsa's language directly, rather than as a result code, and since we're now decoding the dafsa incrementally, we can notice when we're at '*' edges and allow this to match an entire label, after ruling out a matching exception rule.

I have been testing against alexa1-10000-urls.json, in random order. Reversing the strings alone (without changing the algorithm) definitely results in more comparisons overall, but if we reverse the strings and eliminate the "chop first label" loop in GetRegistryLengthImpl, then the number of DAFSA operations is reduced by more than half: the average number of calls to GetNextOffset() drops from 89 to 40, and the benchmark time drops from 1090ms to 714ms.

I've noticed that both the serialized DAFSA size, and performance on likely inputs, seems surprisingly sensitive to the order that things are listed in in the gperf file. So it seems particularly important to use a large corpus in testing, since it's possible for small perturbations of the graph to have a large effect on input.

To illustrate the sensitivity: the initial node in the DAFSA has 36 edges that appear in reverse alphabetical order: z-a9-0. This is ordering makes 'www' queries fast, but 'com' queries probably unnecessarily slow. But why is it in this order? It seems not to have an especially delibarate reason -- it's an artifact of the inputs being sorted, the fact that edges must be are emitted in order of increasing offset, and that the first few layers of the graph aren't very constrained by the topological sort.

Thus, there's a separate optimization opportunity here, we do a smarter re-ordering of these nodes to, for example, move the nodes involved in '.com' queries to lower offsets than those involved in '.xfinity' queries. Even just using suffix/label/letter frequencies in the PSL itself could probably get us most of the way there, since it seems 'com' and 'org' entries from the private registries seem to be most numerous there, and maybe correlate with the overall popularity of those domains. Alternatively, we could make changes to the encoding so that e.g. we can binary search the edge chars, or do a trie-type jump table for the really wide nodes -- I'm not sure that's necessary; it would require a total rewrite of make_dafsa.py, and it would almost certainly inflate the encoded size (by about 5K, perhaps).
Project Member

Comment 9 by bugdroid1@chromium.org, Feb 16 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0b80485a436c8c881164729adb8484868ecd065d

commit 0b80485a436c8c881164729adb8484868ecd065d
Author: nick <nick@chromium.org>
Date: Thu Feb 16 20:26:23 2017

Support prefix queries against the effective_tld_names DAFSA

Reimplement LookupStringInFixedSet in terms of a new incremental
prefix lookup class, FixedSetIncrementalLookup.

Use FixedSetIncrementalLookup in a unittest to enumerate the
language of the DAFSA, and validate that it matches
the body of the .gperf file.

Because the unittest exhaustively explores every node of
the DAFSA on every possible input, it should be safe to
demote the CHECKs in this file to DCHECKs, which is a
performance win.

Because of function inlining and the CHECK removal, this
actually runs about 10% faster than the old version.

Current scores on my Windows machine (best of 3 runs):

  RegistryControlledDomainTest.GetDomainAndRegistry: 98ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 170ms
  RegistryControlledDomainTest.SameDomainOrHost_Same: 194ms

BUG=678334

Review-Url: https://codereview.chromium.org/2641953009
Cr-Commit-Position: refs/heads/master@{#451085}

[modify] https://crrev.com/0b80485a436c8c881164729adb8484868ecd065d/net/base/lookup_string_in_fixed_set.cc
[modify] https://crrev.com/0b80485a436c8c881164729adb8484868ecd065d/net/base/lookup_string_in_fixed_set.h
[modify] https://crrev.com/0b80485a436c8c881164729adb8484868ecd065d/net/base/lookup_string_in_fixed_set_unittest.cc

Project Member

Comment 10 by bugdroid1@chromium.org, Feb 17 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6b6710cc340138ce79eba7a6e8d371eeacf58f56

commit 6b6710cc340138ce79eba7a6e8d371eeacf58f56
Author: nick <nick@chromium.org>
Date: Fri Feb 17 01:12:51 2017

Add some test strings to registry_controlled_domain_unittest.cc

These strings will be interesting values if we reverse the lookup order,
which occurs in the next patch set of this series.

BUG=678334

Review-Url: https://codereview.chromium.org/2648213003
Cr-Commit-Position: refs/heads/master@{#451165}

[modify] https://crrev.com/6b6710cc340138ce79eba7a6e8d371eeacf58f56/net/base/registry_controlled_domains/registry_controlled_domain_unittest.cc

Owner: ----
Status: Available (was: Started)
This issue has been marked as started, but has no owner. Making available.

Sign in to add a comment