Improve GetRegistryLengthImpl() performance by seeding the DAFSA in reverse |
|||||
Issue description
GetRegistryLengthImpl() is a function in registry_controlled_domain.cc that
determines the effective TLD of a URL. By means of the following algorithmic
improvement, we could make it faster.
Currently, the core of the function is a loop that works like this:
s := hostname;
while (s) {
if (LookupInFixedSet(s, "./effective_tld_names.gperf")) {
return s;
} else {
s := s.substring(s.find('.') + 1); // trim chars up to the next dot
}
}
Under the hood, the effective_tld_names.gperf dictionary is statically compiled
into an efficient DAFSA data structure that makes this lookup efficient.
However, we can eliminate the while loop (and thus multiple consultations of the
DAFSA) altogether, by turning this into a "find longest suffix from a fixed set"
by a different construction of the same data structure:
1. Seed the DAFSA with the reversed string values of effective_tld_names.gperf.
2. Feed the characters of the hostname string to the dafsa in reverse.
3. Rather than having a "lookup string in fixed set" DAFSA interface, we make
the DAFSA interface instead be "return longest complete match". By
'complete', we mean that a matching substring of the input needs to
stop at a dot boundary, or the end of the input.
As an illustrative example, 'www.google.com' is an interesting input to
think about, since it has collisions both in the forward and reverse directions.
In the current (forward-scanning) implementation, because 'www.ck' 'and 'www.ro'
are in the list of effective TLDs, the initial lookup for 'www.google.com' needs
to scan as far as the 'g' before the DAFSA can fail lookup for the first
iteration. Then there's a forward scan to find the dot, and a second iteration
('google.com' input) needs to scan to the dot, to eliminate the 'google' TLD
branch. Then there's a scan forward to find the dot, and a third iteration with
input 'com', which returns a match.
With the proposed backwards-scanning DAFSA, on input 'www.google.com', you'd
need to scan as far as 'moc.elgoog.' (because 'withgoogle.com' is in the ETLD
registry) to know that .com is the TLD, since http://withgoogle.com is in the
registry.
Including the final scan to extend '.com' domain to 'google.com', I count 32
character reads going forward, vs. 17 going backward, but if you count just the
DAFSA iterations (which are presumably more expensive than the find-next-dot
scans, due to memory locality), it would be something like 15 vs. 11.
This idea originated in https://codereview.chromium.org/2568133007
csharrison says this this function has been a hot spot historically, though
adding fast-paths in the calling code has reduced its call frequency.
,
Jan 4 2017
That simplification isn't correct, because there are a variety of properties still needing to be considered (to implement the logic described https://publicsuffix.org/list/ - particularly wildcards and exceptions) Granted, I may have misparsed your suggestion, but consider a domain like pref.hokkaido.jp The PSL entries for this (that would be applicable) are jp *.jp *.hokkaido.jp !pref.hokkaido.jp From what I can understand of your description, this would result in worse performance/more comparisons. But perhaps I've misunderstood.
,
Jan 5 2017
Thanks for pointing out the wildcard and exception cases, which do add complexity to this, but I don't think they break the idea.
steps 1 and 2 are safe because it's just applying a bijective transformation to the index and inputs. The wildcards and exceptions values don't affect the dafsa graph; make_dafsa.py doesn't turn wildcards into state machine states, but instead stored as a kind of payload/flag.
Consider what would happen if we did only steps 1 and 2, and left the while(1) loop of GetRegistryLengthImpl() intact. We would see the following call order:
- LookupInFixedSet("moc.elgoog.www") = -1
- LookupInFixedSet("moc.elgoog") = -1
- LookupInFixedSet("moc") = 0
The idea of step 3, in essence, is that because a DAFSA consumes the input string in order, then the second and third calls ("moc.elgoog" and "moc") above are actually redundant computation, because we had already computed them as an intermediate result during the computation of LookupInFixedSet("moc.elgoog.www").
Reading the formal algorithm on https://publicsuffix.org/list/, I'm pretty sure that we would only need a single loop parameter to tracking the currently prevailing rule, and that this could handle the wildcards and exceptions. Something like the following...
===
before the loop, initialize:
int prevailing_rule = 0;
bool prevailing_rule_is_wildcard = true
Then, inside the loop, each time we encounter the end of a label in the input, we'd assign code = GetReturnValue() -- i.e., what the result would be if we stopped there, and go through the following cases:
- if (code & kDafsaPrivateRule) and we're ignoring private rules, set code = -1.
- if (code != -1 && code & kDafsaExceptionRule), it's an exception rule, and we're basically done, per step 5 of the formal algorithm. Trim the most recent label and return. (note: This is actually a different behavior from ToT if the dictionary contained say both !pref.hokkaido.jp and a longer rule *.www.pref.hokkaido.jp, but I think such collisions don't occur in the dictionary today)
- if it's a wildcard or normal result (code != -1), it must have more labels than the previous prevailing_rule. So replace prevailing_rule with the current index, and update prevailing_rule_is_wildcard to (code & kDafsaWildcardRule)
- if there's no match (code == -1), and |prevailing_rule_is_wildcard| is true, then set prevailing_rule to the current substring, and set prevailing_rule_is_wildcard to false.
When we've hit either the end of the string, or the final state of the DAFSA. we know we've identified the prevailing rule (i.e. the matching rule with the most labels) and that there's no matching exception rule.
====
I noticed that the publicsuffix.org doc talks about multiple wildcards, or wildcards in the not-leftmost position (*.*.mil, or state.*.us). Our current implementation doesn't seem capable of handling these at all, and they don't occur in practice in effective_tld_names.dat, so I'm assuming we can keep ignoring them.
As for whether this is worse perf / more comparisons, we'd of course need to measure. With the existing algorithm, unless the hostname has an invalid TLD, you always have to process every character of the input at least once, and some characters are looked at multiple times. whereas consuming in reverse allows you to sometimes skip a run of leftmost characters (e.g. 'www') in the string. So that, at least, feels like it ought to be an efficiency gain for some inputs.
Are there possible efficiency losses? Sure. It's possible that while we feed fewer characters into the DAFSA on average, we wind up traversing more bytes of the dafsa overall. And it's pretty much guaranteed that it'll get slower on some subset of inputs that happen to have a particularly fast path through the graph today.
A starting point might be just to see if the generated dafsa blob gets bigger or smaller on reversed inputs. I'd imagine that smaller is likely faster.
,
Jan 5 2017
Multi-wildcard is https://bugs.chromium.org/p/chromium/issues/detail?id=459802 , and it seems like you also tripped up on the other issue (exception rules trumping longer, more specific things). The lack of multi-wildcards has prevented several recent PSL inclusion requests (notably, Amazon), so it's something that if we're going to be mucking about, would be good to fix. As mentioned, this function is perf sensitive (we took a 25% perf-hit switching to the DAFSA, but that gave us 10x space savings) - it's not just the DOM routines, but also every network request/cookie. It's hard to evaluate what the up-front cost (of reversing the string) is versus the potential performance savings. I don't want to discourage profiling, but I also want to encourage that if we do switch it up, it's aggressively profiled as well.
,
Jan 5 2017
rsleevi: We don't need to reverse the string or create any copies. The idea is just to start at the end of the hostname StringPiece and work forward. operator-- instead of operator++. "25% perf-hit switching to the DAFSA" how was that measured? Is there a benchmark? The DAFSA is a tad smaller on the reversed strings: 40830 bytes, down from 41441. Multi-wildcards on the left seems like a simple extension of the current system. What exactly did Amazon need?
,
Jan 5 2017
https://cs.chromium.org/chromium/src/net/extras/sqlite/sqlite_persistent_cookie_store.cc?rcl=1483575682&l=628 is the main hotpoint that affects startup & page load times. The test involved using a sample profile (of ~2400 cookies) and iterating several runs on that. The major metric to watch is Cookie.TimeParseDomains if you wanted to do A/B testing of this.
,
Jan 13 2017
,
Jan 18 2017
I'll implement this. I've done a good bit of prototyping already, and am convinced it's a significant speedup. Additionally, it is compatible with exception/wildcard rules, and could be extended to support multi-wildcard (*.*.google.com) and mid-wildcard (borg.*.google.com) rules, if anybody requires them -- the idea is that we'd encode '*' in the dafsa's language directly, rather than as a result code, and since we're now decoding the dafsa incrementally, we can notice when we're at '*' edges and allow this to match an entire label, after ruling out a matching exception rule. I have been testing against alexa1-10000-urls.json, in random order. Reversing the strings alone (without changing the algorithm) definitely results in more comparisons overall, but if we reverse the strings and eliminate the "chop first label" loop in GetRegistryLengthImpl, then the number of DAFSA operations is reduced by more than half: the average number of calls to GetNextOffset() drops from 89 to 40, and the benchmark time drops from 1090ms to 714ms. I've noticed that both the serialized DAFSA size, and performance on likely inputs, seems surprisingly sensitive to the order that things are listed in in the gperf file. So it seems particularly important to use a large corpus in testing, since it's possible for small perturbations of the graph to have a large effect on input. To illustrate the sensitivity: the initial node in the DAFSA has 36 edges that appear in reverse alphabetical order: z-a9-0. This is ordering makes 'www' queries fast, but 'com' queries probably unnecessarily slow. But why is it in this order? It seems not to have an especially delibarate reason -- it's an artifact of the inputs being sorted, the fact that edges must be are emitted in order of increasing offset, and that the first few layers of the graph aren't very constrained by the topological sort. Thus, there's a separate optimization opportunity here, we do a smarter re-ordering of these nodes to, for example, move the nodes involved in '.com' queries to lower offsets than those involved in '.xfinity' queries. Even just using suffix/label/letter frequencies in the PSL itself could probably get us most of the way there, since it seems 'com' and 'org' entries from the private registries seem to be most numerous there, and maybe correlate with the overall popularity of those domains. Alternatively, we could make changes to the encoding so that e.g. we can binary search the edge chars, or do a trie-type jump table for the really wide nodes -- I'm not sure that's necessary; it would require a total rewrite of make_dafsa.py, and it would almost certainly inflate the encoded size (by about 5K, perhaps).
,
Feb 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0b80485a436c8c881164729adb8484868ecd065d commit 0b80485a436c8c881164729adb8484868ecd065d Author: nick <nick@chromium.org> Date: Thu Feb 16 20:26:23 2017 Support prefix queries against the effective_tld_names DAFSA Reimplement LookupStringInFixedSet in terms of a new incremental prefix lookup class, FixedSetIncrementalLookup. Use FixedSetIncrementalLookup in a unittest to enumerate the language of the DAFSA, and validate that it matches the body of the .gperf file. Because the unittest exhaustively explores every node of the DAFSA on every possible input, it should be safe to demote the CHECKs in this file to DCHECKs, which is a performance win. Because of function inlining and the CHECK removal, this actually runs about 10% faster than the old version. Current scores on my Windows machine (best of 3 runs): RegistryControlledDomainTest.GetDomainAndRegistry: 98ms RegistryControlledDomainTest.SameDomainOrHost_Different: 170ms RegistryControlledDomainTest.SameDomainOrHost_Same: 194ms BUG=678334 Review-Url: https://codereview.chromium.org/2641953009 Cr-Commit-Position: refs/heads/master@{#451085} [modify] https://crrev.com/0b80485a436c8c881164729adb8484868ecd065d/net/base/lookup_string_in_fixed_set.cc [modify] https://crrev.com/0b80485a436c8c881164729adb8484868ecd065d/net/base/lookup_string_in_fixed_set.h [modify] https://crrev.com/0b80485a436c8c881164729adb8484868ecd065d/net/base/lookup_string_in_fixed_set_unittest.cc
,
Feb 17 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6b6710cc340138ce79eba7a6e8d371eeacf58f56 commit 6b6710cc340138ce79eba7a6e8d371eeacf58f56 Author: nick <nick@chromium.org> Date: Fri Feb 17 01:12:51 2017 Add some test strings to registry_controlled_domain_unittest.cc These strings will be interesting values if we reverse the lookup order, which occurs in the next patch set of this series. BUG=678334 Review-Url: https://codereview.chromium.org/2648213003 Cr-Commit-Position: refs/heads/master@{#451165} [modify] https://crrev.com/6b6710cc340138ce79eba7a6e8d371eeacf58f56/net/base/registry_controlled_domains/registry_controlled_domain_unittest.cc
,
Sep 26
,
Jan 11
This issue has been marked as started, but has no owner. Making available. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by rsleevi@chromium.org
, Jan 4 2017