UrlPatternIndex: Implement case insensitive matching. |
|||||
Issue descriptionUrlPatternIndex exposes the ability to match URLs in both a case-sensitive and a case-insensitive way. However, it seems that case-insensitive matching isn't implemented (See TODO at https://cs.chromium.org/chromium/src/components/url_pattern_index/url_pattern.h?type=cs&q=sensitive+file:%5Esrc/components/url_pattern_index/+package:%5Echromium$&l=68). Hence the matching is always case sensitive. Charles: Can you take a look?
,
Sep 22 2017
I wasn't aware that this wasn't implemented and hence had implemented support for it, since it's exposed as part of url pattern index. Also, I am assuming this is also a bug for subresource-filter? Will evaluate whether it's important enough for DNR and take a stab later if need be.
,
Sep 22 2017
I think it's a bug / missing feature but I'm not sure how impactful it is for us. Pavel, do you have any context?
,
Sep 22 2017
Hi. Not really, I haven't investigated how much case insensitive matching is important.
,
Sep 10
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a9c4e1d4481e67dbfc550c19b04252a8839fa35d commit a9c4e1d4481e67dbfc550c19b04252a8839fa35d Author: Karan Bhatia <karandeepb@chromium.org> Date: Mon Sep 10 23:37:47 2018 UrlPatternIndex: Implement case-insensitive matching. This fixes an existing TODO in url_pattern_index to implement case-insensitive matching. To do so: - Lower case the stored n grams. These n grams are used to find prospective matches for a particular url. - Modify URLPattern to handle case-insensitive matching. This involves lower-casing the url pattern and the url, before comparing them. BUG= 767605 Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi Change-Id: I1377c216d61392c508280141d0d947a47a3ee494 Reviewed-on: https://chromium-review.googlesource.com/1212183 Reviewed-by: Charlie Harrison <csharrison@chromium.org> Reviewed-by: Ned Nguyen <nednguyen@google.com> Reviewed-by: Istiaque Ahmed <lazyboy@chromium.org> Commit-Queue: Karan Bhatia <karandeepb@chromium.org> Cr-Commit-Position: refs/heads/master@{#590118} [modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/chrome/browser/extensions/api/declarative_net_request/declarative_net_request_browsertest.cc [modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/subresource_filter/core/common/indexed_ruleset.cc [modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/url_pattern_index/flat/url_pattern_index.fbs [modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/url_pattern_index/url_pattern.cc [modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/url_pattern_index/url_pattern.h [modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/url_pattern_index/url_pattern_index.cc [modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/url_pattern_index/url_pattern_index.h [modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/url_pattern_index/url_pattern_index_unittest.cc [modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/url_pattern_index/url_pattern_unittest.cc [modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/extensions/browser/api/declarative_net_request/indexed_ruleset_format_version_unittest.cc [modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/extensions/browser/api/declarative_net_request/utils.cc [modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/tools/perf/core/default_local_state.json
,
Sep 11
See if there's any regression on ChromiumPerf/Android Nexus5 Perf/components_perftests/median_match_time.
,
Sep 12
As per https://chromeperf.appspot.com/report?sid=f3483fa43a11ac2b2d8105faa03286e1ab3f2c65c31c665315363262fdf07d1e, there seems to be some regression in "ChromiumPerf/Android Nexus5 Perf/components_perftests/median_match_time." Charlie: Do you think if this is ok or whether this is worth optimizing?
,
Sep 12
Hm looks like a ~20% regression on Win and ~33% regression on Android. It would be nice to improve this if we can. I ran a quick analysis of our corpus of URLs and 1/2 of URLs have at least one uppercase character. This yields the absolute worst performance. 1. Scan URL for lowercase for NGram matching 2. Need to lower-case the URL for NGram matching 3. match_case() false 4. HasAnyUpperAscii(url_pattern_) is (probably) false 5. HasAnyUpperAscii(url.possibly_invalid_spec()) is true 6. Lower-case the pattern 7. Lower-case the url again 8. Standard match This adds a lot of extra work :( My feeling is that subresource_filter (and DNR) may want match_case to be true by default. Otherwise, we can probably remove duplicate work (like 1/5 and 2/7). Additionally, if match_case is false for the pattern, we can lower() it at index time, removing (6) and (4). Profiling can probably find which of these is eating up the most time. My guess is lowercasing the URL is the worst offender.
,
Sep 12
Thanks for the analysis Charlie. I'll try and optimize this. As you pointed out, currently there is a lot of redundant work, which can be reduced. The url is being lower cased for every pattern, but taking care of that should be trivial. Will also look into lower casing the pattern with match_case==false at index time.
,
Sep 12
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e0aeb0ec65e3b74ca5f454a555a04d8c060fff03 commit e0aeb0ec65e3b74ca5f454a555a04d8c060fff03 Author: Karan Bhatia <karandeepb@chromium.org> Date: Wed Sep 12 18:57:21 2018 UrlPatternIndex: Reduce redundant computations while matching. r590118 implemented case-insensitive matching which led to some regression in url matching performance. Currently, it's possible for a url to be lower-cased each time it's matched with a pattern in UrlPattern::MatchesUrl. This CL aims to reduce this redundant computation. A thin wrapper over GURL, UrlInfo is introduced to cache the computation performed on the GURL. BUG= 767605 Change-Id: Ibe977f34e649068453c5a01e3619b93a76c9ae7d Reviewed-on: https://chromium-review.googlesource.com/1220628 Reviewed-by: Charlie Harrison <csharrison@chromium.org> Commit-Queue: Karan Bhatia <karandeepb@chromium.org> Cr-Commit-Position: refs/heads/master@{#590767} [modify] https://crrev.com/e0aeb0ec65e3b74ca5f454a555a04d8c060fff03/components/url_pattern_index/url_pattern.cc [modify] https://crrev.com/e0aeb0ec65e3b74ca5f454a555a04d8c060fff03/components/url_pattern_index/url_pattern.h [modify] https://crrev.com/e0aeb0ec65e3b74ca5f454a555a04d8c060fff03/components/url_pattern_index/url_pattern_index.cc [modify] https://crrev.com/e0aeb0ec65e3b74ca5f454a555a04d8c060fff03/components/url_pattern_index/url_pattern_unittest.cc
,
Sep 12
Looks like on Android the regression has dropped to only 8%. Still not amazing but not bad. I think a further improvement would be: - Default match_case to true in DNR / subresource_filter - Include an optimization which avoids any lowercasing if there are no case insensitive rules. But I don't think we should block on that.
,
Sep 12
>> Default match_case to true in DNR / subresource_filter For subresource filter, aren't most easylist rules case-insensitive by default? So we'll be changing the behavior for them (granted this was already the case, since case-insensitive matching wasn't implemented). Another option which probably won't be difficult to implement would be to lower-case the case-insensitive patterns at index time. I think that should get us to good enough performance. Thoughts?
,
Sep 12
Yes, technically subresource_filter was / will be divergent from EasyList. I think I may be OK with losing some coverage for performance though. In general, I'm not really happy EasyList is insensitive by default. It makes everything even harder to reason about. Currently the extra computation we're doing is: 1. One extra scan for lower case on the URL 2. A call ToLowerASCII on the URL if there is upper case in it 3. A call ToLowerASCII on the pattern, if match_Case is false. We'll still pay for (1) and (2) if we lower-case the patterns at index time, right? I bet that's more expensive than (3), but it depends on the patterns.
,
Sep 12
Note though, (1) and (2) are done once per URL, while (3) is done for each prospective match for a URL being matched. Hence getting rid of (3) "might" have some benefit as well. I'll go ahead and see what (3) looks like and let's take a call on the default value after that?
,
Sep 12
Yep great point. I can try to generate a profile and see where time is going on our current ruleset.
,
Sep 12
I ran a quick profile with the new code:
- 43.93% subresource_filter::IndexedRulesetMatcher::MatchedUrlRule ▒
- 43.88% subresource_filter::IndexedRulesetMatcher::MatchedUrlRule ▒
- 35.54% url_pattern_index::UrlPatternIndexMatcher::FindMatch ▒
- 11.49% url_pattern_index::(anonymous namespace)::FindMatchAmongCandidates ▒
- 8.06% url_pattern_index::UrlPattern::MatchesUrl ▒
+ 2.86% base::ToLowerASCII ▒
2.28% base::internal::find ▒
1.74% url_pattern_index::(anonymous namespace)::IsCaseSensitiveMatch ▒
1.67% url_pattern_index::(anonymous namespace)::GetLongestMatchingSubdomain ▒
- 6.03% url_pattern_index::UrlPattern::UrlInfo::UrlInfo ▒
+ 5.12% base::ToLowerASCII ▒
- 7.53% subresource_filter::FirstPartyOrigin::IsThirdParty ▒
+ 6.97% net::registry_controlled_domains::(anonymous namespace)::SameDomainOrHost ▒
There's a couple things to note:
- Less than half of the time is spent actually doing matching, mostly due to IO
- UrlInfo lower-casing is more expensive than pattern lower-casing (~5% vs 3%)
- Lower-casing is pretty expensive in general relative to other ops. In reality it is closer to 18% of MatchedUrlRule.
,
Sep 12
Thanks for the analysis Charlie, this is pretty helpful. Curious, what did use to profile the code? I think so the plan of action is to: - Flip the default for case-sensitivity. I believe this will also require changes to proto->easylist conversion, can you take this on? - Compute lower cased url on UrlInfo lazily (only when required). - Optionally, lower case patterns while indexing. This is probably not important enough after flipping the default.
,
Sep 12
Also, just making sure you were using a release build, right?
,
Sep 12
I use `perf record` and `perf report` tools from linux perf. >> perf record --call-graph=dwarf -- ./out/profiling-enabled-build/components_perftests --gtest_filter="*Match*" --gtest_repeat=10 >> perf report You need a few things in args.gn: is_debug = false enable_profiling = true enable_callgrind = true I think one plan of action could be (these can be done in any order really): 1. Lower case patterns with match_case = false during indexing. Or fail w/ a syntax error or something for DNR. 2. Flip match_case default to true 3. Alter the NGram extractor to extract lower-cased ngrams automatically w/out an allocation 4. Compute lower-cased url lazily I think if we do this we will revert back to the old performance, except for a tiny regression at (3) but I think that should be really small, since we're already scanning through the URL. (1) is definitely lower priority if we do (2), but I think for DNR it is probably a good idea to do something just in case. E.g. an ad blocker may want to have match_case false for every pattern. I can look into doing (2).
,
Sep 12
Sounds good, I'll take on (1) and (4) for now. (3) probably won't be required after (1).
,
Sep 13
I still think (3) is required even if we do (1). This would remove the need to call lower_case_spec here: https://cs.chromium.org/chromium/src/components/url_pattern_index/url_pattern_index.cc?rcl=062244e45f9d737c516b868e0762e5b0145a59bb&l=663 I can look into doing it.
,
Sep 13
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/03d14673330e4a12cdd520cb6bff60091f02d5c3 commit 03d14673330e4a12cdd520cb6bff60091f02d5c3 Author: Charlie Harrison <csharrison@chromium.org> Date: Thu Sep 13 20:37:02 2018 [url_pattern_index] Implement lower case ngram extraction This is implemented as an additional template parameter for readability. We currently store all ngrams as lowercase. By making ngram extraction automatically extract lower-cased ngrams, callers no longer have to lower-case their input (usually causing an additional allocation / copy). Bug: 767605 Change-Id: I466011a5143785520d2007700b01585363736153 Reviewed-on: https://chromium-review.googlesource.com/1223206 Reviewed-by: Karan Bhatia <karandeepb@chromium.org> Commit-Queue: Charlie Harrison <csharrison@chromium.org> Cr-Commit-Position: refs/heads/master@{#591139} [modify] https://crrev.com/03d14673330e4a12cdd520cb6bff60091f02d5c3/components/url_pattern_index/ngram_extractor.h [modify] https://crrev.com/03d14673330e4a12cdd520cb6bff60091f02d5c3/components/url_pattern_index/ngram_extractor_unittest.cc [modify] https://crrev.com/03d14673330e4a12cdd520cb6bff60091f02d5c3/components/url_pattern_index/url_pattern_index.cc
,
Sep 14
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e177fb6a42054c27db60110eb3ee4ef5272f38ce commit e177fb6a42054c27db60110eb3ee4ef5272f38ce Author: Karan Bhatia <karandeepb@chromium.org> Date: Fri Sep 14 00:57:30 2018 UrlPatternIndex: Lazily compute lower cased url spec. This CL changes UrlPattern::UrlInfo to ensure that the lower cased url spec is computed lazily. Together with r591139, this helps prevent a string allocation for the whole lower cased url unless absolutely necessary. BUG= 767605 Change-Id: I374c268a48900322f258212b05b906c541869031 Reviewed-on: https://chromium-review.googlesource.com/1226450 Reviewed-by: Charlie Harrison <csharrison@chromium.org> Commit-Queue: Karan Bhatia <karandeepb@chromium.org> Cr-Commit-Position: refs/heads/master@{#591243} [modify] https://crrev.com/e177fb6a42054c27db60110eb3ee4ef5272f38ce/components/url_pattern_index/url_pattern.cc [modify] https://crrev.com/e177fb6a42054c27db60110eb3ee4ef5272f38ce/components/url_pattern_index/url_pattern.h
,
Sep 14
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/8d71f6f577d973c04ed02cf8779643a6b874ac50 commit 8d71f6f577d973c04ed02cf8779643a6b874ac50 Author: Charlie Harrison <csharrison@chromium.org> Date: Fri Sep 14 14:43:26 2018 [url_pattern_index] Make case sensitive matching the default Case insensitive matching has some performance cost, mostly in copying and re-allocating the underlying URL to do lower-cased matching. This CL changes IS_MATCH_CASE to IS_CASE_INSENSITIVE in the flatbuffer format, and reverses its meaning. It also changes proto -> flatbuffer conversion to ignore the match_case proto bit. At some later date we may want to either: - Align with the proto rules and make match case false by default - Introduce new rule parsing syntax which supports case insensitivity For now, just punt with a TODO. TBR=nednguyen@chromium.org Bug: 767605 Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi Change-Id: I440d5966076a184ea1042f6f38c8562732ae803d Reviewed-on: https://chromium-review.googlesource.com/1223209 Reviewed-by: Devlin <rdevlin.cronin@chromium.org> Reviewed-by: Josh Karlin <jkarlin@chromium.org> Reviewed-by: Karan Bhatia <karandeepb@chromium.org> Commit-Queue: Charlie Harrison <csharrison@chromium.org> Cr-Commit-Position: refs/heads/master@{#591342} [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/subresource_filter/core/common/indexed_ruleset.cc [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/flat/url_pattern_index.fbs [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/proto/rules.proto [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/url_pattern.cc [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/url_pattern.h [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/url_pattern_index.cc [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/url_pattern_index.h [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/url_pattern_index_unittest.cc [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/url_rule_util.cc [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/extensions/browser/api/declarative_net_request/flat_ruleset_indexer_unittest.cc [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/extensions/browser/api/declarative_net_request/indexed_rule.cc [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/extensions/browser/api/declarative_net_request/indexed_rule_unittest.cc [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/extensions/browser/api/declarative_net_request/indexed_ruleset_format_version_unittest.cc [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/extensions/browser/api/declarative_net_request/utils.cc [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/extensions/common/api/declarative_net_request.idl [modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/tools/perf/core/default_local_state.json
,
Sep 18
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a06f68234f469100b47fe1d58cf17ac8f0b94474 commit a06f68234f469100b47fe1d58cf17ac8f0b94474 Author: Karan Bhatia <karandeepb@chromium.org> Date: Tue Sep 18 00:05:49 2018 UrlPatternIndex: Require case-insensitive patterns to be lower-cased. Currently case-insensitive patterns need to be lower-cased when they are matched against a url. This CL changes the UrlPatternIndex to require case-insensitive patterns to be lower-cased at index time. This allows more efficient matching for case-insensitive patterns by reducing computation during matching. TBR=nednguyen@google.com BUG= 767605 Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi Change-Id: I69f9017ee75cc85dba129adb8fd89df6b2d51686 Reviewed-on: https://chromium-review.googlesource.com/1223328 Commit-Queue: Karan Bhatia <karandeepb@chromium.org> Reviewed-by: Charlie Harrison <csharrison@chromium.org> Reviewed-by: Istiaque Ahmed <lazyboy@chromium.org> Cr-Commit-Position: refs/heads/master@{#591887} [modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/components/subresource_filter/core/common/indexed_ruleset.cc [modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/components/url_pattern_index/flat/url_pattern_index.fbs [modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/components/url_pattern_index/url_pattern.cc [modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/components/url_pattern_index/url_pattern_index.cc [modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/components/url_pattern_index/url_pattern_index.h [modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/extensions/browser/api/declarative_net_request/indexed_rule.cc [modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/extensions/browser/api/declarative_net_request/indexed_rule_unittest.cc [modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/extensions/browser/api/declarative_net_request/indexed_ruleset_format_version_unittest.cc [modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/extensions/browser/api/declarative_net_request/utils.cc [modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/tools/perf/core/default_local_state.json
,
Sep 18
Think that all planned optimizations have landed. And the perf dashboard shows, that the numbers for median_match_time test are now almost at previous levels. Closing this.
,
Sep 18
Thanks Karan! |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by csharrison@chromium.org
, Sep 22 2017Owner: ----
Status: Untriaged (was: Assigned)