New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 767605 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Sep 18
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 696822
issue 609747



Sign in to add a comment

UrlPatternIndex: Implement case insensitive matching.

Project Member Reported by karandeepb@chromium.org, Sep 21 2017

Issue description

UrlPatternIndex exposes the ability to match URLs in both a case-sensitive and a case-insensitive way. However, it seems that case-insensitive matching isn't implemented (See TODO at https://cs.chromium.org/chromium/src/components/url_pattern_index/url_pattern.h?type=cs&q=sensitive+file:%5Esrc/components/url_pattern_index/+package:%5Echromium$&l=68). Hence the matching is always case sensitive. 

Charles: Can you take a look?
 
Cc: csharrison@chromium.org
Owner: ----
Status: Untriaged (was: Assigned)
Is this a requirement for DNR? I wonder how necessary it is, given it'll probably be worse performing.

I probably don't have cycles to implement this in the near future, to be honest, so moving myself to cc for now.


Status: Available (was: Untriaged)
I wasn't aware that this wasn't implemented and hence had implemented support for it, since it's exposed as part of url pattern index. Also, I am assuming this is also a bug for subresource-filter?

Will evaluate whether it's important enough for DNR and take a stab later if need be.
I think it's a bug / missing feature but I'm not sure how impactful it is for us. Pavel, do you have any context?
Hi. Not really, I haven't investigated how much case insensitive matching is important.
Project Member

Comment 5 by bugdroid1@chromium.org, Sep 10

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a9c4e1d4481e67dbfc550c19b04252a8839fa35d

commit a9c4e1d4481e67dbfc550c19b04252a8839fa35d
Author: Karan Bhatia <karandeepb@chromium.org>
Date: Mon Sep 10 23:37:47 2018

UrlPatternIndex: Implement case-insensitive matching.

This fixes an existing TODO in url_pattern_index to implement case-insensitive
matching. To do so:
  - Lower case the stored n grams. These n grams are used to find prospective
    matches for a particular url.
  - Modify URLPattern to handle case-insensitive matching. This involves
    lower-casing the url pattern and the url, before comparing them.

BUG= 767605 

Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi
Change-Id: I1377c216d61392c508280141d0d947a47a3ee494
Reviewed-on: https://chromium-review.googlesource.com/1212183
Reviewed-by: Charlie Harrison <csharrison@chromium.org>
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Istiaque Ahmed <lazyboy@chromium.org>
Commit-Queue: Karan Bhatia <karandeepb@chromium.org>
Cr-Commit-Position: refs/heads/master@{#590118}
[modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/chrome/browser/extensions/api/declarative_net_request/declarative_net_request_browsertest.cc
[modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/subresource_filter/core/common/indexed_ruleset.cc
[modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/url_pattern_index/flat/url_pattern_index.fbs
[modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/url_pattern_index/url_pattern.cc
[modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/url_pattern_index/url_pattern.h
[modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/url_pattern_index/url_pattern_index.cc
[modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/url_pattern_index/url_pattern_index.h
[modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/url_pattern_index/url_pattern_index_unittest.cc
[modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/components/url_pattern_index/url_pattern_unittest.cc
[modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/extensions/browser/api/declarative_net_request/indexed_ruleset_format_version_unittest.cc
[modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/extensions/browser/api/declarative_net_request/utils.cc
[modify] https://crrev.com/a9c4e1d4481e67dbfc550c19b04252a8839fa35d/tools/perf/core/default_local_state.json

NextAction: 2018-09-12
Owner: karandeepb@chromium.org
Status: Started (was: Available)
See if there's any regression on ChromiumPerf/Android Nexus5 Perf/components_perftests/median_match_time.
NextAction: ----
As per https://chromeperf.appspot.com/report?sid=f3483fa43a11ac2b2d8105faa03286e1ab3f2c65c31c665315363262fdf07d1e, there seems to be some regression in "ChromiumPerf/Android Nexus5 Perf/components_perftests/median_match_time." 

Charlie: Do you think if this is ok or whether this is worth optimizing? 
Hm looks like a ~20% regression on Win and ~33% regression on Android. It would be nice to improve this if we can.

I ran a quick analysis of our corpus of URLs and 1/2 of URLs have at least one uppercase character. This yields the absolute worst performance.
1. Scan URL for lowercase for NGram matching
2. Need to lower-case the URL for NGram matching
3. match_case() false
4. HasAnyUpperAscii(url_pattern_) is (probably) false
5. HasAnyUpperAscii(url.possibly_invalid_spec()) is true
6. Lower-case the pattern
7. Lower-case the url again
8. Standard match

This adds a lot of extra work :(

My feeling is that subresource_filter (and DNR) may want match_case to be true by default. Otherwise, we can probably remove duplicate work (like 1/5 and 2/7). Additionally, if match_case is false for the pattern, we can lower() it at index time, removing (6) and (4).

Profiling can probably find which of these is eating up the most time. My guess is lowercasing the URL is the worst offender.
Thanks for the analysis Charlie. I'll try and optimize this. As you pointed out, currently there is a lot of redundant work, which can be reduced. The url is being lower cased for every pattern, but taking care of that should be trivial. 

Will also look into lower casing the pattern with match_case==false at index time.
Project Member

Comment 10 by bugdroid1@chromium.org, Sep 12

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e0aeb0ec65e3b74ca5f454a555a04d8c060fff03

commit e0aeb0ec65e3b74ca5f454a555a04d8c060fff03
Author: Karan Bhatia <karandeepb@chromium.org>
Date: Wed Sep 12 18:57:21 2018

UrlPatternIndex: Reduce redundant computations while matching.

r590118 implemented case-insensitive matching which led to some regression in url
matching performance. Currently, it's possible for a url to be lower-cased each
time it's matched with a pattern in UrlPattern::MatchesUrl. This CL aims to
reduce this redundant computation. A thin wrapper over GURL, UrlInfo is
introduced to cache the computation performed on the GURL.

BUG= 767605 

Change-Id: Ibe977f34e649068453c5a01e3619b93a76c9ae7d
Reviewed-on: https://chromium-review.googlesource.com/1220628
Reviewed-by: Charlie Harrison <csharrison@chromium.org>
Commit-Queue: Karan Bhatia <karandeepb@chromium.org>
Cr-Commit-Position: refs/heads/master@{#590767}
[modify] https://crrev.com/e0aeb0ec65e3b74ca5f454a555a04d8c060fff03/components/url_pattern_index/url_pattern.cc
[modify] https://crrev.com/e0aeb0ec65e3b74ca5f454a555a04d8c060fff03/components/url_pattern_index/url_pattern.h
[modify] https://crrev.com/e0aeb0ec65e3b74ca5f454a555a04d8c060fff03/components/url_pattern_index/url_pattern_index.cc
[modify] https://crrev.com/e0aeb0ec65e3b74ca5f454a555a04d8c060fff03/components/url_pattern_index/url_pattern_unittest.cc

Looks like on Android the regression has dropped to only 8%. Still not amazing but not bad.

I think a further improvement would be:
- Default match_case to true in DNR / subresource_filter
- Include an optimization which avoids any lowercasing if there are no
  case insensitive rules.

But I don't think we should block on that.
>> Default match_case to true in DNR / subresource_filter

For subresource filter, aren't most easylist rules case-insensitive by default? So we'll be changing the behavior for them (granted this was already the case, since case-insensitive matching wasn't implemented).

Another option which probably won't be difficult to implement would be to lower-case the case-insensitive patterns at index time. I think that should get us to good enough performance. Thoughts?
Yes, technically subresource_filter was / will be divergent from EasyList. I think I may be OK with losing some coverage for performance though.

In general, I'm not really happy EasyList is insensitive by default. It makes everything even harder to reason about.

Currently the extra computation we're doing is:
1. One extra scan for lower case on the URL
2. A call ToLowerASCII on the URL if there is upper case in it
3. A call ToLowerASCII on the pattern, if match_Case is false.

We'll still pay for (1) and (2) if we lower-case the patterns at index time, right? I bet that's more expensive than (3), but it depends on the patterns.
Note though, (1) and (2) are done once per URL, while (3) is done for each prospective match for a URL being matched.

Hence getting rid of (3) "might" have some benefit as well. I'll go ahead and see what (3) looks like and let's take a call on the default value after that?
Yep great point. I can try to generate a profile and see where time is going on our current ruleset.
I ran a quick profile with the new code:
-   43.93% subresource_filter::IndexedRulesetMatcher::MatchedUrlRule                                                                    ▒
   - 43.88% subresource_filter::IndexedRulesetMatcher::MatchedUrlRule                                                                                                                               ▒
      - 35.54% url_pattern_index::UrlPatternIndexMatcher::FindMatch                                                                                                                                 ▒
         - 11.49% url_pattern_index::(anonymous namespace)::FindMatchAmongCandidates                                                                                                                ▒
            - 8.06% url_pattern_index::UrlPattern::MatchesUrl                                                                                                                                       ▒
               + 2.86% base::ToLowerASCII                                                                                                                                                           ▒
                 2.28% base::internal::find                                                                                                                                                         ▒
                 1.74% url_pattern_index::(anonymous namespace)::IsCaseSensitiveMatch                                                                                                               ▒
              1.67% url_pattern_index::(anonymous namespace)::GetLongestMatchingSubdomain                                                                                                           ▒
         - 6.03% url_pattern_index::UrlPattern::UrlInfo::UrlInfo                                                                                                                                    ▒
            + 5.12% base::ToLowerASCII                                                                                                                                                              ▒
      - 7.53% subresource_filter::FirstPartyOrigin::IsThirdParty                                                                                                                                    ▒
         + 6.97% net::registry_controlled_domains::(anonymous namespace)::SameDomainOrHost                                                                                                          ▒


There's a couple things to note:
- Less than half of the time is spent actually doing matching, mostly due to IO
- UrlInfo lower-casing is more expensive than pattern lower-casing (~5% vs 3%)
- Lower-casing is pretty expensive in general relative to other ops. In reality it is closer to 18% of MatchedUrlRule.

Thanks for the analysis Charlie, this is pretty helpful. Curious, what did use to profile the code?

I think so the plan of action is to:
- Flip the default for case-sensitivity. I believe this will also require changes to proto->easylist conversion, can you take this on?
- Compute lower cased url on UrlInfo lazily (only when required).
- Optionally, lower case patterns while indexing. This is probably not important enough after flipping the default.

Also, just making sure you were using a release build, right?
I use `perf record` and `perf report` tools from linux perf.

>> perf record  --call-graph=dwarf  -- ./out/profiling-enabled-build/components_perftests --gtest_filter="*Match*"   --gtest_repeat=10

>> perf report

You need a few things in args.gn:
is_debug = false
enable_profiling = true
enable_callgrind = true

I think one plan of action could be (these can be done in any order really):
1. Lower case patterns with match_case = false during indexing. Or fail w/ a syntax error or something for DNR.
2. Flip match_case default to true
3. Alter the NGram extractor to extract lower-cased ngrams automatically w/out an allocation
4. Compute lower-cased url lazily

I think if we do this we will revert back to the old performance, except for a tiny regression at (3) but I think that should be really small, since we're already scanning through the URL.

(1) is definitely lower priority if we do (2), but I think for DNR it is probably a good idea to do something just in case. E.g. an ad blocker may want to have match_case false for every pattern.

I can look into doing (2).
Sounds good, I'll take on (1) and (4) for now. (3) probably won't be required after (1).
I still think (3) is required even if we do (1). This would remove the need to call lower_case_spec here:
https://cs.chromium.org/chromium/src/components/url_pattern_index/url_pattern_index.cc?rcl=062244e45f9d737c516b868e0762e5b0145a59bb&l=663

I can look into doing it.
Project Member

Comment 22 by bugdroid1@chromium.org, Sep 13

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/03d14673330e4a12cdd520cb6bff60091f02d5c3

commit 03d14673330e4a12cdd520cb6bff60091f02d5c3
Author: Charlie Harrison <csharrison@chromium.org>
Date: Thu Sep 13 20:37:02 2018

[url_pattern_index] Implement lower case ngram extraction

This is implemented as an additional template parameter for readability.

We currently store all ngrams as lowercase. By making ngram extraction
automatically extract lower-cased ngrams, callers no longer have to
lower-case their input (usually causing an additional allocation / copy).


Bug:  767605 
Change-Id: I466011a5143785520d2007700b01585363736153
Reviewed-on: https://chromium-review.googlesource.com/1223206
Reviewed-by: Karan Bhatia <karandeepb@chromium.org>
Commit-Queue: Charlie Harrison <csharrison@chromium.org>
Cr-Commit-Position: refs/heads/master@{#591139}
[modify] https://crrev.com/03d14673330e4a12cdd520cb6bff60091f02d5c3/components/url_pattern_index/ngram_extractor.h
[modify] https://crrev.com/03d14673330e4a12cdd520cb6bff60091f02d5c3/components/url_pattern_index/ngram_extractor_unittest.cc
[modify] https://crrev.com/03d14673330e4a12cdd520cb6bff60091f02d5c3/components/url_pattern_index/url_pattern_index.cc

Project Member

Comment 23 by bugdroid1@chromium.org, Sep 14

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e177fb6a42054c27db60110eb3ee4ef5272f38ce

commit e177fb6a42054c27db60110eb3ee4ef5272f38ce
Author: Karan Bhatia <karandeepb@chromium.org>
Date: Fri Sep 14 00:57:30 2018

UrlPatternIndex: Lazily compute lower cased url spec.

This CL changes UrlPattern::UrlInfo to ensure that the lower cased url spec is
computed lazily. Together with r591139, this helps prevent a string allocation
for the whole lower cased url unless absolutely necessary.

BUG= 767605 

Change-Id: I374c268a48900322f258212b05b906c541869031
Reviewed-on: https://chromium-review.googlesource.com/1226450
Reviewed-by: Charlie Harrison <csharrison@chromium.org>
Commit-Queue: Karan Bhatia <karandeepb@chromium.org>
Cr-Commit-Position: refs/heads/master@{#591243}
[modify] https://crrev.com/e177fb6a42054c27db60110eb3ee4ef5272f38ce/components/url_pattern_index/url_pattern.cc
[modify] https://crrev.com/e177fb6a42054c27db60110eb3ee4ef5272f38ce/components/url_pattern_index/url_pattern.h

Project Member

Comment 24 by bugdroid1@chromium.org, Sep 14

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8d71f6f577d973c04ed02cf8779643a6b874ac50

commit 8d71f6f577d973c04ed02cf8779643a6b874ac50
Author: Charlie Harrison <csharrison@chromium.org>
Date: Fri Sep 14 14:43:26 2018

[url_pattern_index] Make case sensitive matching the default

Case insensitive matching has some performance cost, mostly in copying
and re-allocating the underlying URL to do lower-cased matching.

This CL changes IS_MATCH_CASE to IS_CASE_INSENSITIVE in the flatbuffer
format, and reverses its meaning. It also changes proto -> flatbuffer
conversion to ignore the match_case proto bit. At some later date we
may want to either:
 - Align with the proto rules and make match case false by default
 - Introduce new rule parsing syntax which supports case insensitivity

For now, just punt with a TODO.

TBR=nednguyen@chromium.org

Bug:  767605 
Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi
Change-Id: I440d5966076a184ea1042f6f38c8562732ae803d
Reviewed-on: https://chromium-review.googlesource.com/1223209
Reviewed-by: Devlin <rdevlin.cronin@chromium.org>
Reviewed-by: Josh Karlin <jkarlin@chromium.org>
Reviewed-by: Karan Bhatia <karandeepb@chromium.org>
Commit-Queue: Charlie Harrison <csharrison@chromium.org>
Cr-Commit-Position: refs/heads/master@{#591342}
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/subresource_filter/core/common/indexed_ruleset.cc
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/flat/url_pattern_index.fbs
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/proto/rules.proto
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/url_pattern.cc
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/url_pattern.h
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/url_pattern_index.cc
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/url_pattern_index.h
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/url_pattern_index_unittest.cc
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/components/url_pattern_index/url_rule_util.cc
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/extensions/browser/api/declarative_net_request/flat_ruleset_indexer_unittest.cc
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/extensions/browser/api/declarative_net_request/indexed_rule.cc
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/extensions/browser/api/declarative_net_request/indexed_rule_unittest.cc
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/extensions/browser/api/declarative_net_request/indexed_ruleset_format_version_unittest.cc
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/extensions/browser/api/declarative_net_request/utils.cc
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/extensions/common/api/declarative_net_request.idl
[modify] https://crrev.com/8d71f6f577d973c04ed02cf8779643a6b874ac50/tools/perf/core/default_local_state.json

Project Member

Comment 25 by bugdroid1@chromium.org, Sep 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a06f68234f469100b47fe1d58cf17ac8f0b94474

commit a06f68234f469100b47fe1d58cf17ac8f0b94474
Author: Karan Bhatia <karandeepb@chromium.org>
Date: Tue Sep 18 00:05:49 2018

UrlPatternIndex: Require case-insensitive patterns to be lower-cased.

Currently case-insensitive patterns need to be lower-cased when they are matched
against a url. This CL changes the UrlPatternIndex to require case-insensitive
patterns to be lower-cased at index time. This allows more efficient matching
for case-insensitive patterns by reducing computation during matching.

TBR=nednguyen@google.com

BUG= 767605 

Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi
Change-Id: I69f9017ee75cc85dba129adb8fd89df6b2d51686
Reviewed-on: https://chromium-review.googlesource.com/1223328
Commit-Queue: Karan Bhatia <karandeepb@chromium.org>
Reviewed-by: Charlie Harrison <csharrison@chromium.org>
Reviewed-by: Istiaque Ahmed <lazyboy@chromium.org>
Cr-Commit-Position: refs/heads/master@{#591887}
[modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/components/subresource_filter/core/common/indexed_ruleset.cc
[modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/components/url_pattern_index/flat/url_pattern_index.fbs
[modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/components/url_pattern_index/url_pattern.cc
[modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/components/url_pattern_index/url_pattern_index.cc
[modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/components/url_pattern_index/url_pattern_index.h
[modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/extensions/browser/api/declarative_net_request/indexed_rule.cc
[modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/extensions/browser/api/declarative_net_request/indexed_rule_unittest.cc
[modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/extensions/browser/api/declarative_net_request/indexed_ruleset_format_version_unittest.cc
[modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/extensions/browser/api/declarative_net_request/utils.cc
[modify] https://crrev.com/a06f68234f469100b47fe1d58cf17ac8f0b94474/tools/perf/core/default_local_state.json

Status: Fixed (was: Started)
Think that all planned optimizations have landed. And the perf dashboard shows, that the numbers for median_match_time test are now almost at previous levels. Closing this. 
Thanks Karan!

Sign in to add a comment