Omnibox URL matching with percent-encoded Unicode characters is broken |
|
Issue descriptionChrome Version: 67 OS: All (Desktop only?) What steps will reproduce the problem? (1) Visit https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal (2) Close the tab. (3) Type into Omnibox: "wikip%c3%a9" What is the expected result? https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal shows up in search results. What happens instead? It does not. This is a pretty obscure case, but we have code to explicitly handle it and it's broken (twice!). The first part was discovered by mmenke@ during [1]. There are two bugs in url_index_private_data.cc: 1. The search term is decoded with UnescapeURLComponent's string16 overload, which as mmenke discovered, is completely nonsensical, as it decodes percent-escaped sequences as Latin-1. See discussion on Issue 831321 . "wikip%c3%a9" is decoded as "wikipé" and this *does literally match* a page with "wikipé" in the title. It should be decoded as UTF-8, which can be done by converting the search term to UTF-8, using the 8-bit version of UnescapeURLComponent, then converting back to UTF-16. mmenke is doing just this in [1]. 2. If you decode it correctly, as "wikipé", it will still not match the URL, because HistoryIdsToScoredMatches is called with lower_raw_string (the original, unescaped string), which I assume is a mistake. Changing this to lower_unescaped_string allows the match. The first issue will be fixed by [1]. I'll deal with the second issue afterwards and add a test. [1] https://chromium-review.googlesource.com/c/chromium/src/+/1004855
,
Apr 17 2018
Note: The following commit landed which fixed the first of two related bugs mentioned above (should've referenced this issue number): commit 4858757b3c659da84c0eb8d4cccc728331eba281 Author: Matt Menke <mmenke@chromium.org> Date: Fri Apr 13 19:40:43 2018 +0000 Remove UnescapeURLComponent() overload that takes a base::string16. The method has some safe-for-display safety checks that assume the input is UTF-8 / output is UTF-8. This change makes it at least a little harder to avoid those checks, and makes output no longer vary based on whether passing in a std::string or a string16 (By removing the latter option entirely). Bug: 831321 Change-Id: Ib39a2cccd71861213341e92932525e8ecafc60cd Reviewed-on: https://chromium-review.googlesource.com/1004855 Reviewed-by: Matt Giuca <mgiuca@chromium.org> Reviewed-by: Justin Donnelly <jdonnelly@chromium.org> Commit-Queue: Matt Menke <mmenke@chromium.org> Cr-Commit-Position: refs/heads/master@{#550720}
,
Apr 17 2018
Regarding #2 in the original report:
>>>
2. If you decode it correctly, as "wikipé", it will still not match the URL, because HistoryIdsToScoredMatches is called with lower_raw_string (the original, unescaped string), which I assume is a mistake. Changing this to lower_unescaped_string allows the match.
>>>
Be careful when doing this. There's a comment in HistoryIdsToScoredMatches() about why the lower raw string is used:
---
// Break up the raw search string (complete with escaped URL elements) into
// 'terms' (as opposed to 'words'; see comment in HistoryItemsForTerms()).
// We only want to break up the search string on 'true' whitespace rather than
// escaped whitespace. For example, when the user types
// "colspec=ID%20Mstone Release" we get two 'terms': "colspec=id%20mstone" and
// "release".
String16Vector lower_raw_terms =
base::SplitString(lower_raw_string, base::kWhitespaceUTF16,
base::KEEP_WHITESPACE, base::SPLIT_WANT_NONEMPTY);
---
In other words, for the input "A%20B", we want to make sure that the string "A%20B" appears in the results, not merely "A" and "B". I.e., the omnibox input "A%20B" and "A B" should not be treated equivalently.
|
|
►
Sign in to add a comment |
|
Comment 1 by mgiuca@chromium.org
, Apr 12 2018