New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 831962 link

Starred by 1 user

Issue metadata

Status: Started
Owner:
Long OOO (go/where-is-mgiuca)
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Windows , Chrome , Mac
Pri: 3
Type: Bug

Blocked on:
issue 831321



Sign in to add a comment

Omnibox URL matching with percent-encoded Unicode characters is broken

Project Member Reported by mgiuca@chromium.org, Apr 12 2018

Issue description

Chrome Version: 67
OS: All (Desktop only?)

What steps will reproduce the problem?
(1) Visit https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal
(2) Close the tab.
(3) Type into Omnibox: "wikip%c3%a9"

What is the expected result?
https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal shows up in search results.

What happens instead?
It does not.

This is a pretty obscure case, but we have code to explicitly handle it and it's broken (twice!). The first part was discovered by mmenke@ during [1].

There are two bugs in url_index_private_data.cc:

1. The search term is decoded with UnescapeURLComponent's string16 overload, which as mmenke discovered, is completely nonsensical, as it decodes percent-escaped sequences as Latin-1. See discussion on  Issue 831321 . "wikip%c3%a9" is decoded as "wikipé" and this *does literally match* a page with "wikipé" in the title. It should be decoded as UTF-8, which can be done by converting the search term to UTF-8, using the 8-bit version of UnescapeURLComponent, then converting back to UTF-16. mmenke is doing just this in [1].

2. If you decode it correctly, as "wikipé", it will still not match the URL, because HistoryIdsToScoredMatches is called with lower_raw_string (the original, unescaped string), which I assume is a mistake. Changing this to lower_unescaped_string allows the match.

The first issue will be fixed by [1]. I'll deal with the second issue afterwards and add a test.

[1] https://chromium-review.googlesource.com/c/chromium/src/+/1004855
 

Comment 1 by mgiuca@chromium.org, Apr 12 2018

Fix CL with test: https://chromium-review.googlesource.com/c/chromium/src/+/1009443

(I'll wait for mmenke's CL to land to avoid conflicting.)

Comment 2 by mgiuca@chromium.org, Apr 17 2018

Note: The following commit landed which fixed the first of two related bugs mentioned above (should've referenced this issue number):

commit 4858757b3c659da84c0eb8d4cccc728331eba281
Author: Matt Menke <mmenke@chromium.org>
Date:   Fri Apr 13 19:40:43 2018 +0000

    Remove UnescapeURLComponent() overload that takes a base::string16.
    
    The method has some safe-for-display safety checks that assume the input
    is UTF-8 / output is UTF-8.  This change makes it at least a little
    harder to avoid those checks, and makes output no longer vary based on
    whether passing in a std::string or a string16 (By removing the latter
    option entirely).
    
    Bug:  831321 
    Change-Id: Ib39a2cccd71861213341e92932525e8ecafc60cd
    Reviewed-on: https://chromium-review.googlesource.com/1004855
    Reviewed-by: Matt Giuca <mgiuca@chromium.org>
    Reviewed-by: Justin Donnelly <jdonnelly@chromium.org>
    Commit-Queue: Matt Menke <mmenke@chromium.org>
    Cr-Commit-Position: refs/heads/master@{#550720}

Regarding #2 in the original report:
>>>
2. If you decode it correctly, as "wikipé", it will still not match the URL, because HistoryIdsToScoredMatches is called with lower_raw_string (the original, unescaped string), which I assume is a mistake. Changing this to lower_unescaped_string allows the match.
>>>

Be careful when doing this.  There's a comment in HistoryIdsToScoredMatches() about why the lower raw string is used:
---
  // Break up the raw search string (complete with escaped URL elements) into
  // 'terms' (as opposed to 'words'; see comment in HistoryItemsForTerms()).
  // We only want to break up the search string on 'true' whitespace rather than
  // escaped whitespace.  For example, when the user types
  // "colspec=ID%20Mstone Release" we get two 'terms': "colspec=id%20mstone" and
  // "release".
  String16Vector lower_raw_terms =
      base::SplitString(lower_raw_string, base::kWhitespaceUTF16,
                        base::KEEP_WHITESPACE, base::SPLIT_WANT_NONEMPTY);
---

In other words, for the input "A%20B", we want to make sure that the string "A%20B" appears in the results, not merely "A" and "B".  I.e., the omnibox input "A%20B" and "A B" should not be treated equivalently.

Sign in to add a comment