New issue
Advanced search Search tips

Issue 831321 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocking:
issue 831962
issue 824715



Sign in to add a comment

Calling UnescapeURLComponent with a UTF16 string is broken

Project Member Reported by mmenke@chromium.org, Apr 10 2018

Issue description

We have two versions of UnescapeURLComponent, one which takes a std::string and one that takes a string16 result.  They use the same code, behind the scenes, which means they return different results when passed an escaped UTF-8 byte sequence.

UnescapeURLComponent("%F0%9F%94%8F") will return "\xF0\x9F\x94\x8F", which is a valid UTF-8 character.  UnescapeURLComponent(L"%F0%9F%94%8F") will return L"\xF0\x9F\x94\x8F", which is not a valid UTF8 character (The correct value is L"\u1F50F").

It's not even really clear what is intended when calling the string16 version of the string.  It looks like the only consumer of the string16 version of the method is the omnibox.  The simplest solution there is just to convert to UTF8, unescape, and convert back to UTF16, instead of reworking the underlying function to do something sane there.
 

Comment 1 by mmenke@chromium.org, Apr 10 2018

Looks like url_formatter also uses the buggy method, and that will be significantly harder to fix.  :(

Comment 2 by mmenke@chromium.org, Apr 10 2018

The method has a warning about this behavior, but it also has safety checks based on decoding the string as UTF-8, which obviously don't work if the output string is not UTF-8.
Please CC me on any changes you make to this.  We've had weird UTF-8 crashes in the omnibox in the past, and I'd like to know what's happening under the scenes here if something is going to change.

Comment 4 by mmenke@chromium.org, Apr 10 2018

Will do.  It looks like url_formatter is fine - it uses net methods that wrap the UTF8 methods, so it's just a minor omnibox change.

Comment 5 by mgiuca@chromium.org, Apr 11 2018

> UnescapeURLComponent(L"%F0%9F%94%8F") will return L"\xF0\x9F\x94\x8F", which is not a valid UTF8 character

I agree. The concept of UnescapeURLComponent(string16) is inherently meaningless, because what you get back from unescaping a percent-encoded-sequence is a byte, not a UTF-16 code unit or code point.

The technical description of the current behaviour of UnescapeURLComponent(string16) is that it assumes percent-encoded sequences represent Latin-1, and converts them into UTF-16. For example, UnescapeURLComponent(L"caf%E9") will return L"café", which is correct assuming that "%E9" is Latin-1, but that is almost always an incorrect assumption.

> (The correct value is L"\u1F50F").

I don't think it would be correct for this function to presume the text is UTF-8-encoded and decode it further. So I reach the same conclusion as you: UnescapeURLComponent(string16) has to go. I will comment further on the CL (https://crrev.com/c/1004855).

(Fun fact: I dealt with almost the exact same problem updating Python's standard library to Unicode strings, around 10 years ago. See unquote and unquote_to_bytes in https://docs.python.org/3/library/urllib.parse.html).

Comment 6 by mmenke@chromium.org, Apr 11 2018

Blocking: 824715

Comment 7 by mgiuca@chromium.org, Apr 11 2018

I've left a detailed comment on the bug. There's one outstanding issue which perhaps mpearson can help out with (the big comment thread in url_index_private_data.cc).

The CL is (kind of by accident) changing the behaviour here from "decode as Latin-1" to "decode as UTF-8". I think the old behaviour was a latent bug, and the new behaviour is probably for the best.

Comment 8 by mgiuca@chromium.org, Apr 12 2018

Blocking: 831962
Project Member

Comment 9 by bugdroid1@chromium.org, Apr 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4858757b3c659da84c0eb8d4cccc728331eba281

commit 4858757b3c659da84c0eb8d4cccc728331eba281
Author: Matt Menke <mmenke@chromium.org>
Date: Fri Apr 13 19:40:43 2018

Remove UnescapeURLComponent() overload that takes a base::string16.

The method has some safe-for-display safety checks that assume the input
is UTF-8 / output is UTF-8.  This change makes it at least a little
harder to avoid those checks, and makes output no longer vary based on
whether passing in a std::string or a string16 (By removing the latter
option entirely).

Bug:  831321 
Change-Id: Ib39a2cccd71861213341e92932525e8ecafc60cd
Reviewed-on: https://chromium-review.googlesource.com/1004855
Reviewed-by: Matt Giuca <mgiuca@chromium.org>
Reviewed-by: Justin Donnelly <jdonnelly@chromium.org>
Commit-Queue: Matt Menke <mmenke@chromium.org>
Cr-Commit-Position: refs/heads/master@{#550720}
[modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/components/omnibox/browser/url_index_private_data.cc
[modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/escape.cc
[modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/escape.h
[modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/escape_unittest.cc
[modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/unescape_url_component_fuzzer.cc

Status: Fixed (was: Started)
Project Member

Comment 11 by bugdroid1@chromium.org, Apr 17 2018

Labels: merge-merged-testbranch
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4858757b3c659da84c0eb8d4cccc728331eba281

commit 4858757b3c659da84c0eb8d4cccc728331eba281
Author: Matt Menke <mmenke@chromium.org>
Date: Fri Apr 13 19:40:43 2018

Remove UnescapeURLComponent() overload that takes a base::string16.

The method has some safe-for-display safety checks that assume the input
is UTF-8 / output is UTF-8.  This change makes it at least a little
harder to avoid those checks, and makes output no longer vary based on
whether passing in a std::string or a string16 (By removing the latter
option entirely).

Bug:  831321 
Change-Id: Ib39a2cccd71861213341e92932525e8ecafc60cd
Reviewed-on: https://chromium-review.googlesource.com/1004855
Reviewed-by: Matt Giuca <mgiuca@chromium.org>
Reviewed-by: Justin Donnelly <jdonnelly@chromium.org>
Commit-Queue: Matt Menke <mmenke@chromium.org>
Cr-Commit-Position: refs/heads/master@{#550720}
[modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/components/omnibox/browser/url_index_private_data.cc
[modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/escape.cc
[modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/escape.h
[modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/escape_unittest.cc
[modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/unescape_url_component_fuzzer.cc

Project Member

Comment 12 by bugdroid1@chromium.org, Apr 19 2018

Labels: merge-merged-3396
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/22f489551e5371eb2fb50a69578defc940b78d07

commit 22f489551e5371eb2fb50a69578defc940b78d07
Author: Matt Menke <mmenke@chromium.org>
Date: Thu Apr 19 16:59:01 2018

Remove UnescapeURLComponent() overload that takes a base::string16.

The method has some safe-for-display safety checks that assume the input
is UTF-8 / output is UTF-8.  This change makes it at least a little
harder to avoid those checks, and makes output no longer vary based on
whether passing in a std::string or a string16 (By removing the latter
option entirely).

TBR=mmenke@chromium.org

(cherry picked from commit 4858757b3c659da84c0eb8d4cccc728331eba281)

Bug: 824715,  831321 
Change-Id: Ib39a2cccd71861213341e92932525e8ecafc60cd
Reviewed-on: https://chromium-review.googlesource.com/1004855
Reviewed-by: Matt Giuca <mgiuca@chromium.org>
Reviewed-by: Justin Donnelly <jdonnelly@chromium.org>
Commit-Queue: Matt Menke <mmenke@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#550720}
Reviewed-on: https://chromium-review.googlesource.com/1019841
Reviewed-by: Matt Menke <mmenke@chromium.org>
Cr-Commit-Position: refs/branch-heads/3396@{#130}
Cr-Branched-From: 9ef2aa869bc7bc0c089e255d698cca6e47d6b038-refs/heads/master@{#550428}
[modify] https://crrev.com/22f489551e5371eb2fb50a69578defc940b78d07/components/omnibox/browser/url_index_private_data.cc
[modify] https://crrev.com/22f489551e5371eb2fb50a69578defc940b78d07/net/base/escape.cc
[modify] https://crrev.com/22f489551e5371eb2fb50a69578defc940b78d07/net/base/escape.h
[modify] https://crrev.com/22f489551e5371eb2fb50a69578defc940b78d07/net/base/escape_unittest.cc
[modify] https://crrev.com/22f489551e5371eb2fb50a69578defc940b78d07/net/base/unescape_url_component_fuzzer.cc

Sign in to add a comment