Calling UnescapeURLComponent with a UTF16 string is broken |
||||||
Issue description
We have two versions of UnescapeURLComponent, one which takes a std::string and one that takes a string16 result. They use the same code, behind the scenes, which means they return different results when passed an escaped UTF-8 byte sequence.
UnescapeURLComponent("%F0%9F%94%8F") will return "\xF0\x9F\x94\x8F", which is a valid UTF-8 character. UnescapeURLComponent(L"%F0%9F%94%8F") will return L"\xF0\x9F\x94\x8F", which is not a valid UTF8 character (The correct value is L"\u1F50F").
It's not even really clear what is intended when calling the string16 version of the string. It looks like the only consumer of the string16 version of the method is the omnibox. The simplest solution there is just to convert to UTF8, unescape, and convert back to UTF16, instead of reworking the underlying function to do something sane there.
,
Apr 10 2018
The method has a warning about this behavior, but it also has safety checks based on decoding the string as UTF-8, which obviously don't work if the output string is not UTF-8.
,
Apr 10 2018
Please CC me on any changes you make to this. We've had weird UTF-8 crashes in the omnibox in the past, and I'd like to know what's happening under the scenes here if something is going to change.
,
Apr 10 2018
Will do. It looks like url_formatter is fine - it uses net methods that wrap the UTF8 methods, so it's just a minor omnibox change.
,
Apr 11 2018
> UnescapeURLComponent(L"%F0%9F%94%8F") will return L"\xF0\x9F\x94\x8F", which is not a valid UTF8 character I agree. The concept of UnescapeURLComponent(string16) is inherently meaningless, because what you get back from unescaping a percent-encoded-sequence is a byte, not a UTF-16 code unit or code point. The technical description of the current behaviour of UnescapeURLComponent(string16) is that it assumes percent-encoded sequences represent Latin-1, and converts them into UTF-16. For example, UnescapeURLComponent(L"caf%E9") will return L"café", which is correct assuming that "%E9" is Latin-1, but that is almost always an incorrect assumption. > (The correct value is L"\u1F50F"). I don't think it would be correct for this function to presume the text is UTF-8-encoded and decode it further. So I reach the same conclusion as you: UnescapeURLComponent(string16) has to go. I will comment further on the CL (https://crrev.com/c/1004855). (Fun fact: I dealt with almost the exact same problem updating Python's standard library to Unicode strings, around 10 years ago. See unquote and unquote_to_bytes in https://docs.python.org/3/library/urllib.parse.html).
,
Apr 11 2018
,
Apr 11 2018
I've left a detailed comment on the bug. There's one outstanding issue which perhaps mpearson can help out with (the big comment thread in url_index_private_data.cc). The CL is (kind of by accident) changing the behaviour here from "decode as Latin-1" to "decode as UTF-8". I think the old behaviour was a latent bug, and the new behaviour is probably for the best.
,
Apr 12 2018
,
Apr 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4858757b3c659da84c0eb8d4cccc728331eba281 commit 4858757b3c659da84c0eb8d4cccc728331eba281 Author: Matt Menke <mmenke@chromium.org> Date: Fri Apr 13 19:40:43 2018 Remove UnescapeURLComponent() overload that takes a base::string16. The method has some safe-for-display safety checks that assume the input is UTF-8 / output is UTF-8. This change makes it at least a little harder to avoid those checks, and makes output no longer vary based on whether passing in a std::string or a string16 (By removing the latter option entirely). Bug: 831321 Change-Id: Ib39a2cccd71861213341e92932525e8ecafc60cd Reviewed-on: https://chromium-review.googlesource.com/1004855 Reviewed-by: Matt Giuca <mgiuca@chromium.org> Reviewed-by: Justin Donnelly <jdonnelly@chromium.org> Commit-Queue: Matt Menke <mmenke@chromium.org> Cr-Commit-Position: refs/heads/master@{#550720} [modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/components/omnibox/browser/url_index_private_data.cc [modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/escape.cc [modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/escape.h [modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/escape_unittest.cc [modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/unescape_url_component_fuzzer.cc
,
Apr 13 2018
,
Apr 17 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4858757b3c659da84c0eb8d4cccc728331eba281 commit 4858757b3c659da84c0eb8d4cccc728331eba281 Author: Matt Menke <mmenke@chromium.org> Date: Fri Apr 13 19:40:43 2018 Remove UnescapeURLComponent() overload that takes a base::string16. The method has some safe-for-display safety checks that assume the input is UTF-8 / output is UTF-8. This change makes it at least a little harder to avoid those checks, and makes output no longer vary based on whether passing in a std::string or a string16 (By removing the latter option entirely). Bug: 831321 Change-Id: Ib39a2cccd71861213341e92932525e8ecafc60cd Reviewed-on: https://chromium-review.googlesource.com/1004855 Reviewed-by: Matt Giuca <mgiuca@chromium.org> Reviewed-by: Justin Donnelly <jdonnelly@chromium.org> Commit-Queue: Matt Menke <mmenke@chromium.org> Cr-Commit-Position: refs/heads/master@{#550720} [modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/components/omnibox/browser/url_index_private_data.cc [modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/escape.cc [modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/escape.h [modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/escape_unittest.cc [modify] https://crrev.com/4858757b3c659da84c0eb8d4cccc728331eba281/net/base/unescape_url_component_fuzzer.cc
,
Apr 19 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/22f489551e5371eb2fb50a69578defc940b78d07 commit 22f489551e5371eb2fb50a69578defc940b78d07 Author: Matt Menke <mmenke@chromium.org> Date: Thu Apr 19 16:59:01 2018 Remove UnescapeURLComponent() overload that takes a base::string16. The method has some safe-for-display safety checks that assume the input is UTF-8 / output is UTF-8. This change makes it at least a little harder to avoid those checks, and makes output no longer vary based on whether passing in a std::string or a string16 (By removing the latter option entirely). TBR=mmenke@chromium.org (cherry picked from commit 4858757b3c659da84c0eb8d4cccc728331eba281) Bug: 824715, 831321 Change-Id: Ib39a2cccd71861213341e92932525e8ecafc60cd Reviewed-on: https://chromium-review.googlesource.com/1004855 Reviewed-by: Matt Giuca <mgiuca@chromium.org> Reviewed-by: Justin Donnelly <jdonnelly@chromium.org> Commit-Queue: Matt Menke <mmenke@chromium.org> Cr-Original-Commit-Position: refs/heads/master@{#550720} Reviewed-on: https://chromium-review.googlesource.com/1019841 Reviewed-by: Matt Menke <mmenke@chromium.org> Cr-Commit-Position: refs/branch-heads/3396@{#130} Cr-Branched-From: 9ef2aa869bc7bc0c089e255d698cca6e47d6b038-refs/heads/master@{#550428} [modify] https://crrev.com/22f489551e5371eb2fb50a69578defc940b78d07/components/omnibox/browser/url_index_private_data.cc [modify] https://crrev.com/22f489551e5371eb2fb50a69578defc940b78d07/net/base/escape.cc [modify] https://crrev.com/22f489551e5371eb2fb50a69578defc940b78d07/net/base/escape.h [modify] https://crrev.com/22f489551e5371eb2fb50a69578defc940b78d07/net/base/escape_unittest.cc [modify] https://crrev.com/22f489551e5371eb2fb50a69578defc940b78d07/net/base/unescape_url_component_fuzzer.cc |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by mmenke@chromium.org
, Apr 10 2018