New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 829868 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

UnescapeURLWithAdjustmentsImpl unescapes invalid UTF8 characters.

Project Member Reported by mmenke@chromium.org, Apr 6 2018

Issue description

UnescapeURLWithAdjustmentsImpl unescapes part of a URL, with rules about what can and cannot safely be unescaped, based largely on what's safe to display.  However, it unconditional characters that at not valid UTF-8 (Not just byte sequences that aren't mapped to characters, but byte sequences that are illegal under UTF-8).

There seem to be two types of consumers of this method.  There are those that want everything unconditionally unescaped, which typically are not going to display the data directly (Like data URLs).  These typically use all flags to unescape everything, and are the only users of SPOOFING_AND_CONTROL_CHARS.  Then there are consumers that want to display the URL.  These never use SPOOFING_AND_CONTROL_CHARS.  It seems like we should be gating unescaping invalid unicode characters on the SPOOFING_AND_CONTROL_CHARS flag.
 

Comment 1 by mgiuca@chromium.org, Apr 16 2018

Cc: mmenke@chromium.org mgiuca@chromium.org
 Issue 833255  has been merged into this issue.

Comment 2 by mgiuca@chromium.org, Apr 16 2018

Oops I just duplicated this.

Some extra info that I wrote on that bug:

net::UnescapeURLComponent(%D8%9C%85%D8%9C%85", UnescapeRule::NORMAL)

Expected output:
"%D8%9C%85%D8%9C%85"

Actual output:
"%D8%9C\x85%D8%9C\x85"

(Source: mmenke's new test)

I think (as you suggested in https://crrev.com/c/998014): SPOOFING_AND_CONTROL_CHARS is essentially a different function (it has a different "return type": it returns a byte sequence while the normal invocation returns a text string; it just happens that both of those "types" are std::string in C++). SPOOFING_AND_CONTROL_CHARS should be removed from the UnescapeRule enum, and a separate function should be written for the non-displayable 8-bit-clean version of unescaping.

Prior art (my own!): https://docs.python.org/3/library/urllib.parse.html has both unquote() which returns a string, and unquote_to_bytes() which returns a bytes.

Comment 3 by mmenke@chromium.org, Apr 16 2018

I think I'll need to keep the return type the same, unfortunately - almost all consumers would just convert it into an std::string immediately, unfortunately, and it's not just an issue of changing a couple lines where they consume the data.

Comment 4 by mgiuca@chromium.org, Apr 17 2018

#3 Sorry I wasn't clear. I didn't actually mean to change the return type. I was making an analogy that if this were a language with separate text/bytes types (e.g., Python has "str" and "bytes"; Java has "String" and "byte[]"), then it would make sense for an 8-bit clean decoder (the SPOOFING_AND_CONTROL_CHARS variant) to return bytes/byte[], while the Unicode-aware decoded (the NORMAL variant) would return str/String.

C++ uses the same type (std::string) for both (if we ignore string16 which you just got rid of), so both variants should return std::string. But "conceptually" those are different types: one is a string with UTF-8-encoded Unicode characters, the other is a byte array. That is sufficient justification for splitting it up into two separate methods.

Comment 5 by mmenke@chromium.org, Apr 17 2018

Well, there's also std::vector<char>, used in some places.  And net/ uses IOBuffers, which are also independent of strings.
Project Member

Comment 6 by bugdroid1@chromium.org, Apr 27 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/efde028cbcde6e668aab08d81514b532d8915d63

commit efde028cbcde6e668aab08d81514b532d8915d63
Author: Matt Menke <mmenke@chromium.org>
Date: Fri Apr 27 23:41:02 2018

Separate out methods to unescape URLs for use as binary data and as text

In a followup CL, invalid UTF-8 characters will never be unescaped by
the text methods.

Bug: 829868
Change-Id: I71df885005f5faa9f72ac5ff6cfb2789c64ae846
Reviewed-on: https://chromium-review.googlesource.com/1024771
Reviewed-by: John Abd-El-Malek <jam@chromium.org>
Reviewed-by: Devlin <rdevlin.cronin@chromium.org>
Reviewed-by: Varun Khaneja <vakh@chromium.org>
Reviewed-by: Julian Pastarmov <pastarmovj@chromium.org>
Reviewed-by: Taiju Tsuiki <tzik@chromium.org>
Reviewed-by: Matt Giuca <mgiuca@chromium.org>
Commit-Queue: Matt Menke <mmenke@chromium.org>
Cr-Commit-Position: refs/heads/master@{#554562}
[modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/components/policy/core/common/cloud/device_management_service_unittest.cc
[modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/components/safe_browsing/db/v4_protocol_manager_util.cc
[modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/content/browser/web_contents/web_drag_source_mac.mm
[modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/extensions/browser/api/web_request/form_data_parser.cc
[modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/net/base/data_url.cc
[modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/net/base/escape.cc
[modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/net/base/escape.h
[modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/net/base/escape_unittest.cc
[modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/net/test/embedded_test_server/default_handlers.cc
[modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/net/test/embedded_test_server/request_handler_util.cc
[modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/storage/common/fileapi/file_system_util.cc

Sign in to add a comment