UnescapeURLWithAdjustmentsImpl unescapes invalid UTF8 characters. |
|
Issue descriptionUnescapeURLWithAdjustmentsImpl unescapes part of a URL, with rules about what can and cannot safely be unescaped, based largely on what's safe to display. However, it unconditional characters that at not valid UTF-8 (Not just byte sequences that aren't mapped to characters, but byte sequences that are illegal under UTF-8). There seem to be two types of consumers of this method. There are those that want everything unconditionally unescaped, which typically are not going to display the data directly (Like data URLs). These typically use all flags to unescape everything, and are the only users of SPOOFING_AND_CONTROL_CHARS. Then there are consumers that want to display the URL. These never use SPOOFING_AND_CONTROL_CHARS. It seems like we should be gating unescaping invalid unicode characters on the SPOOFING_AND_CONTROL_CHARS flag.
,
Apr 16 2018
Oops I just duplicated this. Some extra info that I wrote on that bug: net::UnescapeURLComponent(%D8%9C%85%D8%9C%85", UnescapeRule::NORMAL) Expected output: "%D8%9C%85%D8%9C%85" Actual output: "%D8%9C\x85%D8%9C\x85" (Source: mmenke's new test) I think (as you suggested in https://crrev.com/c/998014): SPOOFING_AND_CONTROL_CHARS is essentially a different function (it has a different "return type": it returns a byte sequence while the normal invocation returns a text string; it just happens that both of those "types" are std::string in C++). SPOOFING_AND_CONTROL_CHARS should be removed from the UnescapeRule enum, and a separate function should be written for the non-displayable 8-bit-clean version of unescaping. Prior art (my own!): https://docs.python.org/3/library/urllib.parse.html has both unquote() which returns a string, and unquote_to_bytes() which returns a bytes.
,
Apr 16 2018
I think I'll need to keep the return type the same, unfortunately - almost all consumers would just convert it into an std::string immediately, unfortunately, and it's not just an issue of changing a couple lines where they consume the data.
,
Apr 17 2018
#3 Sorry I wasn't clear. I didn't actually mean to change the return type. I was making an analogy that if this were a language with separate text/bytes types (e.g., Python has "str" and "bytes"; Java has "String" and "byte[]"), then it would make sense for an 8-bit clean decoder (the SPOOFING_AND_CONTROL_CHARS variant) to return bytes/byte[], while the Unicode-aware decoded (the NORMAL variant) would return str/String. C++ uses the same type (std::string) for both (if we ignore string16 which you just got rid of), so both variants should return std::string. But "conceptually" those are different types: one is a string with UTF-8-encoded Unicode characters, the other is a byte array. That is sufficient justification for splitting it up into two separate methods.
,
Apr 17 2018
Well, there's also std::vector<char>, used in some places. And net/ uses IOBuffers, which are also independent of strings.
,
Apr 27 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/efde028cbcde6e668aab08d81514b532d8915d63 commit efde028cbcde6e668aab08d81514b532d8915d63 Author: Matt Menke <mmenke@chromium.org> Date: Fri Apr 27 23:41:02 2018 Separate out methods to unescape URLs for use as binary data and as text In a followup CL, invalid UTF-8 characters will never be unescaped by the text methods. Bug: 829868 Change-Id: I71df885005f5faa9f72ac5ff6cfb2789c64ae846 Reviewed-on: https://chromium-review.googlesource.com/1024771 Reviewed-by: John Abd-El-Malek <jam@chromium.org> Reviewed-by: Devlin <rdevlin.cronin@chromium.org> Reviewed-by: Varun Khaneja <vakh@chromium.org> Reviewed-by: Julian Pastarmov <pastarmovj@chromium.org> Reviewed-by: Taiju Tsuiki <tzik@chromium.org> Reviewed-by: Matt Giuca <mgiuca@chromium.org> Commit-Queue: Matt Menke <mmenke@chromium.org> Cr-Commit-Position: refs/heads/master@{#554562} [modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/components/policy/core/common/cloud/device_management_service_unittest.cc [modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/components/safe_browsing/db/v4_protocol_manager_util.cc [modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/content/browser/web_contents/web_drag_source_mac.mm [modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/extensions/browser/api/web_request/form_data_parser.cc [modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/net/base/data_url.cc [modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/net/base/escape.cc [modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/net/base/escape.h [modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/net/base/escape_unittest.cc [modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/net/test/embedded_test_server/default_handlers.cc [modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/net/test/embedded_test_server/request_handler_util.cc [modify] https://crrev.com/efde028cbcde6e668aab08d81514b532d8915d63/storage/common/fileapi/file_system_util.cc |
|
►
Sign in to add a comment |
|
Comment 1 by mgiuca@chromium.org
, Apr 16 2018