Chrome Version: 68
OS: All
net::UnescapeURLComponent(%D8%9C%85%D8%9C%85", UnescapeRule::NORMAL)
Expected output:
"%D8%9C%85%D8%9C%85"
Actual output:
"%D8%9C\x85%D8%9C\x85"
This is how you format a URL for display; spoofing, control chars and other "non-displayable" characters should be left percent-encoded. Yet ill-formed UTF-8 sequences (e.g., "%85") are actually decoded by this method. This results in potentially displaying ill-formed UTF-8 byte sequences.
The method should always return a valid UTF-8 string, so it should simply not decode those byte sequences.
EXCEPTION: When called with UnescapeRule::SPOOFING_AND_CONTROL_CHARS, the function has a completely different purpose, which is to decode all escape sequences irrespective of whether they are displayable or even legal. This version should not consider UTF-8 at all, and simply return a byte sequence.
Actually, as pointed out by mmenke in https://crrev.com/c/998014, SPOOFING_AND_CONTROL_CHARS is essentially a different function (it has a different "return type": it returns a byte sequence while the normal invocation returns a text string; it just happens that both of those "types" are std::string in C++). SPOOFING_AND_CONTROL_CHARS should be removed from the UnescapeRule enum, and a separate function should be written for the non-displayable 8-bit-clean version of unescaping.
Prior art (my own!): https://docs.python.org/3/library/urllib.parse.html has both unquote() which returns a string, and unquote_to_bytes() which returns a bytes.
Comment 1 by mgiuca@chromium.org
, Apr 16 2018Status: Duplicate (was: Assigned)