Discrepancies between the Blink and V8 UTF-8 decoders (invalid UTF-8 handling) |
||||||||||
Issue descriptionThis is a long-standing bug, but only exposed as part of some new V8 features (see issue 758236 ). Especially, When encountered with this byte sequence: 241 << 0b11110001 << lead of a 4-byte char 130 << 0b10000010 << cont 240 <| 0b11110000 << lead of a 4-byte char 131 <| 0b10000011 << cont 138 <| 0b10001010 << cont 131 <| 0b10000011 << cont V8 produces: invalid valid (4-byte-char) Blink produces: invalid invalid invalid invalid invalid (5 times... why? Knowing that the first 4-byte-lead swallows the next 4 bytes, it should produce 3, not 5.) This can lead to arbitrary errors in scripts (parse errors, wrong functions getting called because lazy function positions are off etc.). Most commonly stuff seems to work though, because we aren't strict about function positions. (Most function beginnings are surrounded with some whitespace, so it doesn't matter if the positions is a bit off.) --------------------------------- More info: https://hsivonen.fi/broken-utf-8/ issue 662822 The Unicode Standard, 9.0.0, chapter 3.9, p. 128: "Whenever an unconvertible offset is reached during conversion of a code unit sequence: 1. The maximal subpart at that offset should be replaced by a single U+FFFD. 2. The conversion should proceed at the offset immediately after the maximal subpart." Reference: http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf The "maximal subpart" is defined as follows: The Unicode Standard, 9.0.0, chapter 3.9, p. 127: "D93b Maximal subpart of an ill-formed subsequence: The longest code unit subsequence starting at an unconvertible offset that is either: a. the initial subsequence of a well-formed code unit sequence, or b. a subsequence of length one." So here the maximal subpart would be 241 << 0b11110001 << lead of a 4-byte char 130 << 0b10000010 << cont and the V8 implementation of returning invalid + valid is correct. --------------------------------- Action items: - Unify the Blink and V8 decoders. - Add a defensive CHECK which detects if they produce strings of different length. This way we'll get crash reports and can fix further problems *before* they're exposed by some random V8 features again.
,
Sep 15 2017
,
Sep 15 2017
Debugging notes:
The first invalid char (kNonCharacter2) is produced based on the first 2 bytes (lead + cont), and the next 4 bytes produce an invalid char each.
So this is misdiagnosed - the problem is not how many invalid code points, but that Blink thinks that the second 4-byte character is invalid and V8 thinks it's valid.
static inline int DecodeNonASCIISequence(const uint8_t* sequence,
unsigned length) {
DCHECK(!IsASCII(sequence[0]));
if (length == 2) {
...
}
if (length == 3) {
...
}
DCHECK_EQ(length, 4u);
DCHECK_GE(sequence[0], 0xF0);
DCHECK_LE(sequence[0], 0xF4);
switch (sequence[0]) {
case 0xF0:
if (sequence[1] < 0x90 || sequence[1] > 0xBF) {
return kNonCharacter1; <<<<< HERE
}
break;
case 0xF4:
...
default:
...
}
}
So maybe it's the V8 decoder that is in the wrong here after all...
Action item missing from the bug:
- Use the same UTF-8 decoder in both V8 and Blink. (But where should it live?)
,
Sep 15 2017
Actually, why is this code checking against 0x90 (0b10010000)? The first legal continuation character is 0x80 ((0b10000000). And other branches are checking against 0x80. Is this just a typo?
,
Sep 15 2017
Changing the diagnosis again: this is the overlong sequence detection: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=f0+83+8a+83&mode=bytes 0xf0 0x83 0x8a 0x83 is not the shortest hex sequence for the character in question, and thus rejected by Blink but accepted by V8.
,
Sep 18 2017
,
Sep 19 2017
Right, we should only accept the shortest possible sequence. Seems like V8's validation is a bit sloppy.
,
Sep 19 2017
,
Sep 19 2017
FYI, the relevant part of WHATWG's Encoding spec is at: https://encoding.spec.whatwg.org/#utf-8-decoder
,
Sep 19 2017
Thanks - according to the encoding spec, Blink's behavior for returning 4 invalid characters when encountering an overlong 4-byte sequence is correct, and V8 needs to change.
,
Sep 20 2017
,
Sep 20 2017
Issue 764832 has been merged into this issue.
,
Sep 20 2017
,
Sep 21 2017
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/6389b7e6b6e2d563125b7318438dc555bc832194 commit 6389b7e6b6e2d563125b7318438dc555bc832194 Author: Marja Hölttä <marja@chromium.org> Date: Thu Sep 21 10:44:40 2017 [unicode] Return (the correct) errors for overlong / surrogate sequences. This fix is two-fold: 1) Incremental UTF-8 decoding: Unify incorrect UTF-8 handling between V8 and Blink. Incremental UTF-8 decoding used to allow some overlong sequences / invalid code points which Blink treated as errors. This caused the decoder and the Blink UTF-8 decoder to produce a different number of bytes, resulting in random failures when scripts were streamed (especially, this was detected by the skipping inner functions feature which adds CHECKs against expected function positions). 2) Non-incremental UTF-8 decoding: return the correct amount of invalid characters. According to the encoding spec ( https://encoding.spec.whatwg.org/#utf-8-decoder ), the first byte of an overlong sequence / invalid code point generates an invalid character, and the rest of the bytes are not processed (i.e., pushed back to the byte stream). When they're handled, they will look like lonely continuation bytes, and will generate an invalid character each. As a result, an overlong 4-byte sequence should generate 4 invalid characters (not 1). This is a potentially breaking change, since the (non-incremental) UTF-8 decoding is exposed via the API (String::NewFromUtf8). The behavioral difference happens when the client is passing in invalid UTF-8 (containing overlong / surrogate sequences). However, afaict, this doesn't change the semantics of any JavaScript program: according to the ECMAScript spec, the program is a sequence of Unicode code points, and there's no way to invoke the UTF-8 decoding functionalities from inside JavaScript. Though, this changes the behavior of d8 when decoding source files which are invalid UTF-8. This doesn't change anything related to URI decoding (it already throws exceptions for overlong sequences / invalid code points). BUG: chromium:765608 , chromium:758236, v8:5516 Bug: Change-Id: Ib029f6a8e87186794b092e4e8af32d01cee3ada0 Reviewed-on: https://chromium-review.googlesource.com/671020 Commit-Queue: Marja Hölttä <marja@chromium.org> Reviewed-by: Franziska Hinkelmann <franzih@chromium.org> Reviewed-by: Camillo Bruni <cbruni@chromium.org> Cr-Commit-Position: refs/heads/master@{#48105} [modify] https://crrev.com/6389b7e6b6e2d563125b7318438dc555bc832194/src/unicode.cc [modify] https://crrev.com/6389b7e6b6e2d563125b7318438dc555bc832194/test/cctest/parsing/test-scanner-streams.cc [modify] https://crrev.com/6389b7e6b6e2d563125b7318438dc555bc832194/test/cctest/test-api.cc
,
Sep 23 2017
ClusterFuzz testcase 5526113102331904 is verified as fixed, so closing issue as verified. If this is incorrect, please add ClusterFuzz-Wrong label and re-open the issue.
,
Sep 25 2017
There's another discrepancy, but this time it won't affect streaming, since it's between the Blink decoder and the non-streaming decoder in V8. Byte sequence: e0 (0b11100000) 80 (0b10000000) (last continuation byte missing) What Blink does: Consume e0: start character Got 80: detected overlong sequence; push 80 back to the stream. Return invalid. Got 80 (the same 80 again): invalid What V8 does: Got e0 Got 80 Determine there aren't enough continuation bytes. Return invalid.
,
Sep 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/f130bfd394c9bc901a8689a93b4753722bdca53f commit f130bfd394c9bc901a8689a93b4753722bdca53f Author: Marja Hölttä <marja@chromium.org> Date: Tue Sep 26 12:08:24 2017 [unicode] Fix overlong / surrogate sequences detection some more. Follow up to https://chromium-review.googlesource.com/671020 We still didn't return the correct amount of invalid characters, according to the Encoding spec ( https://encoding.spec.whatwg.org/#utf-8-decoder ), when we saw a byte sequence which was as start of an overlong / invalid sequence, but there weren't enough continuation bytes. A more rigorous test will follow in https://chromium-review.googlesource.com/c/v8/v8/+/681217 BUG= chromium:765608 Change-Id: I535670edc14d3bae144e5a9ca373f12eec78a934 Reviewed-on: https://chromium-review.googlesource.com/681674 Commit-Queue: Marja Hölttä <marja@chromium.org> Reviewed-by: Camillo Bruni <cbruni@chromium.org> Reviewed-by: Daniel Vogelheim <vogelheim@chromium.org> Reviewed-by: Yang Guo <yangguo@chromium.org> Cr-Commit-Position: refs/heads/master@{#48165} [modify] https://crrev.com/f130bfd394c9bc901a8689a93b4753722bdca53f/src/unicode.cc [modify] https://crrev.com/f130bfd394c9bc901a8689a93b4753722bdca53f/test/cctest/test-api.cc
,
Sep 29 2017
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/fcb89f55156395172cdc2da13d2b970e15dd2e19 commit fcb89f55156395172cdc2da13d2b970e15dd2e19 Author: Marja Hölttä <marja@chromium.org> Date: Fri Sep 29 13:18:52 2017 [unicode] Add tests for UTF-8 decoders + minor cleanups. Verify that both UTF-8 decoders (incremental and non-incremental one) match the expectations. Also cleanup / harden the UTF-8 handling code, as suggested in https://chromium-review.googlesource.com/c/v8/v8/+/671020/ . BUG= chromium:765608 Change-Id: I6344d62ca15b75ac8e333421c94c4aa35ab8190d Reviewed-on: https://chromium-review.googlesource.com/681217 Commit-Queue: Marja Hölttä <marja@chromium.org> Reviewed-by: Camillo Bruni <cbruni@chromium.org> Cr-Commit-Position: refs/heads/master@{#48229} [modify] https://crrev.com/fcb89f55156395172cdc2da13d2b970e15dd2e19/src/unicode.cc [modify] https://crrev.com/fcb89f55156395172cdc2da13d2b970e15dd2e19/src/unicode.h [modify] https://crrev.com/fcb89f55156395172cdc2da13d2b970e15dd2e19/test/unittests/unicode-unittest.cc
,
Oct 23 2017
Fixed all discrepancies I know of.
,
Nov 9 2017
Issue v8:7055 has been merged into this issue.
,
Nov 9 2017
FYI: As far as some of the discrepancies came from calling ICU, they are going away. In ICU 60, the handling of ill-formed UTF-8 byte sequences matches the WHATWG Encoding spec. (So if you were adding code to work around ICU, that's no longer necessary.)
,
Nov 10 2017
> As far as some of the discrepancies came from calling ICU Neither Blink nor v8 uses ICU for UTF-8. (Chromium - non-Blink-portion - uses ICU for UTF-8 ; see bug 777950 ) |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by marja@chromium.org
, Sep 15 2017