New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 765608 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocked on:
issue v8:6836

Blocking:
issue 758236



Sign in to add a comment

Discrepancies between the Blink and V8 UTF-8 decoders (invalid UTF-8 handling)

Project Member Reported by marja@chromium.org, Sep 15 2017

Issue description

This is a long-standing bug, but only exposed as part of some new V8 features (see issue 758236 ).

Especially, When encountered with this byte sequence:

241 << 0b11110001 << lead of a 4-byte char
130 << 0b10000010 << cont
240 <| 0b11110000 << lead of a 4-byte char
131 <| 0b10000011 << cont
138 <| 0b10001010 << cont
131 <| 0b10000011 << cont

V8 produces:
invalid
valid (4-byte-char)

Blink produces:
invalid
invalid
invalid
invalid
invalid

(5 times... why? Knowing that the first 4-byte-lead swallows the next 4 bytes, it should produce 3, not 5.)

This can lead to arbitrary errors in scripts (parse errors, wrong functions getting called because lazy function positions are off etc.). Most commonly stuff seems to work though, because we aren't strict about function positions. (Most function beginnings are surrounded with some whitespace, so it doesn't matter if the positions is a bit off.)

---------------------------------

More info:

https://hsivonen.fi/broken-utf-8/
 issue 662822  

The Unicode Standard, 9.0.0, chapter 3.9, p. 128:

  "Whenever an unconvertible offset is reached during conversion of a code
   unit sequence:
   1. The maximal subpart at that offset should be replaced by a single
     U+FFFD.
   2. The conversion should proceed at the offset immediately after the
      maximal subpart."

Reference: http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf

The "maximal subpart" is defined as follows:

The Unicode Standard, 9.0.0, chapter 3.9, p. 127:

"D93b Maximal subpart of an ill-formed subsequence: The longest code unit subsequence starting at an unconvertible offset that is either:
 a. the initial subsequence of a well-formed code unit sequence, or
 b. a subsequence of length one."

So here the maximal subpart would be 
241 << 0b11110001 << lead of a 4-byte char
130 << 0b10000010 << cont

and the V8 implementation of returning invalid + valid is correct.


---------------------------------


Action items:
- Unify the Blink and V8 decoders.
- Add a defensive CHECK which detects if they produce strings of different length. This way we'll get crash reports and can fix further problems *before* they're exposed by some random V8 features again.

 
test8.html
165 bytes View Download

Comment 1 by marja@chromium.org, Sep 15 2017

Components: Blink>JavaScript Blink>Internals>WTF

Comment 2 by marja@chromium.org, Sep 15 2017

Blocking: 758236

Comment 3 Deleted

Comment 4 by marja@chromium.org, Sep 15 2017

Debugging notes:

The first invalid char (kNonCharacter2) is produced based on the first 2 bytes (lead + cont), and the next 4 bytes produce an invalid char each.

So this is misdiagnosed - the problem is not how many invalid code points, but that Blink thinks that the second 4-byte character is invalid and V8 thinks it's valid.


static inline int DecodeNonASCIISequence(const uint8_t* sequence,
                                         unsigned length) {
  DCHECK(!IsASCII(sequence[0]));
  if (length == 2) {
     ...
  }
  if (length == 3) {
     ...
  }
  DCHECK_EQ(length, 4u);
  DCHECK_GE(sequence[0], 0xF0);
  DCHECK_LE(sequence[0], 0xF4);
  switch (sequence[0]) {
    case 0xF0:
      if (sequence[1] < 0x90 || sequence[1] > 0xBF) {
        return kNonCharacter1; <<<<< HERE
      }
      break;
    case 0xF4:
      ...
    default:
      ...
  }  
}

So maybe it's the V8 decoder that is in the wrong here after all...


Action item missing from the bug:
- Use the same UTF-8 decoder in both V8 and Blink. (But where should it live?)

Comment 5 by marja@chromium.org, Sep 15 2017

Actually, why is this code checking against 0x90 (0b10010000)?

The first legal continuation character is 0x80 ((0b10000000). And other branches are checking against 0x80.

Is this just a typo?

Comment 6 by marja@chromium.org, Sep 15 2017

Changing the diagnosis again: this is the overlong sequence detection:

http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=f0+83+8a+83&mode=bytes

0xf0 0x83 0x8a 0x83 is not the shortest hex sequence for the character in question, and thus rejected by Blink but accepted by V8.
Status: Assigned (was: Untriaged)

Comment 8 by yutak@chromium.org, Sep 19 2017

Right, we should only accept the shortest possible sequence. Seems like V8's validation is a bit sloppy.

Comment 9 by marja@chromium.org, Sep 19 2017

Blockedon: v8:6836

Comment 10 by yutak@chromium.org, Sep 19 2017

FYI, the relevant part of WHATWG's Encoding spec is at:
https://encoding.spec.whatwg.org/#utf-8-decoder

Comment 11 by marja@chromium.org, Sep 19 2017

Thanks - according to the encoding spec, Blink's behavior for returning 4 invalid characters when encountering an overlong 4-byte sequence is correct, and V8 needs to change.

Comment 12 by marja@chromium.org, Sep 20 2017

Summary: Discrepansies between the Blink and V8 UTF-8 decoders (invalid UTF-8 handling) (was: Discrapansies between the Blink and V8 UTF-8 decoders (invalid UTF-8 handling))

Comment 13 by marja@chromium.org, Sep 20 2017

Issue 764832 has been merged into this issue.

Comment 14 by marja@chromium.org, Sep 20 2017

Summary: Discrepancies between the Blink and V8 UTF-8 decoders (invalid UTF-8 handling) (was: Discrepansies between the Blink and V8 UTF-8 decoders (invalid UTF-8 handling))
Project Member

Comment 15 by bugdroid1@chromium.org, Sep 21 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/6389b7e6b6e2d563125b7318438dc555bc832194

commit 6389b7e6b6e2d563125b7318438dc555bc832194
Author: Marja Hölttä <marja@chromium.org>
Date: Thu Sep 21 10:44:40 2017

[unicode] Return (the correct) errors for overlong / surrogate sequences.

This fix is two-fold:

1) Incremental UTF-8 decoding: Unify incorrect UTF-8 handling between V8 and
Blink.

Incremental UTF-8 decoding used to allow some overlong sequences / invalid code
points which Blink treated as errors. This caused the decoder and the Blink
UTF-8 decoder to produce a different number of bytes, resulting in random
failures when scripts were streamed (especially, this was detected by the
skipping inner functions feature which adds CHECKs against expected function
positions).

2) Non-incremental UTF-8 decoding: return the correct amount of invalid characters.

According to the encoding spec ( https://encoding.spec.whatwg.org/#utf-8-decoder
), the first byte of an overlong sequence / invalid code point generates an
invalid character, and the rest of the bytes are not processed (i.e., pushed
back to the byte stream). When they're handled, they will look like lonely
continuation bytes, and will generate an invalid character each.

As a result, an overlong 4-byte sequence should generate 4 invalid characters
(not 1).

This is a potentially breaking change, since the (non-incremental) UTF-8
decoding is exposed via the API (String::NewFromUtf8). The behavioral difference
happens when the client is passing in invalid UTF-8 (containing overlong /
surrogate sequences).

However, afaict, this doesn't change the semantics of any JavaScript program:
according to the ECMAScript spec, the program is a sequence of Unicode code
points, and there's no way to invoke the UTF-8 decoding functionalities from
inside JavaScript. Though, this changes the behavior of d8 when decoding source
files which are invalid UTF-8.

This doesn't change anything related to URI decoding (it already throws
exceptions for overlong sequences / invalid code points).

BUG:  chromium:765608 , chromium:758236,  v8:5516 
Bug: 
Change-Id: Ib029f6a8e87186794b092e4e8af32d01cee3ada0
Reviewed-on: https://chromium-review.googlesource.com/671020
Commit-Queue: Marja Hölttä <marja@chromium.org>
Reviewed-by: Franziska Hinkelmann <franzih@chromium.org>
Reviewed-by: Camillo Bruni <cbruni@chromium.org>
Cr-Commit-Position: refs/heads/master@{#48105}
[modify] https://crrev.com/6389b7e6b6e2d563125b7318438dc555bc832194/src/unicode.cc
[modify] https://crrev.com/6389b7e6b6e2d563125b7318438dc555bc832194/test/cctest/parsing/test-scanner-streams.cc
[modify] https://crrev.com/6389b7e6b6e2d563125b7318438dc555bc832194/test/cctest/test-api.cc

Project Member

Comment 16 by ClusterFuzz, Sep 23 2017

Labels: ClusterFuzz-Verified
Status: Verified (was: Assigned)
ClusterFuzz testcase 5526113102331904 is verified as fixed, so closing issue as verified.

If this is incorrect, please add ClusterFuzz-Wrong label and re-open the issue.

Comment 17 by marja@chromium.org, Sep 25 2017

Status: Started (was: Verified)
There's another discrepancy, but this time it won't affect streaming, since it's between the Blink decoder and the non-streaming decoder in V8.

Byte sequence:
e0 (0b11100000)
80 (0b10000000)
(last continuation byte missing)

What Blink does:
Consume e0: start character
Got 80: detected overlong sequence; push 80 back to the stream. Return invalid.
Got 80 (the same 80 again): invalid


What V8 does:
Got e0
Got 80
Determine there aren't enough continuation bytes. Return invalid.


Project Member

Comment 18 by bugdroid1@chromium.org, Sep 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/f130bfd394c9bc901a8689a93b4753722bdca53f

commit f130bfd394c9bc901a8689a93b4753722bdca53f
Author: Marja Hölttä <marja@chromium.org>
Date: Tue Sep 26 12:08:24 2017

[unicode] Fix overlong / surrogate sequences detection some more.

Follow up to https://chromium-review.googlesource.com/671020

We still didn't return the correct amount of invalid characters, according to
the Encoding spec ( https://encoding.spec.whatwg.org/#utf-8-decoder ), when we
saw a byte sequence which was as start of an overlong / invalid sequence, but
there weren't enough continuation bytes.

A more rigorous test will follow in
https://chromium-review.googlesource.com/c/v8/v8/+/681217

BUG= chromium:765608 

Change-Id: I535670edc14d3bae144e5a9ca373f12eec78a934
Reviewed-on: https://chromium-review.googlesource.com/681674
Commit-Queue: Marja Hölttä <marja@chromium.org>
Reviewed-by: Camillo Bruni <cbruni@chromium.org>
Reviewed-by: Daniel Vogelheim <vogelheim@chromium.org>
Reviewed-by: Yang Guo <yangguo@chromium.org>
Cr-Commit-Position: refs/heads/master@{#48165}
[modify] https://crrev.com/f130bfd394c9bc901a8689a93b4753722bdca53f/src/unicode.cc
[modify] https://crrev.com/f130bfd394c9bc901a8689a93b4753722bdca53f/test/cctest/test-api.cc

Project Member

Comment 19 by bugdroid1@chromium.org, Sep 29 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/fcb89f55156395172cdc2da13d2b970e15dd2e19

commit fcb89f55156395172cdc2da13d2b970e15dd2e19
Author: Marja Hölttä <marja@chromium.org>
Date: Fri Sep 29 13:18:52 2017

[unicode] Add tests for UTF-8 decoders + minor cleanups.

Verify that both UTF-8 decoders (incremental and non-incremental one) match the
expectations.

Also cleanup / harden the UTF-8 handling code, as suggested in
https://chromium-review.googlesource.com/c/v8/v8/+/671020/ .


BUG= chromium:765608 

Change-Id: I6344d62ca15b75ac8e333421c94c4aa35ab8190d
Reviewed-on: https://chromium-review.googlesource.com/681217
Commit-Queue: Marja Hölttä <marja@chromium.org>
Reviewed-by: Camillo Bruni <cbruni@chromium.org>
Cr-Commit-Position: refs/heads/master@{#48229}
[modify] https://crrev.com/fcb89f55156395172cdc2da13d2b970e15dd2e19/src/unicode.cc
[modify] https://crrev.com/fcb89f55156395172cdc2da13d2b970e15dd2e19/src/unicode.h
[modify] https://crrev.com/fcb89f55156395172cdc2da13d2b970e15dd2e19/test/unittests/unicode-unittest.cc

Comment 20 by marja@chromium.org, Oct 23 2017

Status: Fixed (was: Started)
Fixed all discrepancies I know of.
Cc: mathias@chromium.org neis@chromium.org marja@chromium.org dcarney@chromium.org yangguo@chromium.org
 Issue v8:7055  has been merged into this issue.
FYI: As far as some of the discrepancies came from calling ICU, they are going away. In ICU 60, the handling of ill-formed UTF-8 byte sequences matches the WHATWG Encoding spec. (So if you were adding code to work around ICU, that's no longer necessary.)

Comment 23 by js...@chromium.org, Nov 10 2017

> As far as some of the discrepancies came from calling ICU

Neither Blink nor v8 uses ICU for UTF-8. (Chromium - non-Blink-portion - uses ICU for UTF-8 ; see  bug 777950  )

Sign in to add a comment