TextDecoder: streaming result for an invalid sequence is "delayed" ? |
||||||
Issue descriptionrsk@google.com observed the following and wanted to check if I should ignore it or if it's a symptom of a larger issue. const decoder = new TextDecoder(); let data = decoder.decode(Uint8Array.of(0xE2), {stream: true}); // data is "" as expected data = decoder.decode(Uint8Array.of("1".codePointAt()), {stream: true}); // data is "" while I would expect "�1" data = decoder.decode(Uint8Array.of("1".codePointAt()), {stream: true}); // now it realizes that first byte is incomplete and returns "�11"
,
Dec 20 2017
I may be misreading this, but I think the error-signalling should be faster here based on reading https://encoding.spec.whatwg.org/#utf-8-decoder
,
Dec 20 2017
Yeah, this is likely an expectation mismatch between TextDecoder and TextCodecUTF8 - I don't think we expose partial decodes in the platform anywhere else, so it wouldn't have been a "bug" in TextCodecUTF8 when originally written. (Also FYI I verified it still repros in ToT) Fix in TextCodecUTF8 will just involve pawing through the state machine a bit, and being performance sensitive. Plus a new WPT case. :)
,
Dec 20 2017
I think this isn't exposing the partial decode, it's just outputting the '1' (and the preceding U+FFFD) in the same encoding step where the corresponding input byte appeared - even though the appearance of '1' in the input stream immediately forces the error and '1' does not indicate the start of a multi-byte sequence. In other words, it behaves as though our decoder still honors the stated UTF-8 multibyte sequence length from the xE2 even after encountering a following byte not allowed in the sequence.
The same thing happens for longer sequences:
const decoder = new TextDecoder();
const data = [];
data.push(decoder.decode(Uint8Array.of(0xF0), {stream: true}));
// data is [""] as expected
data.push(decoder.decode(Uint8Array.of('1'.charCodeAt()), {stream: true}));
// data is ["", ""] while I would expect ["", "�1"]
data.push(decoder.decode(Uint8Array.of('2'.charCodeAt()), {stream: true}));
// data is ["", "", ""] while I would expect ["", "�1", "2"]
data.push(decoder.decode(Uint8Array.of('3'.charCodeAt()), {stream: true}));
// data is ["", "", "", "�123"] while I would expect ["", "�1", "2", "3"];
// now it realizes that first byte is incomplete and returns "�"
data
,
Jan 18 2018
iOS doesn't use Blink, so removing that platform.
,
Jan 19 2018
Marking "GoodFirstBug" because in theory this can be solved without much more context: (1) the test is straightforward (2) the code change will be constrained to TextCodecUTF8 and (3) there are plenty of potential reviewers.
,
Yesterday
(47 hours ago)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Today
(22 hours ago)
Still a good bug. (cc: domfarolino@ in case there's interest in pursuing a fix here)
,
Today
(22 hours ago)
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by jsb...@chromium.org
, Dec 20 2017Firefox behaves as expected ("", "�1", "1") so it's an interop issue at least.