New issue
Advanced search Search tips
Starred by 14 users

Issue metadata

Status: WorkingAsIntended
Owner: ----
Closed: Apr 2015
HW: ----
NextAction: ----
OS: ----
Priority: ----
Type: ----

Sign in to add a comment

Can generate (and parse) invalid UTF-8

Reported by, Sep 9 2013

Issue description

If a String contains an unpaired Unicode surrogate (U+D800 through U+DFFF) encoding it as UTF-8 will result in an invalid string. This is because UTF-8 is defined (in RFC-3629) not to allow surrogate characters at all. (for context: This caused us problems because we were relying on a Node.js frontend to only output valid utf-8 regardless of the validity of user input. Everything worked fine except in the case of incoming unpaired surrogates, at which point our backend crashed with an encoding error).

I've attached a naive fix as `generate-valid-utf8.patch`. (I say naive because it breaks the tests, and I've not figured out how best to alter them).

Relatedly when parsing UTF-8, surrogates are accepted. This should not be allowed (according to RFC-3629 or UNICODE-TR26), instead they should be replaced by U+FFFD in the same was as other invalid byte sequences.

I've attached this approach as `parse-utf8-only.patch`.

That said, it may be the case that people are relying on this laxness so that they can use CESU-8 (though I don't have any evidence for this). It may be more pragmatic to ignore the security recommendations in UNICODE-TR26 and continue allowing correctly paired surrogates when decoding UTF-8 so that CESU-8 continues to work. Even in that case, we should still not parse incorrectly paired surrogates, as they are not allowed in either CESU-8 or UTF-8.

I've attached this approach as `parse-utf8-or-cesu8.patch`

More work will be needed to make any of these patches acceptable, but I'd like to get an idea of which approach you guys would prefer to take.

See also
406 bytes View Download
555 bytes View Download
1.8 KB View Download
I'd be inclined not to fix this.  If the input is malformed then it seems more helpful to preserve the data than to start inserting error characters into the data steam.  The current behaviour is backwards compatible with the way that js has always worked on malformed input and does the right thing with matched surrogate pairs.

Note that the JSON standard is pre-UTF8 and is set in some according to it's author.

Did your back end really crash or did it just throw an uncaught exception?
I feared that might be the conclusion.

One much less intrusive option was proposed at which was just to fix it for JSON encoding.

The JSON RFC says that strings are a sequence of unicode "characters", and Unicode says that surrogates are not characters (confusingly enough the 66 noncharacters are considered characters); though I suspect that the RFC authors didn't chose that word particularly carefully.
Status: WorkingAsIntended

Sign in to add a comment