New issue
Advanced search Search tips

Issue 893489 link

Starred by 2 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Investigate usage of UTF8 adaptor without replacing surrogate

Project Member Reported by andypaicu@chromium.org, Oct 9

Issue description

Our current implementation has the default option of the WTF::StringUTF8Adaptor as the kLenientUtf8Conversion mode but arguably it should be kStrictUTF8ConversionReplacingUnpairedSurrogatesWithFFFD.

Not replacing unpaired surrogates when encoding a UTF-8 string should only be done when we are sure that the String we are converting contains no unpaired surrogates, not as the default.

Investigation is needed into who the usage of the adaptor and whether the lenient mode is intended in any of the usages that use the default option. After that, the default mode should be switched to kStrictUTF8ConversionReplacingUnpairedSurrogatesWithFFFD.

Quoting annevk@:
"The UTF-8 encoder as defined by the Encoding Standard operates on scalar values only, which exclude surrogates by definition. The UTF-8 byte representation doesn't allow for surrogates either, by definition.

Assuming strings in Blink are typically 16-bit lists, you'd always have to replace lone surrogates first, unless you can be certain there are no surrogates. Therefore the default UTF-8 encoding API you'd offer should probably be the one that does the replacing. And maybe offer a more complicated faster API for when you are certain the string is without surrogates."
 
Status: Available (was: Untriaged)
This sounds like a fair request, but we probably need to care about how V8 does
this too. We basically want those conversions to be compatible with V8.

Sign in to add a comment