Our current implementation has the default option of the WTF::StringUTF8Adaptor as the kLenientUtf8Conversion mode but arguably it should be kStrictUTF8ConversionReplacingUnpairedSurrogatesWithFFFD.
Not replacing unpaired surrogates when encoding a UTF-8 string should only be done when we are sure that the String we are converting contains no unpaired surrogates, not as the default.
Investigation is needed into who the usage of the adaptor and whether the lenient mode is intended in any of the usages that use the default option. After that, the default mode should be switched to kStrictUTF8ConversionReplacingUnpairedSurrogatesWithFFFD.
Quoting annevk@:
"The UTF-8 encoder as defined by the Encoding Standard operates on scalar values only, which exclude surrogates by definition. The UTF-8 byte representation doesn't allow for surrogates either, by definition.
Assuming strings in Blink are typically 16-bit lists, you'd always have to replace lone surrogates first, unless you can be certain there are no surrogates. Therefore the default UTF-8 encoding API you'd offer should probably be the one that does the replacing. And maybe offer a more complicated faster API for when you are certain the string is without surrogates."
Comment 1 by yutak@chromium.org
, Oct 12