These days, most source code is assumed to be UTF8 encoded text.
Therefore, it's reasonable to make a constant out of such text,
e.g., const char* kFoo = "I ❤ Unicode."
Unfortunately, the WTF::String(const char*) constructor will interpret the provided string literal as an 8 bit character string (e.g. isolatin1), and make a mess of it, and this constructor is implicit, so it gets called also for URL construction, EXPECT_EQ when unittesting, and other places. It is a bit of a tripwire.
Knowing about the problematic constructor, one can properly write:
WTF::String::FromUTF8("I ❤ Unicode.") and that works.
Still, I wanted to raise this issue since maybe it's an area for improvement for the string library. E.g., there could be a DCHECK, or it could just do UTF8 interpretation if any of the high bits are set (java.lang.String may do it this way I think), or maybe there are other choices that I haven't thought of.
I haven't done an exhaustive audit of this problem, but I ran into it myself and it was a bit of a head scratch also because EXPECT_EQ("foo ❤", SomeMethodReturnsString()); will pass even with corruption because the left side calls WTF::String implicitly.
I did verify that I'm not the only one getting tricked, e.g. see here, the intention is probably to have an encoded unicode character for the CJK thing there, but it's actually garbage so it doesn't test what it's trying to:
https://cs.chromium.org/chromium/src/third_party/blink/renderer/platform/weborigin/kurl_test.cc?sq=package:chromium&g=0&l=520
Comment 1 by yutak@chromium.org
, Jul 24