JS URL constructor breaks on IDNA with Unicode characters |
||||
Issue description
Chrome Version: 63
OS: Linux
What steps will reproduce the problem?
(1) Open JavaScript console.
(2) Type: new URL('https://xn--abcdé')
(3) Type: new URL('https://ab--abcdé')
What is the expected result?
URL { href: "https://xn--xn--abcd-i1a/", ... }
URL { href: "https://xn--ab--abcd-i1a/", ... }
What happens instead?
Uncaught TypeError: Failed to construct 'URL': Invalid URL
Uncaught TypeError: Failed to construct 'URL': Invalid URL
Spec compliance: I don't see any reason in URL Standard [1] why this should be considered an invalid URL; certainly not the second case (starting with "ab--"). Works correctly in Firefox.
What's happening here?
The URL is tricky: the one in step (2) appears to be in IDNA form (Punycode-encoding some Unicode domain), but it actually contains a non-ASCII character, so is therefore illegal Punycode. However, the URL Parser in [1] *does not* attempt to parse/decode Punycode at all (that is a rendering consideration only). The parser's job is to *encode* non-ASCII characters into Punycode form.
Therefore, the correct behaviour is to detect the 'é' character, and thus encode the Unicode domain "xn--abcdé" into Punycode form: "xn--xn--abcd-i1a". (The fact that the domain starts with "xn--" should be irrelevant; it prepends a new "xn--" to the front.) Chrome seems to be trying to parse the illegal Punycode domain at URL parse time, resulting in an error.
Even weirder, it seems you can put *any two unreserved characters* (ASCII alphanumeric, '-', '_', '.' or '~') before the "--", to hit the same error. Hence, strings starting with "ab--", "7~--" or even "----" result in this error. Those are perfectly legal URLs.
Found while investigating Issue 804462 .
[1] https://url.spec.whatwg.org/
,
Jan 23 2018
Hmm, I think the code responsible is in ICU: https://cs.chromium.org/chromium/src/third_party/icu/source/common/uidna.cpp?l=328 "step 5 : verify the sequence does not begin with ACE prefix" It explicitly errors out if trying to encode a domain that starts with "xn--". This doesn't explain why "ab--" also errors out. I also see similar logic, also mysteriously with "Step 5", in the IDNA encoder for Perl (http://cpansearch.perl.org/src/GAAS/URI-1.60/URI/_idna.pm) and Python (https://svn.python.org/projects/python/tags/r32/Lib/encodings/idna.py). So ¯\_(ツ)_/¯ maybe this is intentional??? I don't see anything about this being an error in the Punycode RFC (http://ietf.org/rfc/rfc3492.txt) so I don't know where this requirement came from.
,
Sep 7
,
Sep 10
According to this test site: https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly94bi0tYWJjZCVDMyVBOQ it is supposed to be a TypeError. (This is the reference implementation to the URL standard.) That means Chrome is correct for the first case. However, the "ab--" case is definitely wrong. There aren't any tests for this in the URL test suite. I'll look into adding these.
,
Sep 10
Added tests at https://github.com/web-platform-tests/wpt/pull/12924. The relevant test is https://ab--abcdé.com/ which is expected to parse as https://xn--ab--abcd-i1a.com/, but throws in Chrome.
,
Sep 10
For some background on why the second case possibly throws, see https://github.com/whatwg/url/issues/53. This was fixed through the addition of a CheckHyphens flag in UTS #46, the spec for IDNA processing, which the URL Standard now explicitly turns off. (The check used to be unconditionally done.) On the implementation side, Node.js’ implementation uses ICU as well for IDNA conversion, and does so correctly as far as I can tell (I was the person who implemented it): ToASCII() in https://github.com/nodejs/node/blob/master/src/node_i18n.cc
,
Sep 10
For some additional backstory to the differences betweem Chrome’s IDNA processing versus the spec’s (and why it’s agreed that Chrome’s current processing is not desirable), see: - https://github.com/whatwg/url/issues/267 - https://github.com/whatwg/url/pull/309
,
Sep 10
|
||||
►
Sign in to add a comment |
||||
Comment 1 by mgiuca@chromium.org
, Jan 23 2018