Inconsistency in URL-parsing punycode handling
Reported by
jfkth...@gmail.com,
Sep 16 2017
|
|||||||||
Issue description
UserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:57.0) Gecko/20100101 Firefox/57.0
Steps to reproduce the problem:
In DevTools console, compare the behaviors of:
(1) new URL("http://xn--google.com")
(2) new URL("http://䕮䕵䕶䕱.com")
(3) new URL("http://xn--x.com")
(4) new URL("http://xn--x.xn--google.com")
(5) new URL("http://xn--x.䕮䕵䕶䕱.com")
Also try navigating to these URLs via the address bar.
What is the expected behavior?
All 5 examples should create a URL object.
Note that (1) and (2) result in identical URLs, because (1) is simply the punycode representation of (2); in both cases, the resulting URL has its hostname set to "xn--google.com".
Example (3) also looks like punycode, but is not actually a valid punycode label. This does not prevent 'new URL()' from parsing it, with a resulting hostname of "xn--x.com", although the difference from (1) can be seen by trying to navigate there: (1) will display as 䕮䕵䕶䕱.com in the address bar (and resolves to a parked-site page), whereas (3) results in a web search because it cannot be resolved.
Example (4) shows that the presence of the invalid-punycode label as a subdomain does not interfere with parsing the URL as a whole, nor with navigating to the site: this will also lead to a parked-site page for 䕮䕵䕶䕱.com.
Example (5) should behave identically to (4), just like (2) behaves identically to (1).
What went wrong?
Example (5) in the Dev Tools console results in failure:
> Uncaught TypeError: Failed to construct 'URL': Invalid URL
I believe this is incorrect, AFAICT from reading the reading the URL parsing algorithm[1].
The algorithm depends on a "host parser"[2] which in turn uses a "domain to ASCII"[3] algorithm based on Unicode's ToASCII[4]. This basically splits the domain on dots, and then punycode-encodes any labels that contain non-ASCII characters; but I don't see anything that requires an invalid-ACE label like "xn--accountlogin" to result in a validation failure, nor any justification for treating this differently depending on whether a separate label within the domain contained non-ASCII chars (and therefore was punycode-encoded by ToASCII).
[1] https://url.spec.whatwg.org/#concept-basic-url-parser
[2] https://url.spec.whatwg.org/#concept-host-parser
[3] https://url.spec.whatwg.org/#concept-domain-to-ascii
[4] http://www.unicode.org/reports/tr46/#ToASCII
Did this work before? N/A
Does this work in other browsers? Yes
Chrome version: 60.0.3112.90 (Official Build) (64-bit) Channel: stable
OS Version: OS X 10.12
Flash Version: Shockwave Flash 23.0 r0
Note that Safari behaves as expected here (examples 4 and 5 both parse to identical URLs), as does Firefox once mozilla bug 1399540 (just landed, to address a couple of somewhat different-but-related issues) is fixed.
,
Sep 19 2017
jfkthame@ thanks for the issue.. Tested this issue on Windows 7 and Mac OS 10.12.6 using the latest Canary 63.0.3218.0 and latest Stable 61.0.3163.91 with the below steps. 1. Launched Chrome and opened the above given URLs 2. Opened Console in Devtools on each page and can see no Uncaught TypeError. Please find the attached screen-cast for reference. Tried the same on Firefox and can observe the same behavior. Request you to please attach the screen-cast of the expected behavior for better understanding of the issue. Thanks.
,
Sep 19 2017
Thanks for the feedback. I'm not set up to easily record a screencast right now, but am attaching a screenshot that shows the TypeError in devtools (using current Chrome stable on macOS 10.12). This results from simply entering successive "new URL(...)" commands in the console and observing the results returned, as shown in the image.
,
Sep 19 2017
Thank you for providing more feedback. Adding requester "susanjuniab@chromium.org" to the cc list and removing "Needs-Feedback" label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Sep 20 2017
Able to reproduce this issue on Mac 10.12.6, Win-10 and Ubuntu 14.04 using chrome reported version 61.0.3163.91 and latest canary #63.0.3219.0. This is a non-regression issue as it is observed from M50 old builds. Hence, marking it as untriaged to get more inputs from dev team. Thanks...!!
,
Sep 22 2017
jshin@, can you take a look?
,
Oct 17 2017
,
Oct 17 2017
I don't think this is a networking issue, but rather an issue with our URL parser. I wrote a quick unit test:
TEST(GURLTest, Punycode) {
EXPECT_EQ(GURL("http://xn--google.com"), GURL("http://䕮䕵䕶䕱.com"));
EXPECT_EQ(GURL("http://xn--x.xn--google.com"),
GURL("http://xn--x.䕮䕵䕶䕱.com"));
}
The first expectation succeeds. The second fails.
../../url/gurl_unittest.cc:68: Failure
Expected: GURL("http://xn--x.xn--google.com")
Which is: http://xn--x.xn--google.com/
To be equal to: GURL("http://xn--x.䕮䕵䕶䕱.com")
Which is: http://xn--x.%E4%95%AE%E4%95%B5%E4%95%B6%E4%95%B1.com/
Maybe //url OWNERS have ideas.
,
Oct 17 2017
,
Sep 27
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by manoranj...@chromium.org
, Sep 18 2017