GURL construction from punycode takes a very long time with certain strings |
|||||||||||
Issue descriptionThe GURL(StringPiece) constructor can be fed certain strings which will cause the punycode decode algorithm to consume a large amount of CPU time. The attached file contains an example URL with high-ASCII (though not Unicode-encoded) characters. Passing this string to the GURL constructor causes u_strFromPunycode to make over 65000 insertions into an array, each one using memmove to shift ~38kb of data by 2 bytes. (Over 2GB moved in total) (This was originally found by ClusterFuzz, and causes timeouts on those bots)
,
Jan 22 2018
From the mail thread: Me: It doesn't seem like GURL should be trying to IDN-decode a Host component longer than 253 characters, and probably should reject the URL as invalid. Response: If we do go that route, we should file a bug on https://url.spec.whatwg.org/ and an interop-test to make sure others agree on the limits. The question of whether a URL's authority is limited to DNS hostnames is ... murky... and certainly one I'd appreciate clarity on but also wouldn't want to lead the vanguard on :) There may already be one - I recall there being some discussion on the IDN application process for unrecognized schemes (and whether or not they should be assumed to be generic schemes, and if generic, whether exclusively DNS hostnames), but it'd likely be around https://url.spec.whatwg.org/#host-writing in the state machine to add an additional criteria prior to running the Unicode process - to determine whether it's a valid domain or opaque host ( https://url.spec.whatwg.org/#concept-opaque-host-parser )
,
Jan 22 2018
,
Jan 23 2018
So I haven't tested this but I'll give my thoughts based on my reading of the supplied URL and the URL Standard [1].
First having a look at this URL. What encoding is this file? It contains two non-ASCII bytes (both 0xBC) and it's not valid UTF-8. Is this how it came out of Clusterfuzz? Let's assume it's Latin-1, and therefore those two bytes represent U+00BC VULGAR FRACTION ONE QUARTER (¼).
The other possibility is that Clusterfuzz is literally sending this byte string (with 0xBC byte) into GURL, which is interpreting it as UTF-8 but crapping out because 0xBC is not a valid UTF-8 initial byte. I'll assume the '¼'.
A Python expression that exactly replicates this URL:
'ws://xn--fd¼wsw3t' + 'a' * 9651 + '¼' + 'a' * 55888 + 'Baa333aBBBBBBJBBBCBJBBBBt:/aaaBataaaaaBaa333aBBBBBBfJBBBt'
So is it the ¼ that's causing this, or the length? It's not valid for a Unicode character to appear in a Punycoded domain, so surely we should simply error out in this case (or fall back to treating this as a non-IDNA domain).
My reading of the spec suggests this should be considered a *valid* URL but not a legal Punycode domain, since it contains a Unicode character *inside* the Punycode-encoded data.
When the URL is parsed:
1. The host should be separated out from the rest of the URL:
'xn--fd¼wsw3t' + 'a' * 9651 + '¼' + 'a' * 55888 + 'Baa333aBBBBBBJBBBCBJBBBBt:'
2. The fact that it starts with "xn--" is irrelevant. Just treat it like any other Unicode domain and Punycode it, then prepend "xn--":
'xn--xn--fdwsw3t' + 'a' * 65539 + 'Baa333aBBBBBBJBBBCBJBBBBt:-56523a5oj'
3. Thus the final parsed URL should be:
'ws://xn--xn--fdwsw3t' + 'a' * 65539 + 'Baa333aBBBBBBJBBBCBJBBBBt:-56523a5oj/aaaBataaaaaBaa333aBBBBBBfJBBBt'
When the URL is rendered:
1. The ASCII host should be decoded as an IDNA domain:
'xn--fd¼wsw3t' + 'a' * 9651 + '¼' + 'a' * 55888 + 'Baa333aBBBBBBJBBBCBJBBBBt:'
2. This should be displayed with the ¼ as a Unicode character. No attempt should be made to process the inner "xn--" part of the domain.
Basically, this should be treated as an ordinary URL with a non-Punycoded host with a Unicode character in it. If the fact that it begins with "xn--" is causing some Punycode-handling code to be run, that is a bug. I don't think GURL should ever attempt to decode Punycode.
If there's a problem with very long hosts (outside of Punycode) then maybe we can impose a limit, but I don't think there's anything Punycode specific about this.
> It doesn't seem like GURL should be trying to IDN-decode a Host component longer than 253 characters, and probably should reject the URL as invalid.
What made you say 253, out of interest? Is there precedent for restricting the length of an IDN host name?
I agree that if we start imposing a limit on host names, we should agree on a standard. But ideally, we'd just not have quadratic algorithms in URL processing code so any length is fine.
,
Jan 23 2018
253 is derived from the DNS hostname limits. RFC 3986 didn't lock the reg-name component of 'host' to DNS, and rather explicitly stated it was out of scope. That said, it also had a SHOULD in the form of "URI producers should use names that conform to the DNS syntax, even when use of DNS is not immediately apparent, and should limit these names to no more than 255 characters in length." I'm not sure I understand your remark about "xn--" shouldn't cause Punycode to run, nor that GURL should ever attempt to decode Punycode. These steps are covered in https://url.spec.whatwg.org/#host-parsing which runs the conversion.
,
Jan 23 2018
> What encoding is this file? The bytes are exactly as from ClusterFuzz; I've just stripped the other parts of the file (it contains a feature policy declaration, where this part is being interpreted as a potential origin string, so it is being passed through WTF::String to WTF::StringUTF8Adapter to base::StringPiece and then to GURL. (https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/feature_policy/FeaturePolicy.cpp?sq=package:chromium&dr&l=184) You could call it Latin-1, but it's just the byte sequence 77 73 3a 2f 2f 78 6e 2d 2d 66 64 bc 77 ... At some point, it is interpreted as UTF-8, widened to UTF-16, and then later, the 00bc word is replaced with "1/4" (where the / is U+2044). The path, from what I've seen in tracing, is: GURL::InitCanonical (passed the complete URL as base::StringPiece "ws://xn--f...") url::Canonicalize (char* string is "ws://xn--f...") DoCanonicalize (checks scheme, breaks spec into components) CanonicalizeStandardURL DoCanonicalizeStandardURL CanonicalizeHost DoHost DoHostSubstring (string is "xn--f...", sees that non-ascii characters are present) DoComplexHost (assumes input is utf8, converts to utf16) DoIDNHost IDNToASCII (at this point, the char16* array is 0078 006e 002d 002d 0066 0064 00bc 0077 0073 0077 0033 0074 0061 0061 0061 0061...) uidna_nameToASCII (converts UChar* to UnicodeString) IDNA::nameToASCII UTS46::nameToASCII UTS46::process (splits on '.') UTS46::processUnicode (replaces 0xbc with U+0031 U+2044 U+0034 "1/4") UTS46::processLabel (sees that label starts with "xn--") u_strFromPunycode > It's not valid for a Unicode character to appear in a Punycoded domain, so surely we should simply error out in this case (or fall back to treating this as a non-IDNA domain). Interestingly, u_strFromPunycode takes in 2-byte wide characters, so the slash at this point is 8260 (U+2044), but this value gets cast to uint8_t and used as 0x44 when decoding, so it isn't identified as an invalid character. > So is it the ¼ that's causing this, or the length? Both, I suppose -- without the non-ascii character, DoHostSubstring would never call DoComplexHost, it just calls the all-ascii-fast-path DoSimpleHost. After it gets into the punycode decoder, though, it's a combination of the length, and the actual string data itself; the string represents some 65000 characters, which each need to be inserted, in turn, into the *beginning* of an ever-growing array. > What made you say 253, out of interest? Not me, but ISTR that that's the longest label that can be handled by the DNS. Hostnames can probably be longer than that with general resolvers, but that's the longest single component we should see on the public internet.
,
Jan 30 2018
,
Jan 30 2018
RE #6: The largest single component should be 63 characters. The largest hostname should be 253 characters. As noted in #2, it's not entirely clear whether the applicable standards support placing DNS limits on Punycode, but I believe it would be reasonable to do so.
,
Feb 1 2018
#6 Wow, thanks for digging in that deep. I'm having a closer look now. From your above description, it looks like there are potentially several bugs. Before that, OK OK I made a mistake reading the spec. I now see where the "ToASCII" algorithm [1] detects if the string starts with "xn--" and if so, applies the Punycode-to-Unicode algorithm before running Unicode-to-Punycode. This means any label that starts with "xn--" and contains non-ASCII characters *should* cause the whole algorithm to fail, resulting in a failed URL parse. So u_strFromPunycode should be failing as soon as it hits the U+2044. And from what I can tell, it does. To the issues: First issue: "At some point, it is interpreted as UTF-8, widened to UTF-16" -- but the byte 0xBC cannot be interpreted as UTF-8. It's invalid. If it's becoming U+00BC in UTF-16 then that's been interpreted as Latin-1. Either the whole string is interpreted as Latin-1, or it's using Latin-1 as a fallback when encountering invalid UTF-8. I'm not really sure. I briefly tried calling DoComplexHost with Latin-1 characters and it had some really weird behaviour. I didn't look deeper. I think this should basically be erroring out early. But let's assume it correctly interprets it as U+00BC and move on... The replacement of ¼ with "1/4" is to spec; see https://www.unicode.org/reports/tr46/#ToASCII which applies a mapping table http://www.unicode.org/Public/idna/latest/IdnaMappingTable.txt that maps 00BC to 0031 2044 0034. Second issue: "u_strFromPunycode takes in 2-byte wide characters, so the slash at this point is 8260 (U+2044), but this value gets cast to uint8_t and used as 0x44 when decoding, so it isn't identified as an invalid character." That's surely a bug. u_strFromPunycode should fail when it hits any non-ASCII character. I'm not able to reproduce this: when I run IDNToASCII directly with this input, it fails (as expected). I've uploaded some tests for this: https://chromium-review.googlesource.com/c/chromium/src/+/897286 If what you say is true, I would expect IDNToASCII("xn--¼") to at some point be converted to "xn--1D2" and then that would fail because it's invalid Punycode. Hmm. On the fact that the very long string takes a long time: I haven't looked into it, but it could be something to do with the fact that GURL::InitCanonical uses StdStringCanonOutput (rather than std::string, for some reason). The resize function on StdStringCanonOutput doesn't do any kind of exponential expanding; it just expands to the requested size. So if the Punycode decoding algorithm is asking for it to expand linearly, this will have quadratic performance...? [1] https://www.unicode.org/reports/tr46/#ToASCII
,
Feb 2 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/af4ee7a6123280659e64e113ff47eb1b01cffa16 commit af4ee7a6123280659e64e113ff47eb1b01cffa16 Author: Matt Giuca <mgiuca@chromium.org> Date: Fri Feb 02 03:18:55 2018 URLCanonTest: Added tests for parsing labels that start with xn--. These labels are treated specially by the parser, and need tests. We test both valid xn-- domain labels (ASCII only), as well as xn-- labels that contain Unicode characters (which are invalid). Bug: 804462 Change-Id: I943034745bdf65ae3b6df77c5ed8cb95a590a105 Reviewed-on: https://chromium-review.googlesource.com/897286 Reviewed-by: Mike West <mkwst@chromium.org> Commit-Queue: Matt Giuca <mgiuca@chromium.org> Cr-Commit-Position: refs/heads/master@{#533943} [modify] https://crrev.com/af4ee7a6123280659e64e113ff47eb1b01cffa16/url/url_canon_unittest.cc
,
Feb 26 2018
mgiuca: Did your commit in comment 11 fix this issue?
,
Apr 9 2018
#12 No, that commit just added tests. I don't want to own this bug because it will be quite awhile before I get to it (and I don't have a fix; I just did a bunch of digging which I wrote up above). Unassigning myself.
,
Apr 13 2018
,
Apr 13 2018
See also the profiling result at https://bugs.chromium.org/p/chromium/issues/detail?id=811960#c7 . Somehow memmove takes very long for a 60k long input
,
Apr 25 2018
#15 -- yes, it's not the string resizing that seems to be taking so long, it's the 64K calls to memmove. Each call is moving a 32K block over by only 1 byte to make room for the next character, so the src and dst are almost completely aliased. I suspect that an initial pass over the string could more efficiently compute the final length, and proper location for each ascii block in the output, and then the non-ascii characters could simply be written into the correct place in the string, rather than requiring these expensive memmoves. In the meantime, I'm working on an easier fix to url_canon_host.cc which will simply reject hostnames longer than 253 characters, so the inefficiency will be bounded at least.
,
Apr 25 2018
,
Apr 25 2018
> I'm working on an easier fix to url_canon_host.cc which will simply reject hostnames longer than 253 Yes, that should do it. (I came here to do that, but you beat me to that :-) ). I also have a patch for ICU doing the same. https://chromium-review.googlesource.com/c/chromium/deps/icu/+/1012919 . BTW, test URLs in both bug 811960 and bug 834838 are valid UTF-8 and ASCII-only-ACE-form, respectively (unlike the one in this bug report).
,
Apr 25 2018
812011 and 834848 as well -- hopefully this should squash all of those.
,
Apr 25 2018
bug 834622 is another one. I verified that limiting the length (253) of a hostname or a label (63) made timeout go away. Doing the same in url_canon_host should have the same effect, I guess.
,
Apr 25 2018
,
Apr 25 2018
FYI: I filed an ICU bug ( http://bugs.icu-project.org/trac/ticket/13727 ) with a suggested fix. Because length-limiting in url/ should suffice for Chromium, I'll not make a change locally.
,
May 2 2018
re: Delete comment ⚐ https://chromium-review.googlesource.com/c/chromium/src/+/1028350 From http://bugs.icu-project.org/trac/ticket/13727#comment:7 : <quote> Note that UTS #46, which is based on NFKC_Casefold, removes characters such as Soft Hyphen and Variation Selectors. This means that it is possible for an input string to be very long but for the output string to fit into the DNS limitations. A large multiple would be kind of pathological, but if we limit the input string length we might want to allow at least a small multiple of the maximum domain name length. </quote> So, for UnicodeToACE direction (when encoding Unicode to punycode prefixed by 'xn--'; i.e. before calling IDNToAscii), the check needs to be more lenient (say, 5 times max length). ICU will set the error flag if the result is too long for a hostname and we do check that. See https://cs.chromium.org/chromium/src/url/url_canon_icu.cc?rcl=f9c2cf42a20b6b7ebbd1b4d005ffff9cbb623e64&l=186 (info.error check).
,
May 10 2018
Should I add a protection in ICU with a local patch instead of waiting for https://chromium-review.googlesource.com/c/chromium/src/+/1028350 ?
,
May 10 2018
Sorry for the delay on that one; I was working out what to do with the test failures. It seems that there are a few other places in code that were trying to handle this differently, and I need to get those updated.
,
May 16 2018
From way back in comment #2, the issue of this being defined behaviour in the URL spec (https://url.spec.whatwg.org/) came up. In that spec, the host parser calls domainToASCII without setting the beStrict flag, so DNS lengths are not checked for validity. This becomes part of the URL parser (and the basic URL parser) and so this behaviour gets invoked everywhere. This means that enforcing a 253-character hostname limit in url_canon_host.cc is a potentially non-web-compatible change. On the other hand, actually doing that only breaks one web platform test (https://wpt.fyi/url/toascii.window.html) which appears to have wildly varying results with long hostnames on different browsers, so this is not likely to be behaviour that developers can depend on in any case. > There may already be one - I recall there being some discussion on the IDN application process for unrecognized schemes (and whether or not they should be assumed to be generic schemes, and if generic, whether exclusively DNS hostnames), but it'd likely be around https://url.spec.whatwg.org/#host-writing in the state machine to add an additional criteria prior to running the Unicode process - to determine whether it's a valid domain or opaque host ( https://url.spec.whatwg.org/#concept-opaque-host-parser ) The spec already distinguishes between special and !special URLs, before host parsing, which might make this easier: For parsing, as far as I can tell, Domain to ASCII is only run on special URLs, and the opaque host parser is only run on !isSpecial URLs. For writing, https://url.spec.whatwg.org/#host-writing already checks DNS lengths to define valid-host-string (it's the only place in the spec that does), and https://url.spec.whatwg.org/#url-writing ensures that any valid-url-string whose scheme is special has a valid-host-string. If we update the host parser (https://url.spec.whatwg.org/#concept-host-parser) so that it always calls Domain to ASCII with strict DNS length checking, then we could enforce these limits as part of parsing hostnames. I'll see about proposing that on the WHATWG repo. Without that, the best we can do, I suspect, is to put a sanity-check into DoIDNHost in url_canon_host.cc to try to stop arbitrarily-long-strings from being handled (and hope that even *that* is web-compatible)
,
May 17 2018
Issue 811960 has been merged into this issue.
,
May 18 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/02c682276f26af4ab3221e4572847ec6786c4822 commit 02c682276f26af4ab3221e4572847ec6786c4822 Author: Ian Clelland <iclelland@chromium.org> Date: Fri May 18 20:47:50 2018 Add a sanity check for IDN hostname lengths The Punycode algorithm is O(n^2) in the worst case, and pathological URLs can be provided which take an arbitrarily long time to parse. (These are all invalid host strings, as hosts should not be longer than 253 characters, after escaping, by RFC1034, but may be allowed by the host parser defined in https://url.spec.whatwg.org/#concept-host-parser) This change will cause excessively long IDN hostnames to be rejected by Chrome's parser. Bug: 804462 Change-Id: I85d0838386e383e9cccfdff5a1670651029ca914 Reviewed-on: https://chromium-review.googlesource.com/1028350 Reviewed-by: Mike West <mkwst@chromium.org> Reviewed-by: Steven Holte <holte@chromium.org> Commit-Queue: Ian Clelland <iclelland@chromium.org> Cr-Commit-Position: refs/heads/master@{#560016} [modify] https://crrev.com/02c682276f26af4ab3221e4572847ec6786c4822/components/ukm/ukm_service_unittest.cc [modify] https://crrev.com/02c682276f26af4ab3221e4572847ec6786c4822/url/url_canon_host.cc
,
May 19 2018
ClusterFuzz testcase 4896678231146496 is verified as fixed, so closing issue as verified. If this is incorrect, please add ClusterFuzz-Wrong label and re-open the issue. |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by iclell...@chromium.org
, Jan 22 2018