New issue
Advanced search Search tips

Issue 595263 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2016
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 2
Type: Bug



Sign in to add a comment

Hostnames coming in as punycode with deviation characters should remain punycode

Project Member Reported by js...@chromium.org, Mar 16 2016

Issue description

From  bug 282677  comment 6: 

---------
With current Chrome 32, and with German as one of my browser languages, when I go to my example site http://xn--amt-golener-land-mlb.de/ the Chrome omnibox displays "amt-golßener-land.de". That might be ok, but selecting the whole URL and pasting it elsewhere yields "http://amt-golssener-land.de/" which is a different web site! I think Chrome copy-to-clipboard should copy the Punycode version of the domain, at least when it does not round-trip through the Unicode form.

Note: Firefox 26 shows "http://xn--amt-golener-land-mlb.de/" in the address bar, even after I added German to my browser languages and restarted Firefox.

I do expect to have the browser map ß to ss, but I also expect valid Punycode domain names to be reachable. (Try a web search for "amt golener land uts 46".)

-----------

I'm resolving this issue as a part of CL to revise the IDN display policy. See https://codereview.chromium.org/1258813002/
 

Comment 1 by js...@chromium.org, Mar 16 2016

Components: UI>Internationalization
From  bug 282677 :

1. Type "xn--qi7chamc2ac8a.com" in the omnibox  (Punycode version of #2 without mapping)
2. Type "nytimes.com"   (Unicode version of #1 with full-width ASCII used for nytimes) 
3. Type "nytimes.com"   (ASCII-version)

#2 and #3 go to nytimes.com as they should, but #1 gives 'host not found' error. 

Going back to German sharp-S and 'ss' case (which led me to discover this problem):

1. xn--strae-oqa.de  (punycode version of #2 without IDNA2003 folding)
2. straße.de
3. strasse.de (sharp-S converted to 'ss' per IDNA 2003) 

Currently, Chrome uses IDNA2003 and #2 and #3 go to the same site while #1 goes to a different site (German NIC does not bundle two domains and they belong to separate entities). 

When canonicalizing hostname, we don't check if a host name is in punycode (we just check if it's ASCII). If it's ASCII (even though it may be in 'punycode'), it's not subject to IDNA mapping/rules. 

---------

 bug 282677  is about canonicalization of hostname. It's resolved as WONTFIX. This issue will deal with the issue from the IDN display policy angle. 

Instead of converting Punycode  hostnames with devitation characters or other mapped characters to Unicode, we'll just leave them alone in the CL mentioned above. 


Project Member

Comment 2 by bugdroid1@chromium.org, Mar 18 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/62a928390ba06db29576bbb32606696b3e16a66c

commit 62a928390ba06db29576bbb32606696b3e16a66c
Author: jshin <jshin@chromium.org>
Date: Fri Mar 18 18:42:39 2016

Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.). [9]
10. Character black list (Mozilla's : [8]) is trimmed down to two characters.

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. #9 is  another difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing.

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed.

[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.
    See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108
[9] This is to "fix"  bug 595263 

BUG= 336973 , 595263 
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*

Review URL: https://codereview.chromium.org/1258813002

Cr-Commit-Position: refs/heads/master@{#382029}

[modify] https://crrev.com/62a928390ba06db29576bbb32606696b3e16a66c/components/omnibox/browser/history_url_provider_unittest.cc
[modify] https://crrev.com/62a928390ba06db29576bbb32606696b3e16a66c/components/url_formatter/url_formatter.cc
[modify] https://crrev.com/62a928390ba06db29576bbb32606696b3e16a66c/components/url_formatter/url_formatter.h
[modify] https://crrev.com/62a928390ba06db29576bbb32606696b3e16a66c/components/url_formatter/url_formatter_unittest.cc
[modify] https://crrev.com/62a928390ba06db29576bbb32606696b3e16a66c/url/url_canon_unittest.cc

Comment 3 by js...@chromium.org, Mar 30 2016

Status: Fixed (was: Started)

Sign in to add a comment