Issue metadata
Sign in to add a comment
|
Near-homoglyph whole-script IDN spoofing |
||||||||||||||||||||||||
Issue descriptionSee https://www.vgrsec.com/post20170219.html. https://www.gṃail.com (note the dot under the m) does not show in Punycode on my machine (which is set to US English, which obviously does not use the dotted m character). Shouldn't my locale cause me to see Punycode? Or perhaps I'm misremembering.
,
Mar 21 2017
This issue is something beyond what client-side counter-spoofing code can do, I"m afraid.
One possibility is to check if the set of characters used in a given label is a subset of the 'exemplar or extended exemplar character set of at least one language. That requires the following steps:
1) Build the sets of exemplar characters for say 100 languages
2) For each label, build a set of characters making up the label. Call it |label_set|
3)
found = false
foreach |set| in all the sets built in #1
if |label_set| is contained in |set|
found = true; break
if (found == false) reject
Unfortunately, the example in this bug would pass the test, I believe. So, it doesn't work for that case.
A similarity-check against a small set of high-value domains would catch it, though.
,
Mar 21 2017
It sounds as if this bug is not a dupe, and the mitigations on the other bug won't fix this. We need to either add more mitigations, or find process ways to address this at higher levels (safe browsing, registrar policies).
,
Mar 21 2017
,
Mar 21 2017
,
Mar 21 2017
> This issue is something beyond what client-side counter-spoofing code can do, I"m afraid. Well, there is a very simple mitigation: don't decode Punycode (or, say, only for the local locale for non-English/Latin locales). If it's too hard to design an acceptable algorithm to detect spoofing, then that option is always acceptable from a pure security perspective. Plans for addressing whole-script confusables are in: https://docs.google.com/document/d/1HSiOZIVNhi_ZnTaGLT0ip_ti8FT-oN_kJwpD1k90wVY/edit#heading=h.cp8o2n5ml4a2 I don't know if that would cover this, though. CCing emilyschechter@ for visibility.
,
Mar 22 2017
,
Mar 22 2017
> find process ways to address this at higher levels (safe browsing, registrar policies). I believe this has to be done no matter what we do here. In the meantime, I have a sketch for "a similarity-check against a small set of high-value domains". In IDNToUnicodeWithAdjustments(), after all the labels(hostname components) are converted to Unicode (or left as punycode because it fails to pass the test), calculate the skeleton of the last 2 or 3 components and look it up in the hash table of skeletons for top N high-value domains. If there's a match and it's different from the 'original', show the whole domain in punycode. (Adjustment logic inside a loop has to be tweaked) This will lead to the binary size/memory increase for the hash of skeletons of top N domains (+ original domains: e.g. 'scope.com' in Latin). For non-ASCII hostnames, calculating the skeleton will take time (haven't measured how long it'll take).
,
Mar 22 2017
,
Mar 22 2017
Jungshik, could you estimate the timeline for the similarity check you describe in c#8? is M59 reasonable?
,
Mar 22 2017
Emily, if I don't have to worry about the memory/perf, certainly possible before M59 branch. net/tools/dafsa/make_dafsa.py can be used if the skeleton of an ASCII input string (high value domains) is always ASCII, but I just heard that it's not guaranteed, unfortunately. Let me explore a bit more on this.
,
Mar 23 2017
jshin -- Would caching help w/ perf concerns?
,
Mar 23 2017
,
Mar 23 2017
Perhaps, my concern about memory/perf might be a bit premature. Let me prototype something dead simple and see how it goes. If the skeleton of any ASCII input is guaranteed to be ASCII, the memory representation of skeletons of N "high-value" domains would be pretty efficient (using the same mechanism as used for eTLDs; comment 11). If not, we can also think of using what ICU has for generic utf-16 strings (UCharTrie). As for caching, it'd save calculating the skeleton for an 'incoming' domain' and looking it up. I have little idea how much it'd save, though. So, let me prototype something simple first.
,
Mar 23 2017
,
Mar 30 2017
Hmm, it turned out that http://www.unicode.org/Public/security/revision-03/confusablesSummary.txt does not regard "Latin Small Letter M with dot below" as confusable with 'm'.
,
Mar 30 2017
> Diacritic-agnostic comparison (primary collation strength) against > 'high-value' domains I'll try the above along with confusable_check. The former is mainly for ASCII-Latin (virtually all high value domains) vs Latin with diacritics (potentially spoofing attempts). The later is for non-Latin (e.g. Cyrillic) domains that look like high value domains (again in ASCII-Latin).
,
Apr 6 2017
,
Apr 12 2017
,
Apr 12 2017
lgarron or mgiuca : could you add me (jshin@chromium) to bug 708754 ? I can't see the bug. (my access is denied.) Thanks.
,
Apr 13 2017
#21: Done
,
Apr 13 2017
https://codereview.chromium.org/2784933002 is a WiP. For Alexa top-500 domains, I use two methods to check potentially spoofing domains (see comment 18) and found 717 domains in the "dot com" list. (out of 1042537 IDN domains). PS#2 above works fine, but I haven't measured any perf/memory, etc. Using DAFSA (of skeletons and raw domain names) would save memory and is likely to be a perf win, too. PS #2 just uses hash set of skeletons and sortkeys.
,
Apr 13 2017
Some samples: +1б3.com +4sharéd.com +5б.com +ікеа.com +ĸaĸaĸu.com +aṃazon.com +abøut.com +ḟacebook.com +ḟacebooḳ.com +açcuweather.com +adøbe.com +adobé.com +adobė.com +ỵahoo.com +ýahoo.com +ÿahoo.com +ẏahoo.com +aḷiexpress.com +ałiexpress.com
,
Apr 13 2017
hmm: 'asĸ' with U+0138 (from bug 708754 ) is somehow not filtered by PS #2 above even though ask.com is in the top 500 list.
,
Apr 14 2017
> hmm: 'asĸ' with U+0138 (from bug 708754 ) is somehow not filtered by PS #2 Actually, it is filtered. I didn't see it because it's not registered in ".com".
,
Apr 19 2017
Issue 712877 has been merged into this issue.
,
Apr 19 2017
Derestricting with Issue 683314 , since near-homoglyph attacks are also well-known, and being discussed in several places.
,
Apr 20 2017
,
Apr 20 2017
Adding R-V-G because of https://bugs.chromium.org/p/chromium/issues/detail?id=703750#c17 . I'm not sure if that's sensitive or not; if not, feel free to derestrict :)
,
Apr 24 2017
rsleevi: Are you referring to the gmail comment? It didn't seem sensitive to me, and it would be nice to open this bug so that we can refer to it. WDYT?
,
Apr 25 2017
Deleted comment 17. Below is a recap with a bit of clean-up:
comment 17 with a bit of 'sanitization':
> does not regard "Latin Small Letter M with dot below" (U+1E43) as confusable with 'm'.
And, I can understand why it does not. Diacritic-agnostic comparison (primary collation strength) against 'high-value' domains would catch it, but there may be false positives. Well, whatever is used, there will always be. Need to see how many would be flagged by this method.
We can effectively supplement the Unicode confusable list by replacing U+1E43 (ṃ ) with 'm' in the input, but there are a lot of characters like that and where should we draw a line?
The first method in comment #1 with a slight variation might work better. This alone wouldn't catch gmáil.com with U+00E1 or gmaıl.com with U+0131 or gmaíl.com with U+00ED.
1) Build the sets of exemplar characters for languages written in Latin [1]
2) For each label, build a set of Latin letters in the label. Call it |latin_label_set|
3)
foreach |set| in all the sets built in #1
if |latin_label_set| is contained in |set|
return accept;
return reject
[1] If necessary, {Latin, Greek, Cyrillic} can be used instead of Latin.
,
Apr 25 2017
Ryan, can you go over comment 33 (in case it still has something to worry about) and lift the restriction if it looks good to you?
,
Apr 25 2017
Removing R-V-G. I understand R-V-ST will apply unless y'all make it public :)
,
Apr 25 2017
As for comment 3 on safe-browsing, I found SafeBrowsing already blocks a few of them with actual risk when I went over domains flagged by my WiP CL. For instance, http://xn--cloud-mh1b.com/ ( ḭcloud.com ) leads me to SB interstitial. ------------ Deceptive site ahead Attackers on ḭcloud.com may trick you into doing something dangerous like installing software or revealing your personal information (for example, passwords, phone numbers, or credit cards). Details: Google Safe Browsing recently detected phishing on ḭcloud.com. Phishing sites pretend to be other websites to trick you. Learn more. ----------------------------- The same is true of gmaiļ.com . Apparently, SB began to take into account IDN-similarity for phishing site detection. It may not be fast enough to protect users from 'targeted' attacks(register a domain with a spoofing name, set up a server right away and send a phishing email/SMS/etc to a target). Nonetheless, SB can protect users in most cases without displaying 'innocent' IDNs in punycode. One possibility: 1. Land my WiP CL with some more tweaking (1. top 10k vs top 1k or top N? 2. ø vs o, ł vs l, etc ) for now. 2. After confirming that SB does offer a protection against IDN-exploiting spoofing attempts, come up with a way to give users a very clear signal (interstitial is not liked by many, but it'd be the strongest signal) that they may be navigating to a potential phishing site. With that, just display domain names in Unicode for cases covered by this bug.
,
Apr 25 2017
Clarifying what I meant by #2 in the previous comment. 2. After confirming that SB does offer a protection against IDN-exploiting spoofing attempts, we can display IDNs (flagged by IDN-similar-to-top-domains detector) in Unicode. When a user tries to navigate to a IDN site NOT flagged by SB BUT flagged by Chrome's client-side detector (purely based on names + top domain list), give users a very clear signal (interstitial is not liked by many, but it'd be the strongest signal) that they may be navigating to a potential phishing site.
,
Apr 26 2017
allpublic'ing the bug.
,
May 9 2017
Issue 719773 has been merged into this issue.
,
May 11 2017
Issue 720538 has been merged into this issue.
,
May 19 2017
Issue 723956 has been merged into this issue.
,
May 19 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91 commit a8add0308ba6067eb3de5a8fe82f9c2f2460ad91 Author: jshin <jshin@chromium.org> Date: Fri May 19 06:49:10 2017 Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG= 703750 , 714628 , 719199 , 722639 TEST=components_unittests --gtest_filter=*IDNToUni* Review-Url: https://codereview.chromium.org/2784933002 Cr-Commit-Position: refs/heads/master@{#473109} [modify] https://crrev.com/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91/components/url_formatter/BUILD.gn [modify] https://crrev.com/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91/components/url_formatter/idn_spoof_checker.cc [modify] https://crrev.com/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91/components/url_formatter/idn_spoof_checker.h [add] https://crrev.com/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91/components/url_formatter/top_domains/BUILD.gn [add] https://crrev.com/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91/components/url_formatter/top_domains/README [add] https://crrev.com/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91/components/url_formatter/top_domains/alexa_domains.list [add] https://crrev.com/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91/components/url_formatter/top_domains/alexa_skeletons.gperf [add] https://crrev.com/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91/components/url_formatter/top_domains/make_alexa_top_list.py [add] https://crrev.com/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91/components/url_formatter/top_domains/make_top_domain_gperf.cc [modify] https://crrev.com/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91/components/url_formatter/url_formatter.cc [modify] https://crrev.com/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91/components/url_formatter/url_formatter_unittest.cc
,
May 19 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4eec0f46bf71277f9de364ea8f4fb2f41d894b16 commit 4eec0f46bf71277f9de364ea8f4fb2f41d894b16 Author: tsergeant <tsergeant@chromium.org> Date: Fri May 19 07:24:38 2017 Revert of Mitigate spoofing attempt using Latin letters. (patchset #47 id:850001 of https://codereview.chromium.org/2784933002/ ) Reason for revert: This CL is causing compile to fail on Win x64: https://build.chromium.org/p/chromium/builders/Win%20x64/builds/11432 FAILED: obj/components/url_formatter/top_domains/make_top_domain_gperf/make_top_domain_gperf.obj make_top_domain_gperf.cc(46): error C2220: warning treated as error - no 'object' file generated make_top_domain_gperf.cc(46): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data Original issue's description: > Add checks against spoofing attempt at top domains > > Remove diacritic marks from a hostname and calculate the confusability > skeleton of the accent-free name. Look it up in the pre-calculated list of > the skeletons of top 10k domains. > > Removing diacritic marks from a hostname is equivalent to comparing names with > the primary collation strength in the root locale. To make them equivalent, > three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. > Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's > confusables list. > > Binary file size increase: ~ 59kB for the DAFSA representation of top > domain name skeletons. > > The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) > on my machine per the test run over ~1 million IDNs in com TLD). > > It adds about 1500 domains to the list of domains to display in Punycode out > of ~ 1 million IDNs in com TLD. (3018 => 4571) > > In addition, disallow combining diarctic marks unless they're preceded by > Latin-Greek-Cyrillic. > > BUG= 703750 , 714628 , 719199 , 722639 > TEST=components_unittests --gtest_filter=*IDNToUni* > > Review-Url: https://codereview.chromium.org/2784933002 > Cr-Commit-Position: refs/heads/master@{#473109} > Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91 TBR=rsleevi@chromium.org,pkasting@chromium.org,nick@chromium.org,brettw@chromium.org,emilyschechter@chromium.org,jshin@chromium.org # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG= 703750 , 714628 , 719199 , 722639 Review-Url: https://codereview.chromium.org/2889303003 Cr-Commit-Position: refs/heads/master@{#473118} [modify] https://crrev.com/4eec0f46bf71277f9de364ea8f4fb2f41d894b16/components/url_formatter/BUILD.gn [modify] https://crrev.com/4eec0f46bf71277f9de364ea8f4fb2f41d894b16/components/url_formatter/idn_spoof_checker.cc [modify] https://crrev.com/4eec0f46bf71277f9de364ea8f4fb2f41d894b16/components/url_formatter/idn_spoof_checker.h [delete] https://crrev.com/f677dc5c2d440d6e074a1d624e8a0b7a68371e08/components/url_formatter/top_domains/BUILD.gn [delete] https://crrev.com/f677dc5c2d440d6e074a1d624e8a0b7a68371e08/components/url_formatter/top_domains/README [delete] https://crrev.com/f677dc5c2d440d6e074a1d624e8a0b7a68371e08/components/url_formatter/top_domains/alexa_domains.list [delete] https://crrev.com/f677dc5c2d440d6e074a1d624e8a0b7a68371e08/components/url_formatter/top_domains/alexa_skeletons.gperf [delete] https://crrev.com/f677dc5c2d440d6e074a1d624e8a0b7a68371e08/components/url_formatter/top_domains/make_alexa_top_list.py [delete] https://crrev.com/f677dc5c2d440d6e074a1d624e8a0b7a68371e08/components/url_formatter/top_domains/make_top_domain_gperf.cc [modify] https://crrev.com/4eec0f46bf71277f9de364ea8f4fb2f41d894b16/components/url_formatter/url_formatter.cc [modify] https://crrev.com/4eec0f46bf71277f9de364ea8f4fb2f41d894b16/components/url_formatter/url_formatter_unittest.cc
,
May 22 2017
Issue 724740 has been merged into this issue.
,
May 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a586e96794b89bef4729b33369b8c2035564d376 commit a586e96794b89bef4729b33369b8c2035564d376 Author: jshin <jshin@chromium.org> Date: Mon May 22 07:20:17 2017 Add checks against spoofing attempt at top domains Original CL (https://codereview.chromium.org/2784933002) was reverted due to a compile failure on win_x64 (not detected by CQ but detected post-landing). That issue was addressed using checked_cast. Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. TBR=pkasting@chromium.org BUG= 703750 , 714628 , 719199 , 722639 TEST=components_unittests --gtest_filter=*IDNToUni* CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng Review-Url: https://codereview.chromium.org/2897873002 Cr-Commit-Position: refs/heads/master@{#473519} [modify] https://crrev.com/a586e96794b89bef4729b33369b8c2035564d376/components/url_formatter/BUILD.gn [modify] https://crrev.com/a586e96794b89bef4729b33369b8c2035564d376/components/url_formatter/idn_spoof_checker.cc [modify] https://crrev.com/a586e96794b89bef4729b33369b8c2035564d376/components/url_formatter/idn_spoof_checker.h [add] https://crrev.com/a586e96794b89bef4729b33369b8c2035564d376/components/url_formatter/top_domains/BUILD.gn [add] https://crrev.com/a586e96794b89bef4729b33369b8c2035564d376/components/url_formatter/top_domains/README [add] https://crrev.com/a586e96794b89bef4729b33369b8c2035564d376/components/url_formatter/top_domains/alexa_domains.list [add] https://crrev.com/a586e96794b89bef4729b33369b8c2035564d376/components/url_formatter/top_domains/alexa_skeletons.gperf [add] https://crrev.com/a586e96794b89bef4729b33369b8c2035564d376/components/url_formatter/top_domains/make_alexa_top_list.py [add] https://crrev.com/a586e96794b89bef4729b33369b8c2035564d376/components/url_formatter/top_domains/make_top_domain_gperf.cc [modify] https://crrev.com/a586e96794b89bef4729b33369b8c2035564d376/components/url_formatter/url_formatter.cc [modify] https://crrev.com/a586e96794b89bef4729b33369b8c2035564d376/components/url_formatter/url_formatter_unittest.cc
,
May 24 2017
,
May 25 2017
,
May 26 2017
Issue 726738 has been merged into this issue.
,
Jun 6 2017
Issue 729444 has been merged into this issue.
,
Jun 9 2017
Issue 731745 has been merged into this issue.
,
Jul 24 2017
,
Jan 17 2018
Issue 802070 has been merged into this issue.
,
Oct 19
|
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by pkasting@chromium.org
, Mar 21 2017