IDN spoofing guard: characters that look like multiple characters (font/platform variations) |
||||
Issue descriptionSpun off from bug 817247 . Some characters can have multiple look-alike characters. For instance, U+0153 (œ) can be arguably mapped to 'ae', 'oe' or 'ce'. U+04CF (ӏ) can be mapped to 'i', 'l' or '1'. At the moment, U+04CF is mapped to both 'i' and 'l' (and '1' indirectly because 'l' and '1' (digit) share the spoofing skeleton). If there are more than one of those characters with multiple 'skeletons', we don't have a good solution. What I tried does not work ( https://chromium-review.googlesource.com/c/chromium/src/+/974165/6#message-af0b0cffc6cba6bee7713fd2fc4b8532d0a0a1ba and comments thereafter ). From bug 817247 comment 8: [\u0131\u0269\u026A\u03B9\u0456\u04CF\u13A5\uA647\U000118C3] & [:IdentifierStatus=Allowed:] => ı U+0131 LATIN SMALL LETTER DOTLESS I ι U+03B9 GREEK SMALL LETTER IOTA і U+0456 CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I ӏ U+04CF CYRILLIC SMALL LETTER PALOCHKA Three more characters that may need a similar treatment. They're currently folded to 'i'. In addition to that, we can map them to 'l' (lowercase L) for the 2nd check and calculate the skeleton. Then, it'd match 'digit 1' as well because digit 1's skeleton is lowercase L. (see bug 820068 )
,
May 18 2018
There is no such thing as one character having multiple skeletons. Every character has exactly 1 skeleton (technically called "prototype") according to the TR 39 specification: http://unicode.org/reports/tr39/tr39-1.html The characters should either be added to the same equivalence class (same prototype). Mark has some ideas on how to add more flexibility, but that's still in the early design phases.
,
May 21 2018
Assigning to jshin to get out of Enamel triage queue. Please either find a good owner for this or set back to untriaged.
,
Jun 1 2018
> There is no such thing as one character having multiple skeletons I know that the current spoofing data does not allow that. This bug is about how to tackle cases in the bug report (comment 0) either by mapping data change (e.g. mapping all i-like, l-like and 1-like characters into a single skeleton would be one way, but I'm not sure of it's ramification), changing mapping format/structure or handling that at a 'higher' level (spoofing detection implementation change, or changing its users - as Chrome). Given my recent change, I'm sorry I can't work on this any more.
,
Jun 29 2018
,
Nov 5
Issue 901578 has been merged into this issue. |
||||
►
Sign in to add a comment |
||||
Comment 1 by js...@chromium.org
, May 15 2018