Grapheme boundary iterating doesn't handle degenerate Indic cases well |
||
Issue descriptionChrome Version: 63.0.3227.0 (Developer Build) We have some utility functions in EditingUtilities.cpp to find grapheme boundaries in a Unicode string. E.g., NextGraphemeBoundary() takes a text node and an offset, and finds the next grapheme boundary. I was assigned a crash in crbug.com/769250 that turned out to be related to grapheme handling. It turns out that if you have a string like: "a\u094D": a् you can stick the insertion point before the a, after the ्, or in the middle of the two characters, and select any of the characters as you would expect. However, if you try to compute the VisibleSelection, we canonicalize so that the selection endpoints are on grapheme boundaries. We have a function in StateMachineUtil.cpp called IsGraphemeBreak() that takes two Unicode code points and determines whether or not we can have a grapheme break between them. As far as I can tell, it faithfully implements the algorithm given in Unicode Standard Annex #29: Unicode Text Segmentation: http://www.unicode.org/reports/tr29/ The specific rule we're running into here is GB9: http://www.unicode.org/reports/tr29/#GB9 "Do not break before extending characters or ZWJ." (ZWJ = zero-width joiner). The character in question does appear to be an extending character (Grapheme_Extend: Yes): https://unicode.org/cldr/utility/character.jsp?a=094D However, this character clearly fails to extend the letter a, as well as many other characters, so the algorithm in TR29 appears to be unhelpful here. Looking for a way out, I found in TR29: http://www.unicode.org/reports/tr29/#Notation "These rules are constrained in three ways, to make implementations significantly simpler and more efficient. ... 2. Ignore degenerates. No special provisions are made to get marginally better behavior for degenerate cases that never occur in practice, such as an A followed by an Indic combining mark." So, this exact case is actually called out as something the spec doesn't handle. Unfortunately, I think as long as it's possible to create a web page where you put these two characters together, in principle it's something we have to handle. I think our grapheme logic should ideally be consistent with the logic we actually use to decide that it's possible to stick your insertion point between these two characters.
,
Sep 29 2017
,
Sep 29 2017
Yeah I was mistaken, Chrome doesn't let you select the two code points in a् separately on Linux either. |
||
►
Sign in to add a comment |
||
Comment 1 by rlanday@chromium.org
, Sep 29 20172.2 KB
2.2 KB View Download