New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 770026 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , Chrome , Mac , Fuchsia
Pri: 3
Type: Bug



Sign in to add a comment

Grapheme boundary iterating doesn't handle degenerate Indic cases well

Project Member Reported by rlanday@chromium.org, Sep 29 2017

Issue description

Chrome Version: 63.0.3227.0 (Developer Build)

We have some utility functions in EditingUtilities.cpp to find grapheme boundaries in a Unicode string. E.g., NextGraphemeBoundary() takes a text node and an offset, and finds the next grapheme boundary. I was assigned a crash in crbug.com/769250 that turned out to be related to grapheme handling. It turns out that if you have a string like:

"a\u094D": a्

you can stick the insertion point before the a, after the ्, or in the middle of the two characters, and select any of the characters as you would expect. However, if you try to compute the VisibleSelection, we canonicalize so that the selection endpoints are on grapheme boundaries. We have a function in StateMachineUtil.cpp called IsGraphemeBreak() that takes two Unicode code points and determines whether or not we can have a grapheme break between them. As far as I can tell, it faithfully implements the algorithm given in Unicode Standard Annex #29: Unicode Text Segmentation:
http://www.unicode.org/reports/tr29/

The specific rule we're running into here is GB9:
http://www.unicode.org/reports/tr29/#GB9

"Do not break before extending characters or ZWJ." (ZWJ = zero-width joiner). The character in question does appear to be an extending character (Grapheme_Extend: Yes):
https://unicode.org/cldr/utility/character.jsp?a=094D

However, this character clearly fails to extend the letter a, as well as many other characters, so the algorithm in TR29 appears to be unhelpful here. Looking for a way out, I found in TR29:

http://www.unicode.org/reports/tr29/#Notation

"These rules are constrained in three ways, to make implementations significantly simpler and more efficient.
...
2. Ignore degenerates. No special provisions are made to get marginally better behavior for degenerate cases that never occur in practice, such as an A followed by an Indic combining mark."

So, this exact case is actually called out as something the spec doesn't handle. Unfortunately, I think as long as it's possible to create a web page where you put these two characters together, in principle it's something we have to handle. I think our grapheme logic should ideally be consistent with the logic we actually use to decide that it's possible to stick your insertion point between these two characters.
 
"a्" actually behaves as one character for selection purposes on macOS (I will double-check the behavior on Linux, I may have tested incorrectly). Safari happily combines the a with the joining character (see attachment). So I guess there's no clearly correct behavior here.

I guess even if we changed our behavior here, it wouldn't actually help with the issue in crbug.com/769250 where the composition is getting normalized to become empty since there are cases where the composition tries to split a clearly legitimate grapheme cluster and we still have to figure out how to handle those (either by normalizing the composition, and then handling the case where it becomes empty, or by allowing the composition range to contain only part of a grapheme cluster).
a्.png
2.2 KB View Download
Status: WontFix (was: Assigned)
Yeah I was mistaken, Chrome doesn't let you select the two code points in 	a् separately on Linux either.

Sign in to add a comment