New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 856947 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , Chrome , Mac , Fuchsia
Pri: 3
Type: Bug



Sign in to add a comment

Update CursorMovementIterator while getting along with ICU updating.

Project Member Reported by js...@chromium.org, Jun 27 2018

Issue description

Discovered while investigating webkit unit test failure for ICU 62 update ( bug 850334 )

third_party/blink/renderer/core/editing/state_machines/state_machine_util.cc

has various Unicode character properties hard-coded, but it's not updated for a while. 

Some character properties such as EmojiGAZ are obsolete (as of Unicode 11). Indic syllable properties have undergone quite extensive changes. 

Moreover, IsGraphemeBreak() implemented a version of UAX 29 (Unicode Text Segmentation), but UAX 29 tends to be updated every Unicode release (every year). It's prone to become obsolete. 

I wonder why these character lists were hard-coded and IsGraphemeBreak() were implemented in Blink.  ( https://codereview.chromium.org/1856603002 )

IMHO, those things had better be delegated to ICU's BreakIterator. If necessary, BreakIterator can be tailored.  

Blink layout uses ICU's break iterator. Apparently, IsGraphemeBreak is used is used in editing. 
 


 

Comment 1 by js...@chromium.org, Jun 27 2018

> I wonder why these character lists were hard-coded and IsGraphemeBreak() were implemented in Blink.  ( https://codereview.chromium.org/1856603002 )

The rationale for that is given in  bug 594923  comment 9:

> By replacing with own implementation, we can support
latest Unicode spec without waiting ICU update.
For example, grapheme boundary for the recent complex
emoji sequence is now supported.


However, moving away from ICU has now exactly the opposite effect.  Blink's grapheme cluster break implementation has to be updated every time UAX 29/Unicode is changed in addition to ICU update. 

Another rationale is as following.  This need to be addressed in a different way if it's to be addressed. 

> There is another benefit of new implementation. The
interface of new implementation does not require the
full continuous text buffer. This is good if the target
text crosses the Nodes.

Comment 2 by js...@chromium.org, Jun 27 2018

Components: Blink>Fonts>Emoji

Comment 3 by js...@chromium.org, Jun 27 2018

Description: Show this description

Comment 4 by kojii@chromium.org, Jun 27 2018

Cc: yosin@chromium.org
Cursor movement needs to be different from UAX#29. WebKit and old Blink use a forked copy of ICU rules for cursor movement.

When the forked rules weren't maintained very well, editing team decided to move to is own code by leaning Android doing so, but now the state machine is no longer maintained, so we're back to the original problem.

nona@, what Android team does to keep it up-to-date? Is the effort sharable?

Comment 5 Deleted

Comment 6 by nona@google.com, Jun 27 2018

>#4 I haven't changed any rules in Android in P. Only emoji set was updated.

Comment 7 by nona@google.com, Jun 27 2018

Sorry, I removed comment 5 due to that has internal only link and I missed #1 comment.

At that time, the emoji implementation in ICU was buggy and far from the latest spec. So we decided to implement our own. I believe this has now solved.
So, once the 2nd item is addressed by ICU, I think we can remove our state machine and switch back to the ICU implementation.

I'll submit a feature request to ICU for 2nd item.

Comment 8 by nona@google.com, Jun 27 2018

For the Android stuff, almost the same reason as WebKit, we have our implementation of grapheme breaker.

https://android.googlesource.com/platform/frameworks/minikin/+/master/libs/minikin/GraphemeBreak.cpp

I haven't changed anything in this code in P.

Comment 9 by nona@google.com, Jun 27 2018

Wait, I just looked the BreakIterator API list and found  BreakIterator::adoptText which accepts CharacterIterator instead of UChar* or UnicodeString.

http://icu-project.org/apiref/icu4c/classicu_1_1BreakIterator.html#a1fd72184b59936bb62b07af071e2b0c1

The API doc is not clear but if this works like setText, this could be an solution for the 2nd item.

Comment 10 Deleted

Comment 11 by kojii@chromium.org, Jun 28 2018

Is the code you mentioned a grapheme cluster or a caret boundary?

Comment 12 by nona@google.com, Jun 28 2018

The code is a grapheme cluster and Android use that as a caret boundary.

Comment 13 by e...@chromium.org, Jun 28 2018

Status: Available (was: Untriaged)
Blocking: 850334
Cc: aheninger@google.com mscherer@google.com
> Cursor movement needs to be different from UAX#29. WebKit and old Blink use a forked copy of ICU rules for cursor movement

Yeah, I remember that. The cursor movement is tricky and a simple break-iterator-based method may not work in some situations without an extra-layer of code, but I'm digressing. 

Back to the topic: 

A rule tailored from the latest ICU would be still easier to maintain than a completely different custom code because we can compare the tailored rule against the original and port the difference on top of the latest CLDR/ICU. 

> At that time, the emoji implementation in ICU was buggy and far from the latest spec

IIRC, in Q2 2016 (when this custom code was added to Blink),  we could have cherry-picked ICU data/ breakiterator rules from ICU 58-to-be to support the latest Emoji spec available back then. :-) (ICU 58 was released in October 2016). 

Anyway, putting aside the past,  it'd be great if what Nona mentioned in comment 9 could work for the 2nd point in comment 1. 

In the meantime, I'm marking this as a blocker to  bug 850334  (ICU update to 62) although I may be able to find a temporary work around to update ICU 'now' than later. 



> The code is a grapheme cluster and Android use that as a caret boundary.

If this is UAX#29, then we're degrading in some i18n scripts. The value is unclear if Android doesn't support it, but I saw WebKit added some more rules for i18n cursor movements last year, so we're not as great as WebKit in this regard.
> The value is unclear if Android doesn't support it

Even if a given Android device does not have a font for a new script, one can always use a web font. 

The failing tests with ICU 62.1 are as following. One is clearly about grapheme cluster, but the other two seem to be about editing.

blink_platform_unittests (retry summary)  
failures:
TextBoundariesTest.ForwardWhitespaces
 
webkit_unit_tests (retry summary)  
failures:
StateMachineUtilTest.IsGraphmeBreak_EmojiModifier
EditingUtilitiesTest.uncheckedPreviousNextOffset
> The failing tests with ICU 62.1 are as following

If this is blocking, or important, please assign to yosin@. I think I'm with you, but also we need his consensus.

> Even if a given Android device does not have a font for a new script, one can always use a web font.

True, but the point is, it doesn't work outside Blink; e.g., Android text edit.
I'm down to this test and going to take a look. 

EditingUtilitiesTest.uncheckedPreviousNextOffset

I fixed the other two tests:

Whitespaces: expected results were wrong per the latest UAX 29.  
Statemachine.IsGraph*: again expected results were wrong. I also removed unnecessary codes per the latest UAX 29). 


EditingUtilitiesTest.uncheckedPreviousNextOffset has the same issue (the latest UAX 29 does not have GB 10 any more) and I fixed the test as well. 
https://chromium-review.googlesource.com/c/chromium/src/+/1111818 has a partial fix for Blink's implementation of GB rules. I didn't venture to add GB12 (RI sequence handling). It may be done somewhere else, but I haven't looked for it). 

Moving Grapheme BI back to ICU may be better (if possible, which seems to be the case) in terms of the maintenance. 
nona@, will you update Blink's state machine to adopt new Unicode emoji?
Talked with yosin@, he's ok either way; i.e., nona@ to fix IsGraphemeBreak, or going back to CursorMovementIterator.
Project Member

Comment 23 by bugdroid1@chromium.org, Jul 10

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e9126f1d03725c2ae97d524985971d66089eede3

commit e9126f1d03725c2ae97d524985971d66089eede3
Author: Jungshik Shin <jshin@chromium.org>
Date: Tue Jul 10 08:28:18 2018

Update ICU to 62.1

For the actual changes, see
  https://chromium-review.googlesource.com/c/chromium/deps/icu/+/1128503

Update a few tests per CLDR 33-1/ICU 62/Unicode 11 data changes.

* Arabic number format expected results are also updated because Arabic
locales now use "European" digits by default in Chromium's ICU to match
the behavior of other Google products. The CLDR upstream is expcted to
follow suit soon.

* Blink's homegrown grapheme cluster breakiterator is partly updated
per UAX 29 (text segmentation) in Unicode 11. (see crbug.com/856947)

* Whitespaces are kept together by a new word breaking rule (
 http://unicode.org/reports/tr29/#WB3d ). Update the test to reflect
 that change.

* Fix the currency formatting code in payment to handle better cases
with a space (U+00A0 / U+0020) between a currency code and a numeric
value (e.g. "-USD 47.37" in en-AU for USD).  A test was added to
CurrencyAmounts/PaymentsCurrencyFormatterTest.IsValidCurrencyFormat for
a negative value.

Cq_Include_Trybots: master.tryserver.blink:linux_trusty_blink_rel,mac10.12_blink_rel,win10_blink_rel;master.tryserver.chromium.android:android_blink_rel

Bug:  850334 ,856947
Test: base_unittests, and all other tests
Cq-Include-Trybots: luci.chromium.try:linux_layout_tests_layout_ng
Change-Id: I6e7e76df358fc1e04c9eb95422b3ab7bfb8f78fd
Reviewed-on: https://chromium-review.googlesource.com/1111818
Commit-Queue: Jungshik Shin <jshin@chromium.org>
Reviewed-by: Seigo Nonaka <nona@chromium.org>
Reviewed-by: Kent Tamura <tkent@chromium.org>
Reviewed-by: Yoshifumi Inoue <yosin@chromium.org>
Reviewed-by: Mathieu Perreault <mathp@chromium.org>
Reviewed-by: Avi Drissman <avi@chromium.org>
Cr-Commit-Position: refs/heads/master@{#573650}
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/DEPS
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/base/i18n/number_formatting_unittest.cc
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/base/i18n/time_formatting_unittest.cc
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/components/payments/core/currency_formatter.cc
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/components/payments/core/currency_formatter_unittest.cc
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/third_party/WebKit/LayoutTests/fast/forms/time-multiple-fields/time-multiple-fields-localization.html
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/third_party/WebKit/LayoutTests/platform/linux/fast/forms/time-multiple-fields/time-multiple-fields-localization-expected.txt
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/third_party/WebKit/LayoutTests/platform/mac-mac10.12/fast/forms/time-multiple-fields/time-multiple-fields-localization-expected.txt
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/third_party/WebKit/LayoutTests/platform/mac/fast/forms/time-multiple-fields/time-multiple-fields-localization-expected.txt
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/third_party/WebKit/LayoutTests/platform/win/fast/forms/time-multiple-fields/time-multiple-fields-localization-expected.txt
[delete] https://crrev.com/4ea4214a59dfcd1fe8f2cda5b435b96a118ee1c9/third_party/WebKit/LayoutTests/platform/win7/fast/forms/time-multiple-fields/time-multiple-fields-localization-expected.txt
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/third_party/blink/renderer/core/editing/editing_utilities_test.cc
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/third_party/blink/renderer/core/editing/state_machines/state_machine_util.cc
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/third_party/blink/renderer/core/editing/state_machines/state_machine_util_test.cc
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/third_party/blink/renderer/platform/text/platform_locale_test.cc
[modify] https://crrev.com/e9126f1d03725c2ae97d524985971d66089eede3/third_party/blink/renderer/platform/text/text_boundaries_test.cc

Blocking: -850334
Owner: yoichio@chromium.org
Status: Assigned (was: Available)
Summary: Update CursorMovementIterator while getting along with ICU updating. (was: IsGraphemeBreak: uses out-dated hard-coded character property list and hard-coded rules. )
Design doc: https://goo.gl/HyigTY
Components: -Blink>Fonts>Emoji

Sign in to add a comment