New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 654839 link

Starred by 2 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 3
Type: Bug

Blocking:
issue 212707



Sign in to add a comment

hterm: unicode: some combining characters don't render correctly

Project Member Reported by vapier@chromium.org, Oct 11 2016

Issue description

the user report:
============================================================
I was hoping to use this with a demo I was doing of how some programming languages handle Unicode string comparisons. So, I had this little program:

say ("Å" eq "Å" ? 'same' : 'different');

Those two LATIN A's with ring above are different (though, who knows if they are by the time you get them from me or if they will even come through correctly encoded at all). The first is an "A" followed by U+030A (COMBINING RING ABOVE) and the second is just the single codepoint U+00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE). When I cat that file in hterm, the second (U+00C5) shows correctly as an A with a ring above it, but the first written with two codepoints just shows up as a latin capital letter A with no ring. Both should appear as identical grapheme characters even though they are assembled via two different codepoint sequences.
============================================================

to reproduce easily, run on a Linux system:
printf '\U0000C5\n'   # This shows Å (U+00C5)
printf 'A\U00030A\n'  # This shows A followed by U+030A which should render as Å

i'm attaching an example text file to also show the data clearly.
$ hexdump -C data.txt
00000000  c3 85 0a 41 cc 8a 0a                              |...A...|
 
data.txt
7 bytes View Download
Labels: -OS-Chrome OS-All
Summary: hterm: unicode: some combining characters don't render correctly (was: hterm: some unicode combining characters don't render correctly)

Comment 2 by vapier@chromium.org, Apr 10 2017

poking around, seems like all combining codepoints get eaten in the hterm layer.  it will output the "A" first, then process the combining char by itself to create a wc-node, then see that the node is 0 bytes, and then delete it (i think).

Comment 3 by vapier@chromium.org, Apr 11 2017

changing hterm.TextAttributes.splitWidecharString to not split zero length chars fixes this:
-    if (c < 128 || lib.wc.charWidth(c) == 1) {
+    if (c < 128 || lib.wc.charWidth(c) <= 1) {

but then runs into cursor positioning issues.  that can be "fixed" by hacking all the node helpers to use the lib.wc variants.  e.g. hterm.TextAttributes.nodeWidth & hterm.TextAttributes.nodeSubstr & hterm.TextAttributes.nodeSubstring.

Comment 4 by vapier@chromium.org, Apr 11 2017

Blocking: 212707
so we need to fundamentally rewrite hterm.TextAttributes.splitWidecharString to operate on graphemes instead of codepoints

there is an ECMAScript API in the works to support this:
  https://github.com/tc39/proposal-intl-segmenter
although it's only at stage 2, so we're still too far off to wait for browsers to adopt it.

it's loosely based on the existing v8 non-standard Intl.v8BreakIterator API:
  https://code.google.com/archive/p/v8-i18n/wikis/BreakIterator.wiki

so instead of walking codepoints, we'd change the outer loop to use that iterator.
  let it = new Intl.v8BreakIterator(undefined, {type: 'character'});
  it.adoptText(s);
  let pos = it.first();
  while (pos != -1) {
    let next = it.next();
    if (next == -1)
      break;
    var slice = s.slice(pos, next);
    if (lib.wc.strWidth(slice) == 2)
      ... do wide node things ...
    else
      ... do narrow node things ...
    pos = next;
  }

there is a polyfill for Intl.Segmenter, but i think we'd want to use the v8 version by default since it's backed by C++ (and the assumption being it'll be faster).  have to get some perf numbers first though.

will also have to consider performance here since the loop does not allow for an ASCII-only bypass.  perhaps we run a regex over the input first, and we return early if the entire thing is within the ASCII range.
 Issue 753866  has been merged into this issue.

Comment 7 by js...@chromium.org, Aug 10 2017

Cc: js...@chromium.org littledan@chromium.org
Dan's proposal is only in stage 2. In the meantime, v8 has had Intl.v8BreakIterator since 2012(?). Dan's proposal is to standardize on v8BreakIterator with some changes. 




Comment 8 by js...@chromium.org, Aug 10 2017

Should have read all the comments before adding my comment. I missed comment 5. Sorry for the noise. 
FWIW Josh Bell's Intl.Segmenter polyfill is based on v8BreakIterator, so it should be fine. Anyway, Intl.Segmenter seems likely to reach Stage 3 pretty soon; we just have some abstract discussions to make progress on.
Project Member

Comment 10 by bugdroid1@chromium.org, Aug 16 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/apps/libapps/+/2895994208c9092d0cea88d4b7c381d3c8840e6f

commit 2895994208c9092d0cea88d4b7c381d3c8840e6f
Author: Mike Frysinger <vapier@chromium.org>
Date: Wed Aug 16 23:29:14 2017

libdot: wc: include trailing combining characters in substrings

When creating a substring, gobble up all trailing codepoints that do not
increase the width.  This lets us handle combining characters better by
not chopping them all off when they're at the edge of the substring.  We
already include them if they showed up earlier in the string.

BUG=chromium:654839

Change-Id: Ib15309532b82733ab23297b2cba8106d209f3eee
Reviewed-on: https://chromium-review.googlesource.com/599371
Reviewed-by: Brandon Gilmore <varz@google.com>
Tested-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/2895994208c9092d0cea88d4b7c381d3c8840e6f/libdot/third_party/wcwidth/lib_wc.js
[modify] https://crrev.com/2895994208c9092d0cea88d4b7c381d3c8840e6f/libdot/third_party/wcwidth/lib_wc_tests.js

Project Member

Comment 11 by bugdroid1@chromium.org, Aug 16 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/apps/libapps/+/1e98c0f67cf69a5732f6864cc6dafd4cdc603989

commit 1e98c0f67cf69a5732f6864cc6dafd4cdc603989
Author: Mike Frysinger <vapier@chromium.org>
Date: Wed Aug 16 23:29:17 2017

hterm: differentiate between non-ASCII and wide-characters

We need to treat strings specially when they're non-ASCII and not
just when wide characters.  i.e. When we're dealing with combining
codepoints.  Add an ASCII node property to the text nodes so we
can differentiate the states.

This doesn't fully fix combining characters, but should at least
improve things.  For example, they're still broken with wide chars.

BUG=chromium:654839

Change-Id: I4caeeadda4da019c557ea51e7e9fff4ec6634ef9
Reviewed-on: https://chromium-review.googlesource.com/615984
Reviewed-by: Brandon Gilmore <varz@google.com>
Tested-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/1e98c0f67cf69a5732f6864cc6dafd4cdc603989/hterm/js/hterm_screen_tests.js
[modify] https://crrev.com/1e98c0f67cf69a5732f6864cc6dafd4cdc603989/hterm/js/hterm_terminal.js
[modify] https://crrev.com/1e98c0f67cf69a5732f6864cc6dafd4cdc603989/hterm/js/hterm_screen.js
[modify] https://crrev.com/1e98c0f67cf69a5732f6864cc6dafd4cdc603989/hterm/js/hterm_text_attributes_tests.js
[modify] https://crrev.com/1e98c0f67cf69a5732f6864cc6dafd4cdc603989/hterm/js/hterm_text_attributes.js

i've implemented this locally (not just a poc).  unfortunately, while experimenting with various test inputs, i've run into the limitations of v8BreakIterator and different scripts (e.g. Indic or Thai).

the ತ್ಯ grapheme is broken up into ತ್ (U+0CA4 U+0CCD) and ಯ (U+0CAF).
the ఫ్ట్వే grapheme is broken up into ఫ్ (U+0C2B U+0C4D) and ట్ (U+0C1F U+0C4D) and వే (U+0C35 U+0C47).

the question is whether this is something we should block on.  it's not a regression as we already mostly break things up the same way.  but once we implement issue 737454, this gets us into a pickle as we split up some based on how they render.

i guess we should just leave "improve Indic/etc... support" as a follow up (like issue 717536).
Project Member

Comment 13 by sheriffbot@chromium.org, Sep 4

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Sign in to add a comment