hterm: unicode: some combining characters don't render correctly |
||||
Issue description
the user report:
============================================================
I was hoping to use this with a demo I was doing of how some programming languages handle Unicode string comparisons. So, I had this little program:
say ("Å" eq "Å" ? 'same' : 'different');
Those two LATIN A's with ring above are different (though, who knows if they are by the time you get them from me or if they will even come through correctly encoded at all). The first is an "A" followed by U+030A (COMBINING RING ABOVE) and the second is just the single codepoint U+00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE). When I cat that file in hterm, the second (U+00C5) shows correctly as an A with a ring above it, but the first written with two codepoints just shows up as a latin capital letter A with no ring. Both should appear as identical grapheme characters even though they are assembled via two different codepoint sequences.
============================================================
to reproduce easily, run on a Linux system:
printf '\U0000C5\n' # This shows Å (U+00C5)
printf 'A\U00030A\n' # This shows A followed by U+030A which should render as Å
i'm attaching an example text file to also show the data clearly.
$ hexdump -C data.txt
00000000 c3 85 0a 41 cc 8a 0a |...A...|
,
Apr 10 2017
poking around, seems like all combining codepoints get eaten in the hterm layer. it will output the "A" first, then process the combining char by itself to create a wc-node, then see that the node is 0 bytes, and then delete it (i think).
,
Apr 11 2017
changing hterm.TextAttributes.splitWidecharString to not split zero length chars fixes this:
- if (c < 128 || lib.wc.charWidth(c) == 1) {
+ if (c < 128 || lib.wc.charWidth(c) <= 1) {
but then runs into cursor positioning issues. that can be "fixed" by hacking all the node helpers to use the lib.wc variants. e.g. hterm.TextAttributes.nodeWidth & hterm.TextAttributes.nodeSubstr & hterm.TextAttributes.nodeSubstring.
,
Apr 11 2017
,
Aug 2 2017
so we need to fundamentally rewrite hterm.TextAttributes.splitWidecharString to operate on graphemes instead of codepoints there is an ECMAScript API in the works to support this: https://github.com/tc39/proposal-intl-segmenter although it's only at stage 2, so we're still too far off to wait for browsers to adopt it. it's loosely based on the existing v8 non-standard Intl.v8BreakIterator API: https://code.google.com/archive/p/v8-i18n/wikis/BreakIterator.wiki so instead of walking codepoints, we'd change the outer loop to use that iterator. let it = new Intl.v8BreakIterator(undefined, {type: 'character'}); it.adoptText(s); let pos = it.first(); while (pos != -1) { let next = it.next(); if (next == -1) break; var slice = s.slice(pos, next); if (lib.wc.strWidth(slice) == 2) ... do wide node things ... else ... do narrow node things ... pos = next; } there is a polyfill for Intl.Segmenter, but i think we'd want to use the v8 version by default since it's backed by C++ (and the assumption being it'll be faster). have to get some perf numbers first though. will also have to consider performance here since the loop does not allow for an ASCII-only bypass. perhaps we run a regex over the input first, and we return early if the entire thing is within the ASCII range.
,
Aug 9 2017
Issue 753866 has been merged into this issue.
,
Aug 10 2017
Dan's proposal is only in stage 2. In the meantime, v8 has had Intl.v8BreakIterator since 2012(?). Dan's proposal is to standardize on v8BreakIterator with some changes.
,
Aug 10 2017
Should have read all the comments before adding my comment. I missed comment 5. Sorry for the noise.
,
Aug 10 2017
FWIW Josh Bell's Intl.Segmenter polyfill is based on v8BreakIterator, so it should be fine. Anyway, Intl.Segmenter seems likely to reach Stage 3 pretty soon; we just have some abstract discussions to make progress on.
,
Aug 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/apps/libapps/+/2895994208c9092d0cea88d4b7c381d3c8840e6f commit 2895994208c9092d0cea88d4b7c381d3c8840e6f Author: Mike Frysinger <vapier@chromium.org> Date: Wed Aug 16 23:29:14 2017 libdot: wc: include trailing combining characters in substrings When creating a substring, gobble up all trailing codepoints that do not increase the width. This lets us handle combining characters better by not chopping them all off when they're at the edge of the substring. We already include them if they showed up earlier in the string. BUG=chromium:654839 Change-Id: Ib15309532b82733ab23297b2cba8106d209f3eee Reviewed-on: https://chromium-review.googlesource.com/599371 Reviewed-by: Brandon Gilmore <varz@google.com> Tested-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/2895994208c9092d0cea88d4b7c381d3c8840e6f/libdot/third_party/wcwidth/lib_wc.js [modify] https://crrev.com/2895994208c9092d0cea88d4b7c381d3c8840e6f/libdot/third_party/wcwidth/lib_wc_tests.js
,
Aug 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/apps/libapps/+/1e98c0f67cf69a5732f6864cc6dafd4cdc603989 commit 1e98c0f67cf69a5732f6864cc6dafd4cdc603989 Author: Mike Frysinger <vapier@chromium.org> Date: Wed Aug 16 23:29:17 2017 hterm: differentiate between non-ASCII and wide-characters We need to treat strings specially when they're non-ASCII and not just when wide characters. i.e. When we're dealing with combining codepoints. Add an ASCII node property to the text nodes so we can differentiate the states. This doesn't fully fix combining characters, but should at least improve things. For example, they're still broken with wide chars. BUG=chromium:654839 Change-Id: I4caeeadda4da019c557ea51e7e9fff4ec6634ef9 Reviewed-on: https://chromium-review.googlesource.com/615984 Reviewed-by: Brandon Gilmore <varz@google.com> Tested-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/1e98c0f67cf69a5732f6864cc6dafd4cdc603989/hterm/js/hterm_screen_tests.js [modify] https://crrev.com/1e98c0f67cf69a5732f6864cc6dafd4cdc603989/hterm/js/hterm_terminal.js [modify] https://crrev.com/1e98c0f67cf69a5732f6864cc6dafd4cdc603989/hterm/js/hterm_screen.js [modify] https://crrev.com/1e98c0f67cf69a5732f6864cc6dafd4cdc603989/hterm/js/hterm_text_attributes_tests.js [modify] https://crrev.com/1e98c0f67cf69a5732f6864cc6dafd4cdc603989/hterm/js/hterm_text_attributes.js
,
Sep 4 2017
i've implemented this locally (not just a poc). unfortunately, while experimenting with various test inputs, i've run into the limitations of v8BreakIterator and different scripts (e.g. Indic or Thai). the ತ್ಯ grapheme is broken up into ತ್ (U+0CA4 U+0CCD) and ಯ (U+0CAF). the ఫ్ట్వే grapheme is broken up into ఫ్ (U+0C2B U+0C4D) and ట్ (U+0C1F U+0C4D) and వే (U+0C35 U+0C47). the question is whether this is something we should block on. it's not a regression as we already mostly break things up the same way. but once we implement issue 737454, this gets us into a pickle as we split up some based on how they render. i guess we should just leave "improve Indic/etc... support" as a follow up (like issue 717536).
,
Sep 4
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot |
||||
►
Sign in to add a comment |
||||
Comment 1 by vapier@chromium.org
, Apr 6 2017Summary: hterm: unicode: some combining characters don't render correctly (was: hterm: some unicode combining characters don't render correctly)