New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 792713 link

Starred by 2 users

Issue metadata

Status: Available
Merged: issue 675477
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 3
Type: Bug



Sign in to add a comment

Entering Tamil using Google Handwriting Input produces odd results

Project Member Reported by rlanday@chromium.org, Dec 6 2017

Issue description

Chrome Version: 65.0.3287.0 (Developer Build) (32-bit)
OS: Android

What steps will reproduce the problem?
(1) Set up an Android device with Google Handwriting Input configured with the Tamil language.
(2) Go to editpad.org and type a space, then use handwriting input to enter " ேவ" (\u0bc7\u0bb5) (you don't need to draw the dotted circle).

What is the expected result?

The resulting text should be "  ேவ" and the text except for the first space should have a composition underline under it.

What happens instead?

The resulting text is "  ே ேவ" and the composition underline isn't correct.

The bug is because InputMethodController::SetComposition() works by selecting the composition, replacing the text and requesting that the newly-inserted text be selected, and then using the resulting selection to set the composition range. There are some places where we canonicalize the selection (by creating a VisibleSelection), which normalizes the selection to grapheme cluster boundaries, which the IME isn't expecting.

Originally filed as b/70016473.
 
Description: Show this description

Comment 2 by yosin@chromium.org, Dec 8 2017

Is this happend on plain TEXTAREA? It seems TEXTAREA in editpad.org has event handlers.
I have not yet tested on a plain <textarea>, but I’m pretty sure the issue is unrelated to event handlers (I will confirm tomorrow).

Comment 4 by yosin@chromium.org, Dec 8 2017

Do we need a leading space to reproduce this issue? It seems leading space also duplicated.

>#c3, Ya, it seems event handler doesn't relate to this issue.

<textarea cols="100" rows="30" name="text" id="text" style="background-color: rgb(238, 232, 170); border: 1px dotted rgb(0, 0, 0); width: 1974px; height: 441px; margin: 0px;" onclick="if(document.textform.text.value==default_text)document.textform.text.value=''" 
onkeydown="return insertTab(event,this);" 
onkeyup="return insertTab(event,this);" 
onkeypress="return insertTab(event,this);">


function insertTab(event, obj) {
    var tabKeyCode = 9;
    if (event.which)
        var keycode = event.which;
    else
        var keycode = event.keyCode;
    if (keycode == tabKeyCode) {
        if (event.type == "keydown") {
            if (obj.setSelectionRange) {
                var s = obj.selectionStart;
                var e = obj.selectionEnd;
                obj.value = obj.value.substring(0, s) + "\t" + obj.value.substr(e);
                obj.setSelectionRange(s + 1, s + 1);
                obj.focus();
            } else if (obj.createTextRange) {
                document.selection.createRange().text = "\t"
                obj.onblur = function() {
                    this.focus();
                    this.onblur = null;
                }
                ;
            } else {}
        }
        if (event.returnValue)
            event.returnValue = false;
        if (event.preventDefault)
            event.preventDefault();
        return false;
    }
    return true;
}

Yes, you need a leading space. That causes the first Tamil character to try to join with the space. That's the first part of repro step (2).

Comment 6 by yosin@chromium.org, Dec 8 2017

Could you write a test case for InsertIncrementalCommand for this?
I doubt InsertIncrementalCommand.

e.g.
Seleciton().SetSelection(SetSelectionTextToBody(
 "<div contenteditable> \u0BC7|</div>"); // Note: &#xXXXX; doesn't work for SetSelectionTextToBody());
const auto* const command = InsertIncrementalTextCommand::Create(GetDocuemnt(), "\u0BC7\u0BB5");
command->Apply();
EXPECT_EQ("<div contenteditable> \u0BC7\u0BB5|</div>", GetSelectionTextFromBody());
I think the bug occurs even if we don’t do an incremental insertion. I have a test for InputMethodController in the CL I’m working on:
https://chromium-review.googlesource.com/c/chromium/src/+/812524

Maybe I can also write one for InsertTextCommand that doesn’t go through SetComposition(), if that would be helpful.

Comment 8 by yosin@chromium.org, Dec 8 2017

TL;DR: We should produce U+0BB5 U+00BC7 instead of U+0BC7 U+0BB5 sequence.

I think this should handle in Clank part instead of Blink part. It seems to be handwriting specific.

In this case U+0BC7 should be after U+0BB5 == U+0BB5 U+0BC7, like が = か(U+304B) + ゛(U+309B)

From [1], U+0BC7 has dotted circle at right side, this means previous code point of U+0BC7
is rendered at dotted circle.

On hand writing, it is natural to write U+0BC7 then U+0BB5 as same as visual order. Keyboard
IME may send U+0BB5 U+0BC7 sequence somehow.

[1] https://www.compart.com/en/unicode/U+0BC7

Comment 9 by yosin@chromium.org, Dec 8 2017

Note: There are other combined code points which render combined code point(code point before combining code point) at right side, e.g. U+0BC6,U+0BC7, U+0BC7, and render combined code point in middle of combining code point, U+0BCA, U+0BCB, U+0BCC. Unicode may have more of
such cases.

[1] https://en.wikipedia.org/wiki/Tamil_script


I don’t speak Tamil, so I can’t speak to the particulars of how it’s typically entered via keyboard or handwriting input; I’m just going off of what the bug reporter (who I believe does speak Tamil) described the bug as being. He says the behavior is different when there’s a leading space, and it shouldn’t be. The CL I have up fixes the behavior so handwriting input works the same way in this case as it does in an Android EditText widget (the CL needs revision since it’s currently breaking some test cases, but it does make make Tamil handwriting input work better).

I don’t think we’re supposed to re-order the characters, since that’s not what Android EditText does. The reason for the reverse ordering is not immediately apparent to me since I’m not familiar with Tamil, but the bug reporter didn’t say there’s anything wrong with it.
See https://jsfiddle.net/8cuweanw/2/ for example.

U+00A0 U+0BC7 U+0BB5 sequence looks similar to U+00B5 U+0BC7 sequence, but forme
one has extra space in middle.
cr792713.png
19.2 KB View Download
Perhaps what happened was that the original reporter meant to enter the sequence in 1. but entered the characters in the wrong order and would’ve ended up with the unusual sequence in 2, has she been typing into an Android EditText widget. I suspect that the bug does not occur if you actually input the sequence in 1.

If this is the case and the bug only comes up when incorrectly using the handwriting input, maybe it's not high-pri. I think it is still a bug though since:

- It’s a behavioral difference from EditText
- The original behavior of not showing the first character input was confusing people (such as the bug reporter)
- The current behavior on trunk (duplicating the first character) looks obviously buggy
I'm not sure about impact of this bug. Please estimate impact.

Labels: -Pri-2 Pri-3
I checked how many Tamil users we have on Chrome for Android; it's a fairly small number. The bug does seem to only occur when using handwriting (I tried mashing on Gboard set to Tamil and was unable to trigger buggy behavior). Further, my understanding is that the bug only occurs when entering characters in the opposite order from what the handwriting IME is expecting.

Taking all of that into account, this seems fairly low-pri. Fixing this currently seems more work than it's worth.


Note that Gboard is integrating handwriting input, so handwriting input bugs may become of more importance in the future:
https://9to5google.com/2017/11/27/gboard-6-8-beta-handwriting-keyboard/

Comment 15 by talo@chromium.org, Dec 11 2017

Ah, yes. To give a bit more context, this is a high priority for us for our India user base, particularly as we look forward to where we expect our growth to be.
Labels: -Pri-3 Pri-2
Apparently this also affects Malayalam and possibly some other similar languages/scripts as well. I’m going to spend some more time on this to try to fix the issues with the current CL.

Comment 17 by kojii@chromium.org, Dec 12 2017

Cc: kojii@chromium.org
Can the hand-writing IME emit zwnj (didn't test it works), space character, or some other methods that prevents forming a grapheme cluster with previously committed string? I think fixing in Chrome is one possible solution, but since the IME relies on not-clearly-defined behavior, I guess it will fail in other applications too, such as Firefox.

Does Android text edit handles this case correctly?
The bug does not occur in the Android EditText widget. From my perspective, whenever we have these discrepancies between Chrome/WebView and EditText (which is sort of the “reference implementation” of the Android input APIs), the correct thing to do is to fix Chrome to match EditText. The alternatives are:

1. Ask IME authors to support both the EditText behavior and the Chrome behavior: this doesn’t seem right, since it would be a lot of work for IME authors (they may not even realize they need to test in Chrome).
2. Get the Android team to change EditText to match Chrome: this also doesn’t seem right, because it takes much, much longer for people to start using updated versions of Android vs. updated Chrome/WebView. This also would usually require IME authors to update their software as well, since they usually test against EditText.

So in my mind, the issue is not whether or not the behavior is clearly defined (the correct behavior, for plain text editing cases at least, is always defined by what EditText does); only whether or not the behavioral discrepancy is worth fixing. I initially thought maybe it wasn’t, but I realized that this is potentially important from an emerging markets perspective. I think that even if it only occurs when writing the characters out of order, this bug seems to make it difficult for people to realize they’re writing the characters in the wrong order.

I can test Firefox tomorrow, but its market share is so small on Android (0.65% according to this link):
https://www.netmarketshare.com/browser-market-share.aspx?options=%7B%22filter%22%3A%7B%22%24and%22%3A%5B%7B%22platform%22%3A%7B%22%24in%22%3A%5B%22Android%22%5D%7D%7D%5D%7D%2C%22dateLabel%22%3A%22Trend%22%2C%22attributes%22%3A%22share%22%2C%22group%22%3A%22browser%22%2C%22sort%22%3A%7B%22share%22%3A-1%7D%2C%22id%22%3A%22browsersDesktop%22%2C%22dateInterval%22%3A%22Monthly%22%2C%22dateStart%22%3A%222016-12%22%2C%22dateEnd%22%3A%222017-11%22%2C%22segments%22%3A%22-1000%22%7D

I don’t think it really matters what Firefox does since its behavior is hardly affecting anyone.

Comment 19 by kojii@chromium.org, Dec 12 2017

Ah, sorry, it looks like I didn't write my intention correctly.

I wanted to mean, change both Chrome and IME. For Chrome, I think the expected behavior in this issue is correct and better to fix.

For IME, this case shows a challenge to all native text editor developers. I suppose there are more native text editors than EditText widget, Blink, and Gecko, and better not to challenge them all. But it's up to Android/IME team to make the call, I'm fine either way.
I just tested Firefox; Firefox *does* handle this case correctly (exactly the same way as the Android TextView). It must not be normalizing to grapheme cluster boundaries the way we are.

It looks like a recent fix I landed (https://chromium-review.googlesource.com/c/chromium/src/+/801979) changed the behavior here again. Now the behavior for entering the characters "backwards" is improved (it inserts the correct text, but still only puts the composition underline under the second character instead of both of them), but entering the characters in the correct order is now broken.

So now I think we definitely need to fix this issue in a stable way and add some test cases, since otherwise the behavior's going to randomly get better and worse as we change editing code around.
Labels: -Pri-2 Pri-1
To clarify: I think the current behavior cannot go out in M65, as it would be a bad regression for these languages. We either need to come up with a permanent fix for the issue, or revert my CL I linked.

Comment 22 by kojii@chromium.org, Dec 14 2017

yosin@ is OOO until Jan, if urgent, please ask xiaochenghu@ to review. If you can wait until yosin@ is back, that's good.

> it would be a bad regression for these languages

Just asking for clarification; I agree typing regression is critical for any languages. This issue is only for handwriting, not typing, correct?

Comment 23 by rlan...@gmail.com, Dec 14 2017

I forgot to test the current behavior on Tamil-language Gboard; I can check this tomorrow.
This particular sequence of two characters seems to still work fine on Gboard. So this may be a handwriting-only bug (although, I'm not familiar with Tamil, and I don't know for sure that there's not some other sequence of characters that triggers the bug on Gboard).

I still think this would be good to fix from an emerging markets perspective.
I'm not sure that the behavior for the "enter characters in the correct order" case actually got worse in master; rather, it seems the IME is inconsistent for some reason about whether or not two characters should be combined or if a space should be inserted between them (this is reproducible in an EditText widget).
Mergedinto: 675477
Status: Duplicate (was: Started)
Turns out this bug has also been reported for Linux Tamil IMEs in  crbug.com/675477 . My tentative fix for this bug fixes that one as well. Merging these bugs.
Labels: -Pri-1 Pri-3
Status: Assigned (was: Duplicate)
Unmerging because I came up with a fix for the other issue (which seems way worse than this one) that doesn't fix this one.

The issue here comes up because we're hitting rule GB9a ("Do not break before SpacingMarks") in the Unicode grapheme cluster boundary algorithm:
http://unicode.org/reports/tr29/#GB9a

The handwriting IME seems to be invoking known pathological behavior by sending these spacing mark characters (e.g. U+0BC7: ே) without a preceding no-break space for them to attach to. See:

http://www.unicode.org/versions/Unicode10.0.0/ch07.pdf
(Section 7.9 Combining Marks, subheading "Marks as Spacing Characters")

I left a comment on the Google-internal bug (b/70016473) to see if we can get the handwriting IME updated to stop doing this. If we *really* want to match the native Android EditText widget behavior, we'll probably have to modify our editing code so we support opening a composition that doesn't start at a grapheme cluster boundary, which is somewhat involved.

One approach would be to modify VisiblePosition normalization to add a mode that doesn't normalize to grapheme cluster boundaries. Then IMEs could replace arbitrary substrings of Unicode code points.

The other approach would be to modify InsertTextCommand to work with non-normalized positions, which itself is relatively straightforward (e.g. see proof of concept CL at https://chromium-review.googlesource.com/c/chromium/src/+/823613); but then we also have to update DeleteSelectionCommand (since it's called by InsertTextCommand under certain circumstances), and that looks like it would be a huge mess to update.


I'm starting to think that maybe allowing IMEs to edit on a sub-grapheme cluster boundary level so we can try to copy undocumented platform-specific Unicode behavior is just a bad idea, especially once we get into rich text editing cases. I left a comment on the Google-internal bug (b/70016473) asking if we can fix the handwriting IME to not insert these characters without a preceding no-break space.
Cc: rlanday@chromium.org
Owner: ----
Status: Available (was: Assigned)

Sign in to add a comment