Improve script resolution for common script characters
Reported by
acquado...@gmail.com,
Oct 29 2017
|
|||||||
Issue descriptionUserAgent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0 Example URL: http://aghja.net/pecita-latin.xhtml Steps to reproduce the problem: 1. Open http://aghja.net/pecita-latin.xhtml 2. Look at the Language section 3. 1 7 and Z are displayed in a lang=fr paragraph What is the expected behavior? The default glyphs should be substituted by their French variants. What went wrong? It is the case for the letter Z, not for the figures. The left of the attached file show the good rendering using Firefox and the right show the bug using Chromium. Does it occur on multiple sites: N/A Is it a problem with a plugin? N/A Did this work before? N/A Does this work in other browsers? Yes Chrome version: Chromium 61.0.3163.100 (Developer Build) built on Debian 9.1, running on Debian 9.1 (64-bit) Revision 7accc8730b0f99b5e7c0702ea89d1fa7c17bfe33- OS Linux JavaScript V8 6.1.534.41 Flash 22.0.0.209 /usr/lib/pepperflashplugin-nonfree/libpepflashplayer.so User Agent Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36 Command Line /usr/lib/chromium/chromium --show-component-extension-options --ignore-gpu-blacklist --no-default-browser-check --disable-pings --media-router=0 --enable-remote-extensions --ppapi-flash-path=/usr/lib/pepperflashplugin-nonfree/libpepflashplayer.so --ppapi-flash-version=22.0.0.209 --flag-switches-begin --flag-switches-end Executable Path /usr/lib/chromium/chromium Profile Path /home/philippe/.config/chromium/Profile 1 Channel: stable OS Version: Debian 9.1 Flash Version: Shockwave Flash 22.0 r0 The corresponding lookup table is GSUB single, feature locl, language latn FRA.
,
Oct 30 2017
,
Oct 31 2017
I remember now, the problem already existed in 2016: See Issue 504178 .
,
Oct 31 2017
$ ../harfbuzz/util/hb-shape --language="FR" Pecita-Latin.otf "17 Z" [one.FRA=0+470|seven.FRA=1+470|space=2+470|Zstroke=3+480] HarfBuzz handles this correctly. So something is wrong with passing the language information for the paragraph to the shaping process. I assume the language information does not arrive at HarfBuzzShaper. Any ideas, Koji?
,
Oct 31 2017
We pass "fr" to hb, and hb returns glyph id 18 for "1" and 24 for "7". Can hb-shape dump the glyph id?
,
Oct 31 2017
Thanks for investigating. hb-shape shows glyph id when using the "--no-glyph-names" parameter: $ ../harfbuzz/util/hb-shape --language="FR" --no-glyph-names Pecita-Latin.otf "17 Z" [504=0+470|511=1+470|1=2+470|258=3+480] $ ../harfbuzz/util/hb-shape --no-glyph-names Pecita-Latin.otf "17 Z" [18=0+470|24=1+470|1=2+470|59=3+480] At which level did you see the "fr" being passed? Do we perhaps lose it somewhere and it does not reach HarfBuzzShaper::ShapeRange [1]? [1] https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/fonts/shaping/HarfBuzzShaper.cpp?q=HarfBuzzShaper&sq=package:chromium&l=192
,
Nov 1 2017
1. We call |ToHarfbuzLanguage| for the given "lang" attribute value ("fr" in this case,) which calls |hb_language_from_string| and caches it.
https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/LayoutLocale.cpp?type=cs&sq=package:chromium&l=107
2. It looks like it stores "fr" as is in |hb_language_impl_t|.
https://cs.chromium.org/chromium/src/third_party/harfbuzz-ng/src/hb-common.cc?type=cs&sq=package:chromium&l=167
3. This |hb_language_impl_t| is passed to |ShapeRange|, which is passed to |hb_buffer_set_language| before we call |hb_shape|.
https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/fonts/shaping/HarfBuzzShaper.cpp?type=cs&sq=package:chromium&l=192
I thought we may need to convert lang code to OT tag, but |hb_buffer_set_language|
https://behdad.github.io/harfbuzz/harfbuzz-Buffers.html#hb-buffer-set-language
says we should convert ISO 639 to |hb_langauge_t| using |hb_language_from_string|.
Can you identify what we miss?
,
Nov 1 2017
BTW, we pass HB_SCRIPT_COMMON for |hb_buffer_set_script| because they are digits. Don't know if the script affects 'locl' feature but just in case.
,
Nov 1 2017
Thank you for your efficiency.
Yes, HB_SCRIPT_COMMON is the problem.
The script contain (is hierachically superior to) the local language.
In the font the locl feature is assigned to "latin{FRA }" and I solve the issue adding "DFLT{FRA }.
I think that I am right specifying only the latin script but not sure. May be it is a user error.
Please, ask Behdad Esfahbod or Khaled Hosny to have the good answer.
,
Nov 1 2017
Behdad, do you have answer to #9?
,
Nov 1 2017
The script itemization should take the whole paragraph into account (just like bidi itemization), and if done correctly this should have resolved the numbers as Latin script given that the paragraph contains other Latin characters.
,
Nov 1 2017
Khaled, thank you once again. It's a bit of what I sensed ... Clearly expressed! The granularity of the language is at the level of the sentence, not the glyph when it is shared across scripts. For my font, I go back by removing the French language part of the default script, that does not make sense (but I admit that this architecture is confusing). Kojii, do you agree?
,
Nov 1 2017
What if it's: <p lang="fr">17</p> or what if: <p lang="fr">17!!!</p> or even: <p lang="fr">Some French 17 A-FEW-CHINESE-CHARACTERS and French</p> ? I agree the heuristic script detection algorithms in each browsers can be improved further, but heuristic is heuristic. I'm Japanese, and I know detecting Chinese and Japanese would never be 100% accurate. I think it's a question that, do you want your fonts and typography to rely on heuristic, or to make it deterministic. As long as digits are defined as COMMON in Unicode, it can never be perfect, just like bidi algorithm is not perfect and sometimes requires author's additional markups. We can keep this bug as a feature request to improve the heuristic algorithm. I think we had a bug a few years ago but I can't find it.
,
Nov 2 2017
> I think it's a question that, do you want your fonts and typography to rely on heuristic, or to make it deterministic. Without any hesitation I prefer a heuristic script detection algorithm!!! Why ? Exactly to be able to make versatile character fonts, consequently with many features! Adding a language in a font is a nightmare! You want to do it for three characters and you have to plug it on to all the lookups tables. If in addition it is necessary to multiply by the number of scripts it becomes grotesque. If lang = "ja" can apply to different scripts, lang = "en" always implies the latin script. The problem is thus in the articulation of the ISO 639 standard to unicode, articulation that my knowledge is not standardized and nothing prevents common sense! Can you find the algorithm used by Firefox? That would be a good reference, right?
,
Nov 2 2017
Changed the summary as requested. Heuristic for Latin and digits is easy, doing a good job for all scripts is hard. I'm not a fan of heuristic for that reason and for its non-interoperable nature, but I'm not opposed to it if contributers want to add it.
,
Nov 2 2017
I'm a bit surprised that even when we write something like abc17def the script does not get resolved to Latin, at least I could not get the right digits to show up with this experiment. We should take a look at what ScriptRunIterator does with that.
,
Nov 2 2017
,
Nov 2 2017
> I'm a bit surprised that even when we write something like abc17def the script does not get resolved to Latin Attached file
,
Nov 2 2017
FYI: https://www.microsoft.com/typography/otspec/chapter2.htm#slTbl_sRec > The 'DFLT' script would still be used if the text contained only the neutral characters, however.
,
Nov 2 2017
(To #19) This link is relevant (thank you for quoting it) for the structure of the font. I do not think that it gives a definition of what assigns a script to a text. For example: "A script table with the script tag 'DFLT' (default) may be used in a font to define features that are not script-specific" or "A Script table identifies each language system that defines how to use the glyphs in a script for a particular language. It also references a default language system that defines how to use the script's glyphs in the absence of language-specific knowledge.". Nothing talks about the association of a language to a particular script and nothing forbids to consider that when you specify the attribute lang in an html element then you force also implicitly a corresponding script and that this applies to all the text. I think that the question is here.
,
Nov 7 2017
On a very general level, splitting a piece of text into script runs may work very well (e.g. splitting Arabic and Devanagari, because they need completely different processing), or not really very well (because it may be overkill when just putting glyphs side-by-side, and some character interactions across scripts (e.g. kerning) may be missed). On a more specific level, if different rendering stacks do the splitting into script runs differently (e.g. around default script characters), and fonts are built with one or the other assumption, then things won't work very well. There may be some spec pieces missing. In the case at hand, I'd try to include French shapes of the numbers 1 and 7 labeled as default script. Then unless there is some logic somewhere that throws out the information that this is French, it will find the right shapes independent of whether the numbers are included in a Latin run or in a default script run.
,
Nov 7 2017
I understand now the dilemma of Koji. There is an urgent need for a "script" attribute in html if for each language if it is not clear which script it belongs to (althought, except Chromium, text editors know how to cope without it). However, for "my" "french" problem I think that it is not a update whish but a regression (see comment #3).
,
Dec 11 2017
I just found a big argument! From http://www.adobe.com/devnet/opentype/afdko/topic_feature_file_syntax.html "The only permitted language tag for the 'DFLT script is 'dflt'." If the default script (the numbers among others) can not have a language, then they must inherit of script of the current language! Please reopen this issue as a bug.
,
Dec 11 2017
Thank you for the investigation. My personal opinion is that you should raise this topic to OpenType spec discussion. Its langsys system has some limitations, and I think your use case is valid. I would like this kind of glyph choice be deterministic and interoperable across browsers. I also personally don't think "17" and "17%" and "17z" showing different glyphs is a good experience. > Please reopen this issue as a bug. This issue is still open. It's just no one has raised their hands to work on actively yet.
,
Feb 8 2018
We should implement script detection heuristic that Doug Felt implemented in Google internally. I thought drott already did that, no?
,
Feb 9 2018
Yes, that's in, compare issue 526095 and https://codereview.chromium.org/1323513006
,
Feb 9 2018
Then why are we not resolving Common script to Latin?
,
Feb 12 2018
Adding
TEST_F(ScriptRunIteratorTest, LatinNumbers) {
CHECK_SCRIPT_RUNS({{"abc17def", USCRIPT_LATIN}});
}
to ScriptRunIteratorTest passes, and abc17def renders correctly, looking at the screenshots, my comment #16 seems a little misguided.
Script segmentation happens after the word caching mechanisms splits by spaces. So for numbers that stand on their own, surrounded by spaces, the script resolution lacks context.
,
Dec 12
#23: > From http://www.adobe.com/devnet/opentype/afdko/topic_feature_file_syntax.html > "The only permitted language tag for the 'DFLT script is 'dflt'." I learned from a font expert at Adobe that this sentence was removed from the most recent [AFDKO doc], at the [afdko github issue 438 ]. Adobe seems to be using the technique for `locl` feature for punctuation, see "Language-sensitive Features" of [Ten Mincho blog]. To clarify my point, I do agree that improving script logic is a good thing, and with our new engine under development, we will be able to handle not only "abc17" but also "abc 17". But it's unlikely that we can solve "17." or "17!" without fonts having the DFLT script table. [AFDKO doc]: https://github.com/adobe-type-tools/afdko/blob/develop/docs/OpenTypeFeatureFileSpecification.html [afdko github issue 438 ]: https://github.com/adobe-type-tools/afdko/issues/438 [Ten Mincho blog]: https://blogs.adobe.com/CCJKType/2017/11/ten-mincho.html |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by ajha@chromium.org
, Oct 30 2017Components: Blink>Fonts
Labels: M-64 Needs-Triage-M61 OS-Mac OS-Windows
Status: Untriaged (was: Unconfirmed)