New issue
Advanced search Search tips

Issue 725389 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug



Sign in to add a comment

PDF Hebrew text is rendered LTR instead of RTL

Reported by kidronar...@gmail.com, May 23 2017

Issue description

UserAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36

Steps to reproduce the problem:
1. Open the attached file by Chrome
2. 
3. 

What is the expected behavior?
To behave as it does in Acrobat - run JS and display text correctly.

What went wrong?
I got this file by mail. When clicking it in Chrome, it opens, but JS isn't running and Form Fileds aren't shown. When I save the file to the local drive and open it with Chrome, JS is run and Form Fields are shown, but the encoding is wrong. The values of the Form Fields, which is UTF-16BE encoded, is shown as if it were iso-8859-1 or windows-1250 encoded. In Acrobat it works fine.
I made JS show the values of the Form Fields when the document is opened. It works fine, and is, I can see the correct values in these alert boxes, but not in the document itself.

Did this work before? No 

Chrome version: 58.0.3029.110  Channel: stable
OS Version: 10.0
Flash Version:
 
black_enc2.pdf
1.1 MB Download
Labels: Needs-Triage-M58

Comment 2 by rtoy@chromium.org, May 24 2017

Cc: rtoy@chromium.org
Components: -Blink Internals>Plugins>PDF
Labels: -Needs-Triage-M58 Needs-Feedback
I tried with Acrobat X and it displays as much gibberish as Chrome's PDF Viewer. See attached screenshot. You say Acrobat works fine. What does that mean? Can you attach a screenshot to show what it looks like for you?
acrobat_vs_chrome.png
86.7 KB View Download
Note that in your screenshots, thestig, the Acrobat presentation and the Chrome presentation aren't identical. Look at the first two upper fields - the Acrobat prints the text correctly, while the Chrome reverses it, printing it left-to-right instead of right-to-left.

I'm attaching two screenshots. The first one is when I get the file by mail, and open it directly in Chrome. No fields are shown, and no JavaScript seems to be running. The second is after I save the file to local drive and open it with Chrome. It's different than what your Chrome seems to produce. The upper two fields, instead of showing the correct UTF characters that your Chrome does (even in reverse order), simply show the wrong chars, just like the other fields. For example, instead of the letter Bet (U+05D1 or %D7%90) it shows the char "accented a", which is U+E1, because 0xE1 is really the letter Bet in windows-1255 or iso-8859-8.
screenshot_before_download.png
193 KB View Download
Screenshot_after_download.png
167 KB View Download
Project Member

Comment 5 by sheriffbot@chromium.org, May 28 2017

Cc: thestig@chromium.org
Labels: -Needs-Feedback
Thank you for providing more feedback. Adding requester "thestig@chromium.org" to the cc list and removing "Needs-Feedback" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Is someone working on this bug report?
Status: Available (was: Unconfirmed)
Not yet. Too many bug reports and this one fell through the cracks. Thanks for the reply with more details and the reminder.

So there's actually two issues here:
1) In my screenshot, the Hebrew text is rendered LTR instead of RTL, like in Acrobat. For the record, I wasn't referring to the Hebrew text as gibberish. Though being unfamiliar with the alphabet did make it harder to spot the difference.
2) On your computer, all the text is encoded wrong, whereas for me, some of the Hebrew text actually rendered correctly.
Exactly.

I'll just note that the internal encoding of each field is really different (that is, it's the same Hebrew letters but in different encoding), so I didn't expect all the fields to show readable Hebrew (although that could be a nice feature). The problem was that *none* of them showed the correct characters (for me).
Cc: npm@chromium.org

Comment 10 by npm@chromium.org, Jun 20 2017

* The "get the file by mail, and open it directly in Chrome" is not accurate, are you previewing it on Gmail? In that case, what you see is just an image of the PDF, so in particular it is the case that JS won't run.
* Can you update what you see once your Chrome version is 59? The garbled vs Hebrew could have been solved recently because I was able to reproduce what you see on an older Chrome version.
* The RTL rendering of Hebrew is a problem. Here the overall text layout is LTR even in Adobe (shown by the ending "..."), but they still reorder the Hebrew characters. But then the question is what happens if we have something like "<Hebrew>...<Hebrew>", what's the right order for that if the overall text layout is LTR?
* Many of my users aren't computer experts. If they get the file by mail (e.g. Gmail) they tend to simply click it, and know nothing about previewing, and then they complain that it doesn't work. So I still need to solve this problem. Isn't there at least an option to display an alert, telling the user that's it's only a preview? or cancel this auto-previewing?

* My Chrome claims to be version 59.0.3071.109, and the Hebrew still isn't shown. Only the gibberish. Maybe it's something with the definitions? maybe it's defined to a wrong encoding or something like that?

* I'm not an expert in RTL-LTR issues, but I know that it was a major issue about 20 years ago. Back then, some Web pages that were written in RTL languages tended to be displayed with each word's letters in a reverse order, while others were displayed with the words themselves in reverse order (and each word was shown correctly). This had to do with Visual vs. Logical encodings. I don't know how, but this problem seems to have been completely solved. For many years now, Web pages are always displayed correctly.

I think some editors guess the text direction by the first letter of the line / paragraph (now "first" here may be a little ambigeous). For example, these two lines are identical, except the first English letter in the second line:
ראשון first שני second
gראשון first שני second
I don't know how these will be viewed in your browser, but when I post it, they look identical except this first 'g'. However, typing in Gmail's 'Compose Message' box, the order of the words in each line is different (and so is the alignment).

Comment 12 by npm@chromium.org, Jun 21 2017

* Ok, let's not discuss the Gmail problem here as this is the chromium bug tracker.

* The encoding probably isn't the problem. This is an embedded font so my guess is that Chrome running on your machine is somehow failing to load it, and when using a substitute it will produce those characters. But I'm not sure why that would happen. We're supposed to ship our Freetype version with Chrome on Windows. Of course, unless we figure out how to reproduce the problem we can only guess.

* We do have code for RTL vs LTR, although it is not perfect. PDF's are very different from websites, but it is true that we could learn some tips from how these issues are handled in the browser.
* OK. I reported the bug to Gmail.

* I've downloaded and re-installed Chrome. It still claims its version is 59.0.3071.109 (64 bit, official version) but no significant change occurs. I'm running Hebrew Windows 10 Pro, 64 bit, if it helps.

Comment 14 by npm@chromium.org, Jun 22 2017

I was able to reproduce using pdfium_test on PDFium's M59 branch but looks fixed on ToT. If it also reproduces like this on Linux I can figure out what fixed this.

Comment 15 by npm@chromium.org, Jun 22 2017

The garbledness was fixed by https://pdfium-review.googlesource.com/c/5610/ which made it to M60. Spacing between characters fix will make it to M61. So the remaining issue here is the LTR vs RTL.
I noticed the alerts that pop-up on load handles RTL correctly. It's probably code in the cpwl_edit* files that don't understand RTL.
I don't understand. Is it fixed? why doesn't it work for me then? should I update chromium somehow?
Chrome 59 has no fixes.
Chrome 60 fixes the garbled text.
Chrome 61 fixes the spacing.
We still need to fix the text direction.

See https://en.wikipedia.org/wiki/Google_Chrome_version_history for the version to channel mapping. (Kudos to whoever spends the time to keep it up to date.)
Thanks for the explanation.

If I understand correctly,  https://pdfium-review.googlesource.com/c/5610/ is about the ToUnicode map being ignored. However, this map doesn't exist in my file.
By "my file" do you mean your local copy of Chromium source code that contains third_party/pdfium/core/fpdfapi/font/cpdf_truetypefont.cpp? If so, without knowing how you got your source code, we don't know why either. It's *your* file. :)
By "my file" I mean the pdf file I uploaded when reporting this bug, black_enc2.pdf.

Comment 22 by npm@chromium.org, Jun 27 2017

It will also avoid using FXFT_Get_Char_Index(m_Font.GetFace(), charcode) to get the glyph index even for the case when ToUnicode does not exist. I verified that PDFium TOT renders Hebrew and TOT + revert of that doesn't. Did you try opening it on Chrome Beta to verify if it was fixed? Probably easier.
I now have 60.0.3112.50 (Beta version). The Hebrew chars show correctly in the Text Fields, but the order is reversed (each word's letters are reversed, as well as the order of the words). Space between chars is incorrect (the chars collide with each other). I guess that's what you expected.
Project Member

Comment 24 by sheriffbot@chromium.org, Jun 29 2018

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Comment 25 by npm@chromium.org, Jun 29 2018

Status: Fixed (was: Untriaged)
I looked at this on Chrome Windows vs Acrobat and it seems to coincide so marking this as fixed.

Comment 26 by npm@chromium.org, Jun 29 2018

Status: Available (was: Fixed)
Summary: PDF Hebrew text is rendered LTR instead of RTL (was: PDF Form Fields, Encoding and JavaScript broken in Chrome Plugin)
Oh actually the LTR vs RTL problem is still there.

Sign in to add a comment