New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 743522 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Sep 7
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Windows , Mac
Pri: 2
Type: Compat



Sign in to add a comment

Text copied from PDF has invisible non-printing characters

Reported by khym.cha...@gmail.com, Jul 15 2017

Issue description

UserAgent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36

Example URL:

Steps to reproduce the problem:
1. Open the attached PDF inside of Chrome
2. Select and copy "average film length for"
3. Paste into a text area
4. Place text cursor behind the "f" in "for"
5. Press the left arrow key enough times to move the cursor behind the ending "h" of "length".

What is the expected behavior?
Should only have to press the left arrow twice to move cursor behind "h".

What went wrong?
Have to press left arrow three times, one of the times not visibly changing the position of the cursor.

Does it occur on multiple sites: N/A

Is it a problem with a plugin? No 

Did this work before? N/A 

Does this work in other browsers? N/A

Chrome version: 59.0.3071.115  Channel: stable
OS Version: Fedora 25
Flash Version: Shockwave Flash 26.0 r0

In Linux if you paste the same text into a konsole text console the extra character will be invisible, but if you paste it into console (non-gui) vim in insert mode it will show up as a special character "<200b>".

The same bug happens at any point in the attached PDF file where bold text changes back to non-bold text.

The PDF producer for the file is Skia/PDF m61

The file format is PDF version v 1.5
 
bugg_pdf.pdf
58.7 KB Download
Components: Internals>Plugins>PDF Blink
Labels: Needs-Triage-M59
Cc: brajkumar@chromium.org
Labels: Needs-Feedback
Unable to reproduce this issue on Ubuntu 14.04 using chrome latest stable #59.0.3071.115 by following steps mentioned in the original comment. By pressing the left arrow twice the cursor moved to behind "h" as expected.

Reporter@ Are you able to reproduce this issue on incognito mode as well? Attaching screen cast for reference please take a look and let me know is this is the expected behavior for this issue?

Thanks!
743522.ogv
2.7 MB View Download
I'm no longer able to reproduce the issue which happens when pasting text into a text area or text field, but the problem still shows up when pasting it into vim.

Also, I verified that it also reproduces the vim problem with experiments turned off, extensions turned off, under incognito, with unstable Chrome (v 61.0.3153.4), with a new OS level user account, and with a new desktop environment (Xfce instead of KDE).
Project Member

Comment 4 by sheriffbot@chromium.org, Jul 18 2017

Labels: -Needs-Feedback
Thank you for providing more feedback. Adding requester "brajkumar@chromium.org" to the cc list and removing "Needs-Feedback" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Components: -Blink Blink>Editing>Selection
Can reproduce Ubuntu 14.04, Chrome Stable 59.0.3071.115
Components: -Blink>Editing>Selection
Owner: dsinclair@chromium.org
Status: Assigned (was: Unconfirmed)
The character being added in, 0x200b, is a ZERO WIDTH SPACE according to unicodemap.org. If I run the pdf through pdftotext to look at the text content , and the 0x200b is converted to a '_'. There are '_'s appearing around all of the bold regions in the converted text, so I think this is an artifact of how the document is generated.

Performing the same copy operation on the PDF opened in Evince 3.10.3, the Gnome Document Viewer, I get the same output in vim, i.e. the 0x200b character appearing at the end of bold region. Thus I am pretty sure that the character is actually in the text content in the PDF.

I am going to send this over to dsinclair to determine if copying non-printing characters like this is the correct behaviour.
I also recommend trying this in Acrobat Reader and see how it behaves.

Comment 9 by weili@chromium.org, Aug 18 2017

Labels: OS-Mac OS-Windows
This is a platform independent issue. Basically, pdfium just faithfully returned this char while some viewer such as Acrobat Reader doesn't pass it on during copy and paste. 
Owner: ----
Status: Untriaged (was: Assigned)
Setting PDF bugs assigned to me back to untriaged so they can get re-assigned as needed.
Status: WontFix (was: Untriaged)
I think we _could_ filter characters like this out, but it's not necessarily the right thing to do in all cases. I don't think PDFium _must_ filter it out though, and arguably as a library it should return all it can to cover the broadest uses.

Marking as WontFix, please reopen if you disagree.
Owner: rharrison@chromium.org
Status: Started (was: WontFix)
I agree with hnakashima@ that this isn't a PDFium issue. 0x200B is not a control/non-printing character, which we filter out, but an ignorable printing character, thus it is included in the text string. It is the responsibility of the caller to figure out what that means in their context.

That being said, there is an argument to be made that Chromium from a embedder/UI perspective should be filtering out these characters. I will take a look at doing that.

Project Member

Comment 13 by bugdroid1@chromium.org, Sep 6

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a7a26d22d49a44ade6bef4702c453623bb199fce

commit a7a26d22d49a44ade6bef4702c453623bb199fce
Author: Ryan Harrison <rharrison@chromium.org>
Date: Thu Sep 06 20:22:37 2018

Strip Zero Width Whitespace from PDFium text strings

When getting text from PDFium, the library does not filter ZWW
(0x200B), since it is a valid non-control character. It is ignorable
though, so the embedder aka Chrome, has the option of whether or not
to display this character. Given that it shouldn't have any visual
display, including it in the displayed text can lead to weird UI
situations. Like the length of text being longer then number of
characters displayed or navigating the cursor requires multiple key
presses to get over the ZWW.

BUG= chromium:743522 

Change-Id: I5312a3aad4a752659fb4455853cd1030f0660bd9
Reviewed-on: https://chromium-review.googlesource.com/1210966
Reviewed-by: Henrique Nakashima <hnakashima@chromium.org>
Commit-Queue: Ryan Harrison <rharrison@chromium.org>
Cr-Commit-Position: refs/heads/master@{#589271}
[modify] https://crrev.com/a7a26d22d49a44ade6bef4702c453623bb199fce/pdf/pdfium/pdfium_range.cc

Cc: susan.boorgula@chromium.org
Labels: Needs-Feedback
Tried to reproduce the issue on Windows 10 and Mac OS 10.13.3 on the build without fix 69.0.3457.0 and unable to reproduce the issue by following the below steps.

1. Launched Chrome and opened the given attached pdf.
2. Copied the text 'average film length ​for' from the pdf to a text area and placed the cursor before "f" in "for".
3. Cannot observe any issues on hitting the left arrow button in moving the cursor before 'h'.
Attached is the screen cast for reference.
	
rharrison@ Request you to check and confirm if anything is missed from our end in verifying the issue and help us in verifying the fix on the latest M-71 build.

Thanks..
743522.mp4
569 KB View Download
Status: Fixed (was: Started)
GMail might be doing something smart about the zero width space. Additionally because you place the cursor to the left of the f, I am not sure if it is going to be on the left or right of the non-displaying space.

How I repro this is by copying the text in the OmniBox, since I know it doesn't do something smart with 0x200B, and putting the cursor to the right of f, then using the left arrow to move the cursor. When moving over the space to the left of f, it takes two key presses instead of one. This consistently repros without the patch for me.
Labels: -Needs-Feedback
Project Member

Comment 17 by bugdroid1@chromium.org, Sep 10

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/12f2cd31853e9b5b0322207b336ce089087dd368

commit 12f2cd31853e9b5b0322207b336ce089087dd368
Author: Ryan Harrison <rharrison@chromium.org>
Date: Mon Sep 10 17:45:26 2018

Merge definitions of Zero Width Whitespace in pdf/

I introduced a second definition of ZWW in the PDF plugin code without
realizing it in a previoud CL. This CL merges the two definitions
together.

BUG= chromium:743522 

Change-Id: Id5389bffc9aca70458c4aa934eb3163bf6ad503a
Reviewed-on: https://chromium-review.googlesource.com/1213543
Reviewed-by: Henrique Nakashima <hnakashima@chromium.org>
Commit-Queue: Ryan Harrison <rharrison@chromium.org>
Cr-Commit-Position: refs/heads/master@{#589970}
[modify] https://crrev.com/12f2cd31853e9b5b0322207b336ce089087dd368/pdf/pdf_engine.h
[modify] https://crrev.com/12f2cd31853e9b5b0322207b336ce089087dd368/pdf/pdfium/pdfium_engine.cc
[modify] https://crrev.com/12f2cd31853e9b5b0322207b336ce089087dd368/pdf/pdfium/pdfium_range.cc

Labels: TE-Verified-71.0.3549.0 TE-Verified-M71
Able to reproduce this issue on Windows 10, Mac OS 10.13.3 and Ubuntu 14.04 on the build without fix 69.0.3457.0 and the issue is fixed on the latest M-71 build 71.0.3549.0 as per comment #15
On copying the text 'average film length ​for' from the attached pdf to omnibox and placed the cursor before "f" in "for", then using the left arrow to move the cursor, it takes only one key press.

Attached is the screen cast for reference. 

Hence adding TE verified labels as the fix is working as intended.

Thanks..
743522-M71.mp4
799 KB View Download

Sign in to add a comment