Text copied from PDF has invisible non-printing characters
Reported by
khym.cha...@gmail.com,
Jul 15 2017
|
||||||||||||||
Issue descriptionUserAgent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36 Example URL: Steps to reproduce the problem: 1. Open the attached PDF inside of Chrome 2. Select and copy "average film length for" 3. Paste into a text area 4. Place text cursor behind the "f" in "for" 5. Press the left arrow key enough times to move the cursor behind the ending "h" of "length". What is the expected behavior? Should only have to press the left arrow twice to move cursor behind "h". What went wrong? Have to press left arrow three times, one of the times not visibly changing the position of the cursor. Does it occur on multiple sites: N/A Is it a problem with a plugin? No Did this work before? N/A Does this work in other browsers? N/A Chrome version: 59.0.3071.115 Channel: stable OS Version: Fedora 25 Flash Version: Shockwave Flash 26.0 r0 In Linux if you paste the same text into a konsole text console the extra character will be invisible, but if you paste it into console (non-gui) vim in insert mode it will show up as a special character "<200b>". The same bug happens at any point in the attached PDF file where bold text changes back to non-bold text. The PDF producer for the file is Skia/PDF m61 The file format is PDF version v 1.5
,
Jul 17 2017
Unable to reproduce this issue on Ubuntu 14.04 using chrome latest stable #59.0.3071.115 by following steps mentioned in the original comment. By pressing the left arrow twice the cursor moved to behind "h" as expected. Reporter@ Are you able to reproduce this issue on incognito mode as well? Attaching screen cast for reference please take a look and let me know is this is the expected behavior for this issue? Thanks!
,
Jul 18 2017
I'm no longer able to reproduce the issue which happens when pasting text into a text area or text field, but the problem still shows up when pasting it into vim. Also, I verified that it also reproduces the vim problem with experiments turned off, extensions turned off, under incognito, with unstable Chrome (v 61.0.3153.4), with a new OS level user account, and with a new desktop environment (Xfce instead of KDE).
,
Jul 18 2017
Thank you for providing more feedback. Adding requester "brajkumar@chromium.org" to the cc list and removing "Needs-Feedback" label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Jul 18 2017
Can reproduce Ubuntu 14.04, Chrome Stable 59.0.3071.115
,
Jul 31 2017
,
Aug 1 2017
The character being added in, 0x200b, is a ZERO WIDTH SPACE according to unicodemap.org. If I run the pdf through pdftotext to look at the text content , and the 0x200b is converted to a '_'. There are '_'s appearing around all of the bold regions in the converted text, so I think this is an artifact of how the document is generated. Performing the same copy operation on the PDF opened in Evince 3.10.3, the Gnome Document Viewer, I get the same output in vim, i.e. the 0x200b character appearing at the end of bold region. Thus I am pretty sure that the character is actually in the text content in the PDF. I am going to send this over to dsinclair to determine if copying non-printing characters like this is the correct behaviour.
,
Aug 17 2017
I also recommend trying this in Acrobat Reader and see how it behaves.
,
Aug 18 2017
This is a platform independent issue. Basically, pdfium just faithfully returned this char while some viewer such as Acrobat Reader doesn't pass it on during copy and paste.
,
Sep 4
Setting PDF bugs assigned to me back to untriaged so they can get re-assigned as needed.
,
Sep 5
I think we _could_ filter characters like this out, but it's not necessarily the right thing to do in all cases. I don't think PDFium _must_ filter it out though, and arguably as a library it should return all it can to cover the broadest uses. Marking as WontFix, please reopen if you disagree.
,
Sep 6
I agree with hnakashima@ that this isn't a PDFium issue. 0x200B is not a control/non-printing character, which we filter out, but an ignorable printing character, thus it is included in the text string. It is the responsibility of the caller to figure out what that means in their context. That being said, there is an argument to be made that Chromium from a embedder/UI perspective should be filtering out these characters. I will take a look at doing that.
,
Sep 6
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a7a26d22d49a44ade6bef4702c453623bb199fce commit a7a26d22d49a44ade6bef4702c453623bb199fce Author: Ryan Harrison <rharrison@chromium.org> Date: Thu Sep 06 20:22:37 2018 Strip Zero Width Whitespace from PDFium text strings When getting text from PDFium, the library does not filter ZWW (0x200B), since it is a valid non-control character. It is ignorable though, so the embedder aka Chrome, has the option of whether or not to display this character. Given that it shouldn't have any visual display, including it in the displayed text can lead to weird UI situations. Like the length of text being longer then number of characters displayed or navigating the cursor requires multiple key presses to get over the ZWW. BUG= chromium:743522 Change-Id: I5312a3aad4a752659fb4455853cd1030f0660bd9 Reviewed-on: https://chromium-review.googlesource.com/1210966 Reviewed-by: Henrique Nakashima <hnakashima@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org> Cr-Commit-Position: refs/heads/master@{#589271} [modify] https://crrev.com/a7a26d22d49a44ade6bef4702c453623bb199fce/pdf/pdfium/pdfium_range.cc
,
Sep 7
Tried to reproduce the issue on Windows 10 and Mac OS 10.13.3 on the build without fix 69.0.3457.0 and unable to reproduce the issue by following the below steps. 1. Launched Chrome and opened the given attached pdf. 2. Copied the text 'average film length for' from the pdf to a text area and placed the cursor before "f" in "for". 3. Cannot observe any issues on hitting the left arrow button in moving the cursor before 'h'. Attached is the screen cast for reference. rharrison@ Request you to check and confirm if anything is missed from our end in verifying the issue and help us in verifying the fix on the latest M-71 build. Thanks..
,
Sep 7
GMail might be doing something smart about the zero width space. Additionally because you place the cursor to the left of the f, I am not sure if it is going to be on the left or right of the non-displaying space. How I repro this is by copying the text in the OmniBox, since I know it doesn't do something smart with 0x200B, and putting the cursor to the right of f, then using the left arrow to move the cursor. When moving over the space to the left of f, it takes two key presses instead of one. This consistently repros without the patch for me.
,
Sep 7
,
Sep 10
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/12f2cd31853e9b5b0322207b336ce089087dd368 commit 12f2cd31853e9b5b0322207b336ce089087dd368 Author: Ryan Harrison <rharrison@chromium.org> Date: Mon Sep 10 17:45:26 2018 Merge definitions of Zero Width Whitespace in pdf/ I introduced a second definition of ZWW in the PDF plugin code without realizing it in a previoud CL. This CL merges the two definitions together. BUG= chromium:743522 Change-Id: Id5389bffc9aca70458c4aa934eb3163bf6ad503a Reviewed-on: https://chromium-review.googlesource.com/1213543 Reviewed-by: Henrique Nakashima <hnakashima@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org> Cr-Commit-Position: refs/heads/master@{#589970} [modify] https://crrev.com/12f2cd31853e9b5b0322207b336ce089087dd368/pdf/pdf_engine.h [modify] https://crrev.com/12f2cd31853e9b5b0322207b336ce089087dd368/pdf/pdfium/pdfium_engine.cc [modify] https://crrev.com/12f2cd31853e9b5b0322207b336ce089087dd368/pdf/pdfium/pdfium_range.cc
,
Sep 11
Able to reproduce this issue on Windows 10, Mac OS 10.13.3 and Ubuntu 14.04 on the build without fix 69.0.3457.0 and the issue is fixed on the latest M-71 build 71.0.3549.0 as per comment #15 On copying the text 'average film length for' from the attached pdf to omnibox and placed the cursor before "f" in "for", then using the left arrow to move the cursor, it takes only one key press. Attached is the screen cast for reference. Hence adding TE verified labels as the fix is working as intended. Thanks.. |
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by nyerramilli@chromium.org
, Jul 17 2017Labels: Needs-Triage-M59