New issue
Advanced search Search tips

Issue 788799 link

Starred by 0 users

Issue metadata

Status: Fixed
Owner:
Closed: Sep 25
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 2
Type: Bug

Blocked on:
issue pdfium:1031



Sign in to add a comment

Searching for hyphenated text doesn't work in PDFs

Project Member Reported by rharrison@chromium.org, Nov 27 2017

Issue description

Searching for a word that is line broken with a hyphen does not work correctly from the find box.

In the example pdf the word applications is line broke to be ap-plications. Searching for application doesn't find it the hyphenated example.

This is likely related to how PDFium handles soft hyphens in text, and probably has never worked correctly.
 
spanner.pdf
57.2 KB Download
Where is the hyphenated 'ap-plication' in spanner.pdf? Am I missing it? I see 7 results for 'application'.
Description: Show this description
It is actually ap-plications, not ap-plication. It appears in the second column in the first full paragraph on the line that begins 'that Bigtable can be difficult to use for some kinds of'
Components: UI>Browser>FindInPage
Ah, thanks. It does work in one other PDF viewer I tried on Linux. I also tried r350005 which didn't work, so I suspect you are correct in that it never worked.
I suspect hyphenation will work in some causes, but not all. Specifically if the word is actually being broken across text boxes/objects, then PDFium will actually be including the - in the text string from the public API, instead of removing it. If the word is just broken over two lines in the same box, it should be correct, though that might be buggy also.
Blockedon: pdfium:1031
Labels: -Pri-3 Pri-2
Status: Started (was: Assigned)
Project Member

Comment 8 by bugdroid1@chromium.org, Sep 25

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/189eca26a74224d31e302bf0075f99cb6b2873f0

commit 189eca26a74224d31e302bf0075f99cb6b2873f0
Author: Ryan Harrison <rharrison@chromium.org>
Date: Tue Sep 25 17:47:57 2018

Find strings in PDFs that have been broken by a soft hyphen

Currently if a search term in the PDF text has been broken over two
lines by a soft hyphen, find will not correctly identify it as a
match. This is rooted in the fact that the result of FPDF_GetText
includes a marker for soft-hyphens, 0xFFFE, which causes the match to
fail.

This CL adds in filtering this character from the text being
searched over, so that these matches can pass. This requires changes
in the SearchUsingICU method to strip ignorable characters from the
string before searching, and correctly converting the results back
into the non-stripped index space. Ranges also have had filtering for
0xFFFE added in, so that the highlights created by searching are
properly placed.

BUG= chromium:788799 

Change-Id: I06c8181358cdebe6454c36437065592820637808
Reviewed-on: https://chromium-review.googlesource.com/1234998
Commit-Queue: Ryan Harrison <rharrison@chromium.org>
Reviewed-by: Lei Zhang <thestig@chromium.org>
Reviewed-by: Henrique Nakashima <hnakashima@chromium.org>
Cr-Commit-Position: refs/heads/master@{#593993}
[modify] https://crrev.com/189eca26a74224d31e302bf0075f99cb6b2873f0/pdf/pdf_engine.h
[modify] https://crrev.com/189eca26a74224d31e302bf0075f99cb6b2873f0/pdf/pdfium/findtext_unittest.cc
[modify] https://crrev.com/189eca26a74224d31e302bf0075f99cb6b2873f0/pdf/pdfium/pdfium_engine.cc
[modify] https://crrev.com/189eca26a74224d31e302bf0075f99cb6b2873f0/pdf/pdfium/pdfium_range.cc
[modify] https://crrev.com/189eca26a74224d31e302bf0075f99cb6b2873f0/pdf/pdfium/pdfium_range.h
[add] https://crrev.com/189eca26a74224d31e302bf0075f99cb6b2873f0/pdf/test/data/spanner.pdf

Status: Fixed (was: Started)

Sign in to add a comment