Searching for hyphenated text doesn't work in PDFs |
||||||
Issue descriptionSearching for a word that is line broken with a hyphen does not work correctly from the find box. In the example pdf the word applications is line broke to be ap-plications. Searching for application doesn't find it the hyphenated example. This is likely related to how PDFium handles soft hyphens in text, and probably has never worked correctly.
,
Dec 14 2017
,
Dec 14 2017
It is actually ap-plications, not ap-plication. It appears in the second column in the first full paragraph on the line that begins 'that Bigtable can be difficult to use for some kinds of'
,
Dec 15 2017
Ah, thanks. It does work in one other PDF viewer I tried on Linux. I also tried r350005 which didn't work, so I suspect you are correct in that it never worked.
,
Mar 8 2018
I suspect hyphenation will work in some causes, but not all. Specifically if the word is actually being broken across text boxes/objects, then PDFium will actually be including the - in the text string from the public API, instead of removing it. If the word is just broken over two lines in the same box, it should be correct, though that might be buggy also.
,
Mar 8 2018
,
Sep 19
,
Sep 25
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/189eca26a74224d31e302bf0075f99cb6b2873f0 commit 189eca26a74224d31e302bf0075f99cb6b2873f0 Author: Ryan Harrison <rharrison@chromium.org> Date: Tue Sep 25 17:47:57 2018 Find strings in PDFs that have been broken by a soft hyphen Currently if a search term in the PDF text has been broken over two lines by a soft hyphen, find will not correctly identify it as a match. This is rooted in the fact that the result of FPDF_GetText includes a marker for soft-hyphens, 0xFFFE, which causes the match to fail. This CL adds in filtering this character from the text being searched over, so that these matches can pass. This requires changes in the SearchUsingICU method to strip ignorable characters from the string before searching, and correctly converting the results back into the non-stripped index space. Ranges also have had filtering for 0xFFFE added in, so that the highlights created by searching are properly placed. BUG= chromium:788799 Change-Id: I06c8181358cdebe6454c36437065592820637808 Reviewed-on: https://chromium-review.googlesource.com/1234998 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: Lei Zhang <thestig@chromium.org> Reviewed-by: Henrique Nakashima <hnakashima@chromium.org> Cr-Commit-Position: refs/heads/master@{#593993} [modify] https://crrev.com/189eca26a74224d31e302bf0075f99cb6b2873f0/pdf/pdf_engine.h [modify] https://crrev.com/189eca26a74224d31e302bf0075f99cb6b2873f0/pdf/pdfium/findtext_unittest.cc [modify] https://crrev.com/189eca26a74224d31e302bf0075f99cb6b2873f0/pdf/pdfium/pdfium_engine.cc [modify] https://crrev.com/189eca26a74224d31e302bf0075f99cb6b2873f0/pdf/pdfium/pdfium_range.cc [modify] https://crrev.com/189eca26a74224d31e302bf0075f99cb6b2873f0/pdf/pdfium/pdfium_range.h [add] https://crrev.com/189eca26a74224d31e302bf0075f99cb6b2873f0/pdf/test/data/spanner.pdf
,
Sep 25
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by thestig@chromium.org
, Dec 14 2017