Some text links in pdf files are not recognized
Reported by
m...@issuu.com,
Nov 17 2017
|
|||||
Issue descriptionUserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36 Example URL: Save "data:text/plain,google.com www.google.com" as a pdf and open in chrome Steps to reproduce the problem: 1. Save "data:text/plain,google.com www.google.com" as a pdf 2. open the pdf in chrome 3. hover the mouse over the two links What is the expected behavior? I would be able click both the link to google.com and www.google.com What went wrong? "google.com" is not recognized as a link. Only "www.google.com" is. Does it occur on multiple sites: N/A Is it a problem with a plugin? Yes Build in pdf viewer Did this work before? No Does this work in other browsers? No Neither pdf.js (firefox) or Adobe reader recognizes any text links Chrome version: 61.0.3163.100 Channel: n/a OS Version: OS X 10.11.6 Flash Version:
,
Nov 20 2017
Paste the "data:..." url in the location bar of Chrome and hit enter. This will show an html page with the content from the data-url. Save this page as a pdf file using a pdf printer. Open the resulting pdf in Chrome. Sorry for not being completely clear about this. I saw the technique used in another ticket and thought it was a good way of generating test files. Den 20. nov. 2017 6.05 PM skrev "hnakash… via monorail" < monorail+v2.1948542879@chromium.org>:
,
Nov 20 2017
Thank you for providing more feedback. Adding requester "hnakashima@chromium.org" to the cc list and removing "Needs-Feedback" label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Nov 20 2017
Thank you for clarifying. I reproduced the behavior described. The reason google.com does not show as a link and www.google.com does is that neither are links (hence why other PDF viewers don't create a link there), but PDFium does a best-effort step to add links for strings that are obviously urls. That means www.anything.com and https://anything.com turn into links, but anything.com does not, because the heuristic is conservative. Marking as WAI, but thank you for the report.
,
Nov 21 2017
Thanks for the quick reply. If I may ask, is there a (technical) reason for this behaviour? Even while being conservative, I don't see any reason not to show text like google.com as links (monorail, for instance, easily recognises the link) and as a user it feels weird that some obvious links are showed as links while other obvious links are not.
,
Dec 6 2017
Hmm ... apparently I may not. I guess this is just not a priority :-(
,
Dec 6 2017
It maybe possible to make the heuristic less conservative but it quickly becomes a hard problem as you can't just match on the suffix with so many TLDs available now. If you wanted to try we're happy to accept patches but this isn't something we're looking into at the moment.
,
Dec 6 2017
Sorry, I missed your comment. We could do matching for some common suffixes like .com and .org, I think that would be reasonable. Relevant code is at core/fpdftext/cpdf_linkextract.cpp I'll actually leave this open and work on it when I have cycles.
,
Dec 7 2017
Thanks, that sounds really great. Adding support for common domains would be a great improvement. I found the relevant monorail link recognition here for inspiration: https://chromium.googlesource.com/infra/infra/+/master/appengine/monorail/features/autolink.py
,
Oct 12
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by hnakashima@chromium.org
, Nov 20 2017