New issue
Advanced search Search tips

Issue 786306 link

Starred by 3 users

Issue metadata

Status: Available
Owner: ----
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Windows , Chrome , Mac
Pri: 3
Type: Compat



Sign in to add a comment

Some text links in pdf files are not recognized

Reported by m...@issuu.com, Nov 17 2017

Issue description

UserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36

Example URL:
Save "data:text/plain,google.com www.google.com" as a pdf and open in chrome

Steps to reproduce the problem:
1. Save "data:text/plain,google.com www.google.com" as a pdf
2. open the pdf in chrome
3. hover the mouse over the two links

What is the expected behavior?
I would be able click both the link to google.com and www.google.com

What went wrong?
"google.com" is not recognized as a link. Only "www.google.com" is.

Does it occur on multiple sites: N/A

Is it a problem with a plugin? Yes Build in pdf viewer

Did this work before? No 

Does this work in other browsers? No
 Neither pdf.js (firefox) or Adobe reader recognizes any text links

Chrome version: 61.0.3163.100  Channel: n/a
OS Version: OS X 10.11.6
Flash Version:
 
Labels: Needs-Feedback
I don't quite get how to reproduce this. When you say 'Save "data:text/plain,google.com www.google.com" as a pdf and open in chrome', does that mean pasting the text "data:text/plain,google.com www.google.com" in a file, rename it to PDF? Or do you mean paste that in an html file, open it in Chrome, print to a PDF?

Comment 2 by m...@issuu.com, Nov 20 2017

Paste the "data:..." url in the location bar of Chrome and hit enter. This
will show an html page with the content from the data-url. Save this page
as a pdf file using a pdf printer. Open the resulting pdf in Chrome.

Sorry for not being completely clear about this. I saw the technique used
in another ticket and thought it was a good way of generating test files.

Den 20. nov. 2017 6.05 PM skrev "hnakash… via monorail" <
monorail+v2.1948542879@chromium.org>:
Project Member

Comment 3 by sheriffbot@chromium.org, Nov 20 2017

Cc: hnakashima@chromium.org
Labels: -Needs-Feedback
Thank you for providing more feedback. Adding requester "hnakashima@chromium.org" to the cc list and removing "Needs-Feedback" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Components: Internals>Plugins>PDF
Status: WontFix (was: Unconfirmed)
Thank you for clarifying. I reproduced the behavior described.

The reason google.com does not show as a link and www.google.com does is that neither are links (hence why other PDF viewers don't create a link there), but PDFium does a best-effort step to add links for strings that are obviously urls. That means www.anything.com and https://anything.com turn into links, but anything.com does not, because the heuristic is conservative.

Marking as WAI, but thank you for the report.

Comment 5 by m...@issuu.com, Nov 21 2017

Thanks for the quick reply. If I may ask, is there a (technical) reason for this behaviour? Even while being conservative, I don't see any reason not to show text like google.com as links (monorail, for instance, easily recognises the link) and as a user it feels weird that some obvious links are showed as links while other obvious links are not.

Comment 6 by m...@issuu.com, Dec 6 2017

Hmm ... apparently I may not. I guess this is just not a priority :-(
It maybe possible to make the heuristic less conservative but it quickly becomes a hard problem as you can't just match on the suffix with so many TLDs available now.

If you wanted to try we're happy to accept patches but this isn't something we're looking into at the moment.
Cc: -hnakashima@chromium.org
Labels: -Pri-2 OS-Chrome OS-Linux OS-Windows Pri-3
Owner: hnakashima@chromium.org
Status: Assigned (was: WontFix)
Sorry, I missed your comment. We could do matching for some common suffixes like .com and .org, I think that would be reasonable.

Relevant code is at core/fpdftext/cpdf_linkextract.cpp

I'll actually leave this open and work on it when I have cycles.

Comment 9 by m...@issuu.com, Dec 7 2017

Thanks, that sounds really great. Adding support for common domains would be a great improvement.

I found the relevant monorail link recognition here for inspiration: https://chromium.googlesource.com/infra/infra/+/master/appengine/monorail/features/autolink.py
Owner: ----
Status: Available (was: Assigned)

Sign in to add a comment