PDF text sometimes renders wrong in test |
||||||||||
Issue descriptionChrome Version: tip OS: Linux x86-64 What steps will reproduce the problem? I was unable to reproduce this issue locally. It sometimes shows up on CFI Linux bot with the following test cases failing: https://build.chromium.org/p/chromium.fyi/builders/CFI%20Linux/builds/7651 PDFExtensionTest.PdfAccessibilityEnableLater PDFExtensionTest.PdfAccessibility PDFExtensionTest.PdfAccessibilityInOOPIF PDFExtensionTest.PdfAccessibilityInIframe There is a lot of log spam in the test output, but the immediate cause for the queries to fail looks the following: ../../chrome/browser/pdf/pdf_extension_test.cc:659: Failure Value of: kExpectedPDFAXTree == ax_tree_dump Actual: false Expected: true Expected: embeddedObject group region 'Page 1' paragraph staticText '1 First Section ' inlineTextBox '1 ' inlineTextBox 'First Section ' paragraph staticText 'This is the first section. 1' inlineTextBox 'This is the first section. ' inlineTextBox '1' region 'Page 2' paragraph staticText '1.1 First Subsection ' inlineTextBox '1.1 ' inlineTextBox 'First Subsection ' paragraph staticText 'This is the first subsection. 2' inlineTextBox 'This is the first subsection. ' inlineTextBox '2' region 'Page 3' paragraph staticText '2 Second Section ' inlineTextBox '2 ' inlineTextBox 'Second Section ' paragraph staticText '3' inlineTextBox '3' Actual: embeddedObject group region 'Page 1' paragraph staticText '1 First Section ' inlineTextBox '1 ' inlineTextBox 'First Section ' paragraph staticText 'This is the rst section. 1' inlineTextBox 'This is the rst section. ' inlineTextBox '1' region 'Page 2' paragraph staticText '1.1 First Subsection ' inlineTextBox '1.1 ' inlineTextBox 'First Subsection ' paragraph staticText 'This is the rst subsection. 2' inlineTextBox 'This is the rst subsection. ' inlineTextBox '2' region 'Page 3' paragraph staticText '2 Second Section ' inlineTextBox '2 ' inlineTextBox 'Second Section ' paragraph staticText '3' inlineTextBox '3' As you can see the main different is "This is the rst subsection." instead of "This is the first subsection.". It reminds me of race conditions, but I am not sure if that's what happens here. If anyone has any hints, please, put them here.
,
Mar 14 2017
If this test was reproducible, this is the way to build it: GYP_DEFINES='buildtype=Official' gclient sync gn gen out/cfi '--args=is_debug=false is_cfi=true is_component_build=false' --check ninja -C out/cfi browser_tests # Will take ~40 minutes at the last link step ./out/cfi/browser_tests --gtest_filter=PDFExtensionTest.PdfAccessibility I believe this test failure has nothing about CFI, and something about timings. No evidence, though.
,
Mar 14 2017
dmazzoni added these tests. Over to him.
,
Mar 15 2017
This is still a problem, as many bots are red due to this bug: https://build.chromium.org/p/chromium.fyi/builders/CFI%20Linux%20ToT https://build.chromium.org/p/chromium.fyi/builders/CFI%20Linux https://build.chromium.org/p/chromium.fyi/builders/CFI%20Linux%20Full
,
Mar 15 2017
This is a very mysterious failure. I can't think of what could cause this type of corruption. Is it reasonable to assume these changes are CFI-related?
,
Mar 15 2017
CFI (in this incarnation) does a very simple thing: if it does not like a virtual call, it simply aborts a process with UD2 instruction. Not only I don't observe any aborts here, CFI failures are very deterministic. Another possibility is a miscompilation of a sort, but such issues are deterministic as well. What happens during the test output generation? Are there any threads / processes which communicate with each other or it's just a bunch of functions invoked consequently?
,
Mar 15 2017
Yes, this test is asynchronous. Data is sent from the PDF process to the render process, and from the render process to the browser process. Almost everything is done with strings and simple data structures, though, so it's not clear how we could get errors such as that one. I'll try to reproduce locally with those compile flags.
,
Mar 16 2017
Great news! The test is now failing on 'ThinLTO Linux ToT' bot, which has nothing to do with CFI: https://build.chromium.org/p/chromium.fyi/builders/ThinLTO%20Linux%20ToT/builds/1307 The failure message is the same. It really feels like a race condition of a sort.
,
Mar 17 2017
I plan to disable these tests from running on all buildbots, as the tests are broken and no action has been taken for 3 days: PDFExtensionTest.PdfAccessibilityEnableLater PDFExtensionTest.PdfAccessibility PDFExtensionTest.PdfAccessibilityInOOPIF PDFExtensionTest.PdfAccessibilityInIframe I will create a CL for that soon. Please, object, if there are reasons not to do that.
,
Mar 17 2017
I sent https://codereview.chromium.org/2751973009/ for a review.
,
Mar 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a5082d6ce45219eba13fae950a5fbdda07fe3442 commit a5082d6ce45219eba13fae950a5fbdda07fe3442 Author: krasin <krasin@chromium.org> Date: Sat Mar 18 00:12:51 2017 Disable 4 PDFExtensionTest test cases as they fail on multiple bots. BUG=701427 Review-Url: https://codereview.chromium.org/2751973009 Cr-Commit-Position: refs/heads/master@{#457908} [modify] https://crrev.com/a5082d6ce45219eba13fae950a5fbdda07fe3442/chrome/browser/pdf/pdf_extension_test.cc
,
Mar 20 2017
Interesting - I can reproduce this locally, but when I open the PDF in Chrome it's broken in a similar way. See attached screenshot. Visually it shows "This is the rst section" instead of "This is the First section". Possibly a font issue? I don't understand why this would be working on some bots but not others. Either way this looks like the bug is not in accessibility code or in the test, but the accessibility test is surfacing a real error somewhere. Reassigning to raymes@ to triage and tell me if this looks like a real bug, or a known issue to work around. If the latter, I'll modify the test to make it tolerant of this issue.
,
Mar 20 2017
Specifically this looks like an issue with the "fi" ligature.
,
Mar 20 2017
Hi Dominic, thank you for digging into this. That definitely moves us one step closer to the understanding of the real issue.
,
Mar 20 2017
I created this change to re-enable the tests. https://codereview.chromium.org/2760053002
,
Mar 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9d1abef69b6441eba82f137f2b38ef3d6935182a commit 9d1abef69b6441eba82f137f2b38ef3d6935182a Author: dmazzoni <dmazzoni@chromium.org> Date: Wed Mar 22 19:24:09 2017 Re-enable 4 PDF accessibility tests by making them more robust. Work around issues where the PDF plug-in is returning inconsistent string on different platforms, regarding whitespace and "fi" ligatures. BUG=701427 Review-Url: https://codereview.chromium.org/2760053002 Cr-Commit-Position: refs/heads/master@{#458835} [modify] https://crrev.com/9d1abef69b6441eba82f137f2b38ef3d6935182a/chrome/browser/pdf/pdf_extension_test.cc
,
Mar 24 2017
Thanks for looking at it dmazzoni. This seems like a rendering issue in the plugin then. Assigning to dsinclair.
,
Mar 27 2017
npm@ can you check if this is something weird in the font code?
,
Mar 27 2017
I'm unable to reproduce. When I open test-bookmark.pdf on Chrome 57.0.2987.110 or 59.0.3047.0, I see the correct text. On which Chrome version and OS were you able to see a problem?
,
Mar 27 2017
,
Mar 28 2017
See above, it was reproducing consistently on some of our bots. On my Linux workstation, the bug reproduces with a vanilla open-source Chromium build, but not with an official Google Chrome build. That may be a coincidence but I wonder if Chromium doesn't include something useful for dealing with ligatures...
,
Mar 28 2017
I've looked at this and don't see a problem. I also don't see how the rendering could be flaky. * The fonts are embedded, so character rendering should be pretty consistent. * The only strange thing about the "fi" is that it is represented by "\014" in the PDF. This is allowed under Table 3.2 of PDF spec 1.7, it's octal for charcode 12. But we handle that correctly. The only thing I can think of is this: freetype was updated for the bots, it does not like the embedded fonts anymore, and we have to find substitutes. We then fail to render properly with these. But that still doesn't explain the rendering in #12. I'm stuck until I can reproduce that.
,
Mar 28 2017
Anything you'd like me to check locally since I was able to reproduce it?
,
Mar 28 2017
Bad rendering reproduces for you on a clean ToT build, correct? If it does reproduce consistently, a bisect would help (probably something close to the date of this bug report?). What do you get if you run: freetype-config --ftversion
,
Mar 29 2017
I just tried bisecting. The builds I got from the archive all worked fine - everything was "good". But when I try my own trunk build of Chrome from the same machine, it fails. > freetype-config --ftversion 2.5.2 My gn args: is_component_build = true is_debug = false use_goma = true
,
Mar 29 2017
As another data point, I tried to reproduce this bug on many machines: my desktops, Google Compute Engine instances of various sorts at no availability. This is something about the system.
,
Mar 30 2017
That's also my ftversion. So I don't know what the problem is.
,
Mar 30 2017
This is something wacky with freetype. Using the system freetype (the default on linux) I have this example failing. The failure is in FT_New_Memory_Face returning Freetype Error Code 2 (which I believe is Unknown_File_Format). If I then set pdf_bundle_freetype = true to force the use of our internal freetype the file works correctly and I get the 'fi'. As far as I can tell (from the dpkg version) my system freetype is the same as npm@'s so I don't know why it would fail for me and work for npm@. Adding drott@ in case there is something about freetype that we're missing here? (It looks like my system freetype is 2.5.2-1ubuntu2.6 and the internal one is listed as VER-2-7-1-updates)
,
Mar 30 2017
I'm guessing your FreeType needs type1 and/or specifically type1cid module support in FreeType. I saw issues with these accessibility tests when moving to shared FreeType on Chromium in FreeType because the test file seems to use Type1 fonts. Without type1 font support it did use Arial or something sans-serif at least as fallback. I did not check what the difference is if I disable type1cid. Even if you have identical version numbers of FreeType, perhaps your system FreeTypes differ in module configuration and the things they compile in? You can experiment with this by removing modules and files for third_party/BUILD.gn to force-reproduce the same error. The difference between Chromium's FreeType and PDFiums were in type1.c, type1cid.c and psaux.c, and FT_USE_MODULE( FT_Driver_ClassRec, t1_driver_class ) FT_USE_MODULE( FT_Driver_ClassRec, t1cid_driver_class FT_USE_MODULE( FT_Module_Class, psaux_module_class ) respectively.
,
Mar 31 2017
Can we just fail all PDF tests if you try to build with system freetype? We just just have the test suite fail with an error message saying to rebuild with our built-in FreeType. That should probably be the bot configuration. Alternatively, could we at least spew a message to the console when this happens, explaining that we didn't get Type-1 font support and that PDF bugs should be expected unless you fix FreeType?
,
Mar 31 2017
We can't fail tests if using system freetype, that's the default for Linux. Unless we start shipping Freetype on Linux as well, I think it makes sense to test with system freetype. I think it is reasonable to add a message when Freetype fails to load an embedded font with Unknown_File_Format. But as far as I know we don't have this message spewing set up for internal PDFium methods. For now, your test could probably check for "first" on OS!=Linux, but keep "*rst" on Linux.
,
Apr 5 2017
,
Jun 13 2017
I do not understand affect on white space or "fi" ligatures, but I do have a program that reads PDFs created by chrome and converted to text by PDFBox that have broken with chrome updates. The first time was around October, 2016 sometime. I have lost the details. The last time was around 6/9. The creator of the reports ran on 6/10 and 6/12 in which time the report format changed (I suspect with the chrome 59.0.3071.86 (Official Build)).
The October change had some change between x'C2A0' to space or visa versa. The 59.0.3071.86 change was similar, but I am more familiar with it. To correct my program, I had to make two changes to get similar results in either format. I changed the converted text from PDF as such:
change x'20C2A00A' to x'0A', then
change x'C2A0' to x'20'
In java:
text = text.replace(" " + TAGC2A0 + "\n" , "\n"); // Replace SP +   + LF with LF
text = text.replace(TAGC2A0, " "); // Replace   to SP
It then works as before.
I have no control over the created report (only read). It was created on a windows machine.
In looking at the logs of changes for this chrome release, I found this bug report. Have no idea if related, but only saw this and one other report related to PDFs.
,
Jan 11 2018
I hope it's okay to jump in here. I have some more details and a different repro case that might shed some light. When using a web font, I see the same behaviour where certain characters appear to automatically be treated as pair in the browser (e.g. "fi" or "fl"). If you try to select them with your mouse, they can only be selected as a pair. If you print this page to PDF, the pair does not appear to get saved as text. The rendering of the PDF looks correct however - so perhaps it's being converted to an image instead? Unfortunately the real impact of this issue is that it breaks the ability to search the PDF for words containing those character pairs. Chrome: Version 63.0.3239.132 OS: Windows 10 To test this: 1. Open the attached HTML in Chrome, which uses a web font from fonts.googleapis.com 2. Notice you can select the word "verification" letter by letter on the first line (no font specified) but that the characters "fi" get selected as a pair in the second line (using a Google web font). 3. Print the page to PDF 4. Open the resulting PDF and search/find for "verification". Note that the web font version is not found. 5. If you select the web font "verification" and copy/paste it into a text editor, the "fi" pair is missing. I've attached the PDF as well if you just want to inspect that.
,
Jan 14
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by dsinclair@chromium.org
, Mar 14 2017