Missing headings followed by images with captions |
||
Issue descriptionVersion: M51 What steps will reproduce the problem? (1) Enable DOM distiller (2) Distill http://www.vox.com/2015/9/5/9265501/refugee-crisis-europe-syria What is the expected output? Correct extraction of contents. What do you see instead? The headings "The war and repression driving this unprecedented crisis" and "How the Arab Spring jump-started the refugee crisis" are missing. These headings are labelled as content at first. However, they are followed by images with caption, and that caption is labelled as non-content. In the heading fusion pass, the headings are labelled as non-content. One possible solution is to extract the <figure> as a whole, so the caption wouldn't interfere with the content identification.
,
Aug 12 2016
Using https://codereview.chromium.org/2020403002 as is on http://www.vox.com/2015/9/5/9265501/refugee-crisis-europe-syria leads to other issues. The escaped content of <noscript> are shown as <figcaption> in some cases.
,
Aug 12 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d1730009565e394ff8c0ad04121d2e44d0c3d166 commit d1730009565e394ff8c0ad04121d2e44d0c3d166 Author: wychen <wychen@chromium.org> Date: Fri Aug 12 23:10:17 2016 Roll DOM Distiller JavaScript distribution package Diff since last roll: https://github.com/chromium/dom-distiller/compare/6c16f14405...91f9f016e0 Picked up changes: 91f9f01 Fix figcaption generation 365c44e Add support for figure element f8f3308 Update distillability modeling scripts to predict long articles 8a12e18 Decrease mismatches in feature extraction 4d7ab13 Extract image URLs in WebTables 8d8063a Extract image URLs in srcset as well 34c4a18 Re-enable tests containing <track> in CI 0d4286b The display style of WebText root element should never be inline BUG= 531545 ,539851, 595120 , 610944 , 613374 ,625621,631086, 637170 Review-Url: https://codereview.chromium.org/2245763002 Cr-Commit-Position: refs/heads/master@{#411811} [modify] https://crrev.com/d1730009565e394ff8c0ad04121d2e44d0c3d166/DEPS [modify] https://crrev.com/d1730009565e394ff8c0ad04121d2e44d0c3d166/third_party/dom_distiller_js/README.chromium
,
Aug 15 2016
|
||
►
Sign in to add a comment |
||
Comment 1 by marcelor...@hp.com
, May 31 2016