New issue
Advanced search Search tips

Issue 613374 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Missing headings followed by images with captions

Project Member Reported by wychen@chromium.org, May 19 2016

Issue description

Version: M51

What steps will reproduce the problem?
(1) Enable DOM distiller
(2) Distill http://www.vox.com/2015/9/5/9265501/refugee-crisis-europe-syria

What is the expected output?
Correct extraction of contents.

What do you see instead?
The headings "The war and repression driving this unprecedented crisis" and "How the Arab Spring jump-started the refugee crisis" are missing.

These headings are labelled as content at first. However, they are followed by images with caption, and that caption is labelled as non-content. In the heading fusion pass, the headings are labelled as non-content.

One possible solution is to extract the <figure> as a whole, so the caption wouldn't interfere with the content identification.

 

Comment 1 by marcelor...@hp.com, May 31 2016

Code review for this issue: https://codereview.chromium.org/2020403002

Comment 2 by wychen@chromium.org, Aug 12 2016

Using https://codereview.chromium.org/2020403002 as is on http://www.vox.com/2015/9/5/9265501/refugee-crisis-europe-syria leads to other issues. The escaped content of <noscript> are shown as <figcaption> in some cases.
Project Member

Comment 3 by bugdroid1@chromium.org, Aug 12 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d1730009565e394ff8c0ad04121d2e44d0c3d166

commit d1730009565e394ff8c0ad04121d2e44d0c3d166
Author: wychen <wychen@chromium.org>
Date: Fri Aug 12 23:10:17 2016

Roll DOM Distiller JavaScript distribution package

Diff since last roll:
https://github.com/chromium/dom-distiller/compare/6c16f14405...91f9f016e0

Picked up changes:
91f9f01 Fix figcaption generation
365c44e Add support for figure element
f8f3308 Update distillability modeling scripts to predict long articles
8a12e18 Decrease mismatches in feature extraction
4d7ab13 Extract image URLs in WebTables
8d8063a Extract image URLs in srcset as well
34c4a18 Re-enable tests containing <track> in CI
0d4286b The display style of WebText root element should never be inline

BUG= 531545 ,539851, 595120 , 610944 , 613374 ,625621,631086, 637170 

Review-Url: https://codereview.chromium.org/2245763002
Cr-Commit-Position: refs/heads/master@{#411811}

[modify] https://crrev.com/d1730009565e394ff8c0ad04121d2e44d0c3d166/DEPS
[modify] https://crrev.com/d1730009565e394ff8c0ad04121d2e44d0c3d166/third_party/dom_distiller_js/README.chromium

Comment 4 by wychen@chromium.org, Aug 15 2016

Owner: wychen@chromium.org
Status: Fixed (was: Untriaged)

Sign in to add a comment