New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 647098 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 3
Type: Bug



Sign in to add a comment

Headings are missing after distillation

Project Member Reported by wychen@chromium.org, Sep 15 2016

Issue description

Version: M55

What steps will reproduce the problem?
(1) Distill http://www.electoral-vote.com/evp2016/Pres/Maps/Sep14.html

What is the expected output?
The whole main content.

What do you see instead?
<h4> headers are gone.

 
Cc: noyau@chromium.org
I have also experienced this (Headers missing and replaced by "ElectoralVote" as the heading.
Labels: OS-Android OS-iOS

Comment 4 by wychen@chromium.org, Sep 27 2016

Cc: k...@chromium.org
Labels: -OS-Android -OS-iOS OS-All
The root cause here is the (arguably) improper use of heading tags. All the tags are <h4>, while they should've been <h1>, or at least <h2>. We only recognize <h1> to <h3> as headings right now.

After some software archaeology, boilerpipe also only recognize <h1> to <h3>.
https://github.com/kohlschutter/boilerpipe/blob/2c78035a830282e2435c466f3f14d6d4104d0a94/boilerpipe-common/src/main/java/com/kohlschutter/boilerpipe/sax/DefaultTagActionMap.java#L78

However, the original paper did mention all h1 to h6.
http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf

I guess it's probably OK to add all headings, including h4 to h6.
Thanks for looking into this so quickly! 

Comment 6 by wychen@chromium.org, Oct 20 2016

Issue 657682 has been merged into this issue.
Project Member

Comment 7 by bugdroid1@chromium.org, Oct 24 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a22aa4380f04d2d8aa18f1f3098393516646f181

commit a22aa4380f04d2d8aa18f1f3098393516646f181
Author: wychen <wychen@chromium.org>
Date: Mon Oct 24 19:38:13 2016

Roll DOM Distiller JavaScript distribution package

Diff since last roll:
https://github.com/chromium/dom-distiller/compare/d16a68c1b8...072fe57b48

Picked up changes:
072fe57 Recognize H4 to H6 as headings as well
52047b4 Avoid using getClassName() to avoid issues with <svg>
8cf93ce Bump ChromeDriver version to 2.24
d876125 Add gen_mhtml_corpus.py to convert MHTML to eval corpus
8b33c8b Amend "Fix partially hidden article"
3fd2017 Strip unwanted classNames from all nodes

BUG=593457,599121, 647098 , 658038 

Review-Url: https://codereview.chromium.org/2447453002
Cr-Commit-Position: refs/heads/master@{#427118}

[modify] https://crrev.com/a22aa4380f04d2d8aa18f1f3098393516646f181/DEPS
[modify] https://crrev.com/a22aa4380f04d2d8aa18f1f3098393516646f181/third_party/dom_distiller_js/README.chromium

Comment 8 by wychen@chromium.org, Oct 24 2016

Status: Fixed (was: Untriaged)

Sign in to add a comment