Headings are missing after distillation |
|||||
Issue descriptionVersion: M55 What steps will reproduce the problem? (1) Distill http://www.electoral-vote.com/evp2016/Pres/Maps/Sep14.html What is the expected output? The whole main content. What do you see instead? <h4> headers are gone.
,
Sep 27 2016
I have also experienced this (Headers missing and replaced by "ElectoralVote" as the heading.
,
Sep 27 2016
,
Sep 27 2016
The root cause here is the (arguably) improper use of heading tags. All the tags are <h4>, while they should've been <h1>, or at least <h2>. We only recognize <h1> to <h3> as headings right now. After some software archaeology, boilerpipe also only recognize <h1> to <h3>. https://github.com/kohlschutter/boilerpipe/blob/2c78035a830282e2435c466f3f14d6d4104d0a94/boilerpipe-common/src/main/java/com/kohlschutter/boilerpipe/sax/DefaultTagActionMap.java#L78 However, the original paper did mention all h1 to h6. http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf I guess it's probably OK to add all headings, including h4 to h6.
,
Sep 28 2016
Thanks for looking into this so quickly!
,
Oct 20 2016
Issue 657682 has been merged into this issue.
,
Oct 24 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a22aa4380f04d2d8aa18f1f3098393516646f181 commit a22aa4380f04d2d8aa18f1f3098393516646f181 Author: wychen <wychen@chromium.org> Date: Mon Oct 24 19:38:13 2016 Roll DOM Distiller JavaScript distribution package Diff since last roll: https://github.com/chromium/dom-distiller/compare/d16a68c1b8...072fe57b48 Picked up changes: 072fe57 Recognize H4 to H6 as headings as well 52047b4 Avoid using getClassName() to avoid issues with <svg> 8cf93ce Bump ChromeDriver version to 2.24 d876125 Add gen_mhtml_corpus.py to convert MHTML to eval corpus 8b33c8b Amend "Fix partially hidden article" 3fd2017 Strip unwanted classNames from all nodes BUG=593457,599121, 647098 , 658038 Review-Url: https://codereview.chromium.org/2447453002 Cr-Commit-Position: refs/heads/master@{#427118} [modify] https://crrev.com/a22aa4380f04d2d8aa18f1f3098393516646f181/DEPS [modify] https://crrev.com/a22aa4380f04d2d8aa18f1f3098393516646f181/third_party/dom_distiller_js/README.chromium
,
Oct 24 2016
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by wychen@chromium.org
, Sep 15 2016