Distiller misses the main content due to wrong table classification |
||||
Issue descriptionVersion: M55 OS: All What steps will reproduce the problem? (1) Distill http://www.bpl.org/general/history.htm What is the expected output? See the main content. What do you see instead? Only the footer is available. The MHTML snapshot is attached, in case the page changes its content in the future.
,
Oct 7 2016
Diagnosis: According to the table classification heuristics defined here: java/org/chromium/distiller/TableClassifier.java The first met rule is: 11) Table having >=5 columns is data table. The algorithm is taken from: http://asurkov.blogspot.com/2011/10/data-vs-layout-table.html Some possible ideas: We could add a new rule before rule #11, and probably after rule #9. 1. If the textContent of table is long enough, treat it as a layout table. This might get wrong for really large data tables. 2. If the table contains many <p> elements, treat as a layout table, since data tables usually don't use <p>. We got lucky that this page uses <p> in the article properly. Some articles just use <div> with lots of <br> for paragraphs. 3. If the table contains many <h[1-6]> heading tags, treat it as a layout table. We also got lucky that this page uses <h?> properly. Some would use <font> to make heading look larger without proper semantics. 4. Augment the 1-row 1-col rule: if a row is >95% of table height (0.986 in our case on Nexus 6P), or if a cell is >95% of table width (not useful in our case), treat it as a layout table. Two-column tables like pro vs. con with a long list in the cells might wrongly fall into this category. The precision and recall might need to be tested on a larger and representative corpus to make sure the negative effect (if any) is acceptable. If there's no output change for all the corpora, our approach might be too specific.
,
Oct 7 2016
I'll create a large corpus with non-mobile-friendly distillable pages to test the ideas.
,
Oct 11 2016
It turned out that usage of <p> is a noisy signal. Web pages produced by MS Word are polluted by extra tags, and it is quite common to see data tables containing <p> in this case. Sadly, the percentage is even higher among those with non-mobile-friendly layout, which is our target.
,
Feb 15 2018
|
||||
►
Sign in to add a comment |
||||
Comment 1 by klo...@chromium.org
, Oct 7 2016