New issue
Advanced search Search tips

Issue 654058 link

Starred by 1 user

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 2
Type: Bug

Blocked on:
issue 593457



Sign in to add a comment

Distiller misses the main content due to wrong table classification

Project Member Reported by wychen@chromium.org, Oct 7 2016

Issue description

Version: M55
OS: All

What steps will reproduce the problem?
(1) Distill http://www.bpl.org/general/history.htm

What is the expected output?
See the main content.

What do you see instead?
Only the footer is available.

The MHTML snapshot is attached, in case the page changes its content in the future.
 
BPL - History and Description.mhtml.zip
129 KB Download
Status: Assigned (was: Untriaged)
Status: Started (was: Assigned)
Diagnosis:

According to the table classification heuristics defined here:
  java/org/chromium/distiller/TableClassifier.java
The first met rule is:
  11) Table having >=5 columns is data table.

The algorithm is taken from:
http://asurkov.blogspot.com/2011/10/data-vs-layout-table.html

Some possible ideas:

We could add a new rule before rule #11, and probably after rule #9.

1. If the textContent of table is long enough, treat it as a layout table. This might get wrong for really large data tables.
2. If the table contains many <p> elements, treat as a layout table, since data tables usually don't use <p>. We got lucky that this page uses <p> in the article properly. Some articles just use <div> with lots of <br> for paragraphs.
3. If the table contains many <h[1-6]> heading tags, treat it as a layout table. We also got lucky that this page uses <h?> properly. Some would use <font> to make heading look larger without proper semantics.
4. Augment the 1-row 1-col rule: if a row is >95% of table height (0.986 in our case on Nexus 6P), or if a cell is >95% of table width (not useful in our case), treat it as a layout table. Two-column tables like pro vs. con with a long list in the cells might wrongly fall into this category.

The precision and recall might need to be tested on a larger and representative corpus to make sure the negative effect (if any) is acceptable. If there's no output change for all the corpora, our approach might be too specific.
Blockedon: 593457
I'll create a large corpus with non-mobile-friendly distillable pages to test the ideas.

Comment 4 by wychen@chromium.org, Oct 11 2016

It turned out that usage of <p> is a noisy signal. Web pages produced by MS Word are polluted by extra tags, and it is quite common to see data tables containing <p> in this case. Sadly, the percentage is even higher among those with non-mobile-friendly layout, which is our target.

Comment 5 by k...@chromium.org, Feb 15 2018

Cc: -k...@chromium.org

Sign in to add a comment