Fast article element detection should ignore small elements |
||||
Issue descriptionVersion: M54 What steps will reproduce the problem? (1) Run DOM distiller on http://japanese.engadget.com/2016/09/09/3dcg-saya2016/ What is the expected output? Extracted content. What do you see instead? No data is extracted. In the fast path, the only detected article element is this one: <header class="header container" itemscope="" itemtype="http://schema.org/BlogPosting"> Its dimension is around 400x100 px on mobile, or around 800x70 on desktop. We should filter out these small elements.
,
Feb 1 2017
,
Feb 2 2017
Naively making article detection more accurate would adversely affect the quality evaluation. The key difference is that the title is usually no longer within the root element, so the "expand to title" step no longer works properly.
,
Feb 2 2017
|
||||
►
Sign in to add a comment |
||||
Comment 1 by wychen@chromium.org
, Sep 10 2016