New issue
Advanced search Search tips

Issue 692596 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 3
Type: Bug



Sign in to add a comment

Fuzzy title matching in DOM distiller

Project Member Reported by wychen@chromium.org, Feb 15 2017

Issue description

The title matching algorithm used in DOM distiller requires exact matching, except for the publisher that's stripped away. However, some sites uses slightly different titles in <title> and <h1>, causing the matching to fail.

Example: https://www-marketwatch-com.cdn.ampproject.org/v/www.marketwatch.com/amp/story/guid/E6CA6E62-F220-11E6-82ED-7800910FCE87?amp_js_v=7

What's in <title>:
Tesla could decide to tap capital markets as its shares rally analyst says - MarketWatch

What's in <h1>:
Tesla could decide to tap capital markets as its shares rally, analyst says

If edit distance is short enough, 1 in this example, then it should still match.

 

Comment 1 by wychen@chromium.org, Feb 23 2017

If we don't get title matching right, the distilled content would show the extracted title, and then the unmatched <h1> again, leading to two titles which are almost the same.

Comment 2 by wychen@chromium.org, Mar 16 2017

Labels: Hotlist-GoodFirstBug

Comment 3 by k...@chromium.org, Feb 15 2018

Cc: -k...@chromium.org

Sign in to add a comment