New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 593457 link

Starred by 2 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 2
Type: Bug

Blocking:
issue 654058



Sign in to add a comment

Corpus with representative sample for DOM distiller evaluation

Project Member Reported by wychen@chromium.org, Mar 9 2016

Issue description

Currently we use the dataset "reader-mode-golden-data" for performance evaluation, but it is not representative of what users see, so trade-offs based on that dataset might be biased. We should build another dataset that can represent the real world distribution.

Since it would only be for performance evaluation, we don't really need the "golden answer" part. This way, creating the dataset can be automated.
 
To be more accurate, it would be nice to have full MHTML as input, but this shouldn't block the creation of this dataset.

Comment 2 by wychen@chromium.org, Mar 11 2016

All the articles in "reader-mode-golde-data" have <meta property="og:type" content="article" />, so markup_parsing time would be biased.
Cc: mdjones@chromium.org k...@chromium.org
Owner: wychen@chromium.org
Status: Started (was: Untriaged)
Summary: Corpus with representative sample for DOM distiller evaluation (was: Corpus with representative sample for DOM distiller performance evaluation)
Besides performance evaluation, the corpus can also be used for output difference detection. Since a recent bug (issue 654058) can really use a representative corpus with high coverage, and we can support MHTML in our eval server, it might be a good timing to make it happen.
Blocking: 654058
Non-mobile-friendly distillable corpus is here: cl/135527281
Project Member

Comment 6 by bugdroid1@chromium.org, Oct 24 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a22aa4380f04d2d8aa18f1f3098393516646f181

commit a22aa4380f04d2d8aa18f1f3098393516646f181
Author: wychen <wychen@chromium.org>
Date: Mon Oct 24 19:38:13 2016

Roll DOM Distiller JavaScript distribution package

Diff since last roll:
https://github.com/chromium/dom-distiller/compare/d16a68c1b8...072fe57b48

Picked up changes:
072fe57 Recognize H4 to H6 as headings as well
52047b4 Avoid using getClassName() to avoid issues with <svg>
8cf93ce Bump ChromeDriver version to 2.24
d876125 Add gen_mhtml_corpus.py to convert MHTML to eval corpus
8b33c8b Amend "Fix partially hidden article"
3fd2017 Strip unwanted classNames from all nodes

BUG=593457,599121, 647098 , 658038 

Review-Url: https://codereview.chromium.org/2447453002
Cr-Commit-Position: refs/heads/master@{#427118}

[modify] https://crrev.com/a22aa4380f04d2d8aa18f1f3098393516646f181/DEPS
[modify] https://crrev.com/a22aa4380f04d2d8aa18f1f3098393516646f181/third_party/dom_distiller_js/README.chromium

Comment 7 by wychen@chromium.org, Mar 16 2017

Cc: noyau@chromium.org
We might want a corpus representative for iOS Reading List, if we want to measure performance changes.

Comment 8 by k...@chromium.org, Feb 15 2018

Cc: -k...@chromium.org

Sign in to add a comment