New issue
Advanced search Search tips

Issue 598184 link

Starred by 0 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 3
Type: Feature



Sign in to add a comment

Allow lower-quality matches from HQP, but score them very low

Reported by pdk...@gmail.com, Mar 27 2016

Issue description

UserAgent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.108 Safari/537.36

Steps to reproduce the problem:
1. https://groups.google.com/forum/?#!forum/google-cloud-sdk
2. enter cloud-sdk in the omnibar

What is the expected behavior?

What went wrong?
The URL isn't suggested.

Did this work before? N/A 

Chrome version: 49.0.2623.108  Channel: stable
OS Version: Ubuntu 14.04
Flash Version: 

It's suggested when searching for forum. The following, slightly different URL, is suggested.

https://groups.google.com/forum/#!forum/google-cloud-sdk

It seems that searching for query in http://site?query yields no results, but searching for site does.
 
Components: -UI UI>Browser>Omnibox
Labels: -OS-Linux -Type-Bug -Pri-2 OS-All Pri-3 Type-Feature
Status: Untriaged (was: Unconfirmed)
Feature request for Omnibar suggestions to take query parameters into account from history and bookmarks.

Forwarding onto Omnibox team for further triage.
Owner: mpear...@chromium.org
I seriously hope "https://groups.google.com/forum/?#!forum/google-cloud-sdk" is not the way that Groups or any other site publishes real URLs... that's just terrible.

->Mark for his input on what we do for query params, but this is somewhere between P3 and WontFix for me.
FYI every group link in the emails sent out (in the footer to the post thread) that I have seen uses the query param into hash + exclamation to get the old Google bot to crawl their SPA pages. So, it is basically their default it seems. While the emails *show* a clean URL, they redirect to this querified version.

Comment 4 by phistuck@gmail.com, Apr 3 2016

#3 -
For me, it goes to a querified URL with an actual (analytical) query -
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/discuss-webrtc/FYJksGrlVm4/z1KgmBFkKwAJ
I believe Peters refers to a question mark without any actual query after it as a terrible thing.

Comment 5 by pdk...@gmail.com, Apr 3 2016

Yes, the URL posted is reduced. I'm surprised that queries get special
treatment, in that they are apparently ignored. That's a major bug
IMO, for this particular feature at least.

Comment 6 by pdk...@gmail.com, Apr 3 2016

Any hash following a query is ignored. And apparently all query items
but the first are ignored. Or so I thought, but it seems to be more
confusing than that.

This is a real URL, less the edited project name.

https://console.cloud.google.com/appengine?project=name&moduleId=default&duration=PT1H

It finds project and name, but not the other query items, such as
PT1H. Except when you just enter p, it highlights both p in project
and P in PT1H. Now press t, and the omnibar is blank. No entries.
I don't know what our query indexing rules are but in general queries have a ridiculous amount of crap in them that you almost never want to match for almost any URLs.

Comment 8 by pdk...@gmail.com, Apr 4 2016

It can be weighted then. Not showing any match (when there are no
other non-query matches) is a bug.
What the omnibox does is complicated.  Here roughly are the rules that apply here:
* Allow terms to match after the "?", but only if they're at what looks like a word boundary.
* These query-param term matches are given a score of 0.5.  In comparison for examples, title matches are given a score of 0.8; hostname matches are given a score of 1.0.  These scores are summed for individual terms.
* "cloud-sdk" and "cloud sdk" are both treated as two separate terms.
* The average term score must be at least 0.8 for a result to be displayed (regardless of how often the page is visited).  We have found via comparisons side-by-side comparisons that this threshold is a good level to get rid of off-topic results without losing on-topic results.

In this example, the page isn't meeting the topicality threshold; its term score is 0.5 and 0.5 and thus averages 0.5.  The easy way to verify this is to notice that "cloud-sdk groups" does return the result because it brings the score above the threshold.

Usually in my experience the times people want a match in the ?query part of a URL, the text they want also appears in the title.  Yet, in this case, it does not; the title that gets stored in the history system is simply "Google Groups". :-(  You end up seeing "google-cloud-sdk - Google Groups" as a title in the title bar because of some weird redirection or javascript thing going on (I don't have the time or skills to investigate this right now), but to Chrome's history system, the title doesn't have the group name is in so we can't use it for scoring.

If someone cares enough, they should morph this bug into a complaint about the history system and why the title it stores isn't the title displayed on the tab.  Purely in terms of scoring, this is a WontFix, as the key threshold here has already been tuned.
Mark, rather than a hard threshold, is there a way where we could simply score things differently, e.g. score things at 0.8 highly enough to appear with the "normal" results, and score other things very low, so they'll show up below any search suggestions (or whatever)?

I realize people rarely click on things low down in the box, so this would have little benefit on our metrics, but it seems like it might help to surface some of these.
Often HQP results with matches like this get outscored by search query suggestions anyway even if they would normally appear.  The reason the threshold was added was precisely for the situation when there are few search suggestions; these matches look stupid.
Why is showing no matches better than showing "stupid" matches?  Does it actually improve people's interactions with the box in a measurable way, i.e. we'd regress something by changing this?

Comment 13 by pdk...@gmail.com, Apr 11 2016

I think there should be an ELSE type-of clause at the very least, to show results that would otherwise be not. It can just show the results unordered even, perhaps only ranked by visits or another simple metric. If user-confusion is an argument against it, I argue that now showing results causes more user-confusion.
>>>
Does it actually improve people's interactions with the box in a measurable way, i.e. we'd regress something by changing this?
>>>
It's hard to say.  The effect is so minor that they weren't statistically significant in most cases, but the metris did move in the correct direction when we ran the experiment.  See bug 306198 and especially comments #22 and #23 and the links within them.  As one example of the change: it seems like removing these often-stupid results increased clicks higher in the omnibox and didn't show any decrease in clicks on the results from this provider (HISTORY_QUICK).  I.e., basically all the time these stupid results showed, they were pushing out results people wanted.
That suggests that maybe we should be willing to return these, but with a score of 1, or some other way to ensure they can't push out any other results.  Do you think it's plausible that might not regress any metrics?

Comment 16 by pdk...@gmail.com, Apr 11 2016

I should perhaps point out that I have search suggestions disabled. I interpret the comments as Google (naturally) wanting users to click on search suggestions, which they do slightly less when the missing results are shown.
We don't care about what users click on as long as it's the most useful thing we can show them.

The omnibox is not a vehicle to push people into using a search engine against their will.
pkasting@
>>>
That suggests that maybe we should be willing to return these, but with a score of 1, or some other way to ensure they can't push out any other results.  Do you think it's plausible that might not regress any metrics?
>>>
Yes, that's plausible.  It might regress the side-by-side metrics where we show people the omnibox with various inputs and ask them what they think, but behavioral metrics might be fine.  If you or someone else wants to add code to try to experiment with this, I am happy to review it and help examine the metrics.

pdknsk@,
You make a good point about experiments.  We always run experiments over the whole population of users, which means that we end up selecting improvements that help the most given typical user configurations.  For users like you who have unusual setups, these changes may not work as well (or indeed work worse).

That said, I want to be clear that when I look at omnibox experiments metrics, I do NOT considering increasing use of searches a plus.  Indeed, if that happens, I usually look skeptically at the experiment because the result usually means the user isn't seeing the URL suggestion they wanted and had to resort to a search to find the URL.  This extra step is a negative in my book.

Cc: pkasting@chromium.org mpear...@chromium.org
Owner: ----
Status: Available (was: Untriaged)
Summary: Allow lower-quality matches from HQP, but score them very low (was: omnibar oblivious to some history entries, queries apparently)
It seems like there are potentially a couple changes here:
* In general being more willing to return low-quality matches (things below the thresholds today), but scoring them very low.
* Being able to match at non-word boundaries more?  Maybe we already do this outside the query portion?

It also seems like Mark doesn't intend to take action here and thus isn't a good owner.  I don't have time to implement this myself so I'm not a good owner either.  This is probably "looking for an external contributor" :(

Comment 20 by pdk...@gmail.com, Apr 11 2016

I'd like to show another case (which could actually be an entirely different bug), which I just encountered preparing another omnibar-related bug report.

Enter both URLs in omnibar.

https://bugs.chromium.org/p/monorail/issues/detail?id=1240#c1
https://bugs.chromium.org/p/monorail/issues/detail?id=1240#c2

Type 1240. It only returns the first URL. Remove the first URL from history. It returns no results.

The reason is that only the first URL is stored with title (which has 1240 in it), while the second isn't, and thus the missing results bug is triggered. (That it doesn't store the title could be a separate bug.)
I would file this separately, and focus on that they're not both stored with the correct title.

Comment 22 by pdk...@gmail.com, Apr 11 2016

PS. There is no #c2 on that issue, but it doesn't matter. When you visit the URLs in reverse order, then #c2 is stored with title, and #c1 isn't. Anyway, that's a separate bug.

Comment 24 by pdk...@gmail.com, Aug 22 2016

I made this small patch for me.

--- a/components/omnibox/browser/scored_history_match.cc
+++ b/components/omnibox/browser/scored_history_match.cc
@@ -474,9 +474,9 @@ float ScoredHistoryMatch::GetTopicalityScore(
   // Loop through all URL matches and score them appropriately.
   // First, filter all matches not at a word boundary and in the path (or
   // later).
-  url_matches = FilterTermMatchesByWordStarts(
-      url_matches, terms_to_word_starts_offsets, word_starts.url_word_starts_,
-      end_of_hostname_pos, std::string::npos);
+  // url_matches = FilterTermMatchesByWordStarts(
+  //     url_matches, terms_to_word_starts_offsets, word_starts.url_word_starts_,
+  //     end_of_hostname_pos, std::string::npos);
   if (colon_pos != std::string::npos) {
     // Also filter matches not at a word boundary and in the scheme.
     url_matches = FilterTermMatchesByWordStarts(
@@ -497,13 +497,13 @@ float ScoredHistoryMatch::GetTopicalityScore(
     if ((question_mark_pos != std::string::npos) &&
         (url_match.offset > question_mark_pos)) {
       // The match is in a CGI ?... fragment.
-      DCHECK(at_word_boundary);
-      term_scores[url_match.term_num] += 5;
+      // DCHECK(at_word_boundary);
+      term_scores[url_match.term_num] += at_word_boundary ? 5 : 1;
     } else if ((end_of_hostname_pos != std::string::npos) &&
                (url_match.offset > end_of_hostname_pos)) {
       // The match is in the path.
-      DCHECK(at_word_boundary);
-      term_scores[url_match.term_num] += 8;
+      // DCHECK(at_word_boundary);
+      term_scores[url_match.term_num] += at_word_boundary ? 8 : 1;
     } else if ((colon_pos == std::string::npos) ||
                (url_match.offset > colon_pos)) {
       // The match is in the hostname.
@@ -521,10 +521,10 @@ float ScoredHistoryMatch::GetTopicalityScore(
     } else {
       // The match is in the protocol (a.k.a. scheme).
       // Matches not at a word boundary should have been filtered already.
-      DCHECK(at_word_boundary);
+      // DCHECK(at_word_boundary);
       match_in_scheme = true;
       if (allow_scheme_matches_)
-        term_scores[url_match.term_num] += 10;
+        term_scores[url_match.term_num] += at_word_boundary ? 10 : 0;
     }
   }
   // Now do the analogous loop over all matches in the title.
@@ -542,7 +542,7 @@ float ScoredHistoryMatch::GetTopicalityScore(
     while ((next_word_starts != end_word_starts) &&
            (*next_word_starts < (title_match.offset + term_offset))) {
       ++next_word_starts;
-      ++word_num;
+      // ++word_num;
     }
     if (word_num >= num_title_words_to_allow_)
       break;  // only count the first ten words
@@ -577,9 +578,9 @@ float ScoredHistoryMatch::GetTopicalityScore(
   const float final_topicality_score = topicality_score / num_terms;
 
   // Demote the URL if the topicality score is less than threshold.
-  if (final_topicality_score < topicality_threshold_) {
-    return 0.0;
-  }
+  // if (final_topicality_score < topicality_threshold_) {
+  //   return 0.0;
+  // }
 
   return final_topicality_score;
 }

Thanks for looking into this, but posting patches on bugs isn't the right way to get them looked at.  Follow the steps on https://www.chromium.org/developers/contributing-code , especially in "Uploading a change for review"; once you set some reviewers and Publish+mail to notify them, they can start looking at your proposal.

Comment 26 by pdk...@gmail.com, Aug 22 2016

I didn't mean to get it into Chrome. Just posting for those
interested, or as a starting point.
It would still be more readable on the code review site, and who knows, maybe we would actually like something about it enough to land a modified version.

Like, I wouldn't mind Mark glancing at your scoring changes.  But it'd be easier for him to do there than here.
It looks like this change does two things:
1) allow non-word-boundary matches in the /path and ?cgi components of a URL, scoring such matches low
2) removes the threshold that prevents low-scoring URLs from being suggested

The two examples on this thread are both solved by (2).  I'm not convinced (1) will buy  much.  I think it might be worth experimenting with removing the threshold in (2).  We previously added the threshold as part of a large change; it's not clear what effect the threshold had on metrics.  It did fix some examples we had ("gg" no longer showing the suggestion for web site whose hostname included the string "suggest").

Cc: -mpear...@chromium.org
Owner: mpear...@chromium.org
Status: Started (was: Available)
Assigning to myself for tracking purposes, as I'm currently running an experiment that exactly tackles this issue.  Initial results are promising! :-)
Labels: Hotlist-OmniboxRanking
Project Member

Comment 31 by bugdroid1@chromium.org, Aug 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/fbc743ee8228f50c12e466a440f2920f3c86d5b8

commit fbc743ee8228f50c12e466a440f2920f3c86d5b8
Author: Mark Pearson <mpearson@chromium.org>
Date: Tue Aug 01 20:08:33 2017

Omnibox - Launch Aggressively Suggest Infrequently Visited URLs

Sets default values for all relevant parameters to the ones we decided
to launch.  In particular, this change:
* makes it so that URLs that match the input--even if only visited
  once or twice--are likely to score above low quality query suggestions
  that come from the server.
* boosts a URL suggestion if it appears the URL suggestion is clearly
  seeking that URL.  In particular, if the omnibox input only matches that
  single URL from history, it gets a 3x boost (in effect we count it as
  having three times as many visits).  This boost decreases as the number of
  matching URLs increases, so that if the user input matches five or more
  items from history, nothing gets a boost.
* lowers the threshold for how well a URL must match the input in order
  to be displayed.  Previously, for example, we wouldn't return URLs that
  match a word in the input if the word matches in the ?query or #hash
  section of the URL.  Now we do.
* reduces the relative weight of a "typed visit" (a time the URL is selected
  from the omnibox) compared with a regular visit (click on a link).
  It used to be that the former was worth 20x the latter.  Now it's only
  1.5x.
* changes to a scoring model in which additional visits to a URL are
  guaranteed to increase its score.  Previously we used a model based on
  the average quality of a visit, which means that if a URL has many
  typed visits and then gets a new untyped visit, its score (the average)
  will go down.  Now we use simply a sum, which means the score will
  definitely increase.

Precisely, in terms of code / config, we're launching the following settings:
  "HQPExperimentalScoringBuckets": "0.0:550,1:625,9.0:1300,90.0:1399",
  "HQPTypedValue": "1.5",
  "HQPFreqencyUsesSum": "true",
  "HQPNumMatchesScores": "1:3,2:2.5,3:2,4:1.5",
  "HQPExperimentalScoringTopicalityThreshold": "0.5"

In the process, removes some of the flags for frequency scoring that
I don't think are useful (not the right model for scoring) and aren't
worth going back to.

Bug: 695560,  327085 ,  369989 , 508262,  580688 , 591981,  598184 
Change-Id: Id349c5aaa2e09e6b5284c55fc5790f4b14b8fa7b
Reviewed-on: https://chromium-review.googlesource.com/585377
Commit-Queue: Mark Pearson <mpearson@chromium.org>
Reviewed-by: Peter Kasting <pkasting@chromium.org>
Cr-Commit-Position: refs/heads/master@{#491089}
[modify] https://crrev.com/fbc743ee8228f50c12e466a440f2920f3c86d5b8/components/omnibox/browser/history_quick_provider_unittest.cc
[modify] https://crrev.com/fbc743ee8228f50c12e466a440f2920f3c86d5b8/components/omnibox/browser/in_memory_url_index_unittest.cc
[modify] https://crrev.com/fbc743ee8228f50c12e466a440f2920f3c86d5b8/components/omnibox/browser/omnibox_field_trial.cc
[modify] https://crrev.com/fbc743ee8228f50c12e466a440f2920f3c86d5b8/components/omnibox/browser/omnibox_field_trial.h
[modify] https://crrev.com/fbc743ee8228f50c12e466a440f2920f3c86d5b8/components/omnibox/browser/scored_history_match.cc
[modify] https://crrev.com/fbc743ee8228f50c12e466a440f2920f3c86d5b8/components/omnibox/browser/scored_history_match.h
[modify] https://crrev.com/fbc743ee8228f50c12e466a440f2920f3c86d5b8/components/omnibox/browser/scored_history_match_unittest.cc

Status: Fixed (was: Started)
This bug has been fixed!  It solves all the various issues (mostly involving matching in query parameters) mentioned on this bug.  It does this by lowering the minimum threshold required for matches as pdknsk@gmail.com and pkasting@ suggested.  Thanks!

The fix is submitted for Chrome 62.  I also rolled it out via Chrome's experiment framework to all stable channel users.  (I used the framework to prove that this change is good for users.  Now I turned the "experiment" up to 100%.)  The next time you restart your browser, you should see the new behavior.  Please file new bugs if you have additional feedback/complaints.  Thanks!

Comment 33 by pdk...@gmail.com, Aug 3 2017

​I haven't tried it yet, but it seems non-boundary matches in paths are
still not shown. I'm not sure if I get the experiments in Chromium. I think
yes, as I set fieldtrial_testing_like_official_build.​
You are correct, if a term only matches at a non-word-boundary in the path section of a URL, the URL will not be suggested.  I know you said you wanted that, but in all the examples you provided, that wasn't necessary for any of them.  That's why I called this bug fixed. :-)

Sign in to add a comment