Allow lower-quality matches from HQP, but score them very low
Reported by
pdk...@gmail.com,
Mar 27 2016
|
||||||
Issue descriptionUserAgent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.108 Safari/537.36 Steps to reproduce the problem: 1. https://groups.google.com/forum/?#!forum/google-cloud-sdk 2. enter cloud-sdk in the omnibar What is the expected behavior? What went wrong? The URL isn't suggested. Did this work before? N/A Chrome version: 49.0.2623.108 Channel: stable OS Version: Ubuntu 14.04 Flash Version: It's suggested when searching for forum. The following, slightly different URL, is suggested. https://groups.google.com/forum/#!forum/google-cloud-sdk It seems that searching for query in http://site?query yields no results, but searching for site does.
,
Apr 3 2016
I seriously hope "https://groups.google.com/forum/?#!forum/google-cloud-sdk" is not the way that Groups or any other site publishes real URLs... that's just terrible. ->Mark for his input on what we do for query params, but this is somewhere between P3 and WontFix for me.
,
Apr 3 2016
FYI every group link in the emails sent out (in the footer to the post thread) that I have seen uses the query param into hash + exclamation to get the old Google bot to crawl their SPA pages. So, it is basically their default it seems. While the emails *show* a clean URL, they redirect to this querified version.
,
Apr 3 2016
#3 - For me, it goes to a querified URL with an actual (analytical) query - https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/discuss-webrtc/FYJksGrlVm4/z1KgmBFkKwAJ I believe Peters refers to a question mark without any actual query after it as a terrible thing.
,
Apr 3 2016
Yes, the URL posted is reduced. I'm surprised that queries get special treatment, in that they are apparently ignored. That's a major bug IMO, for this particular feature at least.
,
Apr 3 2016
Any hash following a query is ignored. And apparently all query items but the first are ignored. Or so I thought, but it seems to be more confusing than that. This is a real URL, less the edited project name. https://console.cloud.google.com/appengine?project=name&moduleId=default&duration=PT1H It finds project and name, but not the other query items, such as PT1H. Except when you just enter p, it highlights both p in project and P in PT1H. Now press t, and the omnibar is blank. No entries.
,
Apr 4 2016
I don't know what our query indexing rules are but in general queries have a ridiculous amount of crap in them that you almost never want to match for almost any URLs.
,
Apr 4 2016
It can be weighted then. Not showing any match (when there are no other non-query matches) is a bug.
,
Apr 5 2016
What the omnibox does is complicated. Here roughly are the rules that apply here: * Allow terms to match after the "?", but only if they're at what looks like a word boundary. * These query-param term matches are given a score of 0.5. In comparison for examples, title matches are given a score of 0.8; hostname matches are given a score of 1.0. These scores are summed for individual terms. * "cloud-sdk" and "cloud sdk" are both treated as two separate terms. * The average term score must be at least 0.8 for a result to be displayed (regardless of how often the page is visited). We have found via comparisons side-by-side comparisons that this threshold is a good level to get rid of off-topic results without losing on-topic results. In this example, the page isn't meeting the topicality threshold; its term score is 0.5 and 0.5 and thus averages 0.5. The easy way to verify this is to notice that "cloud-sdk groups" does return the result because it brings the score above the threshold. Usually in my experience the times people want a match in the ?query part of a URL, the text they want also appears in the title. Yet, in this case, it does not; the title that gets stored in the history system is simply "Google Groups". :-( You end up seeing "google-cloud-sdk - Google Groups" as a title in the title bar because of some weird redirection or javascript thing going on (I don't have the time or skills to investigate this right now), but to Chrome's history system, the title doesn't have the group name is in so we can't use it for scoring. If someone cares enough, they should morph this bug into a complaint about the history system and why the title it stores isn't the title displayed on the tab. Purely in terms of scoring, this is a WontFix, as the key threshold here has already been tuned.
,
Apr 5 2016
Mark, rather than a hard threshold, is there a way where we could simply score things differently, e.g. score things at 0.8 highly enough to appear with the "normal" results, and score other things very low, so they'll show up below any search suggestions (or whatever)? I realize people rarely click on things low down in the box, so this would have little benefit on our metrics, but it seems like it might help to surface some of these.
,
Apr 5 2016
Often HQP results with matches like this get outscored by search query suggestions anyway even if they would normally appear. The reason the threshold was added was precisely for the situation when there are few search suggestions; these matches look stupid.
,
Apr 5 2016
Why is showing no matches better than showing "stupid" matches? Does it actually improve people's interactions with the box in a measurable way, i.e. we'd regress something by changing this?
,
Apr 11 2016
I think there should be an ELSE type-of clause at the very least, to show results that would otherwise be not. It can just show the results unordered even, perhaps only ranked by visits or another simple metric. If user-confusion is an argument against it, I argue that now showing results causes more user-confusion.
,
Apr 11 2016
>>> Does it actually improve people's interactions with the box in a measurable way, i.e. we'd regress something by changing this? >>> It's hard to say. The effect is so minor that they weren't statistically significant in most cases, but the metris did move in the correct direction when we ran the experiment. See bug 306198 and especially comments #22 and #23 and the links within them. As one example of the change: it seems like removing these often-stupid results increased clicks higher in the omnibox and didn't show any decrease in clicks on the results from this provider (HISTORY_QUICK). I.e., basically all the time these stupid results showed, they were pushing out results people wanted.
,
Apr 11 2016
That suggests that maybe we should be willing to return these, but with a score of 1, or some other way to ensure they can't push out any other results. Do you think it's plausible that might not regress any metrics?
,
Apr 11 2016
I should perhaps point out that I have search suggestions disabled. I interpret the comments as Google (naturally) wanting users to click on search suggestions, which they do slightly less when the missing results are shown.
,
Apr 11 2016
We don't care about what users click on as long as it's the most useful thing we can show them. The omnibox is not a vehicle to push people into using a search engine against their will.
,
Apr 11 2016
pkasting@ >>> That suggests that maybe we should be willing to return these, but with a score of 1, or some other way to ensure they can't push out any other results. Do you think it's plausible that might not regress any metrics? >>> Yes, that's plausible. It might regress the side-by-side metrics where we show people the omnibox with various inputs and ask them what they think, but behavioral metrics might be fine. If you or someone else wants to add code to try to experiment with this, I am happy to review it and help examine the metrics. pdknsk@, You make a good point about experiments. We always run experiments over the whole population of users, which means that we end up selecting improvements that help the most given typical user configurations. For users like you who have unusual setups, these changes may not work as well (or indeed work worse). That said, I want to be clear that when I look at omnibox experiments metrics, I do NOT considering increasing use of searches a plus. Indeed, if that happens, I usually look skeptically at the experiment because the result usually means the user isn't seeing the URL suggestion they wanted and had to resort to a search to find the URL. This extra step is a negative in my book.
,
Apr 11 2016
It seems like there are potentially a couple changes here: * In general being more willing to return low-quality matches (things below the thresholds today), but scoring them very low. * Being able to match at non-word boundaries more? Maybe we already do this outside the query portion? It also seems like Mark doesn't intend to take action here and thus isn't a good owner. I don't have time to implement this myself so I'm not a good owner either. This is probably "looking for an external contributor" :(
,
Apr 11 2016
I'd like to show another case (which could actually be an entirely different bug), which I just encountered preparing another omnibar-related bug report. Enter both URLs in omnibar. https://bugs.chromium.org/p/monorail/issues/detail?id=1240#c1 https://bugs.chromium.org/p/monorail/issues/detail?id=1240#c2 Type 1240. It only returns the first URL. Remove the first URL from history. It returns no results. The reason is that only the first URL is stored with title (which has 1240 in it), while the second isn't, and thus the missing results bug is triggered. (That it doesn't store the title could be a separate bug.)
,
Apr 11 2016
I would file this separately, and focus on that they're not both stored with the correct title.
,
Apr 11 2016
PS. There is no #c2 on that issue, but it doesn't matter. When you visit the URLs in reverse order, then #c2 is stored with title, and #c1 isn't. Anyway, that's a separate bug.
,
Apr 11 2016
Two additional omnibar requests, and the other bug. https://bugs.chromium.org/p/chromium/issues/detail?id=602346 https://bugs.chromium.org/p/chromium/issues/detail?id=602402 https://bugs.chromium.org/p/chromium/issues/detail?id=602416
,
Aug 22 2016
I made this small patch for me.
--- a/components/omnibox/browser/scored_history_match.cc
+++ b/components/omnibox/browser/scored_history_match.cc
@@ -474,9 +474,9 @@ float ScoredHistoryMatch::GetTopicalityScore(
// Loop through all URL matches and score them appropriately.
// First, filter all matches not at a word boundary and in the path (or
// later).
- url_matches = FilterTermMatchesByWordStarts(
- url_matches, terms_to_word_starts_offsets, word_starts.url_word_starts_,
- end_of_hostname_pos, std::string::npos);
+ // url_matches = FilterTermMatchesByWordStarts(
+ // url_matches, terms_to_word_starts_offsets, word_starts.url_word_starts_,
+ // end_of_hostname_pos, std::string::npos);
if (colon_pos != std::string::npos) {
// Also filter matches not at a word boundary and in the scheme.
url_matches = FilterTermMatchesByWordStarts(
@@ -497,13 +497,13 @@ float ScoredHistoryMatch::GetTopicalityScore(
if ((question_mark_pos != std::string::npos) &&
(url_match.offset > question_mark_pos)) {
// The match is in a CGI ?... fragment.
- DCHECK(at_word_boundary);
- term_scores[url_match.term_num] += 5;
+ // DCHECK(at_word_boundary);
+ term_scores[url_match.term_num] += at_word_boundary ? 5 : 1;
} else if ((end_of_hostname_pos != std::string::npos) &&
(url_match.offset > end_of_hostname_pos)) {
// The match is in the path.
- DCHECK(at_word_boundary);
- term_scores[url_match.term_num] += 8;
+ // DCHECK(at_word_boundary);
+ term_scores[url_match.term_num] += at_word_boundary ? 8 : 1;
} else if ((colon_pos == std::string::npos) ||
(url_match.offset > colon_pos)) {
// The match is in the hostname.
@@ -521,10 +521,10 @@ float ScoredHistoryMatch::GetTopicalityScore(
} else {
// The match is in the protocol (a.k.a. scheme).
// Matches not at a word boundary should have been filtered already.
- DCHECK(at_word_boundary);
+ // DCHECK(at_word_boundary);
match_in_scheme = true;
if (allow_scheme_matches_)
- term_scores[url_match.term_num] += 10;
+ term_scores[url_match.term_num] += at_word_boundary ? 10 : 0;
}
}
// Now do the analogous loop over all matches in the title.
@@ -542,7 +542,7 @@ float ScoredHistoryMatch::GetTopicalityScore(
while ((next_word_starts != end_word_starts) &&
(*next_word_starts < (title_match.offset + term_offset))) {
++next_word_starts;
- ++word_num;
+ // ++word_num;
}
if (word_num >= num_title_words_to_allow_)
break; // only count the first ten words
@@ -577,9 +578,9 @@ float ScoredHistoryMatch::GetTopicalityScore(
const float final_topicality_score = topicality_score / num_terms;
// Demote the URL if the topicality score is less than threshold.
- if (final_topicality_score < topicality_threshold_) {
- return 0.0;
- }
+ // if (final_topicality_score < topicality_threshold_) {
+ // return 0.0;
+ // }
return final_topicality_score;
}
,
Aug 22 2016
Thanks for looking into this, but posting patches on bugs isn't the right way to get them looked at. Follow the steps on https://www.chromium.org/developers/contributing-code , especially in "Uploading a change for review"; once you set some reviewers and Publish+mail to notify them, they can start looking at your proposal.
,
Aug 22 2016
I didn't mean to get it into Chrome. Just posting for those interested, or as a starting point.
,
Aug 22 2016
It would still be more readable on the code review site, and who knows, maybe we would actually like something about it enough to land a modified version. Like, I wouldn't mind Mark glancing at your scoring changes. But it'd be easier for him to do there than here.
,
Sep 28 2016
It looks like this change does two things:
1) allow non-word-boundary matches in the /path and ?cgi components of a URL, scoring such matches low
2) removes the threshold that prevents low-scoring URLs from being suggested
The two examples on this thread are both solved by (2). I'm not convinced (1) will buy much. I think it might be worth experimenting with removing the threshold in (2). We previously added the threshold as part of a large change; it's not clear what effect the threshold had on metrics. It did fix some examples we had ("gg" no longer showing the suggestion for web site whose hostname included the string "suggest").
,
Jun 15 2017
Assigning to myself for tracking purposes, as I'm currently running an experiment that exactly tackles this issue. Initial results are promising! :-)
,
Jul 19 2017
,
Aug 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/fbc743ee8228f50c12e466a440f2920f3c86d5b8 commit fbc743ee8228f50c12e466a440f2920f3c86d5b8 Author: Mark Pearson <mpearson@chromium.org> Date: Tue Aug 01 20:08:33 2017 Omnibox - Launch Aggressively Suggest Infrequently Visited URLs Sets default values for all relevant parameters to the ones we decided to launch. In particular, this change: * makes it so that URLs that match the input--even if only visited once or twice--are likely to score above low quality query suggestions that come from the server. * boosts a URL suggestion if it appears the URL suggestion is clearly seeking that URL. In particular, if the omnibox input only matches that single URL from history, it gets a 3x boost (in effect we count it as having three times as many visits). This boost decreases as the number of matching URLs increases, so that if the user input matches five or more items from history, nothing gets a boost. * lowers the threshold for how well a URL must match the input in order to be displayed. Previously, for example, we wouldn't return URLs that match a word in the input if the word matches in the ?query or #hash section of the URL. Now we do. * reduces the relative weight of a "typed visit" (a time the URL is selected from the omnibox) compared with a regular visit (click on a link). It used to be that the former was worth 20x the latter. Now it's only 1.5x. * changes to a scoring model in which additional visits to a URL are guaranteed to increase its score. Previously we used a model based on the average quality of a visit, which means that if a URL has many typed visits and then gets a new untyped visit, its score (the average) will go down. Now we use simply a sum, which means the score will definitely increase. Precisely, in terms of code / config, we're launching the following settings: "HQPExperimentalScoringBuckets": "0.0:550,1:625,9.0:1300,90.0:1399", "HQPTypedValue": "1.5", "HQPFreqencyUsesSum": "true", "HQPNumMatchesScores": "1:3,2:2.5,3:2,4:1.5", "HQPExperimentalScoringTopicalityThreshold": "0.5" In the process, removes some of the flags for frequency scoring that I don't think are useful (not the right model for scoring) and aren't worth going back to. Bug: 695560, 327085 , 369989 , 508262, 580688 , 591981, 598184 Change-Id: Id349c5aaa2e09e6b5284c55fc5790f4b14b8fa7b Reviewed-on: https://chromium-review.googlesource.com/585377 Commit-Queue: Mark Pearson <mpearson@chromium.org> Reviewed-by: Peter Kasting <pkasting@chromium.org> Cr-Commit-Position: refs/heads/master@{#491089} [modify] https://crrev.com/fbc743ee8228f50c12e466a440f2920f3c86d5b8/components/omnibox/browser/history_quick_provider_unittest.cc [modify] https://crrev.com/fbc743ee8228f50c12e466a440f2920f3c86d5b8/components/omnibox/browser/in_memory_url_index_unittest.cc [modify] https://crrev.com/fbc743ee8228f50c12e466a440f2920f3c86d5b8/components/omnibox/browser/omnibox_field_trial.cc [modify] https://crrev.com/fbc743ee8228f50c12e466a440f2920f3c86d5b8/components/omnibox/browser/omnibox_field_trial.h [modify] https://crrev.com/fbc743ee8228f50c12e466a440f2920f3c86d5b8/components/omnibox/browser/scored_history_match.cc [modify] https://crrev.com/fbc743ee8228f50c12e466a440f2920f3c86d5b8/components/omnibox/browser/scored_history_match.h [modify] https://crrev.com/fbc743ee8228f50c12e466a440f2920f3c86d5b8/components/omnibox/browser/scored_history_match_unittest.cc
,
Aug 2 2017
This bug has been fixed! It solves all the various issues (mostly involving matching in query parameters) mentioned on this bug. It does this by lowering the minimum threshold required for matches as pdknsk@gmail.com and pkasting@ suggested. Thanks! The fix is submitted for Chrome 62. I also rolled it out via Chrome's experiment framework to all stable channel users. (I used the framework to prove that this change is good for users. Now I turned the "experiment" up to 100%.) The next time you restart your browser, you should see the new behavior. Please file new bugs if you have additional feedback/complaints. Thanks!
,
Aug 3 2017
I haven't tried it yet, but it seems non-boundary matches in paths are still not shown. I'm not sure if I get the experiments in Chromium. I think yes, as I set fieldtrial_testing_like_official_build.
,
Aug 3 2017
You are correct, if a term only matches at a non-word-boundary in the path section of a URL, the URL will not be suggested. I know you said you wanted that, but in all the examples you provided, that wasn't necessary for any of them. That's why I called this bug fixed. :-) |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by jonathan.garbee@chromium.org
, Apr 2 2016Labels: -OS-Linux -Type-Bug -Pri-2 OS-All Pri-3 Type-Feature
Status: Untriaged (was: Unconfirmed)