New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 678398 link

Starred by 7 users

Issue metadata

Status: Duplicate
Merged: issue 765006
Owner:
Closed: Nov 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocked on:
issue 704388



Sign in to add a comment

Popular Indian websites not offering translation suggestion

Project Member Reported by mdw@chromium.org, Jan 4 2017

Issue description

Chrome Version: 55.0.2883.95
OS: OS X

What steps will reproduce the problem?
(1) Visit any of the URLs below.
(2) Wait for the page to load.


What is the expected result?

I am offered the option to translate the content into English.

What happens instead?

These sites are predominantly in Indian languages. None of them appear to trigger a translation assist. There is no apparent UI to force Chrome to translate the content for me.

Here are some popular Indian websites that do not offer translate suggestions:

http://www.tupaki.com/  (Telugu)
http://www.eenadu.net/  (Telugu)
http://www.amarujala.com/  (Hindi)
http://tamil.oneindia.com/  (Tamil)

(These are all among the top 100 Indian sites according to Chrome history logs.)


 
Cc: djweiss@chromium.org riesa@chromium.org
Some debugging information -- this doesn't appear to be an issue related to language detection. If you type chrome://translate-internals/#detection-logs in one tab, load the pages in separate tabs, go back to the first one, then we can see that the languages are correctly detected.  

However, in the first three cases, the html lang attribute is English, and this discrepancy (I think) leads to having "und" (or "unknown") as the adopted language.

In the last case, the not-translate flag is set, which results in Translate not triggering.
Cc: abakalov@chromium.org

Comment 3 by riesa@chromium.org, Jan 13 2017

Cc: groby@chromium.org
Owner: zkoch@chromium.org
Correct -- not a LangID issue. LangID model CLD3 correctly identifies the language of all pages. Current Translate UI triggering heuristics prevent triggering since there is a mismatch between what LangID identifies and the content language claimed by the HTML "lang" attribute in the source code.

Currently these pages all explicitly claim to be English '<html ... lang="en">'. 

Assigning to zkoch@ who can triage and decide if revisiting triggering heuristics is important to do after the currently ongoing underlying LangID model change to CLD3 is completed.

Triggering heuristics start here:
https://cs.chromium.org/chromium/src/components/translate/core/language_detection/language_detection_util.cc?rcl=0&l=279

Comment 4 by zkoch@chromium.org, Jan 17 2017

Cc: yyushkina@chromium.org
Ah, interesting. I wonder why these sites are marking these pages as "en". Mdw, any idea?


Not sure what is happening in these particular cases but in general it could be the case that if English is more popular than language X in India, developers might be tempted to mark a page in X as English hoping that search engines will rank it higher than a similar page in X marked as X.

Comment 6 by riesa@google.com, Jan 18 2017

I think it may also be that this could be a preset value in many HTML editors and is left as-is before publishing.

Comment 7 by mdw@chromium.org, Jan 18 2017

Completely agree that this is probably a default value in HTML templates and not indicative of developer intent. Do we have any signals (e.g., maybe from the crawl) in terms of how often the detected language and the document declared language are a mismatch? Should we not prefer the detected language over the lang attribute in markup?

Comment 8 by zkoch@chromium.org, Jan 18 2017

> Do we have any signals (e.g., maybe from the crawl) in terms of how often the detected language and the document declared language are a mismatch? 

This would be an interesting thing to pull. I'm not exactly sure how to do this.

> Should we not prefer the detected language over the lang attribute in markup?

We tend to prefer to give developers ways to assert influence over our predictions for assistive features, which is why I was hesitant to suggest this. That said, there is some precedent here for doing what's best for the user, especially if the use of the markup by the developer is unintentional or the side effects unknown (e.g. autocomplete=off for autofill).

Groby, what's your take on this?
Currently, we have a well-known wrong language pairs list. That will be similar thing you discussed.
https://cs.chromium.org/chromium/src/components/translate/core/language_detection/language_detection_util.cc?q=kWellKnownCodesOnWrongConfiguration

If CLD and page providing language information isn't the same language, and the page providing language is listed here, we trust CLD's result.

This list was created from a result of doc-join, offline analysis for existing all web pages in the world. But, well-known wrong configuration languages may be nearly equal to the used often languages, that means page providing "en" would be always suspicious if CLD says it isn't.

Comment 10 by mdw@chromium.org, Jan 23 2017

How long ago was the wrong language list generated? Does it make sense to refresh it?

Also, rather than weighting all pages equally, you might consider weighting that analysis by the number of clicks. The pages I was looking at were in the top 100 most popular pages in India.

My suggestion would be to redo the analysis for popular pages in our target EM countries (at least IN and ID, if not more), weight by page popularity, and see if revising the list of languages or the rule for deciding which language the page is would make sense in light of the results. This is a pretty important problem for us to address for EM users.

Thanks!


Comment 11 by mdw@chromium.org, Feb 27 2017

Friendly ping.

Cc: napper@chromium.org
Adding Jon, TL for language team. 
Blockedon: 704388
Cc: -riesa@chromium.org -abakalov@chromium.org
Owner: yyushkina@chromium.org
Status: Assigned (was: Untriaged)
Cc: ftang@chromium.org kavvaru@chromium.org
 Issue 586053  has been merged into this issue.
Cc: andrewhayden@chromium.org riesa@chromium.org
 Issue 573304  has been merged into this issue.
Cc: -riesa@chromium.org -ftang@chromium.org -andrewhayden@chromium.org
Cc: durga.behera@chromium.org abakalov@chromium.org riesa@chromium.org hdodda@chromium.org
 Issue 678287  has been merged into this issue.
Components: -UI>Browser>Translate UI>Browser>Language>Translate
Update on this: we're checking to see which if any other languages should be added to the "well-known wrong language pairs list". Metrics for this have landed but we have to wait to actually have the metrics be in a stable release before we can make decision.
Mergedinto: 765006
Status: Duplicate (was: Assigned)

Sign in to add a comment