Issue metadata
Sign in to add a comment
|
Popular Indian websites not offering translation suggestion |
||||||||||||||||||||||||
Issue descriptionChrome Version: 55.0.2883.95 OS: OS X What steps will reproduce the problem? (1) Visit any of the URLs below. (2) Wait for the page to load. What is the expected result? I am offered the option to translate the content into English. What happens instead? These sites are predominantly in Indian languages. None of them appear to trigger a translation assist. There is no apparent UI to force Chrome to translate the content for me. Here are some popular Indian websites that do not offer translate suggestions: http://www.tupaki.com/ (Telugu) http://www.eenadu.net/ (Telugu) http://www.amarujala.com/ (Hindi) http://tamil.oneindia.com/ (Tamil) (These are all among the top 100 Indian sites according to Chrome history logs.)
,
Jan 13 2017
,
Jan 13 2017
Correct -- not a LangID issue. LangID model CLD3 correctly identifies the language of all pages. Current Translate UI triggering heuristics prevent triggering since there is a mismatch between what LangID identifies and the content language claimed by the HTML "lang" attribute in the source code. Currently these pages all explicitly claim to be English '<html ... lang="en">'. Assigning to zkoch@ who can triage and decide if revisiting triggering heuristics is important to do after the currently ongoing underlying LangID model change to CLD3 is completed. Triggering heuristics start here: https://cs.chromium.org/chromium/src/components/translate/core/language_detection/language_detection_util.cc?rcl=0&l=279
,
Jan 17 2017
Ah, interesting. I wonder why these sites are marking these pages as "en". Mdw, any idea?
,
Jan 17 2017
Not sure what is happening in these particular cases but in general it could be the case that if English is more popular than language X in India, developers might be tempted to mark a page in X as English hoping that search engines will rank it higher than a similar page in X marked as X.
,
Jan 18 2017
I think it may also be that this could be a preset value in many HTML editors and is left as-is before publishing.
,
Jan 18 2017
Completely agree that this is probably a default value in HTML templates and not indicative of developer intent. Do we have any signals (e.g., maybe from the crawl) in terms of how often the detected language and the document declared language are a mismatch? Should we not prefer the detected language over the lang attribute in markup?
,
Jan 18 2017
> Do we have any signals (e.g., maybe from the crawl) in terms of how often the detected language and the document declared language are a mismatch? This would be an interesting thing to pull. I'm not exactly sure how to do this. > Should we not prefer the detected language over the lang attribute in markup? We tend to prefer to give developers ways to assert influence over our predictions for assistive features, which is why I was hesitant to suggest this. That said, there is some precedent here for doing what's best for the user, especially if the use of the markup by the developer is unintentional or the side effects unknown (e.g. autocomplete=off for autofill). Groby, what's your take on this?
,
Jan 20 2017
Currently, we have a well-known wrong language pairs list. That will be similar thing you discussed. https://cs.chromium.org/chromium/src/components/translate/core/language_detection/language_detection_util.cc?q=kWellKnownCodesOnWrongConfiguration If CLD and page providing language information isn't the same language, and the page providing language is listed here, we trust CLD's result. This list was created from a result of doc-join, offline analysis for existing all web pages in the world. But, well-known wrong configuration languages may be nearly equal to the used often languages, that means page providing "en" would be always suspicious if CLD says it isn't.
,
Jan 23 2017
How long ago was the wrong language list generated? Does it make sense to refresh it? Also, rather than weighting all pages equally, you might consider weighting that analysis by the number of clicks. The pages I was looking at were in the top 100 most popular pages in India. My suggestion would be to redo the analysis for popular pages in our target EM countries (at least IN and ID, if not more), weight by page popularity, and see if revising the list of languages or the rule for deciding which language the page is would make sense in light of the results. This is a pretty important problem for us to address for EM users. Thanks!
,
Feb 27 2017
Friendly ping.
,
Feb 27 2017
Adding Jon, TL for language team.
,
Apr 14 2017
,
Apr 17 2017
,
Apr 17 2017
,
Apr 17 2017
,
Apr 17 2017
Issue 678287 has been merged into this issue.
,
Apr 27 2017
,
Apr 28 2017
Update on this: we're checking to see which if any other languages should be added to the "well-known wrong language pairs list". Metrics for this have landed but we have to wait to actually have the metrics be in a stable release before we can make decision.
,
Nov 14 2017
|
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by abakalov@chromium.org
, Jan 13 2017