Use UTF-8 as default for plaintext content |
||||
Issue descriptionChrome Version: 61.0.3163.100 When a web server returns “Content-Type: text/plain” in its HTTP response headers without explicitly stating a character encoding, current Chrome seems to assume that the text is in a legacy 8-bit encoding such as ISO 8859-1. In the early days of the web, this probably had been a reasonable fallback. In 2017, however, UTF-8 has become the predominant character encoding on the web, so it would make sense to switch Chrome’s default to UTF-8. See the attached chart, which shows the current distribution of character encodings in Google’s web index. (Publication of this chart approved by Chuck Wu, Google’s director for web data). In the attached version, the Y axis has been hidden because we’re considering the absolute numbers confidential; those with access to Google-internal data can go to http://shortn.corp.google.com/_hrWta9MqyC for an interactive view inclduing absolute numbers.
,
Oct 30 2017
TextResourceDecoder takes care of not just html but also plain text, json, xml, etc. if they are not labelled. This hasn't changed before and after the new encoding detector replaced the ICU-based one. Standard states that json/xml files are assumed to be UTF-8 by default if unlabelled, while plain text doesn't have such guideline. Issue 748440 handled it. I take that this bug is suggesting the UTF-8 default encoding be extended to plain text document. One question: does the UTF-8 adoption in the graph indicate they are all properly labeled or they are combination of the sniffed result and proper labeling as UTF-8? Currently Chrome does not guess UTF-8 https://codereview.chromium.org/2697213002/ Please see Issue 691985 for the background. I think this behavior needs to be taken into account in the proposed change.
,
Oct 30 2017
> One question: does the UTF-8 adoption in the graph indicate they are all properly labeled or they are combination of the sniffed result and proper labeling as UTF-8? For all the docs (regardless of content-type and whether they're labelled or not). Perhaps, we need to get stats on the following: * Encodings assigned to all text/plain with and without charset label in HTML header * Encodings assigned to text/plain without charset label in HTML (and without UTF-8 BOM)
,
Oct 30 2017
> Currently Chrome does not guess UTF-8 https://codereview.chromium.org/2697213002/ Please see Issue 691985 Thank you for the reference. Sorry that I haven't gotten to that issue. In principle, I agree to what Mozilla/WhatWG folks said in that bug.
,
Oct 30 2017
From the bug report: > Publication of this chart approved by Chuck Wu, Google’s director for web data) Per this comment, I'm opening up this bug.
,
Nov 20 2017
(not sure if this should be available or unconfirmed, but seems like the right people are already on CC)
,
Mar 26 2018
Issue 823900 has been merged into this issue.
,
Mar 26 2018
,
Apr 3 2018
|
||||
►
Sign in to add a comment |
||||
Comment 1 by js...@chromium.org
, Oct 30 2017Components: Blink>Loader
Labels: OS-Android OS-Chrome OS-Fuchsia OS-iOS OS-Linux OS-Mac OS-Windows