New issue
Advanced search Search tips

Issue 778994 link

Starred by 8 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , iOS , Chrome , Mac , Fuchsia
Pri: 3
Type: Bug



Sign in to add a comment

Use UTF-8 as default for plaintext content

Project Member Reported by sascha@google.com, Oct 27 2017

Issue description

Chrome Version: 61.0.3163.100

When a web server returns “Content-Type: text/plain” in its HTTP response headers without explicitly stating a character encoding, current Chrome seems to assume that the text is in a legacy 8-bit encoding such as ISO 8859-1. In the early days of the web, this probably had been a reasonable fallback. In 2017, however, UTF-8 has become the predominant character encoding on the web, so it would make sense to switch Chrome’s default to UTF-8.

See the attached chart, which shows the current distribution of character encodings in Google’s web index. (Publication of this chart approved by Chuck Wu, Google’s director for web data). In the attached version, the Y axis has been hidden because we’re considering the absolute numbers confidential; those with access to Google-internal data can go to http://shortn.corp.google.com/_hrWta9MqyC for an interactive view inclduing absolute numbers.
 
UTF8_adoption.png
83.7 KB View Download

Comment 1 by js...@chromium.org, Oct 30 2017

Cc: jinsuk...@chromium.org js...@chromium.org
Components: Blink>Loader
Labels: OS-Android OS-Chrome OS-Fuchsia OS-iOS OS-Linux OS-Mac OS-Windows
Jinsuk,  how do we determine the encoding of a plain text file (text/plain) as opposed to text/html in the *new* world?   


TextResourceDecoder takes care of not just html but also plain text, json, xml, etc. if they are not labelled. This hasn't changed before and after the new encoding detector replaced the ICU-based one.

Standard states that json/xml files are assumed to be UTF-8 by default if unlabelled, while plain text doesn't have such guideline.  Issue 748440  handled it. I take that this bug is suggesting the UTF-8 default encoding be extended to plain text document.

One question: does the UTF-8 adoption in the graph indicate they are all properly labeled or they are combination of the sniffed result and proper labeling as UTF-8?

Currently Chrome does not guess UTF-8 https://codereview.chromium.org/2697213002/ Please see Issue 691985 for the background. I think this behavior needs to be taken into account in the proposed change.

Comment 3 by js...@chromium.org, Oct 30 2017

> One question: does the UTF-8 adoption in the graph indicate they are all properly labeled or they are combination of the sniffed result and proper labeling as UTF-8?

For all the docs (regardless of content-type and whether they're labelled or not). 

Perhaps, we need to get stats on the following:

* Encodings assigned to all text/plain with and without charset label in HTML header

* Encodings assigned to text/plain without charset label in HTML (and without UTF-8 BOM)



Comment 4 by js...@chromium.org, Oct 30 2017

> Currently Chrome does not guess UTF-8 https://codereview.chromium.org/2697213002/ Please see Issue 691985

Thank you for the reference. Sorry that I haven't gotten to that issue. In principle, I agree to what Mozilla/WhatWG folks said in that bug. 

Comment 5 by js...@chromium.org, Oct 30 2017

Labels: allpublic
From the bug report: 
> Publication of this chart approved by Chuck Wu, Google’s director for web data)

Per this comment, I'm opening up this bug. 

Comment 6 by pkl@chromium.org, Nov 20 2017

Status: Available (was: Unconfirmed)
(not sure if this should be available or unconfirmed, but seems like the right people are already on CC)

Comment 7 by jsb...@chromium.org, Mar 26 2018

Issue 823900 has been merged into this issue.

Comment 8 by jsb...@chromium.org, Mar 26 2018

Components: Blink>TextEncoding
Cc: pnangunoori@chromium.org
 Issue 820767  has been merged into this issue.

Sign in to add a comment