New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 691985 link

Starred by 14 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 2
Type: Bug-Regression


Participants' hotlists:
Encoding-Detection


Sign in to add a comment

Encoding detector causing compat issues

Project Member Reported by annevank...@gmail.com, Feb 14 2017

Issue description

Chrome somewhat recently shipped an encoding detector without intent to ship or corresponding standards discussion. There is some standard discussion now at https://github.com/whatwg/encoding/issues/68 but it progresses rather slowly and as can be seen in https://bugzilla.mozilla.org/show_bug.cgi?id=1338797 Chrome's new behavior causes compatibility issues for browsers that don't have the detector since pages will now simply omit encoding declarations.

This is bad.
 

Comment 1 by rtoy@chromium.org, Feb 14 2017

Components: -Blink Blink>TextEncoding

Comment 2 by jsb...@chromium.org, Feb 15 2017

Cc: jinsuk...@chromium.org js...@chromium.org jsb...@chromium.org
Labels: Hotlist-Interop
Cc: aelias@chromium.org
Owner: js...@chromium.org
Please note that encoding detector has always been there as a part of Chrome Browser. The statement "somewhat recently shipped an encoding detector" is not entirely correct. The recent change was about updating it with a better version in terms of accuracy/speed, and removing the encoding menu.

Granted, this particular issue has to do with, among many things discussed in the whatwg link above, "never guess UTF8" - which is to encourage web publishers to specify the UTF-8 label either in HTTP header or in their documents correctly, without simply relying on browser encoding detection.

Encoding detector is in place with an intent to dealing with legacy web sites, not to encourage or allow modern sites (which would in almost all cases go with UTF-8) to go without specifying encoding label. 

Assigning to jshin@ to get his thought. Does it make sense to drop UTF-8 encoding detection capability - i.e. use default encoding (based on locale or TLD) if the detected result is UTF-8?


Comment 4 by aelias@chromium.org, Feb 15 2017

Cc: tkent@chromium.org
Labels: ReleaseBlock-Stable M-57
Owner: jinsuk...@chromium.org
Status: Assigned (was: Untriaged)
It's a good point.  We weren't thinking about the indirect effects on web developer behavior when we made this change.

Developers were already relying on system-locale-based default legacy encoding and serving websites lacking headers for those.  Turning on the autodetector by default didn't really change the interop space for that.  But they were not doing so for UTF-8, but now may start doing so because of our detector.

Jinsuk, please go ahead and make the proposed change to "never guess UTF-8", it 
 sounds like a good suggestion to me.  Let's also cherry-pick it to M57 to try and beat this getting common.
Status: Started (was: Assigned)

Comment 6 by aelias@chromium.org, Feb 16 2017

Reading the thread more carefully, we should also do this:

- In addition to UTF-8, let's exclude autodetector outcomes from the entire "second set" listed by hsivonen@.  We should try to do this in a way that we fall back to the next-most-likely encoding produced by CED.

- Not blocking M57, but we can further discuss about other possible detection approaches such as TLD-based and entire-HTML-based.  There is a tradeoff space here between 1. precision, 2. performance and 3. heuristicness/predictability.  We have found an optimal tradeoff between 1 and 2, but we were not sufficiently weighing 3.  The correct tradeoff is far from self-evident though -- changing this further would probably require some more "try to ship this kind of thing and see how much backlash we get".

Comment 7 by aelias@chromium.org, Feb 16 2017

On a second look, the "second set" do not all call for immediate exclusion, since it includes common encodings like EUC-JP, and hsivonen@ says, if I understand correctly, that Gecko also is shipping an autodetector targeting encodings in that set, by default, at least given certain system locales.

For M57, let's start by excluding:
- UTF-8, to avoid trap of modern content developing dependency on autodetector.
- ISO-2022-JP, due to security concern.  (Likely to get user backlash on this one, but that's better than waiting until possible security crisis and then removing it then and then getting the backlash anyway.  We will have to recommend installing the extension for JP users who care.)

And then for M58, we should give another triage pass on the CED encoding list, and only whitelist 1) pre-existing locale encodings plus 2) encodings like EUC-JP that are known to be prevalent.  Because there is no need to take on potential compat and security risks for completely obscure and unused encodings.  And we can propose the list we come up with on https://github.com/whatwg/encoding since it would become a web platform surface with interop consequences.

Comment 8 by aelias@chromium.org, Feb 16 2017

In addition to ISO-2022-JP, let's exclude for 57 the other 7-bit encodings: ISO-2022-{KR,CN}, HZ-GB and UTF-7, it seems they may all have a similar security concern.
Related  Issue 647582  - other 7 bit encodings are already treated as 7bit ascii. 
Please apply appropriate OS labels. Thank you.
I think you might want to consider the "second set" after all. What Gecko has is a detector for Japanese, picking between EUC-JP and Shift_JIS. And a detector for Russian and Ukrainian, which Henri is not sure of that they add value (rather than misfire) and is looking into removing at https://bugzilla.mozilla.org/show_bug.cgi?id=845791.

I think what our shared goal should be is that 1) we should try to make this deterministic, 2) we should try to end up with the minimum viable solution rather than something complicated, and 3) we should aim to standardize this detection step (and the detector logic).

(What I meant in OP is that the encoding detector you shipped is a significantly more complicated piece of code than what you used to ship, without much data, cross-browser consideration, and standardization discussion.)
Labels: OS-All
> I think you might want to consider the "second set" after all. What Gecko has is a detector for Japanese, picking between EUC-JP and Shift_JIS. 

Yeah, I'm mostly saying EUC-JP is one of the encodings that was listed in the "second set" and we definitely want to preserve autodetection for that one at a minimum.

> I think what our shared goal should be is that  1) we should try to make this deterministic, 2) we should try to end up with the minimum viable solution rather than something complicated, and 3) we should aim to standardize this detection step (and the detector logic).

Agreed.

> (What I meant in OP is that the encoding detector you shipped is a significantly more complicated piece of code than what you used to ship, without much data, cross-browser consideration, and standardization discussion.)

Yeah, I acknowledge that and I'm sorry.  Some background is that encoding has been unowned in Chromium for some time, and me and Jinsuk took on making Chrome for Android have parity with other OSes here.  We determined that neither the manual encoding selection nor the slow ICU-based autodetector were adequate solutions on a phone, so we had to try something different.

We weren't familiar with all of the history, and it wasn't brought to our attention that this was an area other browsers were at all interested in coordinating on.  When no header is specified, vendors each have a different arbitrary mix of autodetectors, system settings, and user settings, with no obvious preexisting interoperability in place that we would be breaking with a change like this.

I think that as long as system settings and user settings were involved in making the decision, it was somewhat out of the realm of standardizable things.  Although now that Chromium has switched to an approach based 100% on what is served by the website, I agree it would be fruitful to standardize it.

We did gather data (see table at http://crbug.com/518968#c29 ), although 
ultimately it has been hard to use data to really drive a decision here because, by definition, there is no way to verify automatically whether a given autodetection result is correct.  We went with CED mostly by virtue of trusting it to be high-quality because it has been refined for years against Google's websearch corpus.  Now that CED is shipped in Chrome, we have the opportunity to try simpler heuristics and compare them to the CED result, treating it as a golden correct autodetection result, when we had no source for that kind of information before (the data in the link above was gathered with the ICU autodetector).
A friendly reminder that M57 Stable is launch is coming VERY soon! Your bug is labelled as Stable ReleaseBlock, pls make sure to land the fix and get it merged into the release branch (2987) ASAP so it gets enough baking time in Beta (before Stable promotion). Thank you!
Project Member

Comment 14 by bugdroid1@chromium.org, Feb 17 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/89967f02f2402870acb8322ad50c391b3a0971e7

commit 89967f02f2402870acb8322ad50c391b3a0971e7
Author: jinsukkim <jinsukkim@chromium.org>
Date: Fri Feb 17 02:38:21 2017

Do not guess UTF8 encoding

Makes the text encoding detector return false if the detected
encoding is UTF8. UTF8 auto-detection can allow/encourage web publishers
to neglect proper encoding labelling and rely on browser-side encoding
detection. This CL helps prevent that.

BUG=691985

Review-Url: https://codereview.chromium.org/2697213002
Cr-Commit-Position: refs/heads/master@{#451194}

[modify] https://crrev.com/89967f02f2402870acb8322ad50c391b3a0971e7/chrome/browser/browser_encoding_browsertest.cc
[delete] https://crrev.com/625c0cc94b7ee0441ac0debc5f295727f526eafd/chrome/test/data/encoding_tests/auto_detect/UTF-8_with_no_encoding_specified.html
[delete] https://crrev.com/625c0cc94b7ee0441ac0debc5f295727f526eafd/chrome/test/data/encoding_tests/auto_detect/expected_results/expected_UTF-8_saved_from_no_encoding_specified.html
[modify] https://crrev.com/89967f02f2402870acb8322ad50c391b3a0971e7/third_party/WebKit/LayoutTests/editing/spelling/delete-misspelled-word.html
[modify] https://crrev.com/89967f02f2402870acb8322ad50c391b3a0971e7/third_party/WebKit/LayoutTests/editing/spelling/move-cursor-to-misspelled-word.html
[modify] https://crrev.com/89967f02f2402870acb8322ad50c391b3a0971e7/third_party/WebKit/LayoutTests/editing/spelling/spelling-insert-newline-between-multi-word-misspelling.html
[modify] https://crrev.com/89967f02f2402870acb8322ad50c391b3a0971e7/third_party/WebKit/LayoutTests/external/wpt/html/syntax/parsing-html-fragments/the-input-byte-stream-015.html
[modify] https://crrev.com/89967f02f2402870acb8322ad50c391b3a0971e7/third_party/WebKit/LayoutTests/fast/css3-text/css3-text-decoration/text-underline-position/text-underline-position-under-vertical-expected.html
[modify] https://crrev.com/89967f02f2402870acb8322ad50c391b3a0971e7/third_party/WebKit/LayoutTests/fast/css3-text/css3-text-decoration/text-underline-position/text-underline-position-under-vertical.html
[modify] https://crrev.com/89967f02f2402870acb8322ad50c391b3a0971e7/third_party/WebKit/LayoutTests/fast/css3-text/css3-word-break/word-break-break-all-in-span-expected.html
[modify] https://crrev.com/89967f02f2402870acb8322ad50c391b3a0971e7/third_party/WebKit/LayoutTests/fast/css3-text/css3-word-break/word-break-break-all-in-span.html
[modify] https://crrev.com/89967f02f2402870acb8322ad50c391b3a0971e7/third_party/WebKit/LayoutTests/fast/dom/Window/invalid-protocol.html
[modify] https://crrev.com/89967f02f2402870acb8322ad50c391b3a0971e7/third_party/WebKit/LayoutTests/fast/text/ellipsis-at-edge-of-ltr-text-in-rtl-flow.html
[modify] https://crrev.com/89967f02f2402870acb8322ad50c391b3a0971e7/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp
[modify] https://crrev.com/89967f02f2402870acb8322ad50c391b3a0971e7/third_party/WebKit/Source/platform/text/TextEncodingDetectorTest.cpp

Labels: Merge-Request-57
> Please note that encoding detector has always been there as a part of Chrome Browser.

Interesting. I thought that in WebKit content-based sniffing was only for Japanese (using the ICU sniffer). Have I understood wrong? If not, when did Chrome broaden sniffing compared to WebKit?

> - UTF-8, to avoid trap of modern content developing dependency on
> autodetector.
> - ISO-2022-JP, due to security concern.

Thanks. This is a good start.

> (Likely to get user backlash on this
> one, but that's better than waiting until possible security crisis and then 
> removing it then and then getting the backlash anyway.  We will have to 
> recommend installing the extension for JP users who care.)

On perhaps not. ISO-2022-JP is rare on the Web. I don't know why browsers support it, but my best guess is that it's leakage from Trident and Gecko being used in email clients.

As for the path forward beyond not guessing UTF-8 or ASCII-incompatible encodings:

There are conflicting goals with unlabeled content:
 1) Avoiding the display of mojibake.
 2) Supporting incremental rendering.
 3) Having deterministic behavior in face of timing differences and differences in how data is split into network packets.

Since Firefox 4, I've been trying hard to prioritize #3, then #2 and only then #1. AFAICT, it's impossible to always have all three.

It bothers me quite a bit that (per https://github.com/whatwg/encoding/issues/68) Chrome decided to give the lowest priority to #3 without prior discussion at the WHATWG.

Chrome seems to treat #1 as though it was a major problem that needs addressing, but Firefox's telemetry suggests that mojibake isn't really that big of a problem on the Web these days. With TLD-based guessing (UI locale-based guessing for .com/net/org) in place (and content-based guessing only for Japanese, Russian and Ukrainian locales), in desktop Firefox 51, the character encoding override menu has not been invoked even once in 99.997% of the sessions. When the override menu is used, 46.73% of the time it's being used to override a label (i.e. a situation that in Chrome doesn't involve sniffing and has no UI recourse). 9% of the override uses is for file: URLs, in which case it would make sense to scan *all the bytes* to see if they are UTF-8. 28.81% of the override uses are the user re-overriding a previous user override, which suggests that users don't have good success at choosing the right override on the first try.

Unlabeled non-file: URL docs account for only 11.3% of the override menu invocations. (And, again, the menu is completely unused in 99.997% of the sessions.)

Which makes me wonder: What makes Chrome developers believe that a sniffer is needed at all or why a sniffer couldn't be limited to sniffing Shift_JIS vs. EUC-JP when the base guess before sniffing would be Shift_JIS?

Could Chrome get away with sniffing exactly 1024 bytes regardless of timing and network buffer boundaries? Firefox sniffs exactly 1024 bytes for <meta>. This means that incremental rendering stalls for pages that are unlabeled and are deliberately served in deferred chunks. In practice, those are rare enough for Firefox to get away with requiring the authors of those to label their stuff.
> Interesting. I thought that in WebKit content-based sniffing was only for Japanese (using the ICU sniffer). Have I understood wrong? If not, when did Chrome broaden sniffing compared to WebKit?

I'm not sure when exactly Chromium diverged from WebKit, but the status prior to M55 (for several years) is that Chromium, by default, did no sniffing whatsoever and just used a system locale default.  Secondly, if the user ever clicked "Autodetect" in the encoding menu, this acted as a permanent setting, and in that case ICU autodetector would run on 100% of page loads, overriding all headers, and supporting the entire set of ICU encodings.

Starting at M55, we removed all menus and all influence of system locale, and started to run CED autodetector by default but only affecting pages without headers.

To my knowledge, Chromium has never shipped a Japanese-specific sniffing configuration.

> There are conflicting goals with unlabeled content:
> 1) Avoiding the display of mojibake.
> 2) Supporting incremental rendering.
> 3) Having deterministic behavior in face of timing differences and differences in how data is split into network packets.

> Since Firefox 4, I've been trying hard to prioritize #3, then #2 and only then #1. AFAICT, it's impossible to always have all three.

> It bothers me quite a bit that (per https://github.com/whatwg/encoding/issues/68) Chrome decided to give the lowest priority to #3 without prior discussion at the WHATWG.

I would add a concern #4 of high importance to Chromium, avoiding reliance on obscure settings (whether browser menus or system settings).  In our internal debates, we were mostly trying to balance #4 and #1.  It was hard to reach internal consensus on deleting the menu without some kind of mitigation so that it didn't seem we were throwing mojibake-concerned users under the bus.  I agree #3 is important and we didn't consider it enough, and we're still willing to consider adjusting it.

Thanks for the telemetry data, it's more detailed than what we had gathered on this.

> why a sniffer couldn't be limited to sniffing Shift_JIS vs. EUC-JP when the base guess before sniffing would be Shift_JIS

The one thing I dislike about this policy is the importance of UI-locale based guessing as a backstop.  It's a quite unusual/surprising factor to influence rendering result, and anecdotally there exist a lot of users with English OS installations whose native language is not English.  Content-based detection still strikes me as a lesser evil, if we make it deterministic.  And in order to eliminate UI locale guessing, then we need to configure our autodetector to target all the world's legacy encodings, not just Japanese.

> Could Chrome get away with sniffing exactly 1024 bytes regardless of timing and network buffer boundaries?

That sounds very reasonable.
Project Member

Comment 18 by sheriffbot@chromium.org, Feb 18 2017

Labels: -Merge-Request-57 Hotlist-Merge-Approved Merge-Approved-57
Your change meets the bar and is auto-approved for M57. Please go ahead and merge the CL to branch 2987 manually. Please contact milestone owner if you have questions.
Owners: amineer@(clank), cmasso@(bling), ketakid@(cros), govind@(desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Cherrypicking to the release M57 WIP https://crrev.com/2705913002
Please merge your change to M57 branch 2987 by 5:00 PM PT Tuesday (02/21) so we can pick it up for this week beta release. Thank you.
Project Member

Comment 21 by sheriffbot@chromium.org, Feb 21 2017

This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Project Member

Comment 22 by bugdroid1@chromium.org, Feb 21 2017

Labels: -merge-approved-57 merge-merged-2987
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/50dce2837d94d319260d6ff269c09514d0691fb0

commit 50dce2837d94d319260d6ff269c09514d0691fb0
Author: jinsukkim <jinsukkim@chromium.org>
Date: Tue Feb 21 19:29:21 2017

Do not guess UTF8 encoding

Makes the text encoding detector return false if the detected
encoding is UTF8. UTF8 auto-detection can allow/encourage web publishers
to neglect proper encoding labelling and rely on browser-side encoding
detection. This CL helps prevent that.

BUG=691985
NOTRY=true
NOPRESUBMIT=true

Review-Url: https://codereview.chromium.org/2697213002
Cr-Commit-Position: refs/heads/master@{#451194}
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:closure_compilation

Review-Url: https://codereview.chromium.org/2705913002
Cr-Commit-Position: refs/branch-heads/2987@{#617}
Cr-Branched-From: ad51088c0e8776e8dcd963dbe752c4035ba6dab6-refs/heads/master@{#444943}

[modify] https://crrev.com/50dce2837d94d319260d6ff269c09514d0691fb0/chrome/browser/browser_encoding_browsertest.cc
[modify] https://crrev.com/50dce2837d94d319260d6ff269c09514d0691fb0/chrome/browser/resources/welcome/welcome.html
[modify] https://crrev.com/50dce2837d94d319260d6ff269c09514d0691fb0/chrome/browser/resources/welcome/win10/inline.html
[modify] https://crrev.com/50dce2837d94d319260d6ff269c09514d0691fb0/chrome/browser/resources/welcome/win10/sectioned.html
[delete] https://crrev.com/2a7a93cf5bfa2d3db02105fb7153fd8dd5dc4a74/chrome/test/data/encoding_tests/auto_detect/UTF-8_with_no_encoding_specified.html
[delete] https://crrev.com/2a7a93cf5bfa2d3db02105fb7153fd8dd5dc4a74/chrome/test/data/encoding_tests/auto_detect/expected_results/expected_UTF-8_saved_from_no_encoding_specified.html
[modify] https://crrev.com/50dce2837d94d319260d6ff269c09514d0691fb0/third_party/WebKit/LayoutTests/editing/spelling/delete-misspelled-word.html
[modify] https://crrev.com/50dce2837d94d319260d6ff269c09514d0691fb0/third_party/WebKit/LayoutTests/editing/spelling/move-cursor-to-misspelled-word.html
[modify] https://crrev.com/50dce2837d94d319260d6ff269c09514d0691fb0/third_party/WebKit/LayoutTests/editing/spelling/spelling-insert-newline-between-multi-word-misspelling.html
[modify] https://crrev.com/50dce2837d94d319260d6ff269c09514d0691fb0/third_party/WebKit/LayoutTests/fast/css3-text/css3-word-break/word-break-break-all-in-span-expected.html
[modify] https://crrev.com/50dce2837d94d319260d6ff269c09514d0691fb0/third_party/WebKit/LayoutTests/fast/css3-text/css3-word-break/word-break-break-all-in-span.html
[modify] https://crrev.com/50dce2837d94d319260d6ff269c09514d0691fb0/third_party/WebKit/LayoutTests/fast/dom/Window/invalid-protocol.html
[modify] https://crrev.com/50dce2837d94d319260d6ff269c09514d0691fb0/third_party/WebKit/LayoutTests/fast/text/ellipsis-at-edge-of-ltr-text-in-rtl-flow.html
[modify] https://crrev.com/50dce2837d94d319260d6ff269c09514d0691fb0/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp
[modify] https://crrev.com/50dce2837d94d319260d6ff269c09514d0691fb0/third_party/WebKit/Source/platform/text/TextEncodingDetectorTest.cpp

Please mark as fixed is there is no pending work here. Thank you.
Labels: -Hotlist-Merge-Approved -ReleaseBlock-Stable -M-57
Leaving this open to track remaining issues: A. remove all other unnecessary encodings from support list, B. autodetect based on exactly 1024 bytes or other fixed number.  Removing release label as these are less urgent.

Comment 25 by tkent@chromium.org, Feb 22 2017

> - ISO-2022-JP, due to security concern.

What's the concern about ISO-2022-JP?  ISO-2022-JP doesn't map an ASCII byte to another ASCII character. It doesn't have risk like UTF-7.

hsivonen@ said on https://github.com/whatwg/encoding/issues/68#issuecomment-274018028 : "ISO-2022-JP has a structure that's known-dangerous on a general level and shown-dangerous for other encodings with similar structure, and ISO-2022-JP is just waiting for someone to demonstrate an attack."  I can't find a very clear explanation of the nature of the problem on the web, although I can find a history of exploits for ISO-2022-KR and ISO-2022-CN.
> ISO-2022-JP doesn't map an ASCII byte to another ASCII character.

It maps non-ASCII characters to ASCII bytes.

> I can find a history of exploits for ISO-2022-KR and ISO-2022-CN

Right. I don't have an attack for -JP and I failed to find one with cursory trying.

The structure of the attack is that the attacker generates some non-ASCII that encodes to ASCII bytes such that interpreted as ASCII those bytes become active content.

Attack PoCs exist for at least ISO-2022-KR, ISO-2022-CN, UTF-7, UTF-16 and HZ. Again, it could be that ISO-2022-JP by luck isn't exploitable, but without an explanation of unexploitability, the reasonable expectation is that eventually someone who tries hard enough finds an exploit as has happened with the other encodings that map non-ASCII characters to ASCII bytes.
> It maps non-ASCII characters to ASCII bytes.

I still don't know how to exploit it precisely.  However, can we avoid it by disabling form submission in pages with auto-detected ISO-2022-JP?

Removing the encoding menu hurt Japanese users. Removing ISO-2022-JP auto-detection might damage Japanese users more.

That would require changing https://encoding.spec.whatwg.org/#get-an-output-encoding (also affects URLs) and might end up breaking pages as well. I think you're correct that it would also reduce (theoretical) risk though.

As for auto-detecting ISO-2022-JP, I believe only Chrome does that and only because it recently started doing so.
> As for auto-detecting ISO-2022-JP, I believe only Chrome does that and only because it recently started doing so.

I think Chrome supported ISO-2022-JP detection even before switching to CED, but auto-detection wasn't enabled by default.

I checked other browser behavior:

IE11: Auto-detect ISO-2022-JP by default
Edge: Auto-detect ISO-2022-JP by default if Windows language is Japanese
Firefox: Auto-detect ISO-2022-JP if Auto-detect>Japanese is enabled.
Safari: Auto-detect ISO-2022-JP if HTTP header or meta charset is one of Shift_JIS, EUC-JP, ISO-2022-JP.  The code looks to auto-detect ISO-2022-JP in other cases, but I couldn't confirm the behavior.
> However, can we avoid it by disabling form submission in pages with auto-detected ISO-2022-JP?

I think it would be bad to invent complex special rules like this--especially when form submission is unlikely to be the only way to exploit ISO-2022-JP if ISO-2022-JP indeed turns out to be exploitable. If ISO-2022-JP indeed is exploitable, one should expect it to be suitable for using script to exfiltrate session cookies via an image load, for example.

Comment 32 by ddw@google.com, Apr 4 2017

My issue 707687 has been merged into this issue, where a UTF-8 document is displaying ASCII when viewed in Chrome from a local file (using file://...).
Can someone include it in any testing of a future bug fix? I have attached the file.

The language is obscure and low-resource (Sango), so it is not likely to be identified correctly by an ML model, but it the only non-ASCII characters are vowels with circumflex and diaeresis, so it "looks" clearly like UTF-8.

When correctly viewed, it should look like "Kêtê töngasô, mbênî yê asungba na lê tî ndüzü. Päsä tî mbï, mbï tï na sêse fadë hîo. Gï fadë na pekônî mo bâa na yâ tî li tî mbï ânyama tî terê tî Sanfûlamör ndêngê na ndêngê. Lo sungba fadë awe. Âyê sô atûku na terê tî mbï bîanî ayeke mabôko tî lo tî kôlï, mbênî lê tî lo ôko, mbêni ndurü kâmba tî yâ tî lo, da tî înön tî lo, biö tî gerê tî lo, gbâ tî pëmbë tî lo na mëngä tî lo ngâ. Kôlï ngangü sô akâi bîanî bîanî awe."

When viewed locally, it looks like:
Kêtê töngasô, mbênî yê asungba na lê tî ndüzü. Päsä tî mbï, mbï tï na sêse fadë hîo. Gï fadë na pekônî mo bâa na yâ tî li tî mbï ânyama tî terê tî Sanfûlamör ndêngê na ndêngê. Lo sungba fadë awe. Âyê sô atûku na terê tî mbï bîanî ayeke mabôko tî lo tî kôlï, mbênî lê tî lo ôko, mbêni ndurü kâmba tî yâ tî lo, da tî înön tî lo, biö tî gerê tî lo, gbâ tî pëmbë tî lo na mëngä tî lo ngâ. Kôlï ngangü sô akâi bîanî bîanî awe.
src.utf8
513 bytes Download
Rephrasing my personal input in comment 12 of  issue 704800 .

I don't think the practice of "return 'false' when detecting utf-8" (refered as "this practice" below) is good.

From what I gathered in this thread, our goals include: 

1) encourage the website to label their contents' encoding/charset;
2) do at little as autodetection as possible, because it's slow and unstable;
3) make the process more deterministic, and doesn't rely on ui-locale or system-locale.

I understand this practice is to discourage unlabelled utf-8 content and I totally agree with it in theory. 

However, in the same time, we still do auto-detection for ANSI encodings and returns unspoiled result. This means if the content is using unlabelled ANSI encoding, they will show just fine (most of time), but ANY unlabelled utf-8 content will be showed as mojibake. 

This discrimination, substantially encourages web developers to use unlabelled ANSI encodings.. which of course is not our intent.

Also, this practice doesn't help at all for the second goal. We still do autodetection for utf-8 files, just doesn't return the result. Why bother then?

What about this way: just kill the autodetection, and assume any unlabeled as UTF-8? It can help both goals (force the developers to label their content, and get rid of autodetection entirely).

jinsu..@chroimum.org mentioned in  issue 704800  #c24 that, "There are lots of websites left without updates, with documents in legacy text encoding but unlabelled. Running encoding detector is required for them." Yeah sure.. what about "websites left without updates with documents in utf-8 text encoding but unlabelled"? Why they deserved to be f*cked but not the other kind of legacy websites? This argument doesn't make any sense at all.


The third part is trivial, but I'd like to mention it: currently, all the unlabeled UTF-8 content will be shown as mojibake, but this mojibake is UI-locate dependent because Chrome will try to use the ANSI encoding corresponding to the UI-locale to decode the (actual utf-8) content. It's mojibake regardless, just feel kinda ironic since one of our goal is to make it independent from UI-locale.


Another fact is that, while most of HTML files are already correctly labelled with their charset (and we want to continue to encourage so), it's very rare for websites to label their JS / CSS files. For example,I found that almost all the JS files from google.com or chormium.org are not charset-labelled (I knew they're mostly just using ASCII characters, but we want devs to label charset no matter what). By observing that, I was thinking maybe we can treat HTML files and other files differently? Just like we're going to treat local file differently already.

> file://

File URLs have different considerations. I think it makes sense for browsers to detect UTF-8 for file:// URLs by examining the entire file before starting parsing (not possible in the HTTP case; opening infinite special "files" like /dev/urandom via file: URLs probably shouldn't be supported).

> What about this way: just kill the autodetection, and assume any unlabeled as UTF-8? It can help both goals (force the developers to label their content, and get rid of autodetection entirely).

That would break legacy content on the Web.

> Yeah sure.. what about "websites left without updates with documents in utf-8 text encoding but unlabelled"? Why they deserved to be f*cked but not the other kind of legacy websites? This argument doesn't make any sense at all.

Such sites are likely substantially fewer, because previously sites (with non-ASCII content) couldn't get away with not labeling UTF-8.

> Another fact is that, while most of HTML files are already correctly labelled with their charset (and we want to continue to encourage so), it's very rare for websites to label their JS / CSS files.

Unlabeled JS/CSS inherits the encoding from the HTML that references them.
> That would break legacy content on the Web.

Disabling utf-8 detection will also break some legacy content on the web, as well as some not-so-legacy ones.

> Unlabeled JS/CSS inherits the encoding from the HTML that references them.

It doesn't if you open them separately. 

I'm glad you mentioned that, because that is exactly what brought me here. I tried to read comments in one JS file (irrelevant to the discussion but the file is https://wikiplus-app.smartgslb.com/Main.js ), however I can't since they are all mojibake.
Did ISO-2022-JP actually get excluded? The code comments seems to say it's specifically included in guessing.

Comment 37 by js...@chromium.org, Oct 30 2017

> > Interesting. I thought that in WebKit content-based sniffing was only for Japanese (using the ICU sniffer). Have I understood wrong? If not, when did Chrome broaden sniffing compared to WebKit?

> I'm not sure when exactly Chromium diverged from WebKit, but the status prior to M55 (for several years) is that Chromium, by default, did no sniffing whatsoever and just used a system locale default.  Secondly, if the user ever clicked "Autodetect" in the encoding menu, this acted as a permanent setting, and in that case ICU autodetector would run on 100% of page loads, overriding all headers, and supporting the entire set of ICU encodings.

Blink was different from WebKit in a couple of ways even before Jinsuk's change:

1) It got rid of Japanese-only encoding detection (which was often wrong) in WebKit. In WebKit, the Japaense encoding detection was (and perhaps is) always ON. 

2) Blink was more like Firefox on the following points:

   - The encoding detection was OFF by default. Users have to turn it on explicitly. 

   - When the auto-detection is off, the encoding assumed for unlabeled document came from a user-preference. 

   - The default value of that user-preference was UI locale-dependent (e.g. It's Shift_JIS if the UI language is Japanese and EUC-KR if the UI language is Korean, Windows-1252 for Western European locales, KOI-8R for Russian, etc). 

   - However, a user was free to change its value in Settings. (i.e. English UI users can set the assumed encoding for unlabelled documents to Shift_JIS or Big5, or whatever). 

   - As an escape hatch, a user could override the encoding manually via UI (however, it was broken in Chrome in that its effect was persistent instead of being applied only to the current page). 

With Jinsuk's recent changes, a couple of things have changed:

  - The encoding detector : ICU encoding detector was replaced by CED (compact encoding detector)
  - Auto-detection is ON for unlabelled documents. There's NO UI to turn it off. 
  - With the above change, the UI to set the assumed encoding for unlabelled documents became obsolete and got removed. 
  - The encoding override UI is also dropped. 

Comment 38 by js...@chromium.org, Oct 30 2017

> With Jinsuk's recent changes, a couple of things have changed:

I have to add a caveat "as far as I remember" regarding Jinsuk's changes (my memory got a bit fuzzier over time). 

Comment 39 by hsivo...@gmail.com, Nov 13 2017

> IE11: Auto-detect ISO-2022-JP by default
> Edge: Auto-detect ISO-2022-JP by default if Windows language is Japanese
> Firefox: Auto-detect ISO-2022-JP if Auto-detect>Japanese is enabled.
> Safari: Auto-detect ISO-2022-JP if HTTP header or meta charset is one of Shift_JIS, EUC-JP, ISO-2022-JP.  The code looks to auto-detect ISO-2022-JP in other cases, but I couldn't confirm the behavior.

Somehow I had managed to miss the above-quoted comment. Sorry.

Is there a link to the test data used for testing IE11 and Edge?

I was able to get IE11 to autodetect EUC-JP but not ISO-2022-JP when choosing the Japanese autodetection option from the encoding context menu. (Tried with U.S. English as the primary Windows language and Japanese as the primary Windows language on a system installed from en-US install media.)

As for Edge, does "Windows language" mean that the system was installed from ja-JP install media? I was unable to get Edge to detect ISO-2022-JP by setting the primary UI language to Japanese and rebooting when the system was installed from en-US install media.

The Safari result is interesting in the sense that if the description is complete, Safari's ISO-2022-JP detection has no overlap with Chrome's and Firefox's ISO-2022-JP detection, which suggests that maybe ISO-2022-JP autodetection isn't essential for Web compat.

Comment 40 by phistuck@gmail.com, Nov 13 2017

#39 - perhaps "Windows language" means the regional settings rather than the user interface language.
Cc: bsittler@chromium.org
#16 - apologies for reviving such an old comment, but are there numbers that quantify "ISO-2022-JP is rare on the Web" ? 

Comment 42 by js...@chromium.org, Nov 21 2017

re comment 41: see bug 778994 and an internal link in comment 0 there. 
ISO-2022-JP is in the noise. 

Sign in to add a comment