New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 704800 link

Starred by 9 users

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug



Sign in to add a comment

charset / encoding detection can't decode UTF-8 without BOM correctly

Reported by human.p...@gmail.com, Mar 24 2017

Issue description

UserAgent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.33 Safari/537.36

Example URL:

Steps to reproduce the problem:
The problem is twofold. 

Firstly, some plain text file can be shown correctly if they're on the web, but not if they're local file.

Open this file: https://gist.githubusercontent.com/fireattack/90ea2aef01c0da9b27558a19f369b060/raw/edd3945952ed96f87a21ac551dfdd13dac208812/test.txt

It displays "我是中文" just fine.

Now, download this file, and re-open the local file with Chrome.

Now, Chrome can't decode it correctly: http://imgur.com/j7SVifz

Secondly, some files can't be decoded correctly no matter it's on the web or a local file.

Example: https://wikiplus-app.smartgslb.com/Main.js 

All the comments are a mess.

What is the expected behavior?

What went wrong?
The charset detection failed.

Does it occur on multiple sites: Yes

Is it a problem with a plugin? No 

Did this work before? N/A 

Does this work in other browsers? Yes

Chrome version: 58.0.3029.33  Channel: beta
OS Version: 6.1 (Windows 7, Windows Server 2008 R2)
Flash Version:
 
test.txt
12 bytes View Download

Comment 1 by kochi@chromium.org, Mar 24 2017

Components: -Blink Blink>TextEncoding
Status: Untriaged (was: Unconfirmed)
The reason the test.text on the web was detected correctly, was that the
web server added "Content-type: text/plain; charset=utf-8" so Chrome
used that as an encoding hint.  
The JS file was served with "Content-type: application/javascript" without
any encoding information.

When you load the test.txt as a local file, as no meta information is given, 
Chrome tried to detect the encoding and failed.

Can someone take a look for this case if this is a quality issue of our
encoding detector?

That makes sense. But I didn't remember having trouble to read comments in JS file before. Is there anything changed around this?
can't repro in m56, so this is a regression

Comment 4 by kochi@chromium.org, Mar 27 2017

I heard in M57 Chrome introduced a logic to prefer encoding according to
UI language in the automatic encoding detector.  It might have affected this.
(this is very wild guess, though)

Well, it does have some effect.. to some degree. 

With English UI language, it shows "我是中文"
With chinese UI language, it shows "鎴戞槸涓枃"

So, as you can see, it made different guesses, but none of them is right.

After all, UTF-8 is language neutral.

Comment 6 by jsb...@chromium.org, Mar 28 2017

Cc: jinsuk...@chromium.org
Cc: -jinsuk...@chromium.org tkent@chromium.org aelias@chromium.org jsb...@chromium.org
Owner: jinsuk...@chromium.org
Status: Assigned (was: Untriaged)
test.txt is an unfortunate side effects of https://codereview.chromium.org/2697213002. 

I think encoding detector should still return UTF-8 for local files.

The CL was introduced to discourage exactly the cases like the second - unlabelled resources. I'm afraid this should be kept in this way. Please consider using a manual encoding extension (one mentioned here https://bugs.chromium.org/p/chromium/issues/detail?id=597488#c70) to view it.
>to discourage exactly the cases like the second - unlabelled resources

Fair enough for html resources - but can you label charset for CSS and JS files? Even if you can, did ANY website actually do that?

I quickly checked all the JS files on this very website (bugs.chromium.org), none of them seem to provide a explicit charset.

Comment 9 by aelias@chromium.org, Mar 29 2017

> Fair enough for html resources - but can you label charset for CSS and JS files? Even if you can, did ANY website actually do that?

Certainly such files can have an HTTP header just like any other.  We want to encourage websites to add such headers moving forward to avoid ambiguity and mandatory autodetection (which is slow and unreliable).  As this is a developer use case, you can install the extension linked above if it bothers you.
I knew. I'm just saying even chromium.org is not doing that practice (having http header to specify the charset), which is kinda ironic.

Comment 11 Deleted

Another thing I didn't quite get after reading the whole issue 691985 is that, someone said

>The one thing I dislike about this policy is the importance of UI-locale based guessing as a backstop

But the reality is, it still seems to rely on UI-locale, as I demonstrated in comment 5.

Isn't it the opposite of what we're trying to do here? We want to do as little autodetection as possible, so why not just default everything as UTF-8, and do no autodetection AT ALL, instead of depending on UI-locale, and messing up the (albeit unlabeled) UTF-8 content?

By returning false after detecting UTF-8, we not only don't avoid autodetecting (the autodetector still runs once), but also returns wrong result (some arbitrary charset depending on UI-locale, which we dislike) at the end.. sounds like a lose-lose situation to me.

Or let me put it this way: which one should we discourage *more*, unlabeled UTF-8 content or unlabelled arbitrary language-specific encoding (be it GB2312, shift-JIS, EUC-JP, whatever) content? The current practice will actually favor the second one: if I have unlabelled GB2312 content, it will show just fine because EAD can successfully return the correct encoding. So by discouraging unlabelled UTF-8 content, we are substantially encouraging an (IMHO) worse practice: unlabelled ANSI encoding.
I understand that it would certainly look ironic. In that sense bug.chromium.org is also one of those 'legacy' web sites which is not completely modern. I believe it was left undetected because all the js files work just fine (based on ascii only).

Please help build new web sites (if https://wikiplus-app.smartgslb.com is is under your control) following good, modern convention being promoted here going forward.
Cc: kkaluri@chromium.org jinsuk...@chromium.org
 Issue 704422  has been merged into this issue.
 Issue 703006  has been merged into this issue.
Project Member

Comment 16 by bugdroid1@chromium.org, Mar 29 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/31472d1f58ffa076a5697e22aded98f2adde994a

commit 31472d1f58ffa076a5697e22aded98f2adde994a
Author: jinsukkim <jinsukkim@chromium.org>
Date: Wed Mar 29 23:24:14 2017

Respect UTF-8 detection result for local file resources

https://crbug.com/2697213002 (not guessing UTF8 encoding)
doesn't have to be applied to local file resources. This
CL makes such cases an exception to the policy.

BUG= 704800 

Review-Url: https://codereview.chromium.org/2784483003
Cr-Commit-Position: refs/heads/master@{#460573}

[modify] https://crrev.com/31472d1f58ffa076a5697e22aded98f2adde994a/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp
[modify] https://crrev.com/31472d1f58ffa076a5697e22aded98f2adde994a/third_party/WebKit/Source/platform/text/TextEncodingDetectorTest.cpp

Labels: Merge-Request-58
Project Member

Comment 18 by sheriffbot@chromium.org, Mar 30 2017

Labels: -Merge-Request-58 Hotlist-Merge-Approved Merge-Approved-58
Your change meets the bar and is auto-approved for M58. Please go ahead and merge the CL to branch 3029 manually. Please contact milestone owner if you have questions.
Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Project Member

Comment 19 by bugdroid1@chromium.org, Mar 31 2017

Labels: -merge-approved-58 merge-merged-3029
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f99097de4e9282a1426f184be1078b84f6a08f18

commit f99097de4e9282a1426f184be1078b84f6a08f18
Author: Jinsuk Kim <jinsukkim@chromium.org>
Date: Fri Mar 31 00:43:36 2017

Respect UTF-8 detection result for local file resources

https://crbug.com/2697213002 (not guessing UTF8 encoding)
doesn't have to be applied to local file resources. This
CL makes such cases an exception to the policy.

BUG= 704800 
NOTRY=true
NOPRESUBMIT=true
TBR=tkent@chromium.org

Review-Url: https://codereview.chromium.org/2784483003
Cr-Commit-Position: refs/heads/master@{#460573}
(cherry picked from commit 31472d1f58ffa076a5697e22aded98f2adde994a)

Review-Url: https://codereview.chromium.org/2781363003 .
Cr-Commit-Position: refs/branch-heads/3029@{#505}
Cr-Branched-From: 939b32ee5ba05c396eef3fd992822fcca9a2e262-refs/heads/master@{#454471}

[modify] https://crrev.com/f99097de4e9282a1426f184be1078b84f6a08f18/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp
[modify] https://crrev.com/f99097de4e9282a1426f184be1078b84f6a08f18/third_party/WebKit/Source/platform/text/TextEncodingDetectorTest.cpp

Status: Fixed (was: Assigned)
Thanks for the fix!

Is there any proper place to discuss any further, especially my concern in comment 12? 

Cc: durga.behera@chromium.org
Issue 707687 has been merged into this issue.
human.peng@ There are lots of websites left without updates, with documents in legacy text encoding but unlabelled. Running encoding detector is required for them. 

Modern websites are in most cases built with UTF-8 by default, so there's much less concern in that regard. We just want to make sure they come labelled as such.

I believe the rationale behind all this was already mentioned in Issue 691985. Please use that bug entry to give your further input.

Comment 25 by ddw@google.com, Apr 4 2017

This issue was mistakenly closed. My issue 707687 has been merged into this issue but it is still not working.
Note that in the Chromium open source project, marking a bug "fixed" refers to the issue having been resolved in the source. It may be some time (more than 6 weeks) before the fix appears in a stable release and is pushed to users.

Labels: TE-Verified-M58 TE-Verified-58.0.3029.54
Verified the issue on windows 7 & 10 using chrome M58 #58.0.3029.54 and observed that characters are rendered correctly as per steps mentioned in comment #0.

Attached screencast for reference. Adding TE-Verified labels.

Thanks!
704800.mp4
156 KB View Download

Sign in to add a comment