charset / encoding detection can't decode UTF-8 without BOM correctly
Reported by
human.p...@gmail.com,
Mar 24 2017
|
|||||||||
Issue descriptionUserAgent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.33 Safari/537.36 Example URL: Steps to reproduce the problem: The problem is twofold. Firstly, some plain text file can be shown correctly if they're on the web, but not if they're local file. Open this file: https://gist.githubusercontent.com/fireattack/90ea2aef01c0da9b27558a19f369b060/raw/edd3945952ed96f87a21ac551dfdd13dac208812/test.txt It displays "我是中文" just fine. Now, download this file, and re-open the local file with Chrome. Now, Chrome can't decode it correctly: http://imgur.com/j7SVifz Secondly, some files can't be decoded correctly no matter it's on the web or a local file. Example: https://wikiplus-app.smartgslb.com/Main.js All the comments are a mess. What is the expected behavior? What went wrong? The charset detection failed. Does it occur on multiple sites: Yes Is it a problem with a plugin? No Did this work before? N/A Does this work in other browsers? Yes Chrome version: 58.0.3029.33 Channel: beta OS Version: 6.1 (Windows 7, Windows Server 2008 R2) Flash Version:
,
Mar 24 2017
That makes sense. But I didn't remember having trouble to read comments in JS file before. Is there anything changed around this?
,
Mar 24 2017
can't repro in m56, so this is a regression
,
Mar 27 2017
I heard in M57 Chrome introduced a logic to prefer encoding according to UI language in the automatic encoding detector. It might have affected this. (this is very wild guess, though)
,
Mar 27 2017
Well, it does have some effect.. to some degree. With English UI language, it shows "æˆ‘æ˜¯ä¸æ–‡" With chinese UI language, it shows "鎴戞槸涓枃" So, as you can see, it made different guesses, but none of them is right. After all, UTF-8 is language neutral.
,
Mar 28 2017
,
Mar 28 2017
test.txt is an unfortunate side effects of https://codereview.chromium.org/2697213002. I think encoding detector should still return UTF-8 for local files. The CL was introduced to discourage exactly the cases like the second - unlabelled resources. I'm afraid this should be kept in this way. Please consider using a manual encoding extension (one mentioned here https://bugs.chromium.org/p/chromium/issues/detail?id=597488#c70) to view it.
,
Mar 29 2017
>to discourage exactly the cases like the second - unlabelled resources Fair enough for html resources - but can you label charset for CSS and JS files? Even if you can, did ANY website actually do that? I quickly checked all the JS files on this very website (bugs.chromium.org), none of them seem to provide a explicit charset.
,
Mar 29 2017
> Fair enough for html resources - but can you label charset for CSS and JS files? Even if you can, did ANY website actually do that? Certainly such files can have an HTTP header just like any other. We want to encourage websites to add such headers moving forward to avoid ambiguity and mandatory autodetection (which is slow and unreliable). As this is a developer use case, you can install the extension linked above if it bothers you.
,
Mar 29 2017
I knew. I'm just saying even chromium.org is not doing that practice (having http header to specify the charset), which is kinda ironic.
,
Mar 29 2017
Another thing I didn't quite get after reading the whole issue 691985 is that, someone said >The one thing I dislike about this policy is the importance of UI-locale based guessing as a backstop But the reality is, it still seems to rely on UI-locale, as I demonstrated in comment 5. Isn't it the opposite of what we're trying to do here? We want to do as little autodetection as possible, so why not just default everything as UTF-8, and do no autodetection AT ALL, instead of depending on UI-locale, and messing up the (albeit unlabeled) UTF-8 content? By returning false after detecting UTF-8, we not only don't avoid autodetecting (the autodetector still runs once), but also returns wrong result (some arbitrary charset depending on UI-locale, which we dislike) at the end.. sounds like a lose-lose situation to me. Or let me put it this way: which one should we discourage *more*, unlabeled UTF-8 content or unlabelled arbitrary language-specific encoding (be it GB2312, shift-JIS, EUC-JP, whatever) content? The current practice will actually favor the second one: if I have unlabelled GB2312 content, it will show just fine because EAD can successfully return the correct encoding. So by discouraging unlabelled UTF-8 content, we are substantially encouraging an (IMHO) worse practice: unlabelled ANSI encoding.
,
Mar 29 2017
I understand that it would certainly look ironic. In that sense bug.chromium.org is also one of those 'legacy' web sites which is not completely modern. I believe it was left undetected because all the js files work just fine (based on ascii only). Please help build new web sites (if https://wikiplus-app.smartgslb.com is is under your control) following good, modern convention being promoted here going forward.
,
Mar 29 2017
,
Mar 29 2017
Issue 703006 has been merged into this issue.
,
Mar 29 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/31472d1f58ffa076a5697e22aded98f2adde994a commit 31472d1f58ffa076a5697e22aded98f2adde994a Author: jinsukkim <jinsukkim@chromium.org> Date: Wed Mar 29 23:24:14 2017 Respect UTF-8 detection result for local file resources https://crbug.com/2697213002 (not guessing UTF8 encoding) doesn't have to be applied to local file resources. This CL makes such cases an exception to the policy. BUG= 704800 Review-Url: https://codereview.chromium.org/2784483003 Cr-Commit-Position: refs/heads/master@{#460573} [modify] https://crrev.com/31472d1f58ffa076a5697e22aded98f2adde994a/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp [modify] https://crrev.com/31472d1f58ffa076a5697e22aded98f2adde994a/third_party/WebKit/Source/platform/text/TextEncodingDetectorTest.cpp
,
Mar 29 2017
,
Mar 30 2017
Your change meets the bar and is auto-approved for M58. Please go ahead and merge the CL to branch 3029 manually. Please contact milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Mar 31 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f99097de4e9282a1426f184be1078b84f6a08f18 commit f99097de4e9282a1426f184be1078b84f6a08f18 Author: Jinsuk Kim <jinsukkim@chromium.org> Date: Fri Mar 31 00:43:36 2017 Respect UTF-8 detection result for local file resources https://crbug.com/2697213002 (not guessing UTF8 encoding) doesn't have to be applied to local file resources. This CL makes such cases an exception to the policy. BUG= 704800 NOTRY=true NOPRESUBMIT=true TBR=tkent@chromium.org Review-Url: https://codereview.chromium.org/2784483003 Cr-Commit-Position: refs/heads/master@{#460573} (cherry picked from commit 31472d1f58ffa076a5697e22aded98f2adde994a) Review-Url: https://codereview.chromium.org/2781363003 . Cr-Commit-Position: refs/branch-heads/3029@{#505} Cr-Branched-From: 939b32ee5ba05c396eef3fd992822fcca9a2e262-refs/heads/master@{#454471} [modify] https://crrev.com/f99097de4e9282a1426f184be1078b84f6a08f18/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp [modify] https://crrev.com/f99097de4e9282a1426f184be1078b84f6a08f18/third_party/WebKit/Source/platform/text/TextEncodingDetectorTest.cpp
,
Apr 3 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c93c2e8927446064827541e3b029d7312686dfd1 commit c93c2e8927446064827541e3b029d7312686dfd1 Author: jinsukkim <jinsukkim@chromium.org> Date: Mon Apr 03 01:07:26 2017 Replace the type of hint url for blink::detectTextEncoding This helps the method avoid creating a new instance of KURL every time it is invoked. BUG= 704800 Review-Url: https://codereview.chromium.org/2786913002 Cr-Commit-Position: refs/heads/master@{#461355} [modify] https://crrev.com/c93c2e8927446064827541e3b029d7312686dfd1/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp [modify] https://crrev.com/c93c2e8927446064827541e3b029d7312686dfd1/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.h [modify] https://crrev.com/c93c2e8927446064827541e3b029d7312686dfd1/third_party/WebKit/Source/core/html/parser/TextResourceDecoderForFuzzing.h [modify] https://crrev.com/c93c2e8927446064827541e3b029d7312686dfd1/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp [modify] https://crrev.com/c93c2e8927446064827541e3b029d7312686dfd1/third_party/WebKit/Source/platform/text/TextEncodingDetector.h [modify] https://crrev.com/c93c2e8927446064827541e3b029d7312686dfd1/third_party/WebKit/Source/platform/text/TextEncodingDetectorTest.cpp
,
Apr 4 2017
,
Apr 4 2017
Thanks for the fix! Is there any proper place to discuss any further, especially my concern in comment 12?
,
Apr 4 2017
Issue 707687 has been merged into this issue.
,
Apr 4 2017
human.peng@ There are lots of websites left without updates, with documents in legacy text encoding but unlabelled. Running encoding detector is required for them. Modern websites are in most cases built with UTF-8 by default, so there's much less concern in that regard. We just want to make sure they come labelled as such. I believe the rationale behind all this was already mentioned in Issue 691985. Please use that bug entry to give your further input.
,
Apr 4 2017
This issue was mistakenly closed. My issue 707687 has been merged into this issue but it is still not working.
,
Apr 4 2017
Note that in the Chromium open source project, marking a bug "fixed" refers to the issue having been resolved in the source. It may be some time (more than 6 weeks) before the fix appears in a stable release and is pushed to users.
,
Apr 5 2017
Verified the issue on windows 7 & 10 using chrome M58 #58.0.3029.54 and observed that characters are rendered correctly as per steps mentioned in comment #0. Attached screencast for reference. Adding TE-Verified labels. Thanks! |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by kochi@chromium.org
, Mar 24 2017Status: Untriaged (was: Unconfirmed)