Chrome uses two character encodings in a single document
Reported by
bengarre...@gmail.com,
Mar 5 2017
|
||||||||||||
Issue descriptionUserAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 Steps to reproduce the problem: 1. Open the test case using the `file:///` protocol in a browser tab. 2. Scroll to line 207, column 77 (3rd character from the right) What is the expected behavior? The last 10 characters on line 207 should be ÛÛßßÜÜÛÛÛÛ What went wrong? The last 10 characters on line 207 return ÛÛßßÜÜÛллл Chrome handles the text until line 207 as Windows-1252. But at col 77 it switches character encoding to ISO-8859-5 and uses that for the remainder of the document. Did this work before? Yes Does work in 50 Does this work in other browsers? N/A Chrome version: 56.0.2924.87 Channel: stable OS Version: 10.0 Flash Version: I built a web extension that converts MS-DOS era plain text documents encoded with CP-437 to UTF-8 HTML5 to view in a browser tab. When using the `file:///` protocol for these files, web browsers such as Firefox manage them as Windows-1252. But Chrome will often handle them with ISO-8859-5 and occasionally GBK, CP-1256, etc. This two character encoding bug though only seems to happen with text that includes ANSI/ECMA-48 escaped control sequences.
,
Mar 6 2017
Can you attach the test case?
,
Mar 6 2017
Thank you for providing more feedback. Adding requester "jsbell@chromium.org" to the cc list and removing "Needs-Feedback" label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Mar 6 2017
Hi, sorry it seems Monorail doesn't like the upload, I'll try with a renamed file.
,
Mar 7 2017
,
Mar 7 2017
Thanks! (Also, the pictures are extra helpful.)
,
Mar 7 2017
Able to reproduce this issue on windows 7 , Windows 10, Mac 10.12.3,Linux Ubuntu 14.04 with Chrome stable version-56.0.2924.87and Canary-59.0.3033.0. Manual Bisect: Good-54.0.2803.0 -Revision-406716 Bad-54.0.2804.0 -Revision-407025 Per revision Bisect Tool Info: You are probably looking for a change made after 407004 (known good), but no later than 407005 (first known bad). CHANGELOG URL: The script might not always return single CL as suspectas some perf builds might get missing due to failure. https://chromium.googlesource.com/chromium/src/+log/79f7b784a97cbb22f11064a05b621b0def87eab3..f0829bf6d80a9109b399580fe48d8c3e1c66eeed Review-Url: https://codereview.chromium.org/1894913002 jinsukkim@ Kindly take a look and please help us to reassign this issue to a right owner if not with respect to this change. Note: Tried bisect in 2 machines (Mac & windows) & got the same CL Thanks.!
,
Mar 7 2017
,
Mar 8 2017
What happened is: - The text is "broken" from the encoding detector's POV; it returned the most likely text encoding(CP932) - Blink refused to accept it since it is not in WHATWG encoding standard. The text encoding remained default (windows-1252) - Blink kept trying detection - at the last chunk, detector returned "ISO-8859-5" which was accepted and cause the bug. The issue can be resolved by extending the list of encoding not accepted by Blink so that detector will return ASCII in place of the detected encoding, and prevent this kind of surprising result from happening. On it now...
,
Mar 8 2017
,
Mar 8 2017
Awesome analysis junsukkim@ ! Thanks for taking this on.
,
Mar 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a86006db0adbe5bda6789d2a14d0805ce6273596 commit a86006db0adbe5bda6789d2a14d0805ce6273596 Author: jinsukkim <jinsukkim@chromium.org> Date: Wed Mar 08 22:34:18 2017 Convert non-WHATWG text encoding to ASCII CED is returning text encodings not supported by WHATWG standard, which Blink refused to accept. It can cause an unexpected bug. This CL converts those encoding to ASCII so that raw bytes of the text remain intact. BUG= 698605 Review-Url: https://codereview.chromium.org/2737033003 Cr-Commit-Position: refs/heads/master@{#455560} [modify] https://crrev.com/a86006db0adbe5bda6789d2a14d0805ce6273596/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp [modify] https://crrev.com/a86006db0adbe5bda6789d2a14d0805ce6273596/third_party/WebKit/Source/platform/text/TextEncodingDetectorTest.cpp
,
Mar 9 2017
,
Mar 9 2017
,
Mar 9 2017
Your change meets the bar and is auto-approved for M58. Please go ahead and merge the CL to branch 3029 manually. Please contact milestone owner if you have questions. Owners: amineer@(clank), cmasso@(bling), bhthompson@(cros), govind@(desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Mar 9 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4c33f0d70750323bef4101380076570d1f2e803c commit 4c33f0d70750323bef4101380076570d1f2e803c Author: Jinsuk Kim <jinsukkim@chromium.org> Date: Thu Mar 09 23:25:13 2017 Convert non-WHATWG text encoding to ASCII CED is returning text encodings not supported by WHATWG standard, which Blink refused to accept. It can cause an unexpected bug. This CL converts those encoding to ASCII so that raw bytes of the text remain intact. BUG= 698605 NOTRY=true NOPRESUBMIT=true TBR=tkent@chromium.org Review-Url: https://codereview.chromium.org/2737033003 Cr-Commit-Position: refs/heads/master@{#455560} (cherry picked from commit a86006db0adbe5bda6789d2a14d0805ce6273596) Review-Url: https://codereview.chromium.org/2742873002 . Cr-Commit-Position: refs/branch-heads/3029@{#97} Cr-Branched-From: 939b32ee5ba05c396eef3fd992822fcca9a2e262-refs/heads/master@{#454471} [modify] https://crrev.com/4c33f0d70750323bef4101380076570d1f2e803c/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp [modify] https://crrev.com/4c33f0d70750323bef4101380076570d1f2e803c/third_party/WebKit/Source/platform/text/TextEncodingDetectorTest.cpp
,
Mar 10 2017
Tested the issue on windows 7, Mac 10.12.3, Linux Ubuntu 14.04 using chrome version#58.0.3029.14 with the steps mentioned in comment #0.Observed that last 10 characters on line 207 displayed as " ÛÛßßÜÜÛÛÛÛ". Hence adding TE-Verified labels. Please find the attached screen cast for the same. Thank you!!
,
Mar 14 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/dd643ddc92e12037e88c4d9e004a868e90c3e5ca commit dd643ddc92e12037e88c4d9e004a868e90c3e5ca Author: jinsukkim <jinsukkim@chromium.org> Date: Tue Mar 14 03:19:01 2017 Enable CED HTML5 mode Set CED(Compact Encoding Detector) to HTML5 mode in which the detector always returns WHATWG-compliant text encoding as output. Call sites now don't have to do post-detection conversion, which is removed in this CL together. BUG= 698605 Review-Url: https://codereview.chromium.org/2746843002 Cr-Commit-Position: refs/heads/master@{#456610} [modify] https://crrev.com/dd643ddc92e12037e88c4d9e004a868e90c3e5ca/base/i18n/encoding_detection.cc [modify] https://crrev.com/dd643ddc92e12037e88c4d9e004a868e90c3e5ca/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp [modify] https://crrev.com/dd643ddc92e12037e88c4d9e004a868e90c3e5ca/third_party/ced/BUILD.gn |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by nyerramilli@chromium.org
, Mar 6 2017