New issue
Advanced search Search tips

Issue 698605 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Windows , Mac
Pri: 1
Type: Bug-Regression



Sign in to add a comment

Chrome uses two character encodings in a single document

Reported by bengarre...@gmail.com, Mar 5 2017

Issue description

UserAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36

Steps to reproduce the problem:
1. Open the test case using the `file:///` protocol in a browser tab.
2. Scroll to line 207, column 77 (3rd character from the right)

What is the expected behavior?
The last 10 characters on line 207 should be ÛÛßßÜÜÛÛÛÛ

What went wrong?
The last 10 characters on line 207 return ÛÛßßÜÜÛллл

Chrome handles the text until line 207 as Windows-1252. But at col 77 it switches character encoding to ISO-8859-5 and uses that for the remainder of the document.

Did this work before? Yes Does work in 50

Does this work in other browsers? N/A

Chrome version: 56.0.2924.87  Channel: stable
OS Version: 10.0
Flash Version: 

I built a web extension that converts MS-DOS era plain text documents encoded with CP-437 to UTF-8 HTML5 to view in a browser tab.

When using the `file:///` protocol for these files, web browsers such as Firefox manage them as Windows-1252. But Chrome will often handle them with ISO-8859-5 and occasionally GBK, CP-1256, etc.

This two character encoding bug though only seems to happen with text that includes ANSI/ECMA-48 escaped control sequences.

 
Labels: Needs-Triage-M56
Labels: Needs-Feedback
NextAction: 2017-03-13
Can you attach the test case?

Comment 3 Deleted

Project Member

Comment 4 by sheriffbot@chromium.org, Mar 6 2017

Cc: jsb...@chromium.org
Labels: -Needs-Feedback
Thank you for providing more feedback. Adding requester "jsbell@chromium.org" to the cc list and removing "Needs-Feedback" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Hi, sorry it seems Monorail doesn't like the upload, I'll try with a renamed file. 
test.txt
33.2 KB View Download
chrome.png
78.3 KB View Download
notepad.png
68.3 KB View Download
NextAction: ----
Thanks! (Also, the pictures are extra helpful.)
Cc: jmukthavaram@chromium.org
Labels: -Pri-2 -Needs-Triage-M56 hasbisect-per-revision M-59 OS-Linux OS-Mac Pri-1
Owner: jinsuk...@chromium.org
Status: Assigned (was: Unconfirmed)
Able to reproduce this issue on windows 7 , Windows 10, Mac 10.12.3,Linux Ubuntu 14.04 with Chrome stable version-56.0.2924.87and Canary-59.0.3033.0.
Manual Bisect:
Good-54.0.2803.0 -Revision-406716
Bad-54.0.2804.0 -Revision-407025

Per revision Bisect Tool Info:
You are probably looking for a change made after 407004 (known good), but no later than 407005 (first known bad).
CHANGELOG URL:
The script might not always return single CL as suspectas some perf builds might get missing due to failure.
https://chromium.googlesource.com/chromium/src/+log/79f7b784a97cbb22f11064a05b621b0def87eab3..f0829bf6d80a9109b399580fe48d8c3e1c66eeed

Review-Url: https://codereview.chromium.org/1894913002

jinsukkim@ Kindly take a look and please help us to reassign this issue to a right owner if not with respect to this change.

Note: Tried bisect in 2 machines (Mac & windows) & got the same CL
Thanks.!
Status: Started (was: Assigned)
What happened is:

- The text is "broken" from the encoding detector's POV; it returned the most likely text encoding(CP932)
- Blink refused to accept it since it is not in WHATWG encoding standard. The text encoding remained default (windows-1252)
- Blink kept trying detection - at the last chunk, detector returned "ISO-8859-5" which was accepted and cause the bug.

The issue can be resolved by extending the list of encoding not accepted by Blink so that detector will return ASCII in place of the detected encoding, and prevent this kind of surprising result from happening. On it now...
Cc: tkent@chromium.org
Awesome analysis junsukkim@ ! Thanks for taking this on.
Project Member

Comment 13 by bugdroid1@chromium.org, Mar 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a86006db0adbe5bda6789d2a14d0805ce6273596

commit a86006db0adbe5bda6789d2a14d0805ce6273596
Author: jinsukkim <jinsukkim@chromium.org>
Date: Wed Mar 08 22:34:18 2017

Convert non-WHATWG text encoding to ASCII

CED is returning text encodings not supported by WHATWG
standard, which Blink refused to accept. It can cause
an unexpected bug. This CL converts those encoding to ASCII
so that raw bytes of the text remain intact.

BUG= 698605 

Review-Url: https://codereview.chromium.org/2737033003
Cr-Commit-Position: refs/heads/master@{#455560}

[modify] https://crrev.com/a86006db0adbe5bda6789d2a14d0805ce6273596/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp
[modify] https://crrev.com/a86006db0adbe5bda6789d2a14d0805ce6273596/third_party/WebKit/Source/platform/text/TextEncodingDetectorTest.cpp

Status: Fixed (was: Started)
Labels: Merge-Request-58
Project Member

Comment 16 by sheriffbot@chromium.org, Mar 9 2017

Labels: -Merge-Request-58 Hotlist-Merge-Approved Merge-Approved-58
Your change meets the bar and is auto-approved for M58. Please go ahead and merge the CL to branch 3029 manually. Please contact milestone owner if you have questions.
Owners: amineer@(clank), cmasso@(bling), bhthompson@(cros), govind@(desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Project Member

Comment 17 by bugdroid1@chromium.org, Mar 9 2017

Labels: -merge-approved-58 merge-merged-3029
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4c33f0d70750323bef4101380076570d1f2e803c

commit 4c33f0d70750323bef4101380076570d1f2e803c
Author: Jinsuk Kim <jinsukkim@chromium.org>
Date: Thu Mar 09 23:25:13 2017

Convert non-WHATWG text encoding to ASCII

CED is returning text encodings not supported by WHATWG
standard, which Blink refused to accept. It can cause
an unexpected bug. This CL converts those encoding to ASCII
so that raw bytes of the text remain intact.

BUG= 698605 
NOTRY=true
NOPRESUBMIT=true
TBR=tkent@chromium.org

Review-Url: https://codereview.chromium.org/2737033003
Cr-Commit-Position: refs/heads/master@{#455560}
(cherry picked from commit a86006db0adbe5bda6789d2a14d0805ce6273596)

Review-Url: https://codereview.chromium.org/2742873002 .
Cr-Commit-Position: refs/branch-heads/3029@{#97}
Cr-Branched-From: 939b32ee5ba05c396eef3fd992822fcca9a2e262-refs/heads/master@{#454471}

[modify] https://crrev.com/4c33f0d70750323bef4101380076570d1f2e803c/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp
[modify] https://crrev.com/4c33f0d70750323bef4101380076570d1f2e803c/third_party/WebKit/Source/platform/text/TextEncodingDetectorTest.cpp

Labels: TE-Verified-M58 TE-Verified-58.0.3029.14
Tested the issue on windows 7, Mac 10.12.3, Linux Ubuntu 14.04 using chrome version#58.0.3029.14 with the steps mentioned in comment #0.Observed that last 10 characters on line 207 displayed as " ÛÛßßÜÜÛÛÛÛ". Hence adding TE-Verified labels.
Please find the attached screen cast for the same.
Thank you!!

698605.mp4
2.9 MB View Download
Project Member

Comment 19 by bugdroid1@chromium.org, Mar 14 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/dd643ddc92e12037e88c4d9e004a868e90c3e5ca

commit dd643ddc92e12037e88c4d9e004a868e90c3e5ca
Author: jinsukkim <jinsukkim@chromium.org>
Date: Tue Mar 14 03:19:01 2017

Enable CED HTML5 mode

Set CED(Compact Encoding Detector) to HTML5 mode in which
the detector always returns WHATWG-compliant text encoding
as output. Call sites now don't have to do post-detection
conversion, which is removed in this CL together.

BUG= 698605 

Review-Url: https://codereview.chromium.org/2746843002
Cr-Commit-Position: refs/heads/master@{#456610}

[modify] https://crrev.com/dd643ddc92e12037e88c4d9e004a868e90c3e5ca/base/i18n/encoding_detection.cc
[modify] https://crrev.com/dd643ddc92e12037e88c4d9e004a868e90c3e5ca/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp
[modify] https://crrev.com/dd643ddc92e12037e88c4d9e004a868e90c3e5ca/third_party/ced/BUILD.gn

Sign in to add a comment