New issue
Advanced search Search tips

Issue 650377 link

Starred by 4 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug


Show other hotlists

Hotlists containing this issue:
Encoding-Detection


Sign in to add a comment

charset detection is unstable

Project Member Reported by toyoshim@chromium.org, Sep 26 2016

Issue description

Version: 54.0.2840.27 (Official Build) beta (64-bit)
OS: Linux

What steps will reproduce the problem?
(1) Visit http://www.mt.cs.keio.ac.jp/person/narita/lv/index_ja.html
(2) Do shift-Reload multiple times

What is the expected output?
Every time, Chrome shows the content in a same way.

What do you see instead?
Sometime Chrome shows it in Japanese, but sometime it does not (it's in broken strange characters, e.g., $BCx:n8"I=<().
It seems that Chrome always shows it in Japanese correctly if it comes from disk cache, or network is fast enough.
 
Components: Blink>HTML>Parser
Labels: Hotlist-Loading
I do not know where the charset detection is, but let me set Blink>HTML>Parser tentatively.

Comment 2 by tkent@chromium.org, Sep 30 2016

Cc: jinsuk...@chromium.org
Components: -Blink>HTML>Parser Blink>TextEncoding
Cc: -jinsuk...@chromium.org
Owner: jinsuk...@chromium.org
Status: Assigned (was: Untriaged)
Cc: toyoshim@chromium.org
The page is encoded in ISO-2022-JP but unlabelled. The detector needs at least 1772 bytes to detect the right encoding for the page. In ordinary cases the data chunk size is big enough (2305, 3749, etc) but repeated forced refreshes cause the data of size smaller (979), which cause the detector to consume all the bytes but still return a wrong result (ASCII_7BIT). The symptom seems to be dependent on the site bandwidth.  Other similar (unlabelled ISO-2022-JP) site like http://itojun.org/paper/keio-doctor97.html doesn't suffer such issue.

Experiment shows that |isReliable|, an output flag of the detector API is not a good indicator for this. What can be utilized to detect this situation is the other flag |consumedBytes|, which, when equal to the input |length|, can tell us the output encoding should not be trusted.


Status: Started (was: Assigned)
On a related note, there was a suggestion (in the form of a question) about dropping the auto-detection of ISO-2022-JP which is the only 7-bit encoding supported by Blink (conforming to WHATWG).  Issue 647582  Some stat to make a right decision with would be useful. 
Thank you for investigation.

Yeah, we still could find a few legacy pages written in ISO-2022-JP as this, but now it rarely happens, and modern sites never use it.
The workaround I mention in #4 comes with a caveat: it can't tell the actual US-ASCII text from the ISO-2022-JP-encoded one for such small length. The detector consumes the entire input and responds 'reliable US-ASCII' for both.

Considering the limited situation where the bug happens, I think I'll leave the bug open for now as it is not affecting most of users. Will keep track of the  Issue 647582  in the meantime.


Status: Assigned (was: Started)

Sign in to add a comment