charset detection is unstable |
||||||
Issue descriptionVersion: 54.0.2840.27 (Official Build) beta (64-bit) OS: Linux What steps will reproduce the problem? (1) Visit http://www.mt.cs.keio.ac.jp/person/narita/lv/index_ja.html (2) Do shift-Reload multiple times What is the expected output? Every time, Chrome shows the content in a same way. What do you see instead? Sometime Chrome shows it in Japanese, but sometime it does not (it's in broken strange characters, e.g., $BCx:n8"I=<(). It seems that Chrome always shows it in Japanese correctly if it comes from disk cache, or network is fast enough.
,
Sep 30 2016
,
Sep 30 2016
,
Sep 30 2016
The page is encoded in ISO-2022-JP but unlabelled. The detector needs at least 1772 bytes to detect the right encoding for the page. In ordinary cases the data chunk size is big enough (2305, 3749, etc) but repeated forced refreshes cause the data of size smaller (979), which cause the detector to consume all the bytes but still return a wrong result (ASCII_7BIT). The symptom seems to be dependent on the site bandwidth. Other similar (unlabelled ISO-2022-JP) site like http://itojun.org/paper/keio-doctor97.html doesn't suffer such issue. Experiment shows that |isReliable|, an output flag of the detector API is not a good indicator for this. What can be utilized to detect this situation is the other flag |consumedBytes|, which, when equal to the input |length|, can tell us the output encoding should not be trusted.
,
Sep 30 2016
,
Sep 30 2016
On a related note, there was a suggestion (in the form of a question) about dropping the auto-detection of ISO-2022-JP which is the only 7-bit encoding supported by Blink (conforming to WHATWG). Issue 647582 Some stat to make a right decision with would be useful.
,
Sep 30 2016
Thank you for investigation. Yeah, we still could find a few legacy pages written in ISO-2022-JP as this, but now it rarely happens, and modern sites never use it.
,
Oct 3 2016
The workaround I mention in #4 comes with a caveat: it can't tell the actual US-ASCII text from the ISO-2022-JP-encoded one for such small length. The detector consumes the entire input and responds 'reliable US-ASCII' for both. Considering the limited situation where the bug happens, I think I'll leave the bug open for now as it is not affecting most of users. Will keep track of the Issue 647582 in the meantime.
,
Nov 30 2016
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by toyoshim@chromium.org
, Sep 26 2016Labels: Hotlist-Loading