WTF::kEntitiesForUnencodables and related substitutions don't work with ISO-2022-JP |
||
Issue description
Chrome Version: (copy from chrome://version)
OS: (e.g. Win7, OSX 10.9.5, etc...)
What steps will reproduce the problem?
(1) Open this URL:
data:text/html;charset=UTF-8,<title>ISO-2022-JP</title><style>*{margin:0;padding:0;border:none;width:100%}</style><body onload=document.forms[0].submit()><form action="data:text/html;charset=ISO-2022-JP,<body style='font-family:monospace' onload='document.body.innerText=document.all.data.innerText.substr(7)+String.fromCharCode(10,10)+location.href.split(String.fromCharCode(61)).pop()'><plaintext id=data style='display:none'>?" accept-charset="ISO-2022-JP" target="output"><label for="input">Unicode</label><br><input readonly name=input value="ABC~%C2%A4%E2%80%A2%E2%98%85%E6%98%9F%F0%9F%8C%9F%E6%98%9F%E2%98%85%E2%80%A2%C2%A4~XYZ"><br><label for="output">ISO-2022-JP</label><br><iframe name=output id=output></iframe></form>
a.k.a.
data:text/html;charset=utf-8,%3Chtml%3E%3Chead%3E%3Ctitle%3EISO-2022-JP%3C%2Ftitle%3E%3Cstyle%3E*%7Bmargin%3A0%3Bpadding%3A0%3Bborder%3Anone%3Bwidth%3A100%25%7D%3C%2Fstyle%3E%3C%2Fhead%3E%3Cbody%20onload%3D%22document.forms%5B0%5D.submit()%22%3E%3Cform%20action%3D%22data%3Atext%2Fhtml%3Bcharset%3DISO-2022-JP%2C%26lt%3Bbody%20style%3D'font-family%3Amonospace'%20onload%3D'document.body.innerText%3Ddocument.all.data.innerText.substr(7)%2BString.fromCharCode(10%2C10)%2Blocation.href.split(String.fromCharCode(61)).pop()'%26gt%3B%26lt%3Bplaintext%20id%3Ddata%20style%3D'display%3Anone'%26gt%3B%3F%22%20accept-charset%3D%22ISO-2022-JP%22%20target%3D%22output%22%3E%3Clabel%20for%3D%22input%22%3EUnicode%3C%2Flabel%3E%3Cbr%3E%3Cinput%20readonly%3D%22%22%20name%3D%22input%22%20value%3D%22ABC~%C2%A4%E2%80%A2%E2%98%85%E6%98%9F%F0%9F%8C%9F%E6%98%9F%E2%98%85%E2%80%A2%C2%A4~XYZ%22%3E%3Cbr%3E%3Clabel%20for%3D%22output%22%3EISO-2022-JP%3C%2Flabel%3E%3Cbr%3E%3Ciframe%20name%3D%22output%22%20id%3D%22output%22%3E%3C%2Fiframe%3E%3C%2Fform%3E%3C%2Fbody%3E%3C%2Fhtml%3E
What is the expected result?
ISO-2022-JP shift state reset before numeric character reference insertion, causing ASCII-compatible interpretation.
Unicode
ABC~Β€β’β
ζπζβ
β’Β€~XYZ
ISO-2022-JP
ABC~¤•β
ζ🌟ζβ
•¤~XYZ
ABC%7E%26%23164%3B%26%238226%3B%1B%24B%21z%401%1B%28B%26%23127775%3B%1B%24B%401%21z%1B%28B%26%238226%3B%26%23164%3B%7EXYZ
What happens instead?
ISO-2022-JP shift state not reset before numeric character reference insertion, causing ASCII-incompatible interpretation and causing misinterpretation of the remainder of the string (at least until the next shift or reset).
Unicode
ABC~Β€β’β
ζπζβ
β’Β€~XYZ
ISO-2022-JP
ABC~¤•β
ζΞζΈ¦η₯ε¦ι
Έι’ζοΌθθΈΞθζ~XYZ
ABC%7E%26%23164%3B%26%238226%3B%1B%24B%21z%401%26%23127775%3B%401%21z%26%238226%3B%26%23164%3B%1B%28B%7EXYZ
Note that WTF::kEntitiesForUnencodables is not alone in having this bug - other WTF::*ForUnencodables replacement modes have this problem, too.
Please use labels and text to provide additional information.
For graphics-related bugs, please copy/paste the contents of the about:gpu
page at the end of this report.
,
Nov 8 2017
BTW Firefox gets this one right: Unicode ABC~Β€β’β ζπζβ β’Β€~XYZ ISO-2022-JP ABC~¤•β ζ🌟ζβ •¤~XYZ ABC%7E%26%23164%3B%26%238226%3B%1B%24B%21z%401%1B%28B%26%23127775%3B%1B%24B%401%21z%1B%28B%26%238226%3B%26%23164%3B%7EXYZ
,
Nov 9 2017
Safari also gets this right, and produces exactly the expected output.
,
Nov 10 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3d17f1f120dc22992fd938afcb7a3cd7edd6aae0 commit 3d17f1f120dc22992fd938afcb7a3cd7edd6aae0 Author: Benjamin C. Wiley Sittler <bsittler@chromium.org> Date: Fri Nov 10 02:23:31 2017 TextCodecICU: respect encoder state when replacing unencodables This also fixes the IgnorableCodePoint test to use inputs which have nontrivial ISO-2022-JP shift states; previously this test did not actually ensure correct behavior when non-ASCII characters representable in ISO-2022-JP and non-ASCII characters not representable in ISO-2022-JP occur in sequence. This also adds WPT coverage. Prior to this fix, encoder state was not respected, leading to incorrect interpretation of the replacements and sometimes following bytes too, depending on whether the replacement lengths were even or odd, and on whether the active state of the ISO-2022-JP G0 character set was one-byte or two-byte. An example, with results transcribed in Unicode for readability: Input: ABC~Β€β’β ζπζβ β’Β€~XYZ Old output: Bytes: ABC~¤•β$B!z@1🌟@1!z•¤β(B~XYZ Meaning: ABC~¤•β ζΞζΈ¦η₯ε¦ι Έι’ζοΌθθΈΞθζ~XYZ New output: Bytes: ABC~¤•β$B!z@1β(B🌟β$B@1!zβ(B•¤~XYZ Meaning: ABC~¤•β ζ🌟ζβ •¤~XYZ Bug: 782565 Change-Id: If2a7b76b99ce77cbec433af5384ed5c4d2e3c581 Reviewed-on: https://chromium-review.googlesource.com/758405 Commit-Queue: Benjamin Wiley Sittler <bsittler@chromium.org> Reviewed-by: Jungshik Shin <jungshik@google.com> Reviewed-by: Joshua Bell <jsbell@chromium.org> Reviewed-by: Emil A Eklund <eae@chromium.org> Cr-Commit-Position: refs/heads/master@{#515425} [add] https://crrev.com/3d17f1f120dc22992fd938afcb7a3cd7edd6aae0/third_party/WebKit/LayoutTests/external/wpt/encoding/legacy-mb-japanese/iso-2022-jp/iso2022jp-encode-form-errors-stateful.html [modify] https://crrev.com/3d17f1f120dc22992fd938afcb7a3cd7edd6aae0/third_party/WebKit/Source/platform/wtf/text/TextCodecICU.cpp [modify] https://crrev.com/3d17f1f120dc22992fd938afcb7a3cd7edd6aae0/third_party/WebKit/Source/platform/wtf/text/TextCodecICUTest.cpp
,
Nov 10 2017
bsittler@: It seems like you landed a CL that fixes this. Please reopen if I guessed incorrectly.
,
Nov 10 2017
Correct! Thanks for closing, I had intended to today if the fix was not reverted (apparently it's not!) |
||
►
Sign in to add a comment |
||
Comment 1 by bsittler@chromium.org
, Nov 8 2017