Encoding problem on Windows (-)
Reported by
julesroh...@googlemail.com,
Sep 29 2017
|
|||||
Issue descriptionUserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36 Example URL: Steps to reproduce the problem: This is a rendering issue What is the expected behavior? Ignore strange character What went wrong? We are rending text from a pdf in the browser. When checking the data response from the server in the network tab, the text displays a strange character (please see the attached screenshot). On OSX the character appears to be ignored, but on Windows two commas (,,) are rendered in its place. The character is also ignored in stdout logging in the OSX terminal. Firefox and IE on Windows both appear to ignore this character and do not render anything in its place. Please see the following resources http://www.i18nqa.com/debug/bug-iso8859-1-vs-windows-1252.html http://www.i18nqa.com/debug/utf8-debug.html Does it occur on multiple sites: N/A Is it a problem with a plugin? N/A Did this work before? N/A Does this work in other browsers? Yes Chrome version: 61.0.3163.100 Channel: stable OS Version: 7/10 Flash Version:
,
Oct 2 2017
Hovering over the red elipsis in chrome displays (\u84)
,
Oct 3 2017
julesrohanveling@ - Thanks for filing the issue...!! Could you please provide a sample URL to test the issue from TE-end. This will help us in triaging the issue further. Thanks...!!
,
Oct 4 2017
Hi, thanks for your response... Please go to http://www.sciencedirect.com/science/article/pii/S2214647416300393 and click on the download pdf link at the top of the page. Thanks, Julian
,
Oct 4 2017
Thank you for providing more feedback. Adding requester "krajshree@chromium.org" to the cc list and removing "Needs-Feedback" label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Oct 4 2017
For those trying to follow the steps: 1. Go to http://www.sciencedirect.com/science/article/pii/S2214647416300393 2. Click the "Download PDF" link at the top, then "Article" 3. This will open a new tab. Note that this is NOT downloading a PDF; as the OP notes it's rendering the PDF to HTML on the server. 4. Open DevTools, go to the Network tab 5. Reload the page 6. In the DevTools Network tab, click on the document entry (first one) 7. In the DevTools Network tab, click on the Response header - it's big, on my powerful machine it took several seconds to appear. The red dot appears in the <title> as per the screenshot. Here's what I get when I copy/paste it on windows: Cyclo(His-Pro)
,
Oct 4 2017
And FYI the raw bytes by curling the URL piped through hexdump -C:
00000010 20 20 3c 68 74 6d 6c 20 6c 61 6e 67 3d 22 65 6e | <html lang="en|
00000020 22 3e 0a 20 20 3c 68 65 61 64 3e 0a 20 20 20 20 |">. <head>. |
00000030 3c 74 69 74 6c 65 3e 4d 65 74 61 62 6f 6c 69 63 |<title>Metabolic|
00000040 20 72 65 6c 61 74 69 6f 6e 73 68 69 70 20 62 65 | relationship be|
00000050 74 77 65 65 6e 20 64 69 61 62 65 74 65 73 20 61 |tween diabetes a|
00000060 6e 64 20 41 6c 7a 68 65 69 6d 65 72 26 61 70 6f |nd Alzheimer&apo|
00000070 73 3b 73 20 44 69 73 65 61 73 65 20 61 66 66 65 |s;s Disease affe|
00000080 63 74 65 64 20 62 79 20 43 79 63 6c 6f 28 48 69 |cted by Cyclo(Hi|
00000090 73 2d c2 84 50 72 6f 29 20 70 6c 75 73 20 7a 69 |s-..Pro) plus zi|
000000a0 6e 63 20 74 72 65 61 74 6d 65 6e 74 3c 2f 74 69 |nc treatment</ti|
000000b0 74 6c 65 3e 0a 20 20 20 20 3c 6d 65 74 61 20 63 |tle>. <meta c|
000000c0 68 61 72 73 65 74 3d 22 55 54 46 2d 38 22 3e 0a |harset="UTF-8">.|
document.characterSet is "UTF-8"
The encoded bytes in question are c2 84:
new TextDecoder("UTF-8").decode(new Uint8Array([0xc2, 0x84])).charCodeAt(0).toString(16)
>> "84"
So what's there is U+0084 (as noted in comment #2)
U+0084 is a Unicode control character. Different fonts render that differently; on Windows I get the double-comma. On Linux I get a wide underscore.
So this doesn't look like an encoding issue. We're decoding the incoming UTF-8 exactly as expected.
Note that Chrome doesn't filter control characters out per issue 530342 . In non-web content areas (window tabs, devtools) and clipboard we're likely just going to rely on what the system font provides.
Moving to Blink > Fonts, but I think this is WAI.
,
Oct 4 2017
Yes this was an intentional change agreed to by the CSS WG to match the unicode specification. The other browsers are either already doing the same or are in the process of doing so. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by julesroh...@googlemail.com
, Oct 2 2017