Version: 56.0.2905.0 (Official Build) canary (64-bit)
OS: ChromeOS
What steps will reproduce the problem?(1) Open:
data:text/html;charset=utf-8,<form accept-charset="windows-1252" method=POST enctype="multipart/form-data" action="https://echo.getpostman.com/post"><ol><li><a download="U&%23xE2;&%23x98;&%23xBA;%E2%98%BA.U&%23xE2;&%23x98;&%23xBA;%E2%98%BA" href="data:application/octet-stream,U%C3%A2%C2%98%C2%BA%E2%98%BA">download</a><li><input type=file name="q"><li><input type=submit value=upload></ol></form>
(2) Follow instructions: download, then choose downloaded file, then upload
(3) Note upload filename as seen by server (right after "files"). (NOTE: this CGI always labels the response as UTF-8 even though in this case it isn't actually - however the test data and filename have been modified to work around that)
What is the expected output?
Unrepresentable characters converted to HTML numeric character references "&#NNNNNNN;":
{"args":{},"data":{},"files":{"U☺☺.U☺☺":...
Firefox does something like this, except each uploaded filename is prefixed by tmp_<number> and suffixed with a different <number> in Firefox too, which we probably don't need to do:
{"args":{},"data":{},"files":{"tmp_7163-U☺☺1556538326.U☺☺":...
What do you see instead?
Unrepresentable characters are converted to "?":
{"args":{},"data":{},"files":{"U☺?.U☺?":...
Please use labels and text to provide additional information.
Version: 56.0.2905.0 (Official Build) canary (64-bit)
OS: ChromeOS
What steps will reproduce the problem?(1) Open:
data:text/html;charset=utf-8,<form accept-charset="windows-1252" method=POST enctype="multipart/form-data" action="https://postman-echo.com/post"><ol><li><a download="U&%23xE2;&%23x98;&%23xBA;%E2%98%BA.U&%23xE2;&%23x98;&%23xBA;%E2%98%BA" href="data:application/octet-stream,U%C3%A2%C2%98%C2%BA%E2%98%BA">download</a><li><input type=file name="q"><li><input type=submit value=upload></ol></form>
(2) Follow instructions: download, then choose downloaded file, then upload
(3) Note upload filename as seen by server (right after "files"). (NOTE: this CGI always labels the response as UTF-8 even though in this case it isn't actually - however the test data and filename have been modified to work around that)
What is the expected output?
Unrepresentable characters converted to HTML numeric character references "&#NNNNNNN;":
{"args":{},"data":{},"files":{"U☺☺.U☺☺":...
Firefox does something like this, except each uploaded filename is prefixed by tmp_<number> and suffixed with a different <number> in Firefox too, which we probably don't need to do:
{"args":{},"data":{},"files":{"tmp_7163-U☺☺1556538326.U☺☺":...
What do you see instead?
Unrepresentable characters are converted to "?":
{"args":{},"data":{},"files":{"U☺?.U☺?":...
Please use labels and text to provide additional information.
Version: 56.0.2905.0 (Official Build) canary (64-bit)
OS: ChromeOS
What steps will reproduce the problem?(1) Open:
data:text/html;charset=utf-8,<form accept-charset="windows-1252" method=POST enctype="multipart/form-data" action="https://postman-echo.com/post"><ol><li><a download="U&%23xE2;&%23x2DC;&%23xBA;%E2%98%BA.U&%23xE2;&%23x2DC;&%23xBA;%E2%98%BA" href="data:application/octet-stream,U%C3%A2%C2%98%C2%BA%E2%98%BA">download</a><li><input type=file name="q"><li><input type=submit value=upload></ol></form>
(2) Follow instructions: download, then choose downloaded file, then upload
(3) Note upload filename as seen by server (right after "files"). (NOTE: this CGI always labels the response as UTF-8 even though in this case it isn't actually - however the test data and filename have been modified to work around that)
What is the expected output?
Unrepresentable characters converted to HTML numeric character references "&#NNNNNNN;":
{"args":{},"data":{},"files":{"U☺☺.U☺☺":...
Firefox does something like this, except each uploaded filename is prefixed by tmp_<number> and suffixed with a different <number> in Firefox too, which we probably don't need to do:
{"args":{},"data":{},"files":{"tmp_7163-U☺☺1556538326.U☺☺":...
What do you see instead?
Unrepresentable characters are converted to "?":
{"args":{},"data":{},"files":{"U☺?.U☺?":...
Please use labels and text to provide additional information.
Components: -Blink>Forms Blink>Forms>Submission Labels: Hotlist-Interop Status: Available (was: Untriaged)
Edge:
It seems C-D header is encoded for the page encoding. So, if the page encoding is UTF-8, UTF-8-encoded bytes are sent.
If the page encoding is windows-1252, ☺ are sent as is, and ☺ is encoded as ☺.
So, we should use numeric references for interoperability.
At first, I thought it'd be best to use RFC 5987 (always with UTF-8), but RFC 7578 has the following:
NOTE: The encoding method described in [RFC5987], which would add a
"filename*" parameter to the Content-Disposition header field, MUST
NOT be used.
RFC 5987 has a similar note.
Great! Looks like Edge, Firefox have matching less-lossy behavior. If Chrome switches to that we have broader consensus and can see about updating the HTML spec to agree. What do you think?
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.
Sorry for the inconvenience if the bug really should have been left as Available. If you change it back, also remove the "Hotlist-Recharge-Cold" label.
For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
FYI, original repro steps bitrotted. Here's the updated URI using the moved postman-echo service:
data:text/html;charset=utf-8,<form accept-charset="windows-1252" method=POST enctype="multipart/form-data" action="https://postman-echo.com/post"><ol><li><a download="U&%23xE2;&%23x98;&%23xBA;%E2%98%BA.U&%23xE2;&%23x98;&%23xBA;%E2%98%BA" href="data:application/octet-stream,U%C3%A2%C2%98%C2%BA%E2%98%BA">download</a><li><input type=file name="q"><li><input type=submit value=upload></ol></form>
It would probably also be a good idea to test precomposed and decomposed combining sequences and similar cases of canonical and de-facto equivalence, as this is an area where real-life data loss is more or less to be expected if an implementation is insufficiently lenient.
For example,
- "Chữ Nôm" and "Chữ Nôm" may be treated interchangeably or canonicalized by some filesystem layers
- two or more of "한글", "한글", "ㅎㅏㄴㄱㅡㄹ", and "하ᄂ그ᄅ" may be treated interchangeably or canonicalized by some filesystem layers
- བྷྲཱྀནྟྲཱནཱེནྡྷི may end up being de-facto equivalent to on a system using a Chinese locale and implementing the proposed Private Use Area mapping for Extended Tibetan Set A (these are at least visually identical on the Chromebook I'm using right now)
One more example:
- '' may end up being de-facto equivalent to 'ཧྭོ' on some systems but on others it's not, and is visually more like '🍎︎' with a bite taken out of it
Comment 1 by bsittler@chromium.org
, Nov 3 2016