New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 795733 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
OoO until Feb 4th
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

'&' and ';' should not be percent-encoded in URL queries

Project Member Reported by raphael....@intel.com, Dec 18 2017

Issue description

We're currently failing a few WPT tests because we're percent-encoding U+0026 ('&') and U+003B (';') in URL queries instead of just adding them as-is.

This behavior seems to date back to at least 2008 with https://trac.webkit.org/changeset/31089/webkit, whereas the URL spec has specified that those two code points shouldn't be percent-encoded since 2012 (https://github.com/whatwg/url/commit/bbeacfe00317554daab0ece983f42692d42487fc).

At the moment, Firefox implements the spec correctly, WebKit and Blink percent-encode them, and Edge doesn't seem to do any encoding.

The current behavior is hardcoded in both WTF as well as //url itself. At the moment I'm focusing on Blink, but I guess we'd need to determine whether //url needs to be updated too.
 
Project Member

Comment 1 by bugdroid1@chromium.org, Dec 19 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/27eaa9817b9837aa478e0667158292a0295985df

commit 27eaa9817b9837aa478e0667158292a0295985df
Author: Raphael Kubo da Costa <raphael.kubo.da.costa@intel.com>
Date: Tue Dec 19 19:21:02 2017

Do not percent-encode '&' and ';' in URL queries.

The URL spec has specified that U+0026 ('&') and U+003B (';') should not be
percent-encoded since
https://github.com/whatwg/url/commit/bbeacfe00317554daab0ece983f42692d42487fc,
so follow suit a few years late and match what Gecko has already been doing.

Bug:  795733 
Change-Id: Icc28699fc14b63ef92751e81d94fb3eeee876781
Reviewed-on: https://chromium-review.googlesource.com/829313
Reviewed-by: Joshua Bell <jsbell@chromium.org>
Commit-Queue: Raphael Kubo da Costa <raphael.kubo.da.costa@intel.com>
Cr-Commit-Position: refs/heads/master@{#525101}
[delete] https://crrev.com/70e9f46252e6732b25744ca62f7cf4af73d028aa/third_party/WebKit/LayoutTests/external/wpt/XMLHttpRequest/open-url-encoding-expected.txt
[delete] https://crrev.com/70e9f46252e6732b25744ca62f7cf4af73d028aa/third_party/WebKit/LayoutTests/external/wpt/encoding/big5-encoder-expected.txt
[delete] https://crrev.com/70e9f46252e6732b25744ca62f7cf4af73d028aa/third_party/WebKit/LayoutTests/external/wpt/encoding/gbk-encoder-expected.txt
[modify] https://crrev.com/27eaa9817b9837aa478e0667158292a0295985df/third_party/WebKit/LayoutTests/fast/url/query-expected.txt
[modify] https://crrev.com/27eaa9817b9837aa478e0667158292a0295985df/third_party/WebKit/LayoutTests/fast/url/script-tests/query.js
[modify] https://crrev.com/27eaa9817b9837aa478e0667158292a0295985df/third_party/WebKit/LayoutTests/http/tests/uri/escaped-entity-expected.txt
[modify] https://crrev.com/27eaa9817b9837aa478e0667158292a0295985df/third_party/WebKit/Source/platform/wtf/text/TextCodec.cpp
[modify] https://crrev.com/27eaa9817b9837aa478e0667158292a0295985df/third_party/WebKit/Source/platform/wtf/text/TextCodec.h
[modify] https://crrev.com/27eaa9817b9837aa478e0667158292a0295985df/third_party/WebKit/Source/platform/wtf/text/TextCodecTest.cpp

I think we should have a Blink intent for this change. Can you do that?

I had a hard time following this. As written, the description doesn't make any sense since & in queries must absolutely be escaped in some circumstances, and absolutely not escaped in others, and both of these currently happen.

I also don't understand what "edge doesn't seem to do any encoding" means. It's definitely doing encoding. Are you saying it doesn't do any non-UTF-8 query encoding? Or something else?

From what I can gather this is about how the first and last character of "&#1234;" is escaped when a character is encountered outside of the character set of the target page. This is delicate because queries contain ampersands with specific meaning. If true, the bug and CL descriptions should have made this much more clear.

Chrome's old behavior I think implies that the unescaping is done after the query is parsed, while the new behavior is that unescaping is done before the query is parsed. If my understanding is correct, this change certainly has the ability to break things. Firefox may not have enough usage in Asian countries where e.g. Big5 might be encountered, and I can't tell what other browsers do from the description (Edge and IE are relevant).

The change in the repo has a one-line description and a link to an IRC discussion from 2012. This doesn't help my concern that the spec'ed behavior isn't sufficiently thought out.

This is why I think there needs to more discussion and an analysisof what servers actually do with this type of input (i.e. how does Apache handle this).
Thanks for the comment, Brett. I'm reverting the CL I landed in comment #1 before M65 branches; after that, there should be plenty of time to investigate the issue.
Sorry for the extra work :-/
And apologies for not being as critical reviewer as I should have been up front. :(

We really, REALLY appreciate the work to align browsers/specs, and pushing on these cases that trip up web developers and lead to random breakages for users. Thanks for your persistence in resolving these, even when it requires more than just code changes!
Project Member

Comment 6 by bugdroid1@chromium.org, Jan 8 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7851ea3411dd04c74549373df930b4688576218e

commit 7851ea3411dd04c74549373df930b4688576218e
Author: Raphael Kubo da Costa <raphael.kubo.da.costa@intel.com>
Date: Mon Jan 08 22:12:24 2018

Revert "Do not percent-encode '&' and ';' in URL queries."

This reverts commit 27eaa9817b9837aa478e0667158292a0295985df.

Reason for revert: brettw would like this to be discussed more widely before landing.

Original change's description:
> Do not percent-encode '&' and ';' in URL queries.
> 
> The URL spec has specified that U+0026 ('&') and U+003B (';') should not be
> percent-encoded since
> https://github.com/whatwg/url/commit/bbeacfe00317554daab0ece983f42692d42487fc,
> so follow suit a few years late and match what Gecko has already been doing.
> 
> Bug:  795733 
> Change-Id: Icc28699fc14b63ef92751e81d94fb3eeee876781
> Reviewed-on: https://chromium-review.googlesource.com/829313
> Reviewed-by: Joshua Bell <jsbell@chromium.org>
> Commit-Queue: Raphael Kubo da Costa <raphael.kubo.da.costa@intel.com>
> Cr-Commit-Position: refs/heads/master@{#525101}

TBR=jsbell@chromium.org,brettw@chromium.org,raphael.kubo.da.costa@intel.com,jshin@chromium.org

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug:  795733 
Change-Id: I0c6f0d1163ae2eccfcb108a7ecfb65fd47d61321
Reviewed-on: https://chromium-review.googlesource.com/852500
Commit-Queue: Raphael Kubo da Costa <raphael.kubo.da.costa@intel.com>
Reviewed-by: Raphael Kubo da Costa <raphael.kubo.da.costa@intel.com>
Cr-Commit-Position: refs/heads/master@{#527778}
[add] https://crrev.com/7851ea3411dd04c74549373df930b4688576218e/third_party/WebKit/LayoutTests/external/wpt/encoding/big5-encoder-expected.txt
[add] https://crrev.com/7851ea3411dd04c74549373df930b4688576218e/third_party/WebKit/LayoutTests/external/wpt/encoding/gbk-encoder-expected.txt
[add] https://crrev.com/7851ea3411dd04c74549373df930b4688576218e/third_party/WebKit/LayoutTests/external/wpt/xhr/open-url-encoding-expected.txt
[modify] https://crrev.com/7851ea3411dd04c74549373df930b4688576218e/third_party/WebKit/LayoutTests/fast/url/query-expected.txt
[modify] https://crrev.com/7851ea3411dd04c74549373df930b4688576218e/third_party/WebKit/LayoutTests/fast/url/script-tests/query.js
[modify] https://crrev.com/7851ea3411dd04c74549373df930b4688576218e/third_party/WebKit/LayoutTests/http/tests/uri/escaped-entity-expected.txt
[modify] https://crrev.com/7851ea3411dd04c74549373df930b4688576218e/third_party/WebKit/Source/platform/wtf/text/TextCodec.cpp
[modify] https://crrev.com/7851ea3411dd04c74549373df930b4688576218e/third_party/WebKit/Source/platform/wtf/text/TextCodec.h
[modify] https://crrev.com/7851ea3411dd04c74549373df930b4688576218e/third_party/WebKit/Source/platform/wtf/text/TextCodecTest.cpp

Comment 7 by js...@chromium.org, Jan 9 2018

Cc: erikv@google.com

Comment 8 by erikv@google.com, Jan 9 2018

Let's start with a simple HTML document in ASCII, containing a form with a text field that will be sent to the server as /path?text=value. If the user types #&; into the text field, it is obvious that the & must be escaped (because & is the separator in the query) and the # must be escaped (because # is the beginning of the fragment).

Now let's suppose the document is encoded in iso-8859-1. In the old days, Unicode was not so popular yet, and the servers expected URLs to be encoded in the same encoding as the original HTML document. So that's what the browsers did. (They would send such a URL in iso-8859-1, %-encoded.)

But what if the user pastes a Japanese character into a text field in an iso-8859-1 document? MSIE was one of the first browsers to encode such characters as numeric character references (&#123;). Since MSIE used UTF-16 internally, characters larger than 0xFFFF were encoded as &#NNNNN;&#NNNNN; where each NNNNN is a decimal number for a Unicode surrogate (0xD800..DFFF).

However, since each text value must be inserted into the query component of a URL, the & and # need to be %-encoded (%26%23...).

So the final question is why we have to %-encode the semicolon (;). Well, from the point of view of the server, the server cannot tell whether an incoming URL is from an HTML form or an <a href=...>. So the client should make sure it uses the same steps to encode forms and hrefs. However, early HTML specs recommended using ; to separate query parameters in hrefs:

https://www.w3.org/TR/html4/appendix/notes.html#ampersands-in-uris

So the safest thing for a browser to do was to also %-encode the semicolon in query parameter values. E.g. GET /path?query=%26%2312345%3B;abc=def HTTP/1.0 (note the escaped and unescaped semicolon %3B and ;).

A good stress test is a stateful encoding like iso-2022-jp with characters that require both %-encoding and numeric character references.

<head>
<meta "charset=iso-2022-jp">
</head>
<body>
<a href="/p?q=&#x611B;&#xE9;&r=s#f">

When clicked, the URL should be /p?q=%1B%24B0%26%1B%28B%26%23233%3B&r=s (note the %26 inside and outside the stateful encoding where %1B is ESC).
Anne is working on updating the spec/WPT to align with Chrome's behavior:

https://github.com/whatwg/url/pull/386
https://github.com/w3c/web-platform-tests/pull/10915

... thanks in part to investigation here. So this may end up getting closed out.

Status: WontFix (was: Started)
Indeed, thanks again Raphael!

Sign in to add a comment