New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 869291 link

Starred by 7 users

Issue metadata

Status: Assigned
Owner:
Long OOO (go/where-is-mgiuca)
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , Chrome , Mac , Fuchsia
Pri: 3
Type: Bug

Blocking:
issue 660384



Sign in to add a comment

Chrome doesn't parse non-special URLs properly

Project Member Reported by mgiuca@chromium.org, Jul 31

Issue description

Chrome Version: 69
OS: All

What steps will reproduce the problem?
(1) https://quuz.org/url/liveview.html#non-special://user:pass@host/foo/bar
(2) Compare "Browser's URL components" with "jsdom/whatwg-url's URL components"

This is the equivalent of typing the following into the JS console:
> url = new URL('non-special://user:pass@host/foo/bar')

And observing the various fields.

What is the expected result?
username: user
password: pass
hostname: host
pathname: /foo/bar

What happens instead?
username:
password:
hostname:
pathname: //user:pass@host/foo/bar

Explanation

This is causing Chrome to fail a number of URL web platform tests (http://w3c-test.org/url/url-constructor.html), including:

non-special://test:@test/x
non-special://:@test/x
non-special://f:999999/c
foo://
foo://///////
foo://///////bar.com/
foo:////://///

(In the above cases, the URL also normalizes incorrectly as well.)

In general, Chrome doesn't find an authority in any URL that isn't in a whitelisted set of schemes {http, https, file, ftp, gopher, ws, wss, filesystem}. It just treats everything after the ':' as the path. This appears to be by design, for dealing with URL schemes like "data:" and "javascript:". See this comment in url_util.cc:

  } else {
    // "Weird" URLs like data: and javascript:.
    ParsePathURL(spec, spec_len, trim_path_end, &parsed_input);
    success = CanonicalizePathURL(spec, spec_len, parsed_input, output,
                                  output_parsed);
  }

However, this ignores all the other "non-weird" URL schemes, like "ssh:", "svn:", "nntp:", etc, which aren't "special" but also fit the normal URL syntax of "scheme://username:password@host:port/path". Even though Chrome doesn't directly deal with these URL schemes, it has a responsibility to correctly parse them in the URL class.

So how to distinguish between "scheme://host/path" URLs and "scheme:data" URLs? TL;DR: Not based on the scheme, but based on whether there are two slashes after the scheme. If there are, we need to parse a "username:password@host:port" authority, as normal. If not, just take everything up to the '?' as the path (this very special case is the atrociously named "cannot-be-a-base-URL" in the URL standard [1]).

The URL standard classifies these URLs as "non-special" [2] (all the ones in Chrome's whitelist, except "filesystem", are "special"). The parser has a major fork at the "scheme state" [3]:
- If the URL is special, it goes to the "special authority slashes state".
- If the URL is not special, but the colon is followed by a "//", it goes to the "authority state". If it's followed by a single '/', it goes to the "path state".
- Otherwise (the URL is not special, and the colon is not followed by a '/'), it goes to the "cannot-be-a-base-URL path state".

So non-special URLs have totally different treatment based on whether the colon is followed by 0, 1 or 2+ slashes. The 0-slash and 1-slash cases are quite similar, but subtly different (notably, the path is normalized in the 1-slash case, but not in the 0-slash case). Only in the 2-or-more-slashes case does the authority get parsed.

Chrome's URL parser treats all non-special URLs as cannot-be-a-base-URL.

[1] https://url.spec.whatwg.org/#url-cannot-be-a-base-url-flag
[2] https://url.spec.whatwg.org/#is-not-special
[3] https://url.spec.whatwg.org/#scheme-state
 
> (1) https://quuz.org/url/liveview.html#non-special://user:pass@host/foo/bar

Just want to point out that quuz.org/url/liveview.html uses an old version of jsdom/whatwg-url, and thus might not reflect the latest standard. Instead you can use

https://jsdom.github.io/whatwg-url/#url=bm9uLXNwZWNpYWw6Ly91c2VyOnBhc3NAaG9zdC9mb28vYmFy&base=YWJvdXQ6Ymxhbms=


In this case the results are the same :)

> atrociously named "cannot-be-a-base-URL"

It used to be worse IMO... https://github.com/whatwg/url/issues/89
#1 Thanks.

I think I would like to a) document what cannot-be-a-base-URL means in a non-normative text somewhere (I literally had a note written down to figure out what cannot-be-a-base-URL meant, and only figured it out once I manually traced through the algorithm), and b) rename it to something like "non-hierarchical URL".

As I understand it, a cannot-be-a-base-URL URL is basically just a non-special scheme, a colon, and some arbitrary text (can also have a query and fragment). The URL has no authority, and the "path" isn't really a path, it's just a string. Slashes and dots are not special. So "non-hierarchical" makes sense to me.

As a bonus, a "cannot-be-a-base-URL" URL *can* be a base URL, if a fragment-only URL is resolved against it: https://jsdom.github.io/whatwg-url/#url=I2Zvbw==&base=YWJvdXQ6Ymxhbms=
Cc: rnimmagadda@chromium.org
 Issue 627877  has been merged into this issue.
Cc: ricea@chromium.org
Labels: Hotlist-Interop
From  https://crbug.com/627877#c12 :

@ricea:
"It's very risky to change any part of Chrome's URL parsing or canonicalisation, as it affects every part of the browser that processes URLs. Because URLs often represent security boundaries, such changes require extensive scrutiny.

"In general we only do them if they are very low risk and/or high reward."

That's pretty troubling. If we can't make major changes to URL parsing, we'll never achieve interop with other browsers or the spec. It seems that we should have a test suite of critical URL parsing cases for security, and then we should be free to change the URL parser if it doesn't affect those cases (or if we can justify changes to those cases).
Blocking: 660384
I believe this issue is what was reported in https://github.com/webcompat/web-bugs/issues/19792 and for which I just added some tests: https://github.com/web-platform-tests/wpt/pull/13516

mgiuca@, can you confirm? See my report on results and for Chrome in particular at https://github.com/web-platform-tests/wpt/pull/13516#issuecomment-429772713
FWIW, I agree with #6 that it's troubling if we feel we are so constrained by web compat and security that we can't converge with other URL parsers. Although I don't believe there is a huge and urgent problem here, people will keep running into these little differences, and need to debug subtle bugs as a result.

Given that we already have the GURL/KURL split, I think it's also an option to treat KURL as our implementation of https://url.spec.whatwg.org/ and thus not have to affect *every* part of the browser that processes URLs.
I have been triaging the URL web platform tests. This change could fix about 51 failures in constructor.html, and probably a similar number in a-element.html. Since we don't treat any of these URLs specially in Chrome anyway, it should be relatively safe from a security perspective.

There may be complicated interactions with URL schemes that were registered by third-party applications, since the canonicalised output changes.

There's still an open question of whether changing the parsing code in this way would break the omnibox. Can someone with relevant expertise be added to this issue?
#7 Having different parsing rules in KURL and GURL was a constant source of security and correctness bugs, which is why they were unified in the first place.
Cc: csharrison@chromium.org
Oh, I had missed that unification. csharrison@ added a great comment in KURL.h in https://chromium-review.googlesource.com/571941. csharrison@, might you also be the right person to answer #8 "whether changing the parsing code in this way would break the omnibox"?
Cc: pkasting@chromium.org
I don't know how this would affect the omnibox, maybe +pkasting would know.
Cc: -pkasting@chromium.org jdonnelly@chromium.org
I no longer work on omnibox; you want the current TL, jdonnelly.

It took me a while to figure out what "this" in comment 11 referred to.  Perhaps someone can summarize.  I believe the proposal is to take the current GURL parsing of non-standard schemes and fork it into two behaviors based on whether the ":" is followed by "//"; in the latter, we parse a bit more like how standard schemes are processed.

The effects on the omnibox are likely to be small, because this should only affect non-standard schemes, and people input fairly few of those into the omnibox.  I suggest looking at some known non-standard schemes that can have handlers registered on users' machines, and see if those are still parsed/fixed up/passed through correctly to open them with the external protocol handler if someone types them in.
 Issue 902253  has been merged into this issue.

Sign in to add a comment