New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 682423 link

Starred by 1 user

Issue metadata

Status: Archived
Owner: ----
Closed: Jan 10
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Consider escaping all whitespace ('\n', '\r', '\t') from URLs instead of removing it

Project Member Reported by csharrison@chromium.org, Jan 18 2017

Issue description

It takes a significant amount of time to do an initial scan of the URL and remove whitespace. Instead, we could treat it like any other bad character and escape it.

Mike is investigating a related issue in  issue 680970  and it looks like promisingly few sites have newlines in URLs. We should be getting UseCounter data about '\n', but we'll also need data for '\r' and '\t', because all will be removed if we stop this scanning.

This would likely require i2i and blink-dev process for deprecation/removal.
 

Comment 1 by mkwst@chromium.org, Jan 19 2017

> It takes a significant amount of time to do an initial scan of the URL and remove whitespace. Instead, we could treat it like any other bad character and escape it.

FWIW, If the usage is low enough to change the behavior, I'd prefer to treat it as a parse error instead.

> We should be getting UseCounter data about '\n', but we'll also need data for '\r' and '\t', because all will be removed if we stop this scanning.

We'll be getting UseCounter data about `\n`, `\r`, and `\t` once https://codereview.chromium.org/2643613002 lands.

I poked at HTTPArchive with `\r` and `\t` as well, and the numbers are still quite low: ~1826 pages[1].

> This would likely require i2i and blink-dev process for deprecation/removal.

I agree!


[1]: For my own future reference:

```
SELECT
  *
FROM (
  SELECT
    page,
    url,
    REGEXP_EXTRACT(LOWER(body), r'(<[a-z][^>]+\s+(?:src|href)\s*=\s*(?:"\s*(?:[^"\s]+\s*(?:\r|\n|\t)+\s*[^"\s]+)+\s*"|\'\s*(?:[^\'\s]+\s*(?:\r|\n|\t)+\s*[^\'\s]+)+\s*\')[^>]+>)') AS match
  FROM
    [httparchive:har.2017_01_01_chrome_requests_bodies] )
WHERE
  page = url
  AND match != "null"
  AND NOT REGEXP_MATCH(match, r'["\']\s*\+')
```
Status: Archived (was: Untriaged)
Archiving P3s older than 1 year with no owner or component.

Sign in to add a comment