New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 648179 link

Starred by 3 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: 2017-08-08
OS: Linux , Android , Windows , Chrome , Mac
Pri: 3
Type: Bug



Sign in to add a comment

Collect metrics on when createElement/setAttribute/createAttribute throw for invalid names

Project Member Reported by domenic@chromium.org, Sep 19 2016

Issue description

We would like to match the set of local names createElement/setAttribute/createAttribute can create to the set of names the HTML parser can create. To do this, we need compat data on how often these throw on invalid characters.

Previously:
- https://www.w3.org/Bugs/Public/show_bug.cgi?id=24271
- https://www.w3.org/Bugs/Public/show_bug.cgi?id=27228
- https://groups.google.com/a/chromium.org/forum/#!msg/blink-dev/YyqxpIPU-0E/k5gu7i9dBwAJ

Specs:
- createElement/createAttribute/setAttribute validation rules: https://www.w3.org/TR/xml/#NT-Name
- attribute name character consumption rules:
  - https://html.spec.whatwg.org/multipage/syntax.html#before-attribute-name-state
  - https://html.spec.whatwg.org/multipage/syntax.html#attribute-name-state
- element name character consumption rules:
  - https://html.spec.whatwg.org/multipage/syntax.html#tag-open-state
  - https://html.spec.whatwg.org/multipage/syntax.html#tag-name-state
 
Description: Show this description
Cc: domenic@chromium.org
Components: Blink>DOM
Labels: -Type-Feature -OS-All OS-Android OS-Chrome OS-Linux OS-Mac OS-Windows Type-Bug
Owner: dominicc@chromium.org
Status: Assigned (was: Untriaged)
I'm unclear what exactly I should be measuring here.

Ideally there would be a proposal of what to move to, and I can measure how much more or less different things will throw as a result of that.

I can add counts of how often those APIs throw *at all* but I'd like to sanity check what question we think that is answering first.
Cc: zcorpan@gmail.com annevank...@gmail.com
The proposal is to move createElement/createAttribute/setAttribute to use the same validation rules as the parser, instead of using the NT-Name production. Hopefully Anne or Simon can help confirm, but based on a quick reading of the above-linked parser sections, the rules are:

- Attributes:
  - cannot start with tab, LF, FF, space, /, >
  - cannot contain tab, LF, FF, space, /, >, =
  - NUL characters get converted to U+FFFD
  - ASCII upper alpha -> ASCII lower alpha (note: https://github.com/whatwg/html/issues/2375)
- Element names:
  - must start with ASCII alpha
  - cannot contain tab, LF, FF, space, /, >
  - NUL characters get converted to U+FFFD
  - ASCII upper alpha -> ASCII lower alpha
Neither can start with/contain CR (U+000D), per https://html.spec.whatwg.org/multipage/syntax.html#preprocessing-the-input-stream.
Ah right, so more precisely, we'd convert any CRs to LF
OK, so I guess we want to measure: 1. how often we would throw under the new regime where we don't today; 2. how often the name would differ because of these transformations.

Where does this leave XML documents? Our XML integration actually just sits there blithely translating SAX callbacks to document->createElement calls. Do we want that to continue using the XML rules, or do this new thing?
Your XML parser presumably validates the element names before allocating the elements? You can already create element names that cannot be represented by the XML parser, e.g., document.createElement("test:test").
I think no changes for XML documents makes sense to me:

- The XML parser is not specified anywhere, so trying to match it is silly
- Very few people use XML
- Nobody who is using XML has asked for this feature, whereas several library authors have asked for it in the HTML context

If it would help, I could try to work on a spec patch proposing exactly what would change. Maybe doing that first is easier?
Status: Started (was: Assigned)
I've been looking at Blink to see if there's something we could reuse here to do this check; it would amount to spinning up a core/html/parser/HTMLTokenizer.h and flipping it into the kTagOpenState, feeding the name, and seeing if we transitioned to and stayed in kTagNameState. But maybe it is simpler to re-encode it, which almost as Domenic says in comment 4:

Elements
- Not empty
- First character is 'a' <= (ch | 0x20) <= 'z'
- Each ch not whitespace: ' ', 0xA, 0x9, 0xC *or 0xD, per Comment 5*.
- Each ch not '/', '>'
(modulo the null handling, hmm.)

Attributes:
- Not empty
- Each ch is not whitespace, /, >, ", ', <, =

And the case handling per comment 4.
I guess the HTML parser is not more permissive than DOM; DOM will accept createElement(':foo') but <:foo> is not a valid per the HTML parser.

What should createElementNS, which processes a qualified name in a string, accept now? Do things like createElementNS('foo:bar:baz') make sense now, with a local name of bar:baz?

The custom element name restrictions are also inspired by XML and may need to be changed.
createAttributeNS also needs consistent handling.
OK, so it looks like the valid HTML attribute names are a superset of the DOM ones. So let's not measure attribute names. Turning errors into non-errors should be pretty safe.

For element names, I think NUL is already invalid in DOM so HTML's remapping to 0xfffd should be fine.

The spec will need to decide what to do with createElementNS, createAttributeNS. Do they keep the existing rules, or do they relax the rules to more closely match the non-namespaced ones? Relaxing the rules is probably nice for consistency. That leaves createProcessingInstruction; it probably doesn't matter what happens to that.

I've uploaded a patch to https://codereview.chromium.org/2838123002 ; it would be good to really closely check that the HTML name validity check is accurately implemented.
Project Member

Comment 14 by bugdroid1@chromium.org, Apr 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/cb56b4ff751b3fef6ada2ae547a2c374caca8361

commit cb56b4ff751b3fef6ada2ae547a2c374caca8361
Author: dominicc <dominicc@chromium.org>
Date: Wed Apr 26 03:07:36 2017

Count element name validity per DOM versus HTML parsing.

DOM uses the XML "Name" production for tag name validity, whereas
the HTML parser does something different. This creates edge cases
where some elements can only be created by createElement, like
<:foo>, and some can only be created by the HTML parser, like
<foo">.

This adds a use counter to count how often pages use createElement,
etc. with tag names that are valid in one system but not the other
to determine whether it is practical for DOM to use the same rules
as HTML for tag names.

BUG=648179

Review-Url: https://codereview.chromium.org/2838123002
Cr-Commit-Position: refs/heads/master@{#467219}

[add] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/third_party/WebKit/LayoutTests/fast/dom/name-validity-usecounter.html
[modify] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/third_party/WebKit/Source/bindings/core/v8/V0CustomElementConstructorBuilder.cpp
[modify] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/third_party/WebKit/Source/core/dom/DOMImplementation.cpp
[modify] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/third_party/WebKit/Source/core/dom/DOMImplementation.h
[modify] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/third_party/WebKit/Source/core/dom/DOMImplementation.idl
[modify] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/third_party/WebKit/Source/core/dom/Document.cpp
[modify] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/third_party/WebKit/Source/core/dom/Document.h
[modify] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/third_party/WebKit/Source/core/dom/Document.idl
[modify] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/third_party/WebKit/Source/core/dom/TreeScopeTest.cpp
[modify] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/third_party/WebKit/Source/core/dom/custom/CustomElementTest.cpp
[modify] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/third_party/WebKit/Source/core/dom/custom/CustomElementTestHelpers.h
[modify] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/third_party/WebKit/Source/core/dom/custom/CustomElementUpgradeSorterTest.cpp
[modify] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/third_party/WebKit/Source/core/frame/UseCounter.h
[modify] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/third_party/WebKit/Source/core/paint/PaintLayerScrollableAreaTest.cpp
[modify] https://crrev.com/cb56b4ff751b3fef6ada2ae547a2c374caca8361/tools/metrics/histograms/histograms.xml

NextAction: 2017-08-08
Thanks so much for taking this up. I've worked out an initial draft of the spec change at https://github.com/whatwg/dom/pull/449, but I wasn't able to come up with any obvious answers for the -NS variants. Your thoughts would be appreciated. I'm also reaching out to the framework authors who have mentioned this problem in the past to see if they can chime in on that thread.
Awesome!
The NextAction date has arrived: 2017-08-08
It might make sense to wait a few more days to confirm stable has rolled out broadly with these use counters.

Here are the available data:

element creation attempts
-------------------------
DOM valid, HTML parser invalid: 0.047 % page loads
DOM invalid, HTML parser valid: 0.000003 % page loads

If the idea was to change DOM's rules to match HTML, I guess we probably should not do that.

You could countenance changing the HTML parser. More research may be needed--the HTML parser doesn't even try to create an element it considers invalid so that DOM valid, HTML parser invalid is under counting the impact of that sort of change.

I believe in practice it will look like pages like:

<body><:foo>hi</:foo>

today render like:

<:foo>hi

and would render like:

hi <-- in a :foo element.

What does the history of the design of the HTML parsing algorithm tell us about how its element name validity rules were arrived at?
Thanks so much for gathering the data! Let's discuss the potential path at https://github.com/whatwg/dom/pull/449, if that's OK.
Owner: ----
Status: Available (was: Started)
Bulk edit bugs owned by dominicc@
Project Member

Comment 22 by bugdroid1@chromium.org, May 9 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ac6b00ded10454b3bf58ac84c6eb4c8bc6de1c8d

commit ac6b00ded10454b3bf58ac84c6eb4c8bc6de1c8d
Author: Fredrik Söderquist <fs@opera.com>
Date: Wed May 09 07:39:00 2018

Fix [a-z] range comparison in IsValidElementNamePerHTMLParser

Bug: 648179
Change-Id: I623023d6e195447381bf8b08124d7475bb0fe93d
Reviewed-on: https://chromium-review.googlesource.com/1049629
Reviewed-by: Kent Tamura <tkent@chromium.org>
Commit-Queue: Fredrik Söderquist <fs@opera.com>
Cr-Commit-Position: refs/heads/master@{#557117}
[modify] https://crrev.com/ac6b00ded10454b3bf58ac84c6eb4c8bc6de1c8d/third_party/blink/renderer/core/dom/document.cc

Sign in to add a comment