Project: v8 Issues People Development process History Sign in
New issue
Advanced search Search tips
Issue 4743 Implement unicode regexp property class
Starred by 5 users Project Member Reported by yangguo@chromium.org, Feb 9 2016 Back to list
Status: Assigned
Owner:
Cc:
HW: ----
OS: ----
Priority: 2
Type: FeatureRequest


Sign in to add a comment
This is currently stage0, but that doesn't prohibit us from experimenting, right? Property classes have the syntax:
/\p{<property alias>}/u
or negated
/\P{<property alias>}/u
and only work in combination with the unicode flag.

It's not entirely clear what <property alias> can be. It should work for character categories:
http://www.fileformat.info/info/unicode/category/index.htm

 
Labels: -Type-Bug Type-FeatureRequest
Project Member Comment 4 by bugdroid1@chromium.org, Mar 7 2016
The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/7dc13c2a8cc44c0ec58e31caaa17653e9d2e5a89

commit 7dc13c2a8cc44c0ec58e31caaa17653e9d2e5a89
Author: yangguo <yangguo@chromium.org>
Date: Mon Mar 07 18:11:55 2016

Revert of [regexp] extend property classes by script category. (patchset #1 id:1 of https://codereview.chromium.org/1774513002/ )

Reason for revert:
wrong noi18n expectations

Original issue's description:
> [regexp] extend property classes by script category.
>
> R=littledan@chromium.org
> BUG=v8:4743
> LOG=N
>
> Committed: https://crrev.com/22f6735ccbe2e341d341e61b9c38ce308b8da655
> Cr-Commit-Position: refs/heads/master@{#34553}

TBR=littledan@chromium.org
# Skipping CQ checks because original CL landed less than 1 days ago.
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=v8:4743

Review URL: https://codereview.chromium.org/1772563003

Cr-Commit-Position: refs/heads/master@{#34554}

[modify] https://crrev.com/7dc13c2a8cc44c0ec58e31caaa17653e9d2e5a89/src/char-predicates-inl.h
[modify] https://crrev.com/7dc13c2a8cc44c0ec58e31caaa17653e9d2e5a89/src/char-predicates.h
[modify] https://crrev.com/7dc13c2a8cc44c0ec58e31caaa17653e9d2e5a89/src/regexp/regexp-parser.cc
[delete] https://crrev.com/22f6735ccbe2e341d341e61b9c38ce308b8da655/test/mjsunit/harmony/regexp-property-script-category.js
[rename] https://crrev.com/7dc13c2a8cc44c0ec58e31caaa17653e9d2e5a89/test/mjsunit/harmony/unicode-regexp-property-class.js

Comment 10 by math...@qiwi.be, May 20 2016
Cc: math...@qiwi.be
Comment 11 Deleted
Comment 12 by math...@qiwi.be, Jun 11 2016
There is now a formal proposal to add this functionality to the language: https://github.com/mathiasbynens/es-regexp-unicode-property-escapes
Comment 13 by phistuck@gmail.com, Jun 11 2016
Cool! Add it to the stage 0 list?
https://github.com/tc39/proposals/blob/master/stage-0-proposals.md
Project Member Comment 15 by bugdroid1@chromium.org, Jun 14 2016
The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/3fe12ef8fab76dfeeab6b568067bd9be8d8a6b37

commit 3fe12ef8fab76dfeeab6b568067bd9be8d8a6b37
Author: yangguo <yangguo@chromium.org>
Date: Tue Jun 14 13:58:53 2016

Revert of [regexp] implement \p{Any}, \p{Ascii}, and \p{Assigned}. (patchset #3 id:40001 of https://codereview.chromium.org/2059113002/ )

Reason for revert:
compile failure

Original issue's description:
> [regexp] implement \p{Any}, \p{Ascii}, and \p{Assigned}.
>
> R=littledan@chromium.org, mathias@qiwi.be
> BUG=v8:4743
>
> Committed: https://crrev.com/92bfd13457c80f02be01551f4ea9a5badfe0e4c4
> Cr-Commit-Position: refs/heads/master@{#36969}

TBR=littledan@chromium.org,mathias@qiwi.be
# Skipping CQ checks because original CL landed less than 1 days ago.
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=v8:4743

Review-Url: https://codereview.chromium.org/2065083002
Cr-Commit-Position: refs/heads/master@{#36970}

[modify] https://crrev.com/3fe12ef8fab76dfeeab6b568067bd9be8d8a6b37/src/regexp/regexp-parser.cc
[modify] https://crrev.com/3fe12ef8fab76dfeeab6b568067bd9be8d8a6b37/src/regexp/regexp-parser.h
[modify] https://crrev.com/3fe12ef8fab76dfeeab6b568067bd9be8d8a6b37/test/mjsunit/harmony/regexp-property-general-category.js
[delete] https://crrev.com/92bfd13457c80f02be01551f4ea9a5badfe0e4c4/test/mjsunit/harmony/regexp-property-lu-ui.js
[delete] https://crrev.com/92bfd13457c80f02be01551f4ea9a5badfe0e4c4/test/mjsunit/harmony/regexp-property-special.js

Project Member Comment 17 by bugdroid1@chromium.org, Nov 18 2016
Labels: Priority-2
Cc: js...@chromium.org
The current spec [0] requires recognizing Other_ID_Start and Other_ID_Continue as binary property classes.

ICU 57 doesn't have a UProperty for Other_ID_Start and Other_ID_Continue. Instead, it seems to consider these part of ID_Start and ID_Continue respectively. For example 

I'm hesitant to just include the definitions of these two binary property classes directly in V8. Is there anything we can do in ICU?

[0] https://github.com/tc39/proposal-regexp-unicode-property-escapes
Some background:

> The `Other_ID_Start` property provides a small list of characters that
> qualified as `ID_Start` characters in some previous version of Unicode
> solely on the basis of their `General_Category` properties, but that no
> longer qualify in the current version. These are called “grandfathered
> characters”.

— http://unicode.org/reports/tr31/#Backward_Compatibility

As such, I’d rather not remove these potentially useful properties from the proposal, unless they prove to be painful to implement or require hardcoding.
Here are the code points matching these properties:

$ node
> require('unicode-9.0.0/Binary_Property/Other_ID_Start/code-points.js')
[0x1885, 0x1886, 0x2118, 0x212E, 0x309B, 0x309C]
> require('unicode-9.0.0/Binary_Property/Other_ID_Continue/code-points.js')
[0xB7, 0x387, 0x1369, 0x136A, 0x136B, 0x136C, 0x136D, 0x136E, 0x136F, 0x1370, 0x1371, 0x19DA]

These lists are smaller than I’d expected. Still, deferring to ICU somehow would be preferable to hardcoding them.
Project Member Comment 22 by bugdroid1@chromium.org, Apr 7
The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/749c00ef454d08807d659ebcbfb92169e2d1dd06

commit 749c00ef454d08807d659ebcbfb92169e2d1dd06
Author: yangguo <yangguo@chromium.org>
Date: Fri Apr 07 18:35:30 2017

[regexp] stage --harmony-regexp-property

R=adamk@chromium.org
BUG=v8:4743

Review-Url: https://codereview.chromium.org/2801263002
Cr-Commit-Position: refs/heads/master@{#44493}

[modify] https://crrev.com/749c00ef454d08807d659ebcbfb92169e2d1dd06/src/flag-definitions.h

Project Member Comment 23 by bugdroid1@chromium.org, Apr 10
The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/582921454d593607cf1c49322f2b412fe7002a2b

commit 582921454d593607cf1c49322f2b412fe7002a2b
Author: yangguo <yangguo@chromium.org>
Date: Mon Apr 10 10:19:01 2017

[regexp] add more tests for binary property classes.

R=jgruber@chromium.org
BUG=v8:4743

Review-Url: https://codereview.chromium.org/2803693006
Cr-Commit-Position: refs/heads/master@{#44512}

[modify] https://crrev.com/582921454d593607cf1c49322f2b412fe7002a2b/test/mjsunit/harmony/regexp-property-binary.js

Project Member Comment 24 by bugdroid1@chromium.org, Apr 11
The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/26e5d0129ca501e4dbadaa098e65222687ce4ab0

commit 26e5d0129ca501e4dbadaa098e65222687ce4ab0
Author: yangguo <yangguo@chromium.org>
Date: Tue Apr 11 07:10:33 2017

[regexp] implement \p{Other_ID_Start} and \p{Other_ID_Continue}.

Other_ID_Start and Other_ID_Continue are not supported by ICU, so for
now we implement these manually as special binary property classes.

R=jgruber@chromium.org
BUG=v8:4743

Review-Url: https://codereview.chromium.org/2808803002
Cr-Commit-Position: refs/heads/master@{#44549}

[modify] https://crrev.com/26e5d0129ca501e4dbadaa098e65222687ce4ab0/src/regexp/regexp-parser.cc
[modify] https://crrev.com/26e5d0129ca501e4dbadaa098e65222687ce4ab0/test/mjsunit/harmony/regexp-property-special.js

Per https://github.com/tc39/proposal-regexp-unicode-property-escapes/issues/27 deprecated binary properties and contributory binary properties (such as `Other_ID_Start` and `Other_ID_Continue`) have been removed from the proposal.

`Composition_Exclusion` and `Prepended_Concatenation_Mark` might get removed, too, once I figure out why ICU doesn’t support them.

Patch: https://codereview.chromium.org/2809143003/
Project Member Comment 26 by bugdroid1@chromium.org, Apr 11
The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/f956279ed618f026cec357b4a36f0efcc77e2fd6

commit f956279ed618f026cec357b4a36f0efcc77e2fd6
Author: mathias <mathias@qiwi.be>
Date: Tue Apr 11 12:40:41 2017

[regexp] remove \p{Other_ID_Start} and \p{Other_ID_Continue}

The spec proposal has been updated to drop contributory binary
properties such as `Other_ID_Start` and `Other_ID_Continue`.

This patch reverts commit 26e5d0129ca501e4dbadaa098e65222687ce4ab0 and
adds tests to ensure these properties are not supported.

R=
BUG=v8:4743

Review-Url: https://codereview.chromium.org/2809143003
Cr-Commit-Position: refs/heads/master@{#44569}

[modify] https://crrev.com/f956279ed618f026cec357b4a36f0efcc77e2fd6/src/regexp/regexp-parser.cc
[modify] https://crrev.com/f956279ed618f026cec357b4a36f0efcc77e2fd6/test/mjsunit/harmony/regexp-property-special.js

Exhaustive tests for all supported properties (and their aliases), values (and their aliases) against all Unicode code points: https://github.com/tc39/test262/tree/master/test/built-ins/RegExp/property-escapes

V8’s implementation currently fails the following:

```
FAIL output/Emoji_Component.js
 Expected no error, got SyntaxError: Invalid regular expression: /^\p{Emoji_Component}+$/: Invalid property name
```

This failure was expected since ICU currently doesn’t support `Emoji_Component`. However, it will likely be included in ICU 60: https://ssl.icu-project.org/trac/ticket/13062

```
FAIL output/Emoji.js
  `\p{Emoji}` should match U+01F9E6 (`🧦`)

FAIL output/Emoji_Presentation.js
  `\p{Emoji_Presentation}` should match U+01F9E6 (`🧦`)

FAIL output/Emoji_Modifier_Base.js
  `\P{Emoji_Modifier_Base}` should match U+01F93C (`🤼`)
```

These can be fixed by updating to ICU 59.1, which uses Emoji 5.0 data.

Thanks Mathias!

V8 follows Chrome wrt ICU version upgrades, so I guess we will just wait.

What about Composition_Exclusion and Prepended_Concatenation_Mark? Does Test262 not test them?
https://github.com/tc39/proposal-regexp-unicode-property-escapes/issues/27 has some background on this, but TL;DR:

1. `Prepended_Concatenation_Mark` has been removed from the proposal because it’s an informative property rather than a normative one.

2. `Full_Composition_Exclusion` and `Composition_Exclusion` have been removed from the proposal because even UTS18 RL2.7 dropped them in revision 19. http://unicode.org/reports/tr18/#Modifications

I have an open PR to remove those three test files from Test262: https://github.com/tc39/test262/pull/980. The latest version of the tests and the script to generate them can always be found here: https://github.com/mathiasbynens/unicode-property-escapes-tests
Blockedon: chromium:711901
Regarding Full_Composition_Exclusion: would it be a spec violation to parse it?
Yes, it would — https://tc39.github.io/proposal-regexp-unicode-property-escapes/#sec-static-semantics-unicodematchproperty-p says:

> To ensure interoperability, implementations must not extend Unicode property support to the remaining enumeration properties.

(I plan to contribute a few manually-written tests to Test262 that assert those properties throw, among other things.)
Blockedon: chromium:699469
Project Member Comment 35 by bugdroid1@chromium.org, Apr 18
Blockedon: 896 2728
Note that any SyntaxErrors in Unicode RegExp property escapes should be early errors instead of runtime errors.
re: Other_ID_Start and Other_ID_Continue

I see little value in supporting them. 

If they're necessary for some hard-to-imagine reasons, Other_ID_Start can be derived by subtracting [:L:][:Nl:] from [:ID_Start:]]. 

Other_ID_Continue can be derived subtracting [[:ID_Start:][:Mn:][:Mc:][:Nd:][:Pc:] from [:ID_Continue:].  

( and adding [:Pattern_Syntax:] and [:Pattern_WhiteSpace:] ). 

http://unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers


Sorry I wrote comment without reading all the comments. It's good that Other_ID_* were dropped from the spec. 

Anyway, ICU in Chromium's tree was updated to 59.1. Note that not all Emoji-related properties are supported in ICU 59.1 (the release cycles of Emoji, Unicode and ICU are kinda 'tangled' up. They are expected to be harmonized in the future). 
@jshin, is there a tracking bug for ICU 60 already? Perhaps we can reuse https://bugs.chromium.org/p/chromium/issues/detail?id=711901 for that.
ICU 60 will not be released until October. 
Anyway, Emoji_Component(?)-related properties support was added to ICU trunk. 
If we really want to support them now, we can merge those changes to our copy of ICU. 

Mathias, what if we just remove Emoji_Component for the initial version of RegExp Unicode property escapes, and add it in in a future revision? This could let us get the overall feature to users faster and reduce the burden on other engines who are trying to implement it.
Did any other implementers mark `Emoji_Component` as a blocker? Not everyone is using ICU, right?

So far the open bugs tracking the implementation have received little attention: https://github.com/tc39/proposal-regexp-unicode-property-escapes/issues/4#issue-159392455
> Not everyone is using ICU, right?

Chrome, Firefox and Safari do.  (I suspect Edge is moving in that direction as well). 
Sign in to add a comment