Unicode regular expression using alternation/OR returns null
Reported by
wuzzyv...@gmail.com,
Aug 25 2016
|
|||||||
Issue description
Chrome Version : 52.0.2743.116 (64-bit)
Other browsers tested:
Add OK or FAIL, along with the version, after other browsers where you
have tested this issue:
Firefox: OK
What steps will reproduce the problem?
(1) Open JS Console
(2) Run the following: '🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦|🍋/ug)
(3) Inspect result.
What is the expected result?
Expected (from Firefox): [ "🍤", "🍦", "🍋", "🍋", "🍦", "🍤" ]
What happens instead?
Chrome/Node returns null
Per discussion at http://stackoverflow.com/q/39152590/500207, the following works: '🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦|🍋{1}/ug)
Also per discussion on StackOverflow, Chrome does work for ASCII sub-strings, and for character sequences, i.e., [🍤🍦🍋], but I am testing specifically with extra-BMP (like emoji) and multi-character strings (precluding the use of character sequences like [abc]).
,
Aug 26 2016
,
Aug 26 2016
,
Aug 29 2016
,
Aug 30 2016
Note that alternating only two sequences works fine:
> '🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦/ug);
< ["🍤", "🍦", "🍦", "🍤"]
But as soon as a third one is added, something breaks:
> '🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦|🍋/ug);
< null
,
Apr 4 2017
This better illustrates what's going on: '🍤🍦🍋ππ🍋🍦🍤'.match(/🍤|🍦|π|🍋/ug) ["π", "π"] Matching SMP chars breaks. I guess something is going wrong when cases are somehow merged.
,
Apr 4 2017
,
Apr 11 2017
RegExpDisjunction::ToNode contains this snippet [0]:
if (alternatives->length() > 2) {
bool found_consecutive_atoms = SortConsecutiveAtoms(compiler);
if (found_consecutive_atoms) RationalizeConsecutiveAtoms(compiler);
FixSingleCharacterDisjunctions(compiler);
if (alternatives->length() == 1) {
return alternatives->at(0)->ToNode(compiler, on_success);
}
}
In --regexp-trace-parser syntax:
Parsing '/🍤|🍦|🍋/ug' gives us: (| '\ud83c\udf64' '\ud83c\udf66' '\ud83c\udf4b')
After RationalizeConsecutiveAtoms: (| (: '\ud83c' (| '\udf4b' '\udf64' '\udf66'))) // This still matches as expected.
After FixSingleCharacterDisjunctions: (| (: '\ud83c' (| [\udf4b \udf64 \udf66]))) // Fails to match.
[0] https://cs.chromium.org/chromium/src/v8/src/regexp/jsregexp.cc?q=RegExpDisjunction::ToNo+package:%5Echromium$&l=5403
,
Apr 11 2017
Looks like the issue is in RationalizeConsecutiveAtoms. RCA optimizes 'ab|ac|az' to 'a(?:b|c|d)', but doesn't consider leading surrogates and thus also changes '🍤|🍦|🍋' to '\ud83c(?:\udf4b|\udf64|\udf66)' (i.e. factoring out the leading surrogate). In unicode mode, the matcher then sees that '🍤' does not match '\ud83c' and advances over the entire surrogate pair. Cooking up a CL to fix this. yangguo@: It may be a separate bug that the form after RationalizeConsecutiveAtoms still matches correctly.
,
Apr 12 2017
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/4635572471b28a38f04307fabc69d5e1802d1049 commit 4635572471b28a38f04307fabc69d5e1802d1049 Author: jgruber <jgruber@chromium.org> Date: Wed Apr 12 09:09:12 2017 [regexp] Consider surrogate pairs when optimizing disjunctions RationalizeConsecutiveAtoms optimizes ab|ac|az to a(?:b|c|d). Ensure that this optimization does not split surrogate pairs in unicode mode. BUG= chromium:641091 Review-Url: https://codereview.chromium.org/2813893002 Cr-Commit-Position: refs/heads/master@{#44599} [modify] https://crrev.com/4635572471b28a38f04307fabc69d5e1802d1049/src/regexp/jsregexp.cc [modify] https://crrev.com/4635572471b28a38f04307fabc69d5e1802d1049/src/regexp/regexp-ast.h [modify] https://crrev.com/4635572471b28a38f04307fabc69d5e1802d1049/src/regexp/regexp-parser.cc [add] https://crrev.com/4635572471b28a38f04307fabc69d5e1802d1049/test/mjsunit/regress/regress-641091.js
,
Apr 12 2017
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by wuzzyv...@gmail.com
, Aug 25 2016