New issue
Advanced search Search tips

Issue 641091 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Unicode regular expression using alternation/OR returns null

Reported by wuzzyv...@gmail.com, Aug 25 2016

Issue description

Chrome Version       : 52.0.2743.116 (64-bit)
Other browsers tested:
  Add OK or FAIL, along with the version, after other browsers where you
have tested this issue:
    Firefox: OK

What steps will reproduce the problem?
(1) Open JS Console
(2) Run the following: '🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦|🍋/ug)
(3) Inspect result.

What is the expected result? 

Expected (from Firefox): [ "🍤", "🍦", "🍋", "🍋", "🍦", "🍤" ]

What happens instead?

Chrome/Node returns null

Per discussion at http://stackoverflow.com/q/39152590/500207, the following works: '🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦|🍋{1}/ug)

Also per discussion on StackOverflow, Chrome does work for ASCII sub-strings, and for character sequences, i.e., [🍤🍦🍋], but I am testing specifically with extra-BMP (like emoji) and multi-character strings (precluding the use of character sequences like [abc]).
 

Comment 1 by wuzzyv...@gmail.com, Aug 25 2016

The regexp seems to work fine in Chrome without the `u` Unicode modifier, i.e., this produces correct result:

'🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦|🍋/g)
Labels: TE-NeedsTriageHelp
Components: Blink>JavaScript
Cc: yangguo@chromium.org jgruber@chromium.org
Components: -Blink>JavaScript Blink>JavaScript>Language
Labels: -TE-NeedsTriageHelp
Status: Available (was: Unconfirmed)

Comment 5 by math...@qiwi.be, Aug 30 2016

Note that alternating only two sequences works fine:

    > '🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦/ug);
    < ["🍤", "🍦", "🍦", "🍤"]

But as soon as a third one is added, something breaks:

    > '🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦|🍋/ug);
    < null
This better illustrates what's going on:

'🍤🍦🍋ππ🍋🍦🍤'.match(/🍤|🍦|π|🍋/ug)
["π", "π"]

Matching SMP chars breaks. I guess something is going wrong when cases are somehow merged.
Cc: -jgruber@chromium.org
Owner: jgruber@chromium.org
RegExpDisjunction::ToNode contains this snippet [0]:

  if (alternatives->length() > 2) {
    bool found_consecutive_atoms = SortConsecutiveAtoms(compiler);
    if (found_consecutive_atoms) RationalizeConsecutiveAtoms(compiler);
    FixSingleCharacterDisjunctions(compiler);
    if (alternatives->length() == 1) {
      return alternatives->at(0)->ToNode(compiler, on_success);
    }
  }

In --regexp-trace-parser syntax:

Parsing '/🍤|🍦|🍋/ug' gives us: (| '\ud83c\udf64' '\ud83c\udf66' '\ud83c\udf4b')
After RationalizeConsecutiveAtoms: (| (: '\ud83c' (| '\udf4b' '\udf64' '\udf66')))  // This still matches as expected.
After FixSingleCharacterDisjunctions: (| (: '\ud83c' (| [\udf4b \udf64 \udf66])))  // Fails to match.

[0] https://cs.chromium.org/chromium/src/v8/src/regexp/jsregexp.cc?q=RegExpDisjunction::ToNo+package:%5Echromium$&l=5403
Status: Started (was: Available)
Looks like the issue is in RationalizeConsecutiveAtoms. RCA optimizes 'ab|ac|az' to 'a(?:b|c|d)', but doesn't consider leading surrogates and thus also changes '🍤|🍦|🍋' to '\ud83c(?:\udf4b|\udf64|\udf66)' (i.e. factoring out the leading surrogate).

In unicode mode, the matcher then sees that '🍤' does not match '\ud83c' and advances over the entire surrogate pair.

Cooking up a CL to fix this.

yangguo@: It may be a separate bug that the form after RationalizeConsecutiveAtoms still matches correctly.
Project Member

Comment 10 by bugdroid1@chromium.org, Apr 12 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/4635572471b28a38f04307fabc69d5e1802d1049

commit 4635572471b28a38f04307fabc69d5e1802d1049
Author: jgruber <jgruber@chromium.org>
Date: Wed Apr 12 09:09:12 2017

[regexp] Consider surrogate pairs when optimizing disjunctions

RationalizeConsecutiveAtoms optimizes ab|ac|az to a(?:b|c|d).
Ensure that this optimization does not split surrogate pairs in unicode
mode.

BUG= chromium:641091 

Review-Url: https://codereview.chromium.org/2813893002
Cr-Commit-Position: refs/heads/master@{#44599}

[modify] https://crrev.com/4635572471b28a38f04307fabc69d5e1802d1049/src/regexp/jsregexp.cc
[modify] https://crrev.com/4635572471b28a38f04307fabc69d5e1802d1049/src/regexp/regexp-ast.h
[modify] https://crrev.com/4635572471b28a38f04307fabc69d5e1802d1049/src/regexp/regexp-parser.cc
[add] https://crrev.com/4635572471b28a38f04307fabc69d5e1802d1049/test/mjsunit/regress/regress-641091.js

Status: Fixed (was: Started)

Sign in to add a comment