2019-02-10

Matching balanced parentheses with recursive Perl regular expressions

This blog post explains how to use recursive Perl regular expressions (regexp) to match substrings with balanced parentheses. Recursive regular expressions is also available in Ruby (with a different syntax) and in the regex extension of Python (but not in the built-in re module), but it's explicitly not available in RE2.

Let's suppose the input file in.txt contains lines like:

a = EQ(x + 6, 42);
a = EQ((x + 6) * 2, 42);
if (x + 6 == 42) { ... }
if (EQ(x + 6, 42)) { ... }
if (EQ((x + 6) * 2, 42)) { ... }
, and let's suppose we want to get all instances of EQ and if, with their arguments.

A non-recursive regexp can get only the instances without nested parentheses:

$ <in.txt perl -0777 -wne '
while (m@(?>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*\(([^()]*)\))@g)
{ print "$1($2)\n" }'
EQ(x + 6, 42)
if(x + 6 == 42)
EQ(x + 6, 42)

With a recursive regexp we can get all matches:

$ <in.txt perl -0777 -wne '
while (m@(?>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*(\(((?:[^()]+|(?2))*)\)))@g)
{ print "$1$2\n" }'
EQ(x + 6, 42)
EQ((x + 6) * 2, 42)
if(x + 6 == 42)
if(EQ(x + 6, 42))
if(EQ((x + 6) * 2, 42))

Please note that the EQ inside the if was not matched, because with the global flag (m@...@g) Perl doesn't consider overlapping or enclosed matches.

The recursive part of the regexp is the (?2): it's a recursive reuse of paren group 2. For more information about recursive regexps, see recursive subpattern in perlre(1). The (?>gt...) construct is a performance optimization to prevent backtracking.

It's also possible to match individual (comma-separated) arguments. For example, here is how to match both arguments of EQ separately, recursively:

$ <in.txt perl -0777 -wne '
while (
m@(>>\b(if|while|EQ|NE|LT|LE|GT|GE)\(((?:[^(),]+|(\(((?:[^()]+|(?3))*)\)))*),\s*((?2))\))@g)
{ print "$1($2, $5)\n" }'
EQ(x + 6, 42)
EQ((x + 6) * 2, 42)
EQ(x + 6, 42)
EQ((x + 6) * 2, 42)

The non-recursive version returns 2 arguments without nested parentheses:

$ <in.txt perl -0777 -wne '
while (m@(?>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*\(([^(),]*),\s*([^(),]*))@g)
{ print "$1($2, $3)\n" }'
EQ(x + 6, 42)
EQ(x + 6, 42)

Here is how to match only a single argument (no comma) recursively:

$ <in.txt perl -0777 -wne '
while (
m@(>>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*\(((?:[^(),]+|(\(((?:[^()]+|(?3))*)\)))*)\))@g)
{ print "$1($2)\n" }'
if(x + 6 == 42)
if(EQ(x + 6, 42))
if(EQ((x + 6) * 2, 42))

The non-recursive version returns 1 argument without nested parentheses:

$ <in.txt perl -0777 -wne '
while (m@(?>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*\(([^(),]*)\))@g)
{ print "$1($2)\n" }'
if(x + 6 == 42)

No comments: