This blog post explains how to use recursive Perl regular expressions (regexp) to match substrings with balanced parentheses. Recursive regular expressions is also available in Ruby (with a different syntax) and in the regex extension of Python (but not in the built-in re module), but it's explicitly not available in RE2.
Let's suppose the input file in.txt contains lines like:
a = EQ(x + 6, 42);
a = EQ((x + 6) * 2, 42);
if (x + 6 == 42) { ... }
if (EQ(x + 6, 42)) { ... }
if (EQ((x + 6) * 2, 42)) { ... }
, and let's suppose we want to get all instances of EQ and if, with their arguments.
A non-recursive regexp can get only the instances without nested parentheses:
$ <in.txt perl -0777 -wne '
while (m@(?>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*\(([^()]*)\))@g)
{ print "$1($2)\n" }'
EQ(x + 6, 42)
if(x + 6 == 42)
EQ(x + 6, 42)
With a recursive regexp we can get all matches:
$ <in.txt perl -0777 -wne '
while (m@(?>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*(\(((?:[^()]+|(?2))*)\)))@g)
{ print "$1$2\n" }'
EQ(x + 6, 42)
EQ((x + 6) * 2, 42)
if(x + 6 == 42)
if(EQ(x + 6, 42))
if(EQ((x + 6) * 2, 42))
Please note that the EQ inside the if was not matched, because with the global flag (m@...@g) Perl doesn't consider overlapping or enclosed matches.
The recursive part of the regexp is the (?2): it's a recursive reuse of paren group 2. For more information about recursive regexps, see recursive subpattern in perlre(1). The (?>gt...) construct is a performance optimization to prevent backtracking.
It's also possible to match individual (comma-separated) arguments. For example, here is how to match both arguments of EQ separately, recursively:
$ <in.txt perl -0777 -wne '
while (
m@(>>\b(if|while|EQ|NE|LT|LE|GT|GE)\(((?:[^(),]+|(\(((?:[^()]+|(?3))*)\)))*),\s*((?2))\))@g)
{ print "$1($2, $5)\n" }'
EQ(x + 6, 42)
EQ((x + 6) * 2, 42)
EQ(x + 6, 42)
EQ((x + 6) * 2, 42)
The non-recursive version returns 2 arguments without nested parentheses:
$ <in.txt perl -0777 -wne '
while (m@(?>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*\(([^(),]*),\s*([^(),]*))@g)
{ print "$1($2, $3)\n" }'
EQ(x + 6, 42)
EQ(x + 6, 42)
Here is how to match only a single argument (no comma) recursively:
$ <in.txt perl -0777 -wne '
while (
m@(>>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*\(((?:[^(),]+|(\(((?:[^()]+|(?3))*)\)))*)\))@g)
{ print "$1($2)\n" }'
if(x + 6 == 42)
if(EQ(x + 6, 42))
if(EQ((x + 6) * 2, 42))
The non-recursive version returns 1 argument without nested parentheses:
$ <in.txt perl -0777 -wne '
while (m@(?>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*\(([^(),]*)\))@g)
{ print "$1($2)\n" }'
if(x + 6 == 42)
No comments:
Post a Comment