This blog post explains how to use recursive Perl regular expressions (regexp) to match substrings with balanced parentheses. Recursive regular expressions is also available in Ruby (with a different syntax) and in the regex extension of Python (but not in the built-in re module), but it's explicitly not available in RE2.
Let's suppose the input file in.txt contains lines like:
a = EQ(x + 6, 42); a = EQ((x + 6) * 2, 42); if (x + 6 == 42) { ... } if (EQ(x + 6, 42)) { ... } if (EQ((x + 6) * 2, 42)) { ... }, and let's suppose we want to get all instances of
EQ
and if
, with their arguments.
A non-recursive regexp can get only the instances without nested parentheses:
$ <in.txt perl -0777 -wne ' while (m@(?>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*\(([^()]*)\))@g) { print "$1($2)\n" }' EQ(x + 6, 42) if(x + 6 == 42) EQ(x + 6, 42)
With a recursive regexp we can get all matches:
$ <in.txt perl -0777 -wne ' while (m@(?>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*(\(((?:[^()]+|(?2))*)\)))@g) { print "$1$2\n" }' EQ(x + 6, 42) EQ((x + 6) * 2, 42) if(x + 6 == 42) if(EQ(x + 6, 42)) if(EQ((x + 6) * 2, 42))
Please note that the EQ
inside the if
was not matched, because with the global flag (m@...@g
) Perl doesn't consider overlapping or enclosed matches.
The recursive part of the regexp is the (?2)
: it's a recursive reuse of paren group 2. For more information about recursive regexps, see recursive subpattern in perlre(1). The (?>gt...)
construct is a performance optimization to prevent backtracking.
It's also possible to match individual (comma-separated) arguments. For example, here is how to match both arguments of EQ
separately, recursively:
$ <in.txt perl -0777 -wne ' while ( m@(>>\b(if|while|EQ|NE|LT|LE|GT|GE)\(((?:[^(),]+|(\(((?:[^()]+|(?3))*)\)))*),\s*((?2))\))@g) { print "$1($2, $5)\n" }' EQ(x + 6, 42) EQ((x + 6) * 2, 42) EQ(x + 6, 42) EQ((x + 6) * 2, 42)
The non-recursive version returns 2 arguments without nested parentheses:
$ <in.txt perl -0777 -wne ' while (m@(?>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*\(([^(),]*),\s*([^(),]*))@g) { print "$1($2, $3)\n" }' EQ(x + 6, 42) EQ(x + 6, 42)
Here is how to match only a single argument (no comma) recursively:
$ <in.txt perl -0777 -wne ' while ( m@(>>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*\(((?:[^(),]+|(\(((?:[^()]+|(?3))*)\)))*)\))@g) { print "$1($2)\n" }' if(x + 6 == 42) if(EQ(x + 6, 42)) if(EQ((x + 6) * 2, 42))
The non-recursive version returns 1 argument without nested parentheses:
$ <in.txt perl -0777 -wne ' while (m@(?>\b(if|while|EQ|NE|LT|LE|GT|GE)\s*\(([^(),]*)\))@g) { print "$1($2)\n" }' if(x + 6 == 42)
No comments:
Post a Comment