Regex Recursion
Nested parentheses. The thing that breaks every regex you've ever written.You know the feeling. You need to match (something (with nested (parens))) and your regex engine just laughs at you. Regular expressions can't count. They're finite automata. They don't do recursion.
Except Perl does.
That pattern matches arbitrarily deep nested parentheses. One line. No parsing libraries. No state machines. Just regex.my $nested = qr~\( ( (?: [^()]+ | \( (?1) \) )* ) \)~x;
Looks like someone sneezed on a keyboard, right? Let's crack it open.
Part 1: THE OUTER SHELL
Strip away the recursion magic and you've got a sandwich:Literal open paren. Stuff in the middle. Literal close paren.\( ...stuff... \)
That's the easy part. The stuff in the middle is where it gets weird.
Part 2: THE CAPTURING GROUP
Those outer parentheses (inside the literal \( \)) create Group 1. This is the group that recurses. Everything it matches gets captured into $1.\( ( (?: [^()]+ | \( (?1) \) )* ) \) ^ ^ | | +------- GROUP 1 ------------+
For input like (a(b(c)d)e), Group 1 captures: a(b(c)d)e
Notice it doesn't include the outermost parens. Those are matched by the literal \( and \) at the edges.
Part 3: THE CHOICE
Inside Group 1, there's a non-capturing group with two options:Translation: "Match either of these things, zero or more times."(?: [^()]+ | \( (?1) \) )*
Option A: [^()]+ One or more non-paren characters Option B: \( (?1) \) An open paren, RECURSE, then close paren
The regex engine tries Option A first. If it hits an open paren, it switches to Option B and dives deeper.
Part 4: THE RECURSION
Here's the magic:This means "run Group 1's pattern right here, right now."(?1)
Not a backreference. Not "match what Group 1 already matched." It's "run the entire Group 1 pattern again from scratch."
When the engine hits (?1), it pushes its current state onto a stack and starts Group 1 over at the current position. When that nested run finishes, it pops back and continues.
It's function calls in regex.
Your brain should hurt a little. That's normal..--. |o_o | |:_/ | // \ \ (| | ) /'\_ _/`\ \___)=(___/
Part 5: WALKING THROUGH IT
Input: (a(b(c)d)e)Here's what happens:Position: 0 1 2 3 4 5 6 7 8 9 10 Character: ( a ( b ( c ) d ) e )
Three levels deep. Handled automatically.STEP PATTERN POS CHAR ACTION ----- --------- --- ---- ------------------------- 1 \( 0 ( Match literal open paren 2 [^()]+ 1 a Match non-paren char 3 \( 2 ( Match nested open 4 (?1) - - RECURSE into Group 1 5 [^()]+ 3 b Match non-paren (in recursion) 6 \( 4 ( Match deeper open 7 (?1) - - RECURSE AGAIN 8 [^()]+ 5 c Match non-paren (deepest) 9 - - - No more parens, return up 10 \) 6 ) Match close (level 2) 11 [^()]+ 7 d Match non-paren (level 1) 12 - - - Return up 13 \) 8 ) Match close (level 1) 14 [^()]+ 9 e Match non-paren (level 0) 15 \) 10 ) Match final close
Part 6: THE STACK
Think of it like function calls:Each (?1) pushes state. Each completed match pops back.LEVEL 0 (outer): +------------------------------------------+ | \( matches position 0 | | | | GROUP 1 capturing: a(b(c)d)e | | +-- [^()]+ matches: a | | +-- \( matches: ( | | +-- (?1) RECURSE ----+ | | : : | | +-- \) matches: ) <--+ (returns here) | | +-- [^()]+ matches: e | | | | \) matches position 10 | +------------------------------------------+ LEVEL 1 (first recursion): +--------------------------------+ | GROUP 1 matching: b(c)d | | +-- [^()]+ matches: b | | +-- \( matches: ( | | +-- (?1) RECURSE --+ | | : : | | +-- \) matches: ) <+ return | | +-- [^()]+ matches: d | +--------------------------------+ LEVEL 2 (deepest): +--------------+ | GROUP 1: c | | +-- [^()]+:| | c | | (no parens, | | return up) | +--------------+
Part 7: THE FULL PATTERN ANNOTATED
The /x modifier lets us spread it out with comments. Much easier to read than the compressed version.my $nested = qr~ \( # literal opening paren ( # GROUP 1 - this is what recurses (?: # non-capturing group for the choice [^()]+ # Option A: non-paren characters | # OR \( (?1) \) # Option B: open, RECURSE, close )* # zero or more times ) \) # literal closing paren ~x;
Part 8: USING IT
Output:#!/usr/bin/env perl use v5.14; # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Pattern for nested parens # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ my $nested = qr~ \( # literal opening paren ( # GROUP 1 - the recursing part (?: # non-capturing choice group [^()]+ # one or more non-parens | # OR \( (?1) \) # paren, recurse, paren )* # zero or more of either ) \) # literal closing paren ~x; # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Test it # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ my $text = "func(a, calc(b, inner(c)), d)"; if ($text =~ m~$nested~) { say "Matched: $&"; # full match with outer parens say "Inside: $1"; # contents without outer parens }
Matched: (a, calc(b, inner(c)), d) Inside: a, calc(b, inner(c)), d
Part 9: VARIATIONS
Match curly braces instead:Match square brackets:my $braces = qr~\{ ( (?: [^{}]+ | \{ (?1) \} )* ) \}~x;
Match HTML-style tags (simplified):my $brackets = qr~\[ ( (?: [^\[\]]+ | \[ (?1) \] )* ) \]~x;
That last one uses (?0) to recurse the entire pattern, and \1 as a backreference to match the closing tag name.my $tags = qr~<(\w+)> ( (?: [^<>]+ | (?0) )* ) </\1>~x;
THE CATCH
This is Perl-specific. PCRE has it. Python's re module doesn't (but the regex module does). JavaScript? Forget it.
If you need portability, you're back to writing a parser.
But if you're in Perl? One line. Unlimited depth. No libraries.
That's the kind of thing that makes people fall in love with this language.
perl.gg.---. / \ \.@-@./ /`\_/`\ // _ \\ | \ )|_ /`\_`> <_/ \ \__/'---'\__/