perl.gg / regex

<!-- category: regex -->

Embedded Code Execution Inside Regex

2026-04-04

You are in the middle of a regex match. The engine is chewing through a string character by character. And right there, between the \d+ and the \s*, you drop in a block of arbitrary Perl code that runs mid-match.
"abc123def" =~ m~(\d+)(?{ say "Found digits: $1" })~; # prints: Found digits: 123
That (?{ }) construct is not a lookahead. It is not a comment. It is live Perl code executing inside the regex engine while the match is still in progress. You have access to the match state, you can modify variables, and you can make decisions based on what has been matched so far.

This is one of the most powerful and most dangerous features in Perl's regex engine. Let's use it.

Part 1: THE BASIC SYNTAX

The (?{ CODE }) construct embeds executable Perl code inside a regular expression. The code runs when the regex engine reaches that point in the pattern:
use feature 'say'; "Hello World" =~ m~Hello (?{ say "matched Hello, continuing..." })\w+~; # Output: matched Hello, continuing...
The code block fires after Hello has matched and before \w+ starts matching. It is a checkpoint. The engine got this far, so your code runs.

You can put multiple code blocks in a single regex:

"2026-04-04" =~ m~ (\d{4}) (?{ say "Year: $1" }) - (\d{2}) (?{ say "Month: $2" }) - (\d{2}) (?{ say "Day: $2" }) ~x;
Wait, that last $2 is wrong. But that is the point. You are debugging the regex by watching what the engine captures at each step. Fix it and move on.

Part 2: ACCESSING MATCH STATE

Inside (?{ }), you have access to all the usual match variables:
my $str = "foo=42, bar=99, baz=7"; $str =~ m~(\w+)=(\d+)(?{ say "Key: $1, Value: $2, Position: " . pos($str); })~g;
The variables available inside the code block:
VARIABLE WHAT IT CONTAINS -------- ----------------------------------- $1, $2... Capture group contents (so far) $& The entire match (so far) $` Everything before the match $' Everything after the match pos() Current position in the string
You can also set your own variables. They are visible outside the regex after the match completes:
my $count = 0; "aabbbcccc" =~ m~((.)\2+)(?{ $count++ })~g; say "Found $count runs of repeated characters";

Part 3: COUNTING MATCHES

One of the most practical uses: count how many times a subpattern matches within a larger pattern:
my $vowel_count = 0; "Supercalifragilistic" =~ m~ (?: [aeiou] (?{ $vowel_count++ }) | [^aeiou] )* ~xi; say "Vowels: $vowel_count"; # Vowels: 8
The (?{ $vowel_count++ }) fires every time the [aeiou] alternative matches. The [^aeiou] branch has no code block, so consonants are silently consumed.

Compare this to splitting the string and counting. The regex approach does it in a single pass, inline, while matching.

Part 4: THE BACKTRACKING TRAP

Here is where things get tricky. The regex engine backtracks. When it does, your code blocks fire again on each new attempt:
my $runs = 0; "aaa" =~ m~(a+)(?{ $runs++; say "Trying: '$1'" })b~; say "Code block ran $runs times";
Trying: 'aaa' Trying: 'aa' Trying: 'a' Code block ran 3 times
The engine first tries matching all three a characters with a+. Then it hits b, which fails. So it backtracks, gives up one a, and tries again. The code block fires on every attempt, not just the final successful one.

This means side effects in (?{ }) can fire multiple times. If you are incrementing a counter, you might get a wrong count. If you are pushing to an array, you might get duplicate entries.

The fix is to track state carefully or use (*PRUNE) and (*COMMIT) to control backtracking:

my @matches; "abc123def456" =~ m~ (?: (\d+) (?{ push @matches, $1 }) (*COMMIT) | . )* ~x; say join(', ', @matches); # 123, 456

Part 5: DYNAMIC SUB-PATTERNS WITH (??{ })

The double-question-mark form (??{ CODE }) is even wilder. It runs the code and uses the return value as a regex pattern to match at that point:
# match a string that repeats its first character N times # where N is the length of the first capture "aaaa" =~ m~^(a+)(??{ "b" x length($1) })$~; # fails "aaabbb" =~ m~^(a+)(??{ "b" x length($1) })$~; # fails (3 a's, need 3 b's) "aaabbb" =~ m~^(a+)(??{ "b{" . length($1) . "}" })$~; # matches!
The (??{ }) block returns a string that becomes part of the regex on the fly. You are generating regex from inside regex.

A classic use: matching balanced parentheses:

my $balanced; $balanced = qr~ \( (?: [^()]+ # non-parens | (??{ $balanced }) # recurse )* \) ~x; my $str = "f(x, g(y, h(z)), w)"; if ($str =~ m~($balanced)~) { say "Matched: $1"; # (x, g(y, h(z)), w) }
The (??{ $balanced }) construct recursively applies the pattern to match nested parentheses of any depth. The regex references itself through the variable.

Part 6: BUILDING PARSE TREES

Use code blocks to build data structures while matching:
my @tokens; my $str = 'name="perl" version="5.40" type="awesome"'; $str =~ m~ (?: (\w+) # key = " ([^"]*) " # value (?{ push @tokens, { key => $1, value => $2 }; }) \s* )+ ~x; for my $t (@tokens) { say "$t->{key} => $t->{value}"; }
name => perl version => 5.40 type => awesome
The regex matches the key-value pairs and the code block builds an array of hashes as it goes. By the time the match finishes, your data structure is ready.

Is this a good idea for production parsing? Usually not. A proper parser is more maintainable. But for quick-and-dirty data extraction from well-structured text, this is devastatingly effective.

Part 7: A PRACTICAL TOKENIZER

Here is a real tokenizer for a simple expression language, built entirely with embedded code:
my @tokens; my $input = "x = 42 + y * (3 - z)"; $input =~ m~ \A (?: \s+ # skip whitespace | ( [a-zA-Z_]\w* ) (?{ push @tokens, [IDENT => $1] }) | ( \d+ ) (?{ push @tokens, [NUM => $2] }) | ( [+\-*/=] ) (?{ push @tokens, [OP => $3] }) | ( [()] ) (?{ push @tokens, [PAREN => $4] }) | (.) (?{ die "Unexpected char: '$5' at pos " . pos() }) )* \z ~x; for my $tok (@tokens) { printf "%-8s %s\n", $tok->[0], $tok->[1]; }
IDENT x OP = NUM 42 OP + IDENT y OP * PAREN ( NUM 3 OP - IDENT z PAREN )
Each alternative in the regex matches one token type. The embedded code block classifies and stores it. The final (.) catch-all produces a useful error for unexpected characters.

Part 8: LOCAL VARIABLES AND SCOPE

Variables modified inside (?{ }) follow the same scoping rules as regular Perl code. But be careful with local:
my $depth = 0; "((()))" =~ m~ (?: \( (?{ local $depth = $depth + 1; say " " x $depth . "down to $depth" }) | \) (?{ say " " x $depth . "up from $depth"; local $depth = $depth - 1 }) | [^()] )* ~x;
Using local inside (?{ }) means the value change is undone if the regex engine backtracks past that point. This is exactly what you want when tracking state that should mirror the match progress.

Without local, changes to variables persist even through backtracking, which can give you wrong results when the engine tries alternative paths.

Part 9: PERFORMANCE AND WARNINGS

Perl is cautious about (?{ }). You may see this warning:
Use of uninitialized value in regexp compilation
And in older Perls (before 5.18), you had to add use re 'eval' to enable code blocks in runtime-compiled patterns:
use re 'eval'; # required before 5.18 for variable patterns my $pat = '(\d+)(?{ say $1 })'; "abc123" =~ m~$pat~;
In Perl 5.18 and later, code blocks in literal regexes work without the pragma. But code blocks in patterns assembled from strings still need it.

Performance-wise, code blocks add overhead. The regex engine has to call back into the Perl interpreter at every code point. For simple patterns, this is negligible. For patterns in tight loops over millions of strings, measure before committing.

WHEN TO USE (?{ }): * Debugging complex regexes * Building data structures during matching * Quick-and-dirty tokenizers * Counting subpattern hits * When the alternative is worse WHEN NOT TO USE (?{ }): * Production parsers (use a real parser) * Performance-critical inner loops * When a simpler approach works * When your coworkers will revolt

Part 10: THE POWER AND THE RESPONSIBILITY

REGEX / \ / \ PERL PERL CODE CODE \ / \ / RESULTS "We put code in your regex so you can compute while you match." .--. |o_o | |:_/ | // \ \ (| | ) /'\_ _/`\ \___)=(___/
The (?{ }) construct blurs the line between matching and computing. The regex engine becomes a state machine that can execute arbitrary code at every transition. That is extraordinarily powerful and exactly as dangerous as it sounds.

Use it for debugging. Use it for prototyping. Use it for the kind of quick text munging where you need just a little more logic than a regex normally provides. But think twice before putting it in production code that other people have to maintain.

The best use of (?{ }) is the one where you use it to understand your regex, fix the problem, and then remove it. The second best use is a tokenizer that would take 50 lines of procedural code but fits in 15 lines of regex with embedded code.

Either way, you are using Perl the way Perl wants to be used: as a language that trusts you with power and assumes you know what you are doing.

perl.gg