<!-- category: regex -->
Embedded Code Execution Inside Regex
You are in the middle of a regex match. The engine is chewing through a string character by character. And right there, between the\d+ and the \s*, you drop in a block of arbitrary Perl
code that runs mid-match.
That"abc123def" =~ m~(\d+)(?{ say "Found digits: $1" })~; # prints: Found digits: 123
(?{ }) construct is not a lookahead. It is not a comment.
It is live Perl code executing inside the regex engine while the
match is still in progress. You have access to the match state,
you can modify variables, and you can make decisions based on
what has been matched so far.
This is one of the most powerful and most dangerous features in Perl's regex engine. Let's use it.
Part 1: THE BASIC SYNTAX
The(?{ CODE }) construct embeds executable Perl code inside a
regular expression. The code runs when the regex engine reaches
that point in the pattern:
The code block fires afteruse feature 'say'; "Hello World" =~ m~Hello (?{ say "matched Hello, continuing..." })\w+~; # Output: matched Hello, continuing...
Hello has matched and before \w+
starts matching. It is a checkpoint. The engine got this far, so
your code runs.
You can put multiple code blocks in a single regex:
Wait, that last"2026-04-04" =~ m~ (\d{4}) (?{ say "Year: $1" }) - (\d{2}) (?{ say "Month: $2" }) - (\d{2}) (?{ say "Day: $2" }) ~x;
$2 is wrong. But that is the point. You are
debugging the regex by watching what the engine captures at each
step. Fix it and move on.
Part 2: ACCESSING MATCH STATE
Inside(?{ }), you have access to all the usual match variables:
The variables available inside the code block:my $str = "foo=42, bar=99, baz=7"; $str =~ m~(\w+)=(\d+)(?{ say "Key: $1, Value: $2, Position: " . pos($str); })~g;
You can also set your own variables. They are visible outside the regex after the match completes:VARIABLE WHAT IT CONTAINS -------- ----------------------------------- $1, $2... Capture group contents (so far) $& The entire match (so far) $` Everything before the match $' Everything after the match pos() Current position in the string
my $count = 0; "aabbbcccc" =~ m~((.)\2+)(?{ $count++ })~g; say "Found $count runs of repeated characters";
Part 3: COUNTING MATCHES
One of the most practical uses: count how many times a subpattern matches within a larger pattern:Themy $vowel_count = 0; "Supercalifragilistic" =~ m~ (?: [aeiou] (?{ $vowel_count++ }) | [^aeiou] )* ~xi; say "Vowels: $vowel_count"; # Vowels: 8
(?{ $vowel_count++ }) fires every time the [aeiou]
alternative matches. The [^aeiou] branch has no code block,
so consonants are silently consumed.
Compare this to splitting the string and counting. The regex approach does it in a single pass, inline, while matching.
Part 4: THE BACKTRACKING TRAP
Here is where things get tricky. The regex engine backtracks. When it does, your code blocks fire again on each new attempt:my $runs = 0; "aaa" =~ m~(a+)(?{ $runs++; say "Trying: '$1'" })b~; say "Code block ran $runs times";
The engine first tries matching all threeTrying: 'aaa' Trying: 'aa' Trying: 'a' Code block ran 3 times
a characters with
a+. Then it hits b, which fails. So it backtracks, gives
up one a, and tries again. The code block fires on every
attempt, not just the final successful one.
This means side effects in (?{ }) can fire multiple times.
If you are incrementing a counter, you might get a wrong count.
If you are pushing to an array, you might get duplicate entries.
The fix is to track state carefully or use (*PRUNE) and
(*COMMIT) to control backtracking:
my @matches; "abc123def456" =~ m~ (?: (\d+) (?{ push @matches, $1 }) (*COMMIT) | . )* ~x; say join(', ', @matches); # 123, 456
Part 5: DYNAMIC SUB-PATTERNS WITH (??{ })
The double-question-mark form(??{ CODE }) is even wilder.
It runs the code and uses the return value as a regex pattern
to match at that point:
The# match a string that repeats its first character N times # where N is the length of the first capture "aaaa" =~ m~^(a+)(??{ "b" x length($1) })$~; # fails "aaabbb" =~ m~^(a+)(??{ "b" x length($1) })$~; # fails (3 a's, need 3 b's) "aaabbb" =~ m~^(a+)(??{ "b{" . length($1) . "}" })$~; # matches!
(??{ }) block returns a string that becomes part of the
regex on the fly. You are generating regex from inside regex.
A classic use: matching balanced parentheses:
Themy $balanced; $balanced = qr~ \( (?: [^()]+ # non-parens | (??{ $balanced }) # recurse )* \) ~x; my $str = "f(x, g(y, h(z)), w)"; if ($str =~ m~($balanced)~) { say "Matched: $1"; # (x, g(y, h(z)), w) }
(??{ $balanced }) construct recursively applies the pattern
to match nested parentheses of any depth. The regex references
itself through the variable.
Part 6: BUILDING PARSE TREES
Use code blocks to build data structures while matching:my @tokens; my $str = 'name="perl" version="5.40" type="awesome"'; $str =~ m~ (?: (\w+) # key = " ([^"]*) " # value (?{ push @tokens, { key => $1, value => $2 }; }) \s* )+ ~x; for my $t (@tokens) { say "$t->{key} => $t->{value}"; }
The regex matches the key-value pairs and the code block builds an array of hashes as it goes. By the time the match finishes, your data structure is ready.name => perl version => 5.40 type => awesome
Is this a good idea for production parsing? Usually not. A proper parser is more maintainable. But for quick-and-dirty data extraction from well-structured text, this is devastatingly effective.
Part 7: A PRACTICAL TOKENIZER
Here is a real tokenizer for a simple expression language, built entirely with embedded code:my @tokens; my $input = "x = 42 + y * (3 - z)"; $input =~ m~ \A (?: \s+ # skip whitespace | ( [a-zA-Z_]\w* ) (?{ push @tokens, [IDENT => $1] }) | ( \d+ ) (?{ push @tokens, [NUM => $2] }) | ( [+\-*/=] ) (?{ push @tokens, [OP => $3] }) | ( [()] ) (?{ push @tokens, [PAREN => $4] }) | (.) (?{ die "Unexpected char: '$5' at pos " . pos() }) )* \z ~x; for my $tok (@tokens) { printf "%-8s %s\n", $tok->[0], $tok->[1]; }
Each alternative in the regex matches one token type. The embedded code block classifies and stores it. The finalIDENT x OP = NUM 42 OP + IDENT y OP * PAREN ( NUM 3 OP - IDENT z PAREN )
(.)
catch-all produces a useful error for unexpected characters.
Part 8: LOCAL VARIABLES AND SCOPE
Variables modified inside(?{ }) follow the same scoping rules
as regular Perl code. But be careful with local:
Usingmy $depth = 0; "((()))" =~ m~ (?: \( (?{ local $depth = $depth + 1; say " " x $depth . "down to $depth" }) | \) (?{ say " " x $depth . "up from $depth"; local $depth = $depth - 1 }) | [^()] )* ~x;
local inside (?{ }) means the value change is undone
if the regex engine backtracks past that point. This is exactly
what you want when tracking state that should mirror the match
progress.
Without local, changes to variables persist even through
backtracking, which can give you wrong results when the engine
tries alternative paths.
Part 9: PERFORMANCE AND WARNINGS
Perl is cautious about(?{ }). You may see this warning:
And in older Perls (before 5.18), you had to addUse of uninitialized value in regexp compilation
use re 'eval'
to enable code blocks in runtime-compiled patterns:
In Perl 5.18 and later, code blocks in literal regexes work without the pragma. But code blocks in patterns assembled from strings still need it.use re 'eval'; # required before 5.18 for variable patterns my $pat = '(\d+)(?{ say $1 })'; "abc123" =~ m~$pat~;
Performance-wise, code blocks add overhead. The regex engine has to call back into the Perl interpreter at every code point. For simple patterns, this is negligible. For patterns in tight loops over millions of strings, measure before committing.
WHEN TO USE (?{ }): * Debugging complex regexes * Building data structures during matching * Quick-and-dirty tokenizers * Counting subpattern hits * When the alternative is worse WHEN NOT TO USE (?{ }): * Production parsers (use a real parser) * Performance-critical inner loops * When a simpler approach works * When your coworkers will revolt
Part 10: THE POWER AND THE RESPONSIBILITY
TheREGEX / \ / \ PERL PERL CODE CODE \ / \ / RESULTS "We put code in your regex so you can compute while you match." .--. |o_o | |:_/ | // \ \ (| | ) /'\_ _/`\ \___)=(___/
(?{ }) construct blurs the line between matching and
computing. The regex engine becomes a state machine that can
execute arbitrary code at every transition. That is extraordinarily
powerful and exactly as dangerous as it sounds.
Use it for debugging. Use it for prototyping. Use it for the kind of quick text munging where you need just a little more logic than a regex normally provides. But think twice before putting it in production code that other people have to maintain.
The best use of (?{ }) is the one where you use it to
understand your regex, fix the problem, and then remove it. The
second best use is a tokenizer that would take 50 lines of
procedural code but fits in 15 lines of regex with embedded code.
Either way, you are using Perl the way Perl wants to be used: as a language that trusts you with power and assumes you know what you are doing.
perl.gg