<!-- category: regex -->
The \G Anchor for Stateful Tokenizing
You need to break a string into tokens. Not split on a single delimiter. Real tokenizing. Numbers, operators, identifiers, whitespace, each one extracted in sequence, exactly where the last one left off.Most people reach for a parsing library. Or they write a loop
with substr and position tracking. Ugly, fragile, slow.
Perl has a regex anchor for this. It is called \G.
One regex. One loop. Full tokenizer. Themy $expr = "count + 42 * total"; my @tokens; while ($expr =~ m~\G\s*(\w+|\d+|[+*/-])~gc) { push @tokens, $1; } say join ", ", @tokens; # count, +, 42, *, total
\G anchor means
"match right where the previous match ended." It turns a global
regex into a stateful scanner that chews through the string from
left to right, one token at a time.
Part 1: WHAT \G DOES
Every string in Perl has an internal position pointer called pos(). When you do am//g match, Perl updates this pointer
to the end of the match. The next m//g starts searching from
that position.
\G anchors a match to exactly that position. Not "somewhere
after it." Exactly at pos().
That failure is the point.my $str = "aaa bbb ccc"; # first m//g match finds "aaa" at position 0 $str =~ m~\G(\w+)~g; say $1; # aaa say pos($str); # 3 # without \G, the next match could skip ahead # WITH \G, it must match right at position 3 $str =~ m~\G(\w+)~g; # fails! position 3 is a space, not \w
\G does not let the regex engine
go hunting forward for a match. It says "match HERE or not at
all." This is what makes it useful for tokenizing. You never
skip over characters you have not accounted for.
Part 2: THE GC FLAG
There is a subtlety withm//g that trips people up. Normally,
when a m//g match fails, Perl resets pos() to 0. Your
position is lost. The next match starts over from the beginning.
The c flag changes this. With m//gc, a failed match leaves
pos() where it was. The position survives failure.
Without themy $str = "abc 123 def"; pos($str) = 0; # try to match digits at position 0 if ($str =~ m~\G(\d+)~gc) { say "digits: $1"; } else { say "no digits at pos " . pos($str); # pos is still 0! } # now try letters if ($str =~ m~\G(\w+)~gc) { say "word: $1"; # word: abc say "pos: " . pos($str); # pos: 3 }
c flag, that first failed match would have reset
pos to 0, and you would never know you tried. With c, your
position is sticky. Failed attempts do not reset the scanner.
This is essential for tokenizers. You try one pattern, and if it fails, you try another, all at the same position.
Part 3: BUILDING A SIMPLE TOKENIZER
Here is the pattern. A while loop with multiple\G matches
inside, each trying a different token type:
Output:use strict; use warnings; use feature 'say'; my $input = ' hello + 42 * world '; my @tokens; while (pos($input) < length($input)) { if ($input =~ m~\G\s+~gc) { # skip whitespace } elsif ($input =~ m~\G(\d+)~gc) { push @tokens, { type => 'NUM', value => $1 }; } elsif ($input =~ m~\G(\w+)~gc) { push @tokens, { type => 'IDENT', value => $1 }; } elsif ($input =~ m~\G([+\-*/])~gc) { push @tokens, { type => 'OP', value => $1 }; } else { die "Unexpected character at position " . pos($input) . ": '" . substr($input, pos($input), 1) . "'"; } } for my $tok (@tokens) { say "$tok->{type}: $tok->{value}"; }
Each branch tries to match a different token type at the current position. If nothing matches, you have an unexpected character. TheIDENT: hello OP: + NUM: 42 OP: * IDENT: world
\G anchor prevents the regex from skipping ahead and hiding
errors.
Part 4: ORDER MATTERS
In the tokenizer above, the\d+ branch comes before \w+.
This matters. If \w+ came first, it would swallow 42 as an
identifier because \w matches digits.
Put the more specific patterns first. Digits before words. Keywords before general identifiers. Longest-match alternatives before shorter ones.PATTERN ORDER INPUT "42abc" RESULT -------------- ------------- ------ \d+ before \w+ 42 = NUM correct abc = IDENT \w+ before \d+ 42abc = IDENT wrong (greedy \w ate everything)
This is the same principle as any lexer. The difference is you are building it with one regex feature instead of a table-driven state machine.
Part 5: LEXING A REAL EXPRESSION
Let's build a tokenizer for a simple math expression language with parentheses, floating-point numbers, and named functions:Output:use strict; use warnings; use feature 'say'; my $expr = 'sqrt(3.14) + max(x, y) * 2.0'; my @tokens; while (pos($expr) // 0 < length($expr)) { if ($expr =~ m~\G\s+~gc) { next; # skip whitespace } elsif ($expr =~ m~\G(\d+\.\d+|\d+)~gc) { push @tokens, ['NUM', $1]; } elsif ($expr =~ m~\G(sqrt|max|min|abs)~gc) { push @tokens, ['FUNC', $1]; } elsif ($expr =~ m~\G([a-zA-Z_]\w*)~gc) { push @tokens, ['VAR', $1]; } elsif ($expr =~ m~\G([+\-*/])~gc) { push @tokens, ['OP', $1]; } elsif ($expr =~ m~\G(\()~gc) { push @tokens, ['LPAREN', $1]; } elsif ($expr =~ m~\G(\))~gc) { push @tokens, ['RPAREN', $1]; } elsif ($expr =~ m~\G(,)~gc) { push @tokens, ['COMMA', $1]; } else { my $pos = pos($expr) // 0; die "Lexer error at position $pos: '" . substr($expr, $pos, 10) . "'"; } } for my $t (@tokens) { printf "%-8s %s\n", $t->[0], $t->[1]; }
Notice thatFUNC sqrt LPAREN ( NUM 3.14 RPAREN ) OP + FUNC max LPAREN ( VAR x COMMA , VAR y RPAREN ) OP * NUM 2.0
sqrt and max are recognized as FUNC tokens
because they match before the general [a-zA-Z_]\w* pattern.
The keyword list acts like reserved words in a real language.
Part 6: COMPARISON TO SPLIT
People sometimes try to tokenize withsplit. It works for
simple cases:
But split is the wrong tool for anything more complex:my @words = split /\s+/, "hello world foo";
Split destroys structure. It gives you pieces without telling you what kind of pieces they are. AFEATURE split \G tokenizer ----------------- ------- ------------ Fixed delimiter yes not needed Multiple token types no yes Position tracking no yes (pos) Error on bad input no (silent) yes (else branch) Preserves structure no yes Context-sensitive no yes
\G tokenizer gives you labeled
tokens with type information. That is the difference between
chopping a string and parsing it.
Part 7: LOG FORMAT PARSING
Here is a practical example. Parse an Apache combined log line into structured fields:use strict; use warnings; use feature 'say'; my $line = '192.168.1.1 - frank [10/Oct/2025:13:55:36 -0700]' . ' "GET /api/users HTTP/1.1" 200 2326' . ' "https://example.com/" "Mozilla/5.0"'; my %entry; if ($line =~ m~\G(\S+)~gc) { $entry{ip} = $1 } if ($line =~ m~\G\s+(\S+)~gc) { $entry{ident} = $1 } if ($line =~ m~\G\s+(\S+)~gc) { $entry{user} = $1 } if ($line =~ m~\G\s+\[([^\]]+)\]~gc) { $entry{time} = $1 } if ($line =~ m~\G\s+"([^"]*)"~gc) { $entry{request} = $1 } if ($line =~ m~\G\s+(\d+)~gc) { $entry{status} = $1 } if ($line =~ m~\G\s+(\d+)~gc) { $entry{size} = $1 } if ($line =~ m~\G\s+"([^"]*)"~gc) { $entry{referrer} = $1 } if ($line =~ m~\G\s+"([^"]*)"~gc) { $entry{agent} = $1 } for my $k (sort keys %entry) { printf "%-10s %s\n", $k, $entry{$k}; }
Each field is extracted in sequence. If a field is malformed, the chain of matches stops right there. You know exactly which field failed and where.agent Mozilla/5.0 ident - ip 192.168.1.1 referrer https://example.com/ request GET /api/users HTTP/1.1 size 2326 status 200 time 10/Oct/2025:13:55:36 -0700 user frank
Compare this to one massive regex with nine capture groups. Good luck debugging that when the log format changes.
Part 8: THE POS() FUNCTION
pos() is the companion function to \G. It reads and writes
the position pointer:
You can usemy $str = "hello world"; # read the current position say pos($str) // "undef"; # undef (no match yet) # set position manually pos($str) = 6; # now \G anchors to position 6 if ($str =~ m~\G(\w+)~gc) { say $1; # world say pos($str); # 11 } # reset position pos($str) = undef; # or pos($str) = 0
pos() to implement backtracking in your tokenizer.
Save the position before trying something speculative, and restore
it if the attempt fails:
This is manual backtracking. Most tokenizers do not need it, but it is there when you do.my $saved = pos($input); # try an ambitious match unless ($input =~ m~\G(complex_pattern)~gc) { pos($input) = $saved; # rewind # try something else }
Part 9: RESETTING POSITION
A few ways to reset the position pointer:The distinction between# set to undef (no position) pos($str) = undef; # set to 0 (beginning) pos($str) = 0; # a failed m//g (without c) resets it $str =~ m~nomatch~g; # pos is now undef # a successful non-global match does NOT affect pos $str =~ m~something~; # pos unchanged
undef and 0 matters. When pos is
undef, the string has no position state. The first m//g starts
from the beginning. When pos is 0, the string has an explicit
position at the start. Same practical effect for most uses, but
undef is the "clean" state.
STATE OF pos() WHAT HAPPENS ON NEXT m//g ------------------- -------------------------- undef start from beginning 0 start from position 0 5 start from position 5 past end of string match fails immediately
Part 10: PUTTING IT ALL TOGETHER
The\G anchor is not a hack or an obscure trick. It is how Perl
expects you to build scanners. The regex engine already tracks
position state. \G just gives you access to it.
The recipe is always the same:
- Write a
whileloop - Inside, use
if/elsifwithm~\G...~gcpatterns - Each branch handles one token type
- The
elsebranch catches errors - The loop ends when pos reaches the end of the string
INPUT STRING ┌──────────────────────────────┐ │ sqrt(3.14) + max(x, y) * 2 │ └──────────────────────────────┘ ^ | pos() moves right ──────────> | ┌────┴────┐ │ \G(...) │ "What's at this position?" └────┬────┘ │ ┌────┼──────────┬──────────┬────────┐ │ │ │ │ │ NUM FUNC VAR OP ERROR
You could install a parsing framework. You could use Parse::Lex or Marpa or a PEG grammar. Sometimes you should. But for 90% of tokenizing tasks, a.--. |o_o | "Match where you left off. |:_/ | Not where you feel like." // \ \ (| | ) /'\_ _/`\ \___)=(___/
\G loop in 30 lines of vanilla Perl will
do the job faster than any framework can be installed and
configured.
The regex engine is already a state machine. \G just lets you
drive it one token at a time. Every match advances the cursor.
Every failure tells you something is wrong. The string is consumed
left to right, one bite at a time, with complete control over
what each bite looks like.
That is not regex abuse. That is regex done right.
perl.gg