perl.gg / regex

<!-- category: regex -->

The \G Anchor for Stateful Tokenizing

2026-04-08

You need to break a string into tokens. Not split on a single delimiter. Real tokenizing. Numbers, operators, identifiers, whitespace, each one extracted in sequence, exactly where the last one left off.

Most people reach for a parsing library. Or they write a loop with substr and position tracking. Ugly, fragile, slow.

Perl has a regex anchor for this. It is called \G.

my $expr = "count + 42 * total"; my @tokens; while ($expr =~ m~\G\s*(\w+|\d+|[+*/-])~gc) { push @tokens, $1; } say join ", ", @tokens; # count, +, 42, *, total
One regex. One loop. Full tokenizer. The \G anchor means "match right where the previous match ended." It turns a global regex into a stateful scanner that chews through the string from left to right, one token at a time.

Part 1: WHAT \G DOES

Every string in Perl has an internal position pointer called pos(). When you do a m//g match, Perl updates this pointer to the end of the match. The next m//g starts searching from that position.

\G anchors a match to exactly that position. Not "somewhere after it." Exactly at pos().

my $str = "aaa bbb ccc"; # first m//g match finds "aaa" at position 0 $str =~ m~\G(\w+)~g; say $1; # aaa say pos($str); # 3 # without \G, the next match could skip ahead # WITH \G, it must match right at position 3 $str =~ m~\G(\w+)~g; # fails! position 3 is a space, not \w
That failure is the point. \G does not let the regex engine go hunting forward for a match. It says "match HERE or not at all." This is what makes it useful for tokenizing. You never skip over characters you have not accounted for.

Part 2: THE GC FLAG

There is a subtlety with m//g that trips people up. Normally, when a m//g match fails, Perl resets pos() to 0. Your position is lost. The next match starts over from the beginning.

The c flag changes this. With m//gc, a failed match leaves pos() where it was. The position survives failure.

my $str = "abc 123 def"; pos($str) = 0; # try to match digits at position 0 if ($str =~ m~\G(\d+)~gc) { say "digits: $1"; } else { say "no digits at pos " . pos($str); # pos is still 0! } # now try letters if ($str =~ m~\G(\w+)~gc) { say "word: $1"; # word: abc say "pos: " . pos($str); # pos: 3 }
Without the c flag, that first failed match would have reset pos to 0, and you would never know you tried. With c, your position is sticky. Failed attempts do not reset the scanner.

This is essential for tokenizers. You try one pattern, and if it fails, you try another, all at the same position.

Part 3: BUILDING A SIMPLE TOKENIZER

Here is the pattern. A while loop with multiple \G matches inside, each trying a different token type:
use strict; use warnings; use feature 'say'; my $input = ' hello + 42 * world '; my @tokens; while (pos($input) < length($input)) { if ($input =~ m~\G\s+~gc) { # skip whitespace } elsif ($input =~ m~\G(\d+)~gc) { push @tokens, { type => 'NUM', value => $1 }; } elsif ($input =~ m~\G(\w+)~gc) { push @tokens, { type => 'IDENT', value => $1 }; } elsif ($input =~ m~\G([+\-*/])~gc) { push @tokens, { type => 'OP', value => $1 }; } else { die "Unexpected character at position " . pos($input) . ": '" . substr($input, pos($input), 1) . "'"; } } for my $tok (@tokens) { say "$tok->{type}: $tok->{value}"; }
Output:
IDENT: hello OP: + NUM: 42 OP: * IDENT: world
Each branch tries to match a different token type at the current position. If nothing matches, you have an unexpected character. The \G anchor prevents the regex from skipping ahead and hiding errors.

Part 4: ORDER MATTERS

In the tokenizer above, the \d+ branch comes before \w+. This matters. If \w+ came first, it would swallow 42 as an identifier because \w matches digits.
PATTERN ORDER INPUT "42abc" RESULT -------------- ------------- ------ \d+ before \w+ 42 = NUM correct abc = IDENT \w+ before \d+ 42abc = IDENT wrong (greedy \w ate everything)
Put the more specific patterns first. Digits before words. Keywords before general identifiers. Longest-match alternatives before shorter ones.

This is the same principle as any lexer. The difference is you are building it with one regex feature instead of a table-driven state machine.

Part 5: LEXING A REAL EXPRESSION

Let's build a tokenizer for a simple math expression language with parentheses, floating-point numbers, and named functions:
use strict; use warnings; use feature 'say'; my $expr = 'sqrt(3.14) + max(x, y) * 2.0'; my @tokens; while (pos($expr) // 0 < length($expr)) { if ($expr =~ m~\G\s+~gc) { next; # skip whitespace } elsif ($expr =~ m~\G(\d+\.\d+|\d+)~gc) { push @tokens, ['NUM', $1]; } elsif ($expr =~ m~\G(sqrt|max|min|abs)~gc) { push @tokens, ['FUNC', $1]; } elsif ($expr =~ m~\G([a-zA-Z_]\w*)~gc) { push @tokens, ['VAR', $1]; } elsif ($expr =~ m~\G([+\-*/])~gc) { push @tokens, ['OP', $1]; } elsif ($expr =~ m~\G(\()~gc) { push @tokens, ['LPAREN', $1]; } elsif ($expr =~ m~\G(\))~gc) { push @tokens, ['RPAREN', $1]; } elsif ($expr =~ m~\G(,)~gc) { push @tokens, ['COMMA', $1]; } else { my $pos = pos($expr) // 0; die "Lexer error at position $pos: '" . substr($expr, $pos, 10) . "'"; } } for my $t (@tokens) { printf "%-8s %s\n", $t->[0], $t->[1]; }
Output:
FUNC sqrt LPAREN ( NUM 3.14 RPAREN ) OP + FUNC max LPAREN ( VAR x COMMA , VAR y RPAREN ) OP * NUM 2.0
Notice that sqrt and max are recognized as FUNC tokens because they match before the general [a-zA-Z_]\w* pattern. The keyword list acts like reserved words in a real language.

Part 6: COMPARISON TO SPLIT

People sometimes try to tokenize with split. It works for simple cases:
my @words = split /\s+/, "hello world foo";
But split is the wrong tool for anything more complex:
FEATURE split \G tokenizer ----------------- ------- ------------ Fixed delimiter yes not needed Multiple token types no yes Position tracking no yes (pos) Error on bad input no (silent) yes (else branch) Preserves structure no yes Context-sensitive no yes
Split destroys structure. It gives you pieces without telling you what kind of pieces they are. A \G tokenizer gives you labeled tokens with type information. That is the difference between chopping a string and parsing it.

Part 7: LOG FORMAT PARSING

Here is a practical example. Parse an Apache combined log line into structured fields:
use strict; use warnings; use feature 'say'; my $line = '192.168.1.1 - frank [10/Oct/2025:13:55:36 -0700]' . ' "GET /api/users HTTP/1.1" 200 2326' . ' "https://example.com/" "Mozilla/5.0"'; my %entry; if ($line =~ m~\G(\S+)~gc) { $entry{ip} = $1 } if ($line =~ m~\G\s+(\S+)~gc) { $entry{ident} = $1 } if ($line =~ m~\G\s+(\S+)~gc) { $entry{user} = $1 } if ($line =~ m~\G\s+\[([^\]]+)\]~gc) { $entry{time} = $1 } if ($line =~ m~\G\s+"([^"]*)"~gc) { $entry{request} = $1 } if ($line =~ m~\G\s+(\d+)~gc) { $entry{status} = $1 } if ($line =~ m~\G\s+(\d+)~gc) { $entry{size} = $1 } if ($line =~ m~\G\s+"([^"]*)"~gc) { $entry{referrer} = $1 } if ($line =~ m~\G\s+"([^"]*)"~gc) { $entry{agent} = $1 } for my $k (sort keys %entry) { printf "%-10s %s\n", $k, $entry{$k}; }
agent Mozilla/5.0 ident - ip 192.168.1.1 referrer https://example.com/ request GET /api/users HTTP/1.1 size 2326 status 200 time 10/Oct/2025:13:55:36 -0700 user frank
Each field is extracted in sequence. If a field is malformed, the chain of matches stops right there. You know exactly which field failed and where.

Compare this to one massive regex with nine capture groups. Good luck debugging that when the log format changes.

Part 8: THE POS() FUNCTION

pos() is the companion function to \G. It reads and writes the position pointer:
my $str = "hello world"; # read the current position say pos($str) // "undef"; # undef (no match yet) # set position manually pos($str) = 6; # now \G anchors to position 6 if ($str =~ m~\G(\w+)~gc) { say $1; # world say pos($str); # 11 } # reset position pos($str) = undef; # or pos($str) = 0
You can use pos() to implement backtracking in your tokenizer. Save the position before trying something speculative, and restore it if the attempt fails:
my $saved = pos($input); # try an ambitious match unless ($input =~ m~\G(complex_pattern)~gc) { pos($input) = $saved; # rewind # try something else }
This is manual backtracking. Most tokenizers do not need it, but it is there when you do.

Part 9: RESETTING POSITION

A few ways to reset the position pointer:
# set to undef (no position) pos($str) = undef; # set to 0 (beginning) pos($str) = 0; # a failed m//g (without c) resets it $str =~ m~nomatch~g; # pos is now undef # a successful non-global match does NOT affect pos $str =~ m~something~; # pos unchanged
The distinction between undef and 0 matters. When pos is undef, the string has no position state. The first m//g starts from the beginning. When pos is 0, the string has an explicit position at the start. Same practical effect for most uses, but undef is the "clean" state.
STATE OF pos() WHAT HAPPENS ON NEXT m//g ------------------- -------------------------- undef start from beginning 0 start from position 0 5 start from position 5 past end of string match fails immediately

Part 10: PUTTING IT ALL TOGETHER

The \G anchor is not a hack or an obscure trick. It is how Perl expects you to build scanners. The regex engine already tracks position state. \G just gives you access to it.

The recipe is always the same:

  1. Write a while loop
  2. Inside, use if/elsif with m~\G...~gc patterns
  3. Each branch handles one token type
  4. The else branch catches errors
  5. The loop ends when pos reaches the end of the string

INPUT STRING ┌──────────────────────────────┐ │ sqrt(3.14) + max(x, y) * 2 │ └──────────────────────────────┘ ^ | pos() moves right ──────────> | ┌────┴────┐ │ \G(...) │ "What's at this position?" └────┬────┘ │ ┌────┼──────────┬──────────┬────────┐ │ │ │ │ │ NUM FUNC VAR OP ERROR
.--. |o_o | "Match where you left off. |:_/ | Not where you feel like." // \ \ (| | ) /'\_ _/`\ \___)=(___/
You could install a parsing framework. You could use Parse::Lex or Marpa or a PEG grammar. Sometimes you should. But for 90% of tokenizing tasks, a \G loop in 30 lines of vanilla Perl will do the job faster than any framework can be installed and configured.

The regex engine is already a state machine. \G just lets you drive it one token at a time. Every match advances the cursor. Every failure tells you something is wrong. The string is consumed left to right, one bite at a time, with complete control over what each bite looks like.

That is not regex abuse. That is regex done right.

perl.gg