🐪 The \G Anchor for Stateful Tokenizing

2026-04-08

You need to break a string into tokens. Not split on a single delimiter. Real tokenizing. Numbers, operators, identifiers, whitespace, each one extracted in sequence, exactly where the last one left off.

Most people reach for a parsing library. Or they write a loop with substr and position tracking. Ugly, fragile, slow.

Perl has a regex anchor for this. It is called \G.

my $expr = "count + 42 * total";

my @tokens;
while ($expr =~ m~\G\s*(\w+|\d+|[+*/-])~gc)
{
    push @tokens, $1;
}

say join ", ", @tokens;
# count, +, 42, *, total

One regex. One loop. Full tokenizer. The \G anchor means "match right where the previous match ended." It turns a global regex into a stateful scanner that chews through the string from left to right, one token at a time.

Part 1: WHAT \G DOES

Every string in Perl has an internal position pointer called pos(). When you do a m//g match, Perl updates this pointer to the end of the match. The next m//g starts searching from that position.

\G anchors a match to exactly that position. Not "somewhere after it." Exactly at pos().

my $str = "aaa bbb ccc";

# first m//g match finds "aaa" at position 0
$str =~ m~\G(\w+)~g;
say $1;          # aaa
say pos($str);   # 3

# without \G, the next match could skip ahead
# WITH \G, it must match right at position 3
$str =~ m~\G(\w+)~g;    # fails! position 3 is a space, not \w

That failure is the point. \G does not let the regex engine go hunting forward for a match. It says "match HERE or not at all." This is what makes it useful for tokenizing. You never skip over characters you have not accounted for.

Part 2: THE GC FLAG

There is a subtlety with m//g that trips people up. Normally, when a m//g match fails, Perl resets pos() to 0. Your position is lost. The next match starts over from the beginning.

The c flag changes this. With m//gc, a failed match leaves pos() where it was. The position survives failure.

my $str = "abc 123 def";
pos($str) = 0;

# try to match digits at position 0
if ($str =~ m~\G(\d+)~gc)
{
    say "digits: $1";
}
else
{
    say "no digits at pos " . pos($str);    # pos is still 0!
}

# now try letters
if ($str =~ m~\G(\w+)~gc)
{
    say "word: $1";                          # word: abc
    say "pos: " . pos($str);                # pos: 3
}

Without the c flag, that first failed match would have reset pos to 0, and you would never know you tried. With c, your position is sticky. Failed attempts do not reset the scanner.

This is essential for tokenizers. You try one pattern, and if it fails, you try another, all at the same position.

Part 3: BUILDING A SIMPLE TOKENIZER

Here is the pattern. A while loop with multiple \G matches inside, each trying a different token type:

use strict;
use warnings;
use feature 'say';

my $input = '  hello + 42 * world  ';

my @tokens;

while (pos($input) <; length($input))
{
    if ($input =~ m~\G\s+~gc)
    {
        # skip whitespace
    }
    elsif ($input =~ m~\G(\d+)~gc)
    {
        push @tokens, { type =>; 'NUM', value =>; $1 };
    }
    elsif ($input =~ m~\G(\w+)~gc)
    {
        push @tokens, { type =>; 'IDENT', value =>; $1 };
    }
    elsif ($input =~ m~\G([+\-*/])~gc)
    {
        push @tokens, { type =>; 'OP', value =>; $1 };
    }
    else
    {
        die "Unexpected character at position " . pos($input)
            . ": '" . substr($input, pos($input), 1) . "'";
    }
}

for my $tok (@tokens)
{
    say "$tok->;{type}: $tok->;{value}";
}

Output:

IDENT: hello
OP: +
NUM: 42
OP: *
IDENT: world

Each branch tries to match a different token type at the current position. If nothing matches, you have an unexpected character. The \G anchor prevents the regex from skipping ahead and hiding errors.

Part 4: ORDER MATTERS

In the tokenizer above, the \d+ branch comes before \w+. This matters. If \w+ came first, it would swallow 42 as an identifier because \w matches digits.

PATTERN ORDER       INPUT "42abc"     RESULT
--------------      -------------     ------
\d+ before \w+      42 = NUM          correct
                     abc = IDENT

\w+ before \d+      42abc = IDENT     wrong (greedy \w ate everything)

Put the more specific patterns first. Digits before words. Keywords before general identifiers. Longest-match alternatives before shorter ones.

This is the same principle as any lexer. The difference is you are building it with one regex feature instead of a table-driven state machine.

Part 5: LEXING A REAL EXPRESSION

Let's build a tokenizer for a simple math expression language with parentheses, floating-point numbers, and named functions:

use strict;
use warnings;
use feature 'say';

my $expr = 'sqrt(3.14) + max(x, y) * 2.0';

my @tokens;

while (pos($expr) // 0 <; length($expr))
{
    if ($expr =~ m~\G\s+~gc)
    {
        next;    # skip whitespace
    }
    elsif ($expr =~ m~\G(\d+\.\d+|\d+)~gc)
    {
        push @tokens, ['NUM', $1];
    }
    elsif ($expr =~ m~\G(sqrt|max|min|abs)~gc)
    {
        push @tokens, ['FUNC', $1];
    }
    elsif ($expr =~ m~\G([a-zA-Z_]\w*)~gc)
    {
        push @tokens, ['VAR', $1];
    }
    elsif ($expr =~ m~\G([+\-*/])~gc)
    {
        push @tokens, ['OP', $1];
    }
    elsif ($expr =~ m~\G(\()~gc)
    {
        push @tokens, ['LPAREN', $1];
    }
    elsif ($expr =~ m~\G(\))~gc)
    {
        push @tokens, ['RPAREN', $1];
    }
    elsif ($expr =~ m~\G(,)~gc)
    {
        push @tokens, ['COMMA', $1];
    }
    else
    {
        my $pos = pos($expr) // 0;
        die "Lexer error at position $pos: '"
            . substr($expr, $pos, 10) . "'";
    }
}

for my $t (@tokens)
{
    printf "%-8s %s\n", $t->;[0], $t->;[1];
}

Output:

FUNC     sqrt
LPAREN   (
NUM      3.14
RPAREN   )
OP       +
FUNC     max
LPAREN   (
VAR      x
COMMA    ,
VAR      y
RPAREN   )
OP       *
NUM      2.0

Notice that sqrt and max are recognized as FUNC tokens because they match before the general [a-zA-Z_]\w* pattern. The keyword list acts like reserved words in a real language.

Part 6: COMPARISON TO SPLIT

People sometimes try to tokenize with split. It works for simple cases:

my @words = split /\s+/, "hello world foo";

But split is the wrong tool for anything more complex:

FEATURE              split            \G tokenizer
-----------------    -------          ------------
Fixed delimiter      yes              not needed
Multiple token types no               yes
Position tracking    no               yes (pos)
Error on bad input   no (silent)      yes (else branch)
Preserves structure  no               yes
Context-sensitive    no               yes

Split destroys structure. It gives you pieces without telling you what kind of pieces they are. A \G tokenizer gives you labeled tokens with type information. That is the difference between chopping a string and parsing it.

Part 7: LOG FORMAT PARSING

Here is a practical example. Parse an Apache combined log line into structured fields:

use strict;
use warnings;
use feature 'say';

my $line = '192.168.1.1 - frank [10/Oct/2025:13:55:36 -0700]'
         . ' "GET /api/users HTTP/1.1" 200 2326'
         . ' "https://example.com/" "Mozilla/5.0"';

my %entry;

if ($line =~ m~\G(\S+)~gc)            { $entry{ip}       = $1 }
if ($line =~ m~\G\s+(\S+)~gc)         { $entry{ident}    = $1 }
if ($line =~ m~\G\s+(\S+)~gc)         { $entry{user}     = $1 }
if ($line =~ m~\G\s+\[([^\]]+)\]~gc)  { $entry{time}     = $1 }
if ($line =~ m~\G\s+"([^"]*)"~gc)     { $entry{request}  = $1 }
if ($line =~ m~\G\s+(\d+)~gc)         { $entry{status}   = $1 }
if ($line =~ m~\G\s+(\d+)~gc)         { $entry{size}     = $1 }
if ($line =~ m~\G\s+"([^"]*)"~gc)     { $entry{referrer} = $1 }
if ($line =~ m~\G\s+"([^"]*)"~gc)     { $entry{agent}    = $1 }

for my $k (sort keys %entry)
{
    printf "%-10s %s\n", $k, $entry{$k};
}

agent      Mozilla/5.0
ident      -
ip         192.168.1.1
referrer   https://example.com/
request    GET /api/users HTTP/1.1
size       2326
status     200
time       10/Oct/2025:13:55:36 -0700
user       frank

Each field is extracted in sequence. If a field is malformed, the chain of matches stops right there. You know exactly which field failed and where.

Compare this to one massive regex with nine capture groups. Good luck debugging that when the log format changes.

Part 8: THE POS() FUNCTION

pos() is the companion function to \G. It reads and writes the position pointer:

my $str = "hello world";

# read the current position
say pos($str) // "undef";    # undef (no match yet)

# set position manually
pos($str) = 6;

# now \G anchors to position 6
if ($str =~ m~\G(\w+)~gc)
{
    say $1;                   # world
    say pos($str);            # 11
}

# reset position
pos($str) = undef;           # or pos($str) = 0

You can use pos() to implement backtracking in your tokenizer. Save the position before trying something speculative, and restore it if the attempt fails:

my $saved = pos($input);

# try an ambitious match
unless ($input =~ m~\G(complex_pattern)~gc)
{
    pos($input) = $saved;    # rewind
    # try something else
}

This is manual backtracking. Most tokenizers do not need it, but it is there when you do.

Part 9: RESETTING POSITION

A few ways to reset the position pointer:

# set to undef (no position)
pos($str) = undef;

# set to 0 (beginning)
pos($str) = 0;

# a failed m//g (without c) resets it
$str =~ m~nomatch~g;    # pos is now undef

# a successful non-global match does NOT affect pos
$str =~ m~something~;   # pos unchanged

The distinction between undef and 0 matters. When pos is undef, the string has no position state. The first m//g starts from the beginning. When pos is 0, the string has an explicit position at the start. Same practical effect for most uses, but undef is the "clean" state.

STATE OF pos()         WHAT HAPPENS ON NEXT m//g
-------------------    --------------------------
undef                  start from beginning
0                      start from position 0
5                      start from position 5
past end of string     match fails immediately

Part 10: PUTTING IT ALL TOGETHER

The \G anchor is not a hack or an obscure trick. It is how Perl expects you to build scanners. The regex engine already tracks position state. \G just gives you access to it.

The recipe is always the same:

Write a while loop
Inside, use if/elsif with m~\G...~gc patterns
Each branch handles one token type
The else branch catches errors
The loop ends when pos reaches the end of the string

    INPUT STRING
    ┌──────────────────────────────┐
    │ sqrt(3.14) + max(x, y) * 2  │
    └──────────────────────────────┘
         ^
         |
       pos() moves right ──────────>
         |
    ┌────┴────┐
    │ \G(...) │  "What's at this position?"
    └────┬────┘
         │
    ┌────┼──────────┬──────────┬────────┐
    │    │          │          │        │
   NUM  FUNC      VAR        OP     ERROR

      .--.
     |o_o |     "Match where you left off.
     |:_/ |      Not where you feel like."
    //   \ \
   (|     | )
  /'\_   _/`\
  \___)=(___/

You could install a parsing framework. You could use Parse::Lex or Marpa or a PEG grammar. Sometimes you should. But for 90% of tokenizing tasks, a \G loop in 30 lines of vanilla Perl will do the job faster than any framework can be installed and configured.

The regex engine is already a state machine. \G just lets you drive it one token at a time. Every match advances the cursor. Every failure tells you something is wrong. The string is consumed left to right, one bite at a time, with complete control over what each bite looks like.

That is not regex abuse. That is regex done right.

perl.gg