🐪 Mojo::DOM CSS Selectors for HTML Parsing

2026-03-31

Stop parsing HTML with regex. Seriously. Stop.

use Mojo::DOM;

my $dom = Mojo::DOM->;new($html);
$dom->;find('a[href]')->;each(sub
{
    say $_->;{href};
});

That grabs every link from an HTML document using a CSS selector. The same selector syntax you already know from jQuery, from your browser's DevTools, from every CSS file you've ever written.

Mojo::DOM ships with Mojolicious. No extra dependencies. No XPath nightmares. No building parse trees by hand. Just CSS selectors and method chains.

Part 1: CREATING A DOM OBJECT

Feed it HTML. Get back an object you can query:

use Mojo::DOM;

my $dom = Mojo::DOM->;new('<div><p>Hello</p><p>World</p></div>');

say $dom->;at('p')->;text;         # Hello
say $dom->;find('p')->;size;       # 2

From a file:

use Mojo::File qw(path);
use Mojo::DOM;

my $html = path('page.html')->;slurp;
my $dom = Mojo::DOM->;new($html);

From a web request (if you're using the full Mojolicious stack):

use Mojo::UserAgent;

my $ua = Mojo::UserAgent->;new;
my $dom = $ua->;get('https://example.com')->;result->;dom;

That last one fetches a page and gives you a DOM object in one chain. Beautiful.

Part 2: CSS SELECTOR BASICS

If you've written CSS, you already know this:

SELECTOR              MATCHES
--------              -------
div                   All <div> elements
.classname            Elements with class="classname"
#myid                 Element with id="myid"
a[href]               <a> tags that have an href attribute
input[type="text"]    <input> with type="text"
div > p               <p> directly inside <div>
div p                 <p> anywhere inside <div>
ul > li:first-child   First <li> in a <ul>
h2 + p                <p> immediately after <h2>

Mojo::DOM supports the full CSS3 selector spec (well, nearly all of it). If it works in your browser's document.querySelectorAll(), it probably works here.

my $dom = Mojo::DOM->;new($html);

# by tag
$dom->;find('p');

# by class
$dom->;find('.warning');

# by id
$dom->;at('#main-content');

# by attribute
$dom->;find('img[alt]');

# by attribute value
$dom->;find('a[href="https://perl.org"]');

# descendant
$dom->;find('div.content p');

# direct child
$dom->;find('ul > li');

Part 3: FIND VS AT

Two main query methods. Learn the difference.

find() returns ALL matches as a Mojo::Collection:

my $links = $dom->;find('a');
say $links->;size;    # number of matches

at() returns the FIRST match as a single element (or undef):

my $title = $dom->;at('title');
say $title->;text;    # page title

Use find when you expect multiple results. Use at when you want one specific element.

# all paragraphs
$dom->;find('p')->;each(sub { say $_->;text });

# the one h1 on the page
my $heading = $dom->;at('h1');
say $heading->;text if $heading;

Always check for undef when using at. If no element matches, you get nothing back. Calling ->text on undef will blow up.

Part 4: EXTRACTING TEXT AND ATTRIBUTES

Get the text content of an element:

my $dom = Mojo::DOM->;new('<p>Hello <b>World</b></p>');

say $dom->;at('p')->;text;         # Hello
say $dom->;at('p')->;all_text;     # Hello World
say $dom->;at('b')->;text;         # World

text gives you the direct text of that element (not nested elements). all_text gives you everything, recursively.

Get attributes with hash-style access:

my $dom = Mojo::DOM->;new('<a href="https://perl.org" class="ext">Perl</a>');

my $link = $dom->;at('a');
say $link->;{href};     # https://perl.org
say $link->;{class};    # ext
say $link->;attr('href');  # same thing, method style

Both $element->{attr} and $element->attr('attr') work. The hash dereference is shorter and more Perlish.

Part 5: CHAINING WITH MAP AND GREP

Since find() returns a Mojo::Collection, you get chainable functional operations for free:

# extract all URLs from a page
my @urls = $dom->;find('a[href]')
    ->;map(attr =>; 'href')
    ->;each;

# extract all image sources
my @images = $dom->;find('img[src]')
    ->;map(attr =>; 'src')
    ->;grep(sub { m~\.jpg$~i })
    ->;each;

The ->map(attr => 'href') shorthand calls ->attr('href') on each element. It's a Mojo::Collection trick for calling methods on every item.

Filter and transform in one pipeline:

# find all external links with their text
$dom->;find('a[href^="http"]')->;each(sub
{
    printf "%-30s %s\n", $_->;text, $_->;{href};
});

The [href^="http"] selector means "href attribute starts with http." CSS attribute selectors are powerful. ^= is starts-with, $= is ends-with, *= is contains.

Part 6: REAL WEB SCRAPING

Let's scrape something real. Extract a table into a data structure:

use Mojo::DOM;
use Mojo::UserAgent;

my $ua = Mojo::UserAgent->;new;
my $dom = $ua->;get('https://example.com/data')->;result->;dom;

my @rows;
$dom->;find('table.data-table tr')->;each(sub
{
    my @cells = $_->;find('td')->;map('text')->;each;
    push @rows, \@cells if @cells;
});

for my $row (@rows)
{
    say join "\t", @$row;
}

Each <tr> is found. Inside each row, we find all <td> elements, extract their text, and collect it into an array. Done.

Extract a definition list:

my %glossary;
$dom->;find('dl dt')->;each(sub
{
    my $term = $_->;text;
    my $def = $_->;next->;text;    # next sibling (the <dd>)
    $glossary{$term} = $def;
});

The ->next method gives you the next sibling element. There's also ->previous, ->parent, ->children, and ->ancestors for DOM traversal.

Part 7: HANDLING MALFORMED HTML

Real-world HTML is garbage. Missing closing tags, nested tables from 2003, inline styles from hell. Mojo::DOM handles it gracefully.

my $mess = '<p>Unclosed paragraph<p>Another one<div>Mixed up</p></div>';
my $dom = Mojo::DOM->;new($mess);

$dom->;find('p')->;each(sub
{
    say $_->;all_text;
});

It doesn't crash. It doesn't die. It makes a best effort to parse the mess and lets you query it. Is the parse tree exactly what the author intended? Who knows. The author didn't know either. But you can still extract data from it.

This is a huge advantage over strict XML parsers that refuse to touch anything that isn't well-formed.

Part 8: COMPARISON TO HTML::TREEBUILDER

The old-school way to parse HTML in Perl was HTML::TreeBuilder from the HTML::Tree distribution:

# HTML::TreeBuilder (the old way)
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->;new_from_content($html);
my @links = $tree->;look_down('_tag', 'a', sub { $_[0]->;attr('href') });
for my $link (@links)
{
    say $link->;attr('href');
}
$tree->;delete;    # manual cleanup!

# Mojo::DOM (the new way)
use Mojo::DOM;
Mojo::DOM->;new($html)->;find('a[href]')->;each(sub
{
    say $_->;{href};
});

Mojo::DOM wins on every front:

FEATURE              HTML::TreeBuilder    Mojo::DOM
-------              -----------------    ---------
Selector syntax      look_down()          CSS selectors
Cleanup required     Yes ($tree->delete)  No (automatic)
Chaining             No                   Yes
Collection methods   No                   map, grep, each
Dependencies         HTML::Tree dist      Mojolicious
Learning curve       Moderate             Low (if you know CSS)

HTML::TreeBuilder isn't bad. It served the community well for decades. But Mojo::DOM is what you should reach for in 2026.

Part 9: ADVANCED SELECTORS

Mojo::DOM supports some powerful selectors that most people never use:

# nth-child
$dom->;find('tr:nth-child(odd)');         # odd rows
$dom->;find('tr:nth-child(2n)');          # even rows
$dom->;find('li:nth-child(3n+1)');       # every third, starting at 1

# not
$dom->;find('p:not(.hidden)');            # paragraphs without .hidden

# has (contains matching descendant)
$dom->;find('div:has(> img)');            # divs with direct img child

# empty
$dom->;find('td:empty');                  # empty table cells

# multiple selectors (OR)
$dom->;find('h1, h2, h3');               # all top-level headings

Combine them for surgical precision:

# external links in the main content area, excluding nav
$dom->;find('#content a[href^="http"]:not(.internal)')->;each(sub
{
    say "$_->;{href}: " . $_->;text;
});

That single selector does what would take 10 lines of manual tree walking.

Part 10: PRACTICAL DATA EXTRACTION

Pull structured data from a page and convert to a Perl data structure:

use Mojo::DOM;
use Mojo::JSON qw(encode_json);

my $dom = Mojo::DOM->;new($html);

my @products = $dom->;find('div.product')->;map(sub
{
    {
        name  =>; $_->;at('h3')->;text,
        price =>; $_->;at('.price')->;text,
        link  =>; $_->;at('a.details')->;{href},
        image =>; $_->;at('img')->;{src},
    }
})->;each;

say encode_json(\@products);

Each div.product is found. Inside each one, we extract the name, price, link, and image using at() to grab specific child elements. The map returns a hashref for each product. The result is a clean Perl data structure ready for JSON export or database insertion.

Want it as CSV instead?

$dom->;find('div.product')->;each(sub
{
    my $name  = $_->;at('h3')->;text;
    my $price = $_->;at('.price')->;text;
    $price =~ s~[^\d.]~~g;    # strip $ and commas
    say "$name,$price";
});

From HTML to CSV in 6 lines. No regex-based HTML parsing. No fragile string matching. Just CSS selectors pointing at exactly what you want.

     $html
       |
   Mojo::DOM
       |
   find('selector')
       |
   +---+---+---+
   |   |   |   |
  [1] [2] [3] [4]   Mojo::Collection
       |
   map / grep / each
       |
     data

  jQuery for Perl. No browser required.

      .--.
     |o_o |
     |:_/ |
    //   \ \
   (|     | )
  /'\_   _/`\
  \___)=(___/

perl.gg