<!-- category: modern-perl -->
Mojo::DOM CSS Selectors for HTML Parsing
Stop parsing HTML with regex. Seriously. Stop.That grabs every link from an HTML document using a CSS selector. The same selector syntax you already know from jQuery, from your browser's DevTools, from every CSS file you've ever written.use Mojo::DOM; my $dom = Mojo::DOM->new($html); $dom->find('a[href]')->each(sub { say $_->{href}; });
Mojo::DOM ships with Mojolicious. No extra dependencies. No XPath nightmares. No building parse trees by hand. Just CSS selectors and method chains.
Part 1: CREATING A DOM OBJECT
Feed it HTML. Get back an object you can query:From a file:use Mojo::DOM; my $dom = Mojo::DOM->new('<div><p>Hello</p><p>World</p></div>'); say $dom->at('p')->text; # Hello say $dom->find('p')->size; # 2
From a web request (if you're using the full Mojolicious stack):use Mojo::File qw(path); use Mojo::DOM; my $html = path('page.html')->slurp; my $dom = Mojo::DOM->new($html);
That last one fetches a page and gives you a DOM object in one chain. Beautiful.use Mojo::UserAgent; my $ua = Mojo::UserAgent->new; my $dom = $ua->get('https://example.com')->result->dom;
Part 2: CSS SELECTOR BASICS
If you've written CSS, you already know this:Mojo::DOM supports the full CSS3 selector spec (well, nearly all of it). If it works in your browser'sSELECTOR MATCHES -------- ------- div All <div> elements .classname Elements with class="classname" #myid Element with id="myid" a[href] <a> tags that have an href attribute input[type="text"] <input> with type="text" div > p <p> directly inside <div> div p <p> anywhere inside <div> ul > li:first-child First <li> in a <ul> h2 + p <p> immediately after <h2>
document.querySelectorAll(), it probably works here.
my $dom = Mojo::DOM->new($html); # by tag $dom->find('p'); # by class $dom->find('.warning'); # by id $dom->at('#main-content'); # by attribute $dom->find('img[alt]'); # by attribute value $dom->find('a[href="https://perl.org"]'); # descendant $dom->find('div.content p'); # direct child $dom->find('ul > li');
Part 3: FIND VS AT
Two main query methods. Learn the difference.find() returns ALL matches as a Mojo::Collection:
my $links = $dom->find('a'); say $links->size; # number of matches
at() returns the FIRST match as a single element (or undef):
Usemy $title = $dom->at('title'); say $title->text; # page title
find when you expect multiple results. Use at when you want one specific element.
Always check for# all paragraphs $dom->find('p')->each(sub { say $_->text }); # the one h1 on the page my $heading = $dom->at('h1'); say $heading->text if $heading;
undef when using at. If no element matches, you get nothing back. Calling ->text on undef will blow up.
Part 4: EXTRACTING TEXT AND ATTRIBUTES
Get the text content of an element:my $dom = Mojo::DOM->new('<p>Hello <b>World</b></p>'); say $dom->at('p')->text; # Hello say $dom->at('p')->all_text; # Hello World say $dom->at('b')->text; # World
text gives you the direct text of that element (not nested elements). all_text gives you everything, recursively.
Get attributes with hash-style access:
Bothmy $dom = Mojo::DOM->new('<a href="https://perl.org" class="ext">Perl</a>'); my $link = $dom->at('a'); say $link->{href}; # https://perl.org say $link->{class}; # ext say $link->attr('href'); # same thing, method style
$element->{attr} and $element->attr('attr') work. The hash dereference is shorter and more Perlish.
Part 5: CHAINING WITH MAP AND GREP
Sincefind() returns a Mojo::Collection, you get chainable functional operations for free:
The# extract all URLs from a page my @urls = $dom->find('a[href]') ->map(attr => 'href') ->each; # extract all image sources my @images = $dom->find('img[src]') ->map(attr => 'src') ->grep(sub { m~\.jpg$~i }) ->each;
->map(attr => 'href') shorthand calls ->attr('href') on each element. It's a Mojo::Collection trick for calling methods on every item.
Filter and transform in one pipeline:
The# find all external links with their text $dom->find('a[href^="http"]')->each(sub { printf "%-30s %s\n", $_->text, $_->{href}; });
[href^="http"] selector means "href attribute starts with http." CSS attribute selectors are powerful. ^= is starts-with, $= is ends-with, *= is contains.
Part 6: REAL WEB SCRAPING
Let's scrape something real. Extract a table into a data structure:Eachuse Mojo::DOM; use Mojo::UserAgent; my $ua = Mojo::UserAgent->new; my $dom = $ua->get('https://example.com/data')->result->dom; my @rows; $dom->find('table.data-table tr')->each(sub { my @cells = $_->find('td')->map('text')->each; push @rows, \@cells if @cells; }); for my $row (@rows) { say join "\t", @$row; }
<tr> is found. Inside each row, we find all <td> elements, extract their text, and collect it into an array. Done.
Extract a definition list:
Themy %glossary; $dom->find('dl dt')->each(sub { my $term = $_->text; my $def = $_->next->text; # next sibling (the <dd>) $glossary{$term} = $def; });
->next method gives you the next sibling element. There's also ->previous, ->parent, ->children, and ->ancestors for DOM traversal.
Part 7: HANDLING MALFORMED HTML
Real-world HTML is garbage. Missing closing tags, nested tables from 2003, inline styles from hell. Mojo::DOM handles it gracefully.It doesn't crash. It doesn't die. It makes a best effort to parse the mess and lets you query it. Is the parse tree exactly what the author intended? Who knows. The author didn't know either. But you can still extract data from it.my $mess = '<p>Unclosed paragraph<p>Another one<div>Mixed up</p></div>'; my $dom = Mojo::DOM->new($mess); $dom->find('p')->each(sub { say $_->all_text; });
This is a huge advantage over strict XML parsers that refuse to touch anything that isn't well-formed.
Part 8: COMPARISON TO HTML::TREEBUILDER
The old-school way to parse HTML in Perl wasHTML::TreeBuilder from the HTML::Tree distribution:
Mojo::DOM wins on every front:# HTML::TreeBuilder (the old way) use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new_from_content($html); my @links = $tree->look_down('_tag', 'a', sub { $_[0]->attr('href') }); for my $link (@links) { say $link->attr('href'); } $tree->delete; # manual cleanup! # Mojo::DOM (the new way) use Mojo::DOM; Mojo::DOM->new($html)->find('a[href]')->each(sub { say $_->{href}; });
FEATURE HTML::TreeBuilder Mojo::DOM ------- ----------------- --------- Selector syntax look_down() CSS selectors Cleanup required Yes ($tree->delete) No (automatic) Chaining No Yes Collection methods No map, grep, each Dependencies HTML::Tree dist Mojolicious Learning curve Moderate Low (if you know CSS)
HTML::TreeBuilder isn't bad. It served the community well for decades. But Mojo::DOM is what you should reach for in 2026.
Part 9: ADVANCED SELECTORS
Mojo::DOM supports some powerful selectors that most people never use:Combine them for surgical precision:# nth-child $dom->find('tr:nth-child(odd)'); # odd rows $dom->find('tr:nth-child(2n)'); # even rows $dom->find('li:nth-child(3n+1)'); # every third, starting at 1 # not $dom->find('p:not(.hidden)'); # paragraphs without .hidden # has (contains matching descendant) $dom->find('div:has(> img)'); # divs with direct img child # empty $dom->find('td:empty'); # empty table cells # multiple selectors (OR) $dom->find('h1, h2, h3'); # all top-level headings
That single selector does what would take 10 lines of manual tree walking.# external links in the main content area, excluding nav $dom->find('#content a[href^="http"]:not(.internal)')->each(sub { say "$_->{href}: " . $_->text; });
Part 10: PRACTICAL DATA EXTRACTION
Pull structured data from a page and convert to a Perl data structure:Eachuse Mojo::DOM; use Mojo::JSON qw(encode_json); my $dom = Mojo::DOM->new($html); my @products = $dom->find('div.product')->map(sub { { name => $_->at('h3')->text, price => $_->at('.price')->text, link => $_->at('a.details')->{href}, image => $_->at('img')->{src}, } })->each; say encode_json(\@products);
div.product is found. Inside each one, we extract the name, price, link, and image using at() to grab specific child elements. The map returns a hashref for each product. The result is a clean Perl data structure ready for JSON export or database insertion.
Want it as CSV instead?
From HTML to CSV in 6 lines. No regex-based HTML parsing. No fragile string matching. Just CSS selectors pointing at exactly what you want.$dom->find('div.product')->each(sub { my $name = $_->at('h3')->text; my $price = $_->at('.price')->text; $price =~ s~[^\d.]~~g; # strip $ and commas say "$name,$price"; });
perl.gg$html | Mojo::DOM | find('selector') | +---+---+---+ | | | | [1] [2] [3] [4] Mojo::Collection | map / grep / each | data jQuery for Perl. No browser required. .--. |o_o | |:_/ | // \ \ (| | ) /'\_ _/`\ \___)=(___/