perl.gg / one-liners

Reservoir Sampling

2025-12-26

Pick a random line from a file. Easy, right?
my @lines = <FILE>; print $lines[rand @lines];
Load the whole file, pick one. Works fine for small files.

Now do it on a 50GB log file.

You can't load 50GB into memory. You don't know how many lines there are until you've read them all. You can only read forward, once.

One line solves this:

perl -e 'rand($.) < 1 && ($line = $_) while <>; print $line'
Every line has an equal chance of being selected. Mathematically proven. Memory usage: one line, always.

Part 1: THE ALGORITHM

This is called reservoir sampling. Invented by computer scientists, compressed into Perl by hackers.

The logic:

Line 1: 1/1 chance (100%) - always keep it Line 2: 1/2 chance (50%) - maybe replace Line 3: 1/3 chance (33%) - maybe replace Line N: 1/N chance - maybe replace
When you're done, every line had exactly 1/N chance of being the final selection. The math works out perfectly.

Part 2: HOW IT WORKS

rand($.) < 1 && ($line = $_) while <>
Break it down:
PIECE WHAT IT DOES ------------ ------------------------------------------ while <> Read lines one at a time $. Current line number (1, 2, 3, ...) rand($.) Random number from 0 to $. (exclusive) rand($.) < 1 True with probability 1/$. && Short-circuit: only do right side if left is true ($line = $_) Save current line (also returns true)
The parentheses around the assignment matter. Without them, the precedence is wrong.
.--. |o_o | |:_/ | // \ \ (| | ) /'\_ _/`\ \___)=(___/

Part 3: THE PROBABILITY PROOF

Don't trust me? Let's prove it.

For line K to be the final selection:

1. It must be selected when read: probability 1/K 2. It must NOT be replaced by any later line
Probability of not being replaced by line K+1: (K)/(K+1) Probability of not being replaced by line K+2: (K+1)/(K+2) ... Probability of not being replaced by line N: (N-1)/N

Multiply them all:

(1/K) × (K/(K+1)) × ((K+1)/(K+2)) × ... × ((N-1)/N)
Everything cancels (telescoping product):
= 1/N
Every line has exactly 1/N probability. QED.

Part 4: VARIATIONS

Select K random lines (reservoir of size K):
perl -e ' @r = (); while (<>) { if (@r < 5) { push @r, $_ } elsif (rand($.) < 5) { $r[rand @r] = $_ } } print @r; ' hugefile.txt
Keep 5 random lines. Same principle, bigger reservoir.

Weighted random selection (by line length):

perl -e ' $w += length; rand($w) < length && ($line = $_) while <>; print $line '
Longer lines are more likely to be selected.

Part 5: PRACTICAL USES

Sample from logs without loading them:
perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' /var/log/huge.log
Random test case from a data file:
perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' test_cases.txt
Quick sanity check on data:
for i in {1..10}; do perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' data.csv done
Ten random samples. See if your data looks reasonable.

Part 6: MULTIPLE RANDOM LINES

Want more than one? Run it multiple times:
for i in {1..5}; do perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' file.txt done
Or use the reservoir approach for guaranteed distinct lines:
perl -e ' while (<>) { if ($. <= 10) { $r[$.-1] = $_ } elsif (rand($.) < 10) { $r[rand 10] = $_ } } print @r ' file.txt
Ten distinct random lines.

Part 7: SEEDING RANDOMNESS

Need reproducible results? Seed the RNG:
perl -e 'srand(42); rand($.) < 1 && ($l = $_) while <>; print $l' file.txt
Same seed = same "random" selection. Useful for testing.

Part 8: COMPARISON

Three approaches to random line selection:
METHOD MEMORY SPEED WORKS ON STREAMS ---------------- -------- -------- ---------------- Load into array O(N) O(N) No Two-pass (count) O(1) O(2N) No Reservoir O(1) O(N) Yes
Reservoir wins on memory and works on pipes:
some_command | perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l'
You can't seek on a pipe. Reservoir sampling doesn't need to.

Part 9: GOTCHAS

Empty files:
perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' /dev/null
Prints nothing. $line is never set. Add a check if you care:
perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l // "empty\n"'
Very short files:
echo "only line" | perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l'
Works fine. Line 1 has 100% chance of selection.

Part 10: THE ONE-LINER

Memorize this:
perl -e 'rand($.) < 1 && ($line = $_) while <>; print $line'
Or shorter:
perl -ne 'rand($.) < 1 && ($l = $_) }{ print $l'
Butterfly version. Eight characters shorter.

Or with -l for clean output:

perl -lne 'rand($.) < 1 && ($l = $_) }{ print $l'

Part 11: WHY THIS MATTERS

This isn't just a trick. It's a real algorithm with real applications:
- Sampling from data streams - A/B testing selection - Random auditing - Load testing with random inputs - Statistical sampling from logs
When you can't fit the data in memory and you can't make two passes, reservoir sampling is the answer.

And Perl does it in one line.

rand($.) < 1 | v +---------+ | keep? | +---------+ / \ yes no / \ [save] [next] Fair selection from infinite streams

Created By: Wildcard Wizard. Copyright 2026