Reservoir Sampling
Pick a random line from a file. Easy, right?Load the whole file, pick one. Works fine for small files.my @lines = <FILE>; print $lines[rand @lines];
Now do it on a 50GB log file.
You can't load 50GB into memory. You don't know how many lines there are until you've read them all. You can only read forward, once.
One line solves this:
Every line has an equal chance of being selected. Mathematically proven. Memory usage: one line, always.perl -e 'rand($.) < 1 && ($line = $_) while <>; print $line'
Part 1: THE ALGORITHM
This is called reservoir sampling. Invented by computer scientists, compressed into Perl by hackers.The logic:
When you're done, every line had exactly 1/N chance of being the final selection. The math works out perfectly.Line 1: 1/1 chance (100%) - always keep it Line 2: 1/2 chance (50%) - maybe replace Line 3: 1/3 chance (33%) - maybe replace Line N: 1/N chance - maybe replace
Part 2: HOW IT WORKS
Break it down:rand($.) < 1 && ($line = $_) while <>
The parentheses around the assignment matter. Without them, the precedence is wrong.PIECE WHAT IT DOES ------------ ------------------------------------------ while <> Read lines one at a time $. Current line number (1, 2, 3, ...) rand($.) Random number from 0 to $. (exclusive) rand($.) < 1 True with probability 1/$. && Short-circuit: only do right side if left is true ($line = $_) Save current line (also returns true)
.--. |o_o | |:_/ | // \ \ (| | ) /'\_ _/`\ \___)=(___/
Part 3: THE PROBABILITY PROOF
Don't trust me? Let's prove it.For line K to be the final selection:
Probability of not being replaced by line K+1: (K)/(K+1) Probability of not being replaced by line K+2: (K+1)/(K+2) ... Probability of not being replaced by line N: (N-1)/N1. It must be selected when read: probability 1/K 2. It must NOT be replaced by any later line
Multiply them all:
Everything cancels (telescoping product):(1/K) × (K/(K+1)) × ((K+1)/(K+2)) × ... × ((N-1)/N)
Every line has exactly 1/N probability. QED.= 1/N
Part 4: VARIATIONS
Select K random lines (reservoir of size K):Keep 5 random lines. Same principle, bigger reservoir.perl -e ' @r = (); while (<>) { if (@r < 5) { push @r, $_ } elsif (rand($.) < 5) { $r[rand @r] = $_ } } print @r; ' hugefile.txt
Weighted random selection (by line length):
Longer lines are more likely to be selected.perl -e ' $w += length; rand($w) < length && ($line = $_) while <>; print $line '
Part 5: PRACTICAL USES
Sample from logs without loading them:Random test case from a data file:perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' /var/log/huge.log
Quick sanity check on data:perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' test_cases.txt
Ten random samples. See if your data looks reasonable.for i in {1..10}; do perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' data.csv done
Part 6: MULTIPLE RANDOM LINES
Want more than one? Run it multiple times:Or use the reservoir approach for guaranteed distinct lines:for i in {1..5}; do perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' file.txt done
Ten distinct random lines.perl -e ' while (<>) { if ($. <= 10) { $r[$.-1] = $_ } elsif (rand($.) < 10) { $r[rand 10] = $_ } } print @r ' file.txt
Part 7: SEEDING RANDOMNESS
Need reproducible results? Seed the RNG:Same seed = same "random" selection. Useful for testing.perl -e 'srand(42); rand($.) < 1 && ($l = $_) while <>; print $l' file.txt
Part 8: COMPARISON
Three approaches to random line selection:Reservoir wins on memory and works on pipes:METHOD MEMORY SPEED WORKS ON STREAMS ---------------- -------- -------- ---------------- Load into array O(N) O(N) No Two-pass (count) O(1) O(2N) No Reservoir O(1) O(N) Yes
You can't seek on a pipe. Reservoir sampling doesn't need to.some_command | perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l'
Part 9: GOTCHAS
Empty files:Prints nothing. $line is never set. Add a check if you care:perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' /dev/null
Very short files:perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l // "empty\n"'
Works fine. Line 1 has 100% chance of selection.echo "only line" | perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l'
Part 10: THE ONE-LINER
Memorize this:Or shorter:perl -e 'rand($.) < 1 && ($line = $_) while <>; print $line'
Butterfly version. Eight characters shorter.perl -ne 'rand($.) < 1 && ($l = $_) }{ print $l'
Or with -l for clean output:
perl -lne 'rand($.) < 1 && ($l = $_) }{ print $l'
Part 11: WHY THIS MATTERS
This isn't just a trick. It's a real algorithm with real applications:When you can't fit the data in memory and you can't make two passes, reservoir sampling is the answer.- Sampling from data streams - A/B testing selection - Random auditing - Load testing with random inputs - Statistical sampling from logs
And Perl does it in one line.
rand($.) < 1 | v +---------+ | keep? | +---------+ / \ yes no / \ [save] [next] Fair selection from infinite streams
Created By: Wildcard Wizard. Copyright 2026