Reservoir Sampling

2025-12-26

Pick a random line from a file. Easy, right?

my @lines = <FILE>;
print $lines[rand @lines];

Load the whole file, pick one. Works fine for small files.

Now do it on a 50GB log file.

You can't load 50GB into memory. You don't know how many lines there are until you've read them all. You can only read forward, once.

One line solves this:

perl -e 'rand($.) < 1 && ($line = $_) while <>; print $line'

Every line has an equal chance of being selected. Mathematically proven. Memory usage: one line, always.

Part 1: THE ALGORITHM

This is called reservoir sampling. Invented by computer scientists, compressed into Perl by hackers.

The logic:

Line 1: 1/1 chance (100%) - always keep it
Line 2: 1/2 chance (50%) - maybe replace
Line 3: 1/3 chance (33%) - maybe replace
Line N: 1/N chance - maybe replace

When you're done, every line had exactly 1/N chance of being the final selection. The math works out perfectly.

Part 2: HOW IT WORKS

rand($.) < 1 && ($line = $_) while <>

Break it down:

PIECE           WHAT IT DOES
------------    ------------------------------------------
while <>        Read lines one at a time
$.              Current line number (1, 2, 3, ...)
rand($.)        Random number from 0 to $. (exclusive)
rand($.) < 1    True with probability 1/$.
&&              Short-circuit: only do right side if left is true
($line = $_)    Save current line (also returns true)

The parentheses around the assignment matter. Without them, the precedence is wrong.

      .--.
     |o_o |
     |:_/ |
    //   \ \
   (|     | )
  /'\_   _/`\
  \___)=(___/

Part 3: THE PROBABILITY PROOF

Don't trust me? Let's prove it.

For line K to be the final selection:

1. It must be selected when read: probability 1/K
2. It must NOT be replaced by any later line

Probability of not being replaced by line K+1: (K)/(K+1) Probability of not being replaced by line K+2: (K+1)/(K+2) ... Probability of not being replaced by line N: (N-1)/N

Multiply them all:

(1/K) × (K/(K+1)) × ((K+1)/(K+2)) × ... × ((N-1)/N)

Everything cancels (telescoping product):

= 1/N

Every line has exactly 1/N probability. QED.

Part 4: VARIATIONS

Select K random lines (reservoir of size K):

perl -e '
    @r = ();
    while (<>) {
        if (@r < 5) { push @r, $_ }
        elsif (rand($.) < 5) { $r[rand @r] = $_ }
    }
    print @r;
' hugefile.txt

Keep 5 random lines. Same principle, bigger reservoir.

Weighted random selection (by line length):

perl -e '
    $w += length; 
    rand($w) < length && ($line = $_) while <>; 
    print $line
'

Longer lines are more likely to be selected.

Part 5: PRACTICAL USES

Sample from logs without loading them:

perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' /var/log/huge.log

Random test case from a data file:

perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' test_cases.txt

Quick sanity check on data:

for i in {1..10}; do
    perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' data.csv
done

Ten random samples. See if your data looks reasonable.

Part 6: MULTIPLE RANDOM LINES

Want more than one? Run it multiple times:

for i in {1..5}; do
    perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' file.txt
done

Or use the reservoir approach for guaranteed distinct lines:

perl -e '
    while (<>) {
        if ($. <= 10) { $r[$.-1] = $_ }
        elsif (rand($.) < 10) { $r[rand 10] = $_ }
    }
    print @r
' file.txt

Ten distinct random lines.

Part 7: SEEDING RANDOMNESS

Need reproducible results? Seed the RNG:

perl -e 'srand(42); rand($.) < 1 && ($l = $_) while <>; print $l' file.txt

Same seed = same "random" selection. Useful for testing.

Part 8: COMPARISON

Three approaches to random line selection:

METHOD              MEMORY      SPEED       WORKS ON STREAMS
----------------    --------    --------    ----------------
Load into array     O(N)        O(N)        No
Two-pass (count)    O(1)        O(2N)       No  
Reservoir           O(1)        O(N)        Yes

Reservoir wins on memory and works on pipes:

some_command | perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l'

You can't seek on a pipe. Reservoir sampling doesn't need to.

Part 9: GOTCHAS

Empty files:

perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l' /dev/null

Prints nothing. $line is never set. Add a check if you care:

perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l // "empty\n"'

Very short files:

echo "only line" | perl -e 'rand($.) < 1 && ($l = $_) while <>; print $l'

Works fine. Line 1 has 100% chance of selection.

Part 10: THE ONE-LINER

Memorize this:

perl -e 'rand($.) < 1 && ($line = $_) while <>; print $line'

Or shorter:

perl -ne 'rand($.) < 1 && ($l = $_) }{ print $l'

Butterfly version. Eight characters shorter.

Or with -l for clean output:

perl -lne 'rand($.) < 1 && ($l = $_) }{ print $l'

Part 11: WHY THIS MATTERS

This isn't just a trick. It's a real algorithm with real applications:

- Sampling from data streams
- A/B testing selection
- Random auditing
- Load testing with random inputs
- Statistical sampling from logs

When you can't fit the data in memory and you can't make two passes, reservoir sampling is the answer.

And Perl does it in one line.

                rand($.) < 1
                     |
                     v
                +---------+
                |  keep?  |
                +---------+
                 /       \
               yes       no
               /           \
          [save]        [next]

     Fair selection from infinite streams