Duplicate Detector

2025-12-19

Find duplicates. One of the most common tasks in sysadmin life.

perl -ne 'print if $seen{$_}++'

That's it. Prints every line that appears more than once.

The trick is the post-increment. First time a line appears, $seen{$} is 0 (false), so nothing prints. Second time, it's 1 (true), so it prints. Third time, still true, prints again.

Want only the second occurrence? We'll get there.

Part 1: HOW IT WORKS

print if $seen{$_}++
Break it down:
PIECE WHAT IT DOES ------------ ------------------------------------------ $seen{$_} Hash lookup - has this line been seen? ++ Post-increment - add 1 AFTER returning value $seen{$_}++ Returns OLD value (0 first time), then increments print if ... Print if the condition is true (non-zero)
First encounter: $seen{$} is undef (0 in numeric context). Returns 0, then becomes 1. Condition false, no print.

Second encounter: $seen{$} is 1. Returns 1, then becomes 2. Condition true, print.

.--. |o_o | |:_/ | // \ \ (| | ) /'\_ _/`\ \___)=(___/

Part 2: VARIATIONS
Print duplicates only once (not every repeat):
perl -ne 'print if $seen{$_}++ == 1'
The == 1 means "only on the second occurrence."
Print unique lines only (no duplicates at all):

perl -ne 'print unless $seen{$_}++'
Flip the logic. Print first occurrence, skip all repeats.
Print lines that appear exactly N times (requires two passes):

perl -ne '$c{$_}++; END { print for grep { $c{$_} == 3 } keys %c }'

Part 3: FIRST VS LAST OCCURRENCE
First occurrence of each line:
perl -ne 'print unless $seen{$_}++'
Last occurrence of each line:
perl -ne '$last{$_} = $_; END { print values %last }'
This overwrites each time, so only the last survives.
But order is lost. Want last occurrence in order?

perl -ne '$last{$_} = $.; END { print sort { $last{$a} <=> $last{$b} } keys %last }'
Store line numbers, sort by them at the end.
Part 4: CASE INSENSITIVE
Ignore case when detecting duplicates:
perl -ne 'print if $seen{lc $_}++'
The lc lowercases the key. "Hello" and "HELLO" are now the same.
Normalize whitespace too:

perl -ne '$k = lc; $k =~ s/\s+/ /g; print if $seen{$k}++'

Part 5: FIELD-BASED DUPLICATES
Duplicate detection on a specific column:
perl -ane 'print if $seen{$F[0]}++'
The -a splits each line into @F. This checks for duplicate first fields only.
Duplicate IPs in a log:

perl -ane 'print if $seen{$F[0]}++' access.log
Duplicate usernames in /etc/passwd:
perl -F: -ane 'print if $seen{$F[0]}++' /etc/passwd

Part 6: COUNTING DUPLICATES
How many times does each line appear?
perl -ne '$c{$_}++; END { print "$c{$_}: $_" for keys %c }'
Output like:
3: this line appeared three times 1: this line appeared once 5: this line appeared five times
Sorted by count:
perl -ne '$c{$_}++; END { print "$c{$_}: $_" for sort { $c{$b} <=> $c{$a} } keys %c }'
Most frequent first.
Part 7: ADJACENT DUPLICATES
The uniq command only removes adjacent duplicates:
perl -ne 'print unless $_ eq $last; $last = $_'
Line must equal the previous line to be skipped.
This is faster (no hash) but only catches consecutive repeats:

aaa aaa <- removed bbb aaa <- NOT removed, not adjacent

Part 8: REAL WORLD EXAMPLES
Find duplicate lines in a config file:
perl -ne 'print "$.: $_" if $seen{$_}++' config.ini
Includes line numbers so you can find them.
Duplicate entries in /etc/hosts:

perl -ane 'print if $seen{$F[1]}++' /etc/hosts
Checks hostname field (second column).
Duplicate SSH keys:

perl -ne 'print if $seen{(split)[1]}++' ~/.ssh/authorized_keys
Keys are in the second field. Finds if someone's key is listed twice.
Duplicate cron jobs:

crontab -l | perl -ne 'print if $seen{$_}++'

Part 9: MEMORY CONSIDERATIONS
The hash stores every unique line. For huge files with many unique lines, this eats memory.
For massive files, consider:

sort file.txt | uniq -d
External sort handles files larger than RAM. But loses original order.
Or process in chunks if you only care about recent duplicates:

tail -10000 huge.log | perl -ne 'print if $seen{$_}++'

Part 10: THE FAMILY
These patterns are related:
perl -ne 'print if $seen{$_}++' # All duplicates perl -ne 'print if $seen{$_}++ == 1' # Each duplicate once perl -ne 'print unless $seen{$_}++' # Unique lines only perl -ne '$c{$_}++; END{print for grep{$c{$_}>1}keys%c}' # Dupes, one each
The post-increment idiom is the heart of all of them.
Part 11: WHY POST-INCREMENT
Why $seen{$}++ instead of ++$seen{$_}?

Pre-increment returns the NEW value:

++$seen{$_}    # Returns 1 on first encounter (true!)

Post-increment returns the OLD value:

$seen{$_}++    # Returns 0 on first encounter (false!)

With pre-increment, everything prints. The ++ happens before the return. Post-increment is what makes the logic work.

Part 12: COMBINING WITH OTHER PATTERNS

Duplicates matching a pattern:

perl -ne 'print if /error/i && $seen{$_}++'

Only errors, and only repeated ones.

Duplicates across multiple files:

perl -ne 'print "$ARGV: $_" if $seen{$_}++' *.log

Shows which file the duplicate is in.

Duplicates within a time window (log files):

perl -ane '
    $t = $F[0]; 
    %seen = () if $t ne $last_t; 
    print if $seen{$_}++; 
    $last_t = $t
' timestamped.log

Resets the seen hash when the timestamp changes.

                $seen{$_}++
                    |
               +----+----+
               |         |
            first     again
               |         |
             skip      print

     The post-increment trick