Text Processing [sort, uniq & tr]
Introduction
Raw command output is rarely clean. Logs contain duplicates, data streams carry unwanted characters, and sorted lists only become meaningful once you isolate what’s unique or discard what’s redundant. sort, uniq, and tr are the tools that tame this noise — they reorganise, deduplicate, and reshape text at the character level before it reaches your analysis pipeline.
sort — Line Ordering
sort rearranges lines alphabetically, numerically, or by custom keys. It’s the prerequisite for uniq, which only detects duplicates in adjacent lines.
Basic Usage
sort data.txt # Alphabetical (default)
sort -n data.txt # Numerical
sort -r data.txt # Reverse order
sort -u data.txt # Sort and deduplicate in one passSorting by Specific Fields
When lines contain structured data — CSVs, logs, /etc/passwd — sort by a particular column:
sort -t: -k3 -n /etc/passwd # Sort by UID (field 3, colon-delimited)
sort -t, -k2 data.csv # Sort by second column (comma-delimited)-t sets the field delimiter. -k selects which field to sort on.
Key Flags
| Flag | Description |
|---|---|
-n |
Numeric sort |
-r |
Reverse order |
-u |
Output only unique lines |
-h |
Human-numeric sort — understands K, M, G suffixes |
-t <char> |
Set field delimiter |
-k <field> |
Sort by a specific field |
-f |
Fold — case-insensitive sorting |
-R |
Random order — useful for sampling |
Locale awareness:
sortrespects the system locale by default, which can cause unexpected ordering with special characters. UseLC_ALL=C sortfor consistent byte-level sorting across systems.
uniq — Uniqueness Isolation
uniq filters adjacent duplicate lines from its input. That adjacency requirement is critical — it only compares each line to the one immediately before it.
The Sorting Problem
# This does NOT catch all duplicates:
uniq data.txt
# This does:
sort data.txt | uniqWithout sorting first, identical lines separated by other content will both survive. sort brings duplicates together so uniq can see them.
Filtering Modes
sort data.txt | uniq -c # Count occurrences of each line
sort data.txt | uniq -d # Show only lines that appear more than once
sort data.txt | uniq -u # Show only lines that appear exactly onceCount Mode Deep Dive
The -c flag transforms uniq into a frequency analyser:
sort data.txt | uniq -c 3 apple
1 banana
7 cherry
2 datePipe into sort -rn to rank by frequency — highest first:
sort data.txt | uniq -c | sort -rn 7 cherry
3 apple
2 date
1 bananaThis sort | uniq -c | sort -rn pattern appears constantly in log analysis, traffic auditing, and forensic enumeration. Commit it to muscle memory.
Skipping Fields and Characters
When lines have a prefix you want to ignore during comparison — timestamps, for example — skip over them:
sort -k2 data.txt | uniq -f1 # Skip first field when comparing
sort data.txt | uniq -s10 # Skip first 10 characters when comparingKey Flags
| Flag | Description |
|---|---|
-c |
Prefix each line with its occurrence count |
-d |
Print only lines that appear more than once |
-u |
Print only lines that appear exactly once |
-i |
Case-insensitive comparison |
-f <n> |
Skip the first n fields before comparing |
-s <n> |
Skip the first n characters before comparing |
tr — Character Translation
tr operates at the character level. It translates, deletes, or squeezes characters from stdin. It doesn’t read files directly — it always works on piped or redirected input.
Character Translation
Map one character set to another positionally:
echo "hello world" | tr 'a-z' 'A-Z'
# Output: HELLO WORLDThe two sets are positional — the first character in the source maps to the first in the destination, and so on.
ROT13
The classic Caesar cipher rotating by 13 positions. Apply it twice to return to the original — ROT13 is its own inverse:
echo "uryyb jbeyq" | tr 'A-Za-z' 'N-ZA-Mn-za-m'
# Output: hello worldDeleting Characters
Strip specific characters from a stream:
echo "hello world" | tr -d ' '
# Output: helloworld
echo "phone: +1 (555) 123-4567" | tr -d '()-+'
# Output: phone: 1 555 1234567This is significantly faster and cleaner than invoking a regex engine for simple character removal.
Squeezing Repeated Characters
Collapse consecutive identical characters into a single instance:
echo "aaa bbb ccc" | tr -s 'a-z '
# Output: a b cNormalize whitespace — collapse multiple spaces into one:
echo "too many spaces" | tr -s ' '
# Output: too many spacesCharacter Classes
tr supports POSIX character classes for broader matching:
echo "Hello, World! 123" | tr -d '[:digit:]'
# Output: Hello, World!
echo "Hello, World! 123" | tr '[:lower:]' '[:upper:]'
# Output: HELLO, WORLD! 123| Class | Matches |
|---|---|
[:alnum:] |
Letters and digits |
[:alpha:] |
Letters only |
[:digit:] |
Digits 0–9 |
[:lower:] |
Lowercase letters |
[:upper:] |
Uppercase letters |
[:space:] |
Whitespace characters |
[:punct:] |
Punctuation |
Complement Mode
The -c flag complements the set — it operates on everything not specified:
# Delete everything that is NOT printable or a newline:
tr -cd '[:print:]\n' < corrupted.txt > clean.txt
# Keep only digits:
tr -cd '[:digit:]' < messy.txt > numbers_only.txtLine Ending Conversion
Convert Windows line endings (CRLF) to Unix (LF):
tr -d '\r' < dos_file.txt > unix_file.txtKey Flags
| Flag | Description |
|---|---|
-d |
Delete characters in the specified set |
-s |
Squeeze — collapse repeated characters |
-c |
Complement — operate on characters NOT in the set |
-t |
Truncate the destination set to match the source length |
Chaining These Tools
These three commands are rarely used in isolation. Their power comes from composition — and from combining with the search and extraction tools covered in the previous note.
Frequency Analysis Pipeline
The single most useful pattern for log analysis:
sort data.txt | uniq -c | sort -rnClean and Deduplicate a Wordlist
tr '[:upper:]' '[:lower:]' < wordlist.txt | sort -u > clean_wordlist.txtNormalise and Count File Extensions
find . -type f | sed 's/.*\.//' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rnStrip Junk, Sort, Isolate Uniques
tr -d '[:punct:]' < raw.txt | tr -s ' ' '\n' | sort | uniq -uThis removes punctuation, splits words onto individual lines, sorts them, and prints only words that appear exactly once.
Character Frequency Count
fold -w1 data.txt | sort | uniq -c | sort -rnfold -w1 splits input into one character per line, then sort | uniq -c counts each.
Companion commands: The pipeline examples above use
find,sed, andawkas building blocks. Their flags and usage are covered in the previous note — here they serve as glue between the sorting, deduplication, and translation stages.
Key Distinction
| Tool | Operates on | What it does |
|---|---|---|
sort |
Lines | Reorders lines by key |
uniq |
Adjacent lines | Filters or counts duplicates |
tr |
Characters | Translates, deletes, or squeezes characters |
sort and uniq work on lines. tr works on characters. Don’t confuse them — tr -d removes characters within lines; uniq removes entire duplicate lines.