Text Processing [sort, uniq & tr]

Docs

Linux

Introduction

Raw command output is rarely clean. Logs contain duplicates, data streams carry unwanted characters, and sorted lists only become meaningful once you isolate what’s unique or discard what’s redundant. sort, uniq, and tr are the tools that tame this noise — they reorganise, deduplicate, and reshape text at the character level before it reaches your analysis pipeline.

sort — Line Ordering

sort rearranges lines alphabetically, numerically, or by custom keys. It’s the prerequisite for uniq, which only detects duplicates in adjacent lines.

Basic Usage

sort data.txt                 # Alphabetical (default)
sort -n data.txt              # Numerical
sort -r data.txt              # Reverse order
sort -u data.txt              # Sort and deduplicate in one pass

Sorting by Specific Fields

When lines contain structured data — CSVs, logs, /etc/passwd — sort by a particular column:

sort -t: -k3 -n /etc/passwd         # Sort by UID (field 3, colon-delimited)
sort -t, -k2 data.csv               # Sort by second column (comma-delimited)

-t sets the field delimiter. -k selects which field to sort on.

Key Flags

Flag	Description
`-n`	Numeric sort
`-r`	Reverse order
`-u`	Output only unique lines
`-h`	Human-numeric sort — understands K, M, G suffixes
`-t <char>`	Set field delimiter
`-k <field>`	Sort by a specific field
`-f`	Fold — case-insensitive sorting
`-R`	Random order — useful for sampling

Locale awareness: sort respects the system locale by default, which can cause unexpected ordering with special characters. Use LC_ALL=C sort for consistent byte-level sorting across systems.

uniq — Uniqueness Isolation

uniq filters adjacent duplicate lines from its input. That adjacency requirement is critical — it only compares each line to the one immediately before it.

The Sorting Problem

# This does NOT catch all duplicates:
uniq data.txt

# This does:
sort data.txt | uniq

Without sorting first, identical lines separated by other content will both survive. sort brings duplicates together so uniq can see them.

Filtering Modes

sort data.txt | uniq -c       # Count occurrences of each line
sort data.txt | uniq -d       # Show only lines that appear more than once
sort data.txt | uniq -u       # Show only lines that appear exactly once

Count Mode Deep Dive

The -c flag transforms uniq into a frequency analyser:

sort data.txt | uniq -c

      3 apple
      1 banana
      7 cherry
      2 date

Pipe into sort -rn to rank by frequency — highest first:

sort data.txt | uniq -c | sort -rn

      7 cherry
      3 apple
      2 date
      1 banana

This sort | uniq -c | sort -rn pattern appears constantly in log analysis, traffic auditing, and forensic enumeration. Commit it to muscle memory.

Skipping Fields and Characters

When lines have a prefix you want to ignore during comparison — timestamps, for example — skip over them:

sort -k2 data.txt | uniq -f1       # Skip first field when comparing
sort data.txt | uniq -s10          # Skip first 10 characters when comparing

Key Flags

Flag	Description
`-c`	Prefix each line with its occurrence count
`-d`	Print only lines that appear more than once
`-u`	Print only lines that appear exactly once
`-i`	Case-insensitive comparison
`-f <n>`	Skip the first `n` fields before comparing
`-s <n>`	Skip the first `n` characters before comparing

tr — Character Translation

tr operates at the character level. It translates, deletes, or squeezes characters from stdin. It doesn’t read files directly — it always works on piped or redirected input.

Character Translation

Map one character set to another positionally:

echo "hello world" | tr 'a-z' 'A-Z'
# Output: HELLO WORLD

The two sets are positional — the first character in the source maps to the first in the destination, and so on.

ROT13

The classic Caesar cipher rotating by 13 positions. Apply it twice to return to the original — ROT13 is its own inverse:

echo "uryyb jbeyq" | tr 'A-Za-z' 'N-ZA-Mn-za-m'
# Output: hello world

Deleting Characters

Strip specific characters from a stream:

echo "hello world" | tr -d ' '
# Output: helloworld

echo "phone: +1 (555) 123-4567" | tr -d '()-+'
# Output: phone: 1 555 1234567

This is significantly faster and cleaner than invoking a regex engine for simple character removal.

Squeezing Repeated Characters

Collapse consecutive identical characters into a single instance:

echo "aaa   bbb   ccc" | tr -s 'a-z '
# Output: a b c

Normalize whitespace — collapse multiple spaces into one:

echo "too    many     spaces" | tr -s ' '
# Output: too many spaces

Character Classes

tr supports POSIX character classes for broader matching:

echo "Hello, World! 123" | tr -d '[:digit:]'
# Output: Hello, World! 

echo "Hello, World! 123" | tr '[:lower:]' '[:upper:]'
# Output: HELLO, WORLD! 123

Class	Matches
`[:alnum:]`	Letters and digits
`[:alpha:]`	Letters only
`[:digit:]`	Digits 0–9
`[:lower:]`	Lowercase letters
`[:upper:]`	Uppercase letters
`[:space:]`	Whitespace characters
`[:punct:]`	Punctuation

Complement Mode

The -c flag complements the set — it operates on everything not specified:

# Delete everything that is NOT printable or a newline:
tr -cd '[:print:]\n' < corrupted.txt > clean.txt

# Keep only digits:
tr -cd '[:digit:]' < messy.txt > numbers_only.txt

Line Ending Conversion

Convert Windows line endings (CRLF) to Unix (LF):

tr -d '\r' < dos_file.txt > unix_file.txt

Key Flags

Flag	Description
`-d`	Delete characters in the specified set
`-s`	Squeeze — collapse repeated characters
`-c`	Complement — operate on characters NOT in the set
`-t`	Truncate the destination set to match the source length

Chaining These Tools

These three commands are rarely used in isolation. Their power comes from composition — and from combining with the search and extraction tools covered in the previous note.

Frequency Analysis Pipeline

The single most useful pattern for log analysis:

sort data.txt | uniq -c | sort -rn

Clean and Deduplicate a Wordlist

tr '[:upper:]' '[:lower:]' < wordlist.txt | sort -u > clean_wordlist.txt

Normalise and Count File Extensions

find . -type f | sed 's/.*\.//' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn

Strip Junk, Sort, Isolate Uniques

tr -d '[:punct:]' < raw.txt | tr -s ' ' '\n' | sort | uniq -u

This removes punctuation, splits words onto individual lines, sorts them, and prints only words that appear exactly once.

Character Frequency Count

fold -w1 data.txt | sort | uniq -c | sort -rn

fold -w1 splits input into one character per line, then sort | uniq -c counts each.

Companion commands: The pipeline examples above use find, sed, and awk as building blocks. Their flags and usage are covered in the previous note — here they serve as glue between the sorting, deduplication, and translation stages.

Key Distinction

Tool	Operates on	What it does
`sort`	Lines	Reorders lines by key
`uniq`	Adjacent lines	Filters or counts duplicates
`tr`	Characters	Translates, deletes, or squeezes characters

sort and uniq work on lines. tr works on characters. Don’t confuse them — tr -d removes characters within lines; uniq removes entire duplicate lines.

Binary Inspection [strings, xxd, base64]File & Directory Operations