
Data Refinement: sort, uniq, wc, cut, and tr
Master the auxiliary tools for data manipulation. Learn to slice columns with cut, translate characters with tr, and perform frequency analysis with sort and uniq. Discover how to count lines, words, and bytes with wc.
Sorting and Filtering: Mastering the Data Stream
In the previous lesson, we learned about the "Holy Trinity" of search and transformation. But often, the output of those tools is messy or repetitive. To turn raw text into a professional report or a clean dataset, you need the "Refinement Tools": sort, uniq, wc, cut, and tr.
These commands are the scalpel and sandpaper of the Linux command line. They allow you to polish your data until exactly what you need remains.
1. cut: Slicing the Columns
If you have a file with many columns (like a CSV or a log) and you only want one specific piece of info, cut is the fastest tool.
# Get the 1st column of the passwd file (usernames)
# -d is the delimiter (colon), -f is the field number
cut -d ':' -f 1 /etc/passwd
2. sort: Bringing Order to Chaos
Output in Linux is often unsorted. sort puts lines in alphabetical or numerical order.
sort -n: Numerical sort (stops10from coming before2).sort -r: Reverse sort.sort -k 2: Sort by the second column.
# Sort a list of numbers correctly
cat numbers.txt | sort -n
3. uniq: Removing and Counting Duplicates
uniq removes adjacent duplicate lines.
Pro Tip: uniq only works if the duplicates are touching. This is why we almost always use sort before uniq.
# The Frequency Combo: Sort -> Count Uniques -> Sort by Count
cat access.log | cut -d ' ' -f 1 | sort | uniq -c | sort -nr
uniq -c: Counts how many times each line appeared.uniq -d: Only shows the duplicate lines.
4. wc: The Word counter
wc (Word Count) is used to measure the size of your data.
wc -l: Count lines. (Most common).wc -w: Count words.wc -c: Count bytes.
# How many users are on this system?
cat /etc/passwd | wc -l
5. tr: The Translator
tr is used for deleting or replacing characters. It works on a character-by-character basis.
# Convert a file to all UPPERCASE
cat data.txt | tr 'a-z' 'A-Z'
# Delete all newlines to make a single long string
cat names.txt | tr -d '\n'
6. Practical: Building a Log Frequency Report
Let's combine everything we've learned into a single, high-powered pipeline. Imagine you want to find the top 5 most frequent errors in your system log.
# 1. Read log
# 2. Filter for 'Error'
# 3. Cut to get the error message column
# 4. Sort alphabetically (to group them)
# 5. Count occurrences
# 6. Sort numerically (highest first)
# 7. Take the top 5
grep -i "error" /var/log/syslog | cut -d ':' -f 4 | sort | uniq -c | sort -nr | head -n 5
7. Example: A Data Quality Auditor (Python)
If you are uploading data to a database, you need to ensure there are no duplicates. Here is a Python script that mimics sort | uniq -d to find duplicates in a large file without loading the whole thing into memory.
import sys
def find_duplicates_efficiently(file_path):
"""
Finds and reports duplicate lines in a file.
"""
seen = set()
duplicates = set()
try:
with open(file_path, 'r') as f:
for line in f:
line = line.strip()
if line in seen:
duplicates.add(line)
else:
seen.add(line)
return list(duplicates)
except FileNotFoundError:
return []
if __name__ == "__main__":
# Example usage
target_file = "emails.txt"
if not os.path.exists(target_file):
with open(target_file, "w") as f:
f.write("user@example.com\nadmin@host.com\nuser@example.com\n")
dupes = find_duplicates_efficiently(target_file)
if dupes:
print(f"Found {len(dupes)} duplicate entries:")
for d in dupes:
print(f" [DUP] {d}")
else:
print("No duplicates found. Data is clean.")
8. Summary
Mastering data refinement allows you to turn "Noise" into "Actionable Intelligence."
- Use
cutto extract specific columns. - Use
trfor character cleanup and case switching. - Always
sortbefore youuniq. - Use
wc -lto verify the scale of your processing.
In the next lesson, we will learn how to pack all this data up for shipping using File Compression and Archiving with tar, gzip, and zip.
Quiz Questions
- How do you find the line count of a specific directory's file list?
- Why must you run
sortbeforeuniq? - Which
cutflag defines the character used to separate columns?
Continue to Lesson 6: File Compression and Archiving—tar, gzip, and zip.