
The Fortune Teller: SMART Monitoring and Health
Predict the future of your hardware. Master the 'SMART' self-test system built into every modern hard drive. Learn to use 'smartctl' to identify failing disks before they crash and use 'hdparm' to benchmark your storage performance.
Disk Health: Predicting the Crash
In the previous lesson, we learned how to survive a disk failure using RAID. But wouldn't it be better to know the failure is coming before it happens?
Modern hard drives (HDD and SSD) have a built-in computer that monitors its own health. This system is called S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology). It tracks things like "Reallocated Sectors" (how many times the disk had to move data because a spot went bad) and "Power-On Hours."
In this lesson, we will learn to read these "Vital Signs" and benchmark our disk performance.
1. smartctl: The Doctor's Scope
The tool we use in Linux is smartmontools.
The Quick Check:
# See the basic health status
sudo smartctl -H /dev/sda
# Expected Output: "SMART overall-health self-assessment test result: PASSED"
The Deep Dive (Attributes):
sudo smartctl -A /dev/sda
What to look for:
- ID 5 (Reallocated_Sector_Ct): If this is 0, great. If it is increasing, your disk is physically dying.
- ID 197 (Current_Pending_Sector): Data is stuck in a bad spot. Backup immediately.
- ID 241 (Total_LBAs_Written): On SSDs, this tells you how much "Life" you have used. SSDs wear out after too many writes.
2. Running a Self-Test
You can ask the disk to perform a more thorough internal exam while you wait.
# Start a short 2-minute test
sudo smartctl -t short /dev/sda
# See the results of the test
sudo smartctl -l selftest /dev/sda
3. Benchmarking: How fast is your floor?
Sometimes a disk isn't failing, but it's just slow. You need to verify if you are getting the speed you paid for.
I. The Quick Check (hdparm)
# Test the 'Buffered' read speed of the disk
sudo hdparm -Tt /dev/sda
II. The Pro Tool (fio)
fio is the industry standard for benchmarking. It simulates real-world app behavior (like a database writing small chunks of data).
# Test random write performance
sudo fio --name=random-write --ioengine=libaio --rw=randwrite --bs=4k --size=1g --numjobs=1 --runtime=60 --time_based --end_fsync=1
4. Practical: Setting up the SMART Daemon
You should never have to manually run these tests. You should have a "Background Doctor" watching your disks.
- Install
smartmontools. - Edit
/etc/default/smartmontoolsand setstart_smartd=yes. - In
/etc/smartd.conf, you can tell Linux to email you if a disk fails its self-test.
# Example smartd entry to email the admin
/dev/sda -a -m sysadmin@company.com
5. Identifying SSD Wear
SSDs (Solid State Drives) do not last forever. They have a "Write Endurance" limit.
# Look for 'Percentage Used' or 'Wear Leveling Count'
sudo smartctl -a /dev/nvme0n1 | grep Percentage
If your "Percentage Used" is at 99%, it is time to buy a new drive today.
6. Example: A Disk Failure Predictor (Python)
Here is a Python script that parses the smartctl output and flags a warning if any of the "Critical 3" attributes are non-zero.
import subprocess
import re
def predict_failure(device="/dev/sda"):
"""
Checks for critical sector reallocation counts.
"""
print(f"--- Predicting Failure for {device} ---")
res = subprocess.run(["sudo", "smartctl", "-A", device], capture_output=True, text=True)
# We look for ID 5 (Reallocated Sectors) or ID 197 (Pending)
critical_matches = re.findall(r"(Reallocated_Sector_Ct|Current_Pending_Sector).+\s(\d+)$",
res.stdout, re.MULTILINE)
risks = 0
for attr, value in critical_matches:
val = int(value)
if val > 0:
print(f"[!!!] DANGER: {attr} is {val}! Disk is physically failing.")
risks += 1
if risks == 0:
print("[OK] No critical surface errors found.")
if __name__ == "__main__":
predict_failure("/dev/sda")
7. Professional Tip: Check 'dmesg' for I/O Errors
If your disk is failing, the Kernel will start screaming in the logs. If you see messages like "I/O error, dev sda, sector...", your SMART data is already irrelevant—your disk is actively dying. Unmount it immediately to save your data.
8. Summary
Disk health is about awareness and action.
smartctl -His your quick health check.- Attributes 5 and 197 are the primary signs of physical death.
smartdprovides 24/7 automated monitoring.- SSD Percentage Used tells you how close you are to the write limit.
hdparmandfioverify that your hardware performs as expected.
This concludes Module 12: Disk Management and Filesystems. You are now a master of the physical and logical layers of Linux storage.
In the next module, we will explore the final pillar of a robust server: System Logging and Monitoring.
Quiz Questions
- Why does an SSD have a limited lifespan compared to a traditional spinning HDD?
- What is a "Reallocated Sector" and why is it a sign of impending failure?
- What is the difference between a "Short" and a "Long" SMART self-test?
End of Module 12. Proceed to Module 13: System Logging and Monitoring.