Disk Health: Predicting the Crash

In the previous lesson, we learned how to survive a disk failure using RAID. But wouldn't it be better to know the failure is coming before it happens?

Modern hard drives (HDD and SSD) have a built-in computer that monitors its own health. This system is called S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology). It tracks things like "Reallocated Sectors" (how many times the disk had to move data because a spot went bad) and "Power-On Hours."

In this lesson, we will learn to read these "Vital Signs" and benchmark our disk performance.

1. smartctl: The Doctor's Scope

The tool we use in Linux is smartmontools.

The Quick Check:

# See the basic health status
sudo smartctl -H /dev/sda
# Expected Output: "SMART overall-health self-assessment test result: PASSED"

The Deep Dive (Attributes):

sudo smartctl -A /dev/sda

What to look for:

ID 5 (Reallocated_Sector_Ct): If this is 0, great. If it is increasing, your disk is physically dying.
ID 197 (Current_Pending_Sector): Data is stuck in a bad spot. Backup immediately.
ID 241 (Total_LBAs_Written): On SSDs, this tells you how much "Life" you have used. SSDs wear out after too many writes.

2. Running a Self-Test

You can ask the disk to perform a more thorough internal exam while you wait.

# Start a short 2-minute test
sudo smartctl -t short /dev/sda

# See the results of the test
sudo smartctl -l selftest /dev/sda

3. Benchmarking: How fast is your floor?

Sometimes a disk isn't failing, but it's just slow. You need to verify if you are getting the speed you paid for.

I. The Quick Check (hdparm)

# Test the 'Buffered' read speed of the disk
sudo hdparm -Tt /dev/sda

II. The Pro Tool (fio)

fio is the industry standard for benchmarking. It simulates real-world app behavior (like a database writing small chunks of data).

# Test random write performance
sudo fio --name=random-write --ioengine=libaio --rw=randwrite --bs=4k --size=1g --numjobs=1 --runtime=60 --time_based --end_fsync=1

4. Practical: Setting up the SMART Daemon

You should never have to manually run these tests. You should have a "Background Doctor" watching your disks.

Install smartmontools.
Edit /etc/default/smartmontools and set start_smartd=yes.
In /etc/smartd.conf, you can tell Linux to email you if a disk fails its self-test.

# Example smartd entry to email the admin
/dev/sda -a -m sysadmin@company.com

5. Identifying SSD Wear

SSDs (Solid State Drives) do not last forever. They have a "Write Endurance" limit.

# Look for 'Percentage Used' or 'Wear Leveling Count'
sudo smartctl -a /dev/nvme0n1 | grep Percentage

If your "Percentage Used" is at 99%, it is time to buy a new drive today.

6. Example: A Disk Failure Predictor (Python)

Here is a Python script that parses the smartctl output and flags a warning if any of the "Critical 3" attributes are non-zero.

import subprocess
import re

def predict_failure(device="/dev/sda"):
    """
    Checks for critical sector reallocation counts.
    """
    print(f"--- Predicting Failure for {device} ---")
    
    res = subprocess.run(["sudo", "smartctl", "-A", device], capture_output=True, text=True)
    
    # We look for ID 5 (Reallocated Sectors) or ID 197 (Pending)
    critical_matches = re.findall(r"(Reallocated_Sector_Ct|Current_Pending_Sector).+\s(\d+)$", 
                                  res.stdout, re.MULTILINE)
    
    risks = 0
    for attr, value in critical_matches:
        val = int(value)
        if val > 0:
            print(f"[!!!] DANGER: {attr} is {val}! Disk is physically failing.")
            risks += 1
            
    if risks == 0:
        print("[OK] No critical surface errors found.")

if __name__ == "__main__":
    predict_failure("/dev/sda")

7. Professional Tip: Check 'dmesg' for I/O Errors

If your disk is failing, the Kernel will start screaming in the logs. If you see messages like "I/O error, dev sda, sector...", your SMART data is already irrelevant—your disk is actively dying. Unmount it immediately to save your data.

8. Summary

Disk health is about awareness and action.

smartctl -H is your quick health check.
Attributes 5 and 197 are the primary signs of physical death.
smartd provides 24/7 automated monitoring.
SSD Percentage Used tells you how close you are to the write limit.
hdparm and fio verify that your hardware performs as expected.

This concludes Module 12: Disk Management and Filesystems. You are now a master of the physical and logical layers of Linux storage.

In the next module, we will explore the final pillar of a robust server: System Logging and Monitoring.

Quiz Questions

Why does an SSD have a limited lifespan compared to a traditional spinning HDD?
What is a "Reallocated Sector" and why is it a sign of impending failure?
What is the difference between a "Short" and a "Long" SMART self-test?

End of Module 12. Proceed to Module 13: System Logging and Monitoring.

The Fortune Teller: SMART Monitoring and Health