Hardware Health Monitoring¶
Last Updated: 2025-12-08 Status: Production
Overview¶
This guide covers hardware health monitoring for Tower Fleet infrastructure, focusing on: - ZFS pool health - RAIDZ2 array status and integrity - SMART drive monitoring - Predictive disk failure detection - Automated alerts - Prometheus integration for proactive monitoring - Maintenance schedules - Regular scrubs and health checks
Quick Health Check¶
ZFS Pool Status¶
# Check all pool health
zpool status
# Detailed status with error counts
zpool status -v
# Check specific pool
zpool status vault
zpool status rpool
What to look for:
- State: Should be ONLINE (not DEGRADED or FAULTED)
- READ/WRITE/CKSUM errors: All should be 0
- Scan status: Last scrub should be recent (< 1 month) with 0 errors
Example healthy output:
pool: vault
state: ONLINE
scan: scrub repaired 0B in 08:23:15 with 0 errors on Sun Dec 1 00:33:17 2025
config:
NAME STATE READ WRITE CKSUM
vault ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-ST4000VX016-3CV104_ZW61ZP88 ONLINE 0 0 0
ata-ST4000VX016-3CV104_ZW626RNN ONLINE 0 0 0
...
errors: No known data errors
SMART Drive Health¶
# Quick health check for all drives
for dev in sda sdb sdc sdd sde sdf sdg sdh; do
echo "=== /dev/$dev ==="
smartctl -H /dev/$dev 2>/dev/null | grep -i "overall-health"
done
# Detailed check with key metrics
for dev in sda sdb sdc sdd sde sdf sdg sdh; do
echo "=== /dev/$dev ==="
smartctl -i /dev/$dev 2>/dev/null | grep -E "(Model|Serial)"
smartctl -a /dev/$dev 2>/dev/null | grep -E "(Power_On_Hours|Reallocated_Sector|Current_Pending|Offline_Uncorrectable)" | head -4
echo
done
Critical SMART attributes to monitor:
| Attribute ID | Name | Meaning | Threshold |
|---|---|---|---|
| 5 | Reallocated_Sector_Ct | Bad sectors remapped | > 50 = warning, > 500 = replace |
| 9 | Power_On_Hours | Drive age | > 43,800 (5 years) = watch closely |
| 197 | Current_Pending_Sector | Sectors waiting to be remapped | > 0 = warning |
| 198 | Offline_Uncorrectable | Sectors that can't be read | > 0 = replace soon |
| 199 | UDMA_CRC_Error_Count | Cable/connection errors | > 100 = check cables |
ZFS Pool Management¶
Understanding Pool States¶
ONLINE ✅ - All devices working normally - No action required
DEGRADED ⚠️ - One or more devices faulted - Pool still functional but at reduced redundancy - RAIDZ2: Can lose 2 drives - if 2 are faulted, pool is at risk - Action: Replace faulted drive(s) immediately
FAULTED ❌ - Too many devices failed, data loss possible - Pool may be read-only or offline - Action: Emergency recovery procedure required
Drive Mapping¶
Map ZFS device IDs to physical /dev/sd* devices:
# List all drive mappings
ls -la /dev/disk/by-id/ | grep -E "ata-ST|ata-WDC" | grep -v "part"
# Example output:
# ata-ST4000VN008-2DR166_ZDH8P5G1 -> ../../sdg
# ata-WDC_WD40EZRZ-22GXCB0_WD-WCC7K2NCDZ8Y -> ../../sde
This helps correlate ZFS errors with physical drive locations.
ZFS Scrub Operations¶
What is a scrub? - Reads every block in the pool to detect silent corruption - Repairs errors using parity/mirrors - Should run monthly on production pools
Manual scrub:
# Start scrub
zpool scrub vault
# Check scrub progress
zpool status vault
# Cancel scrub (if needed)
zpool scrub -s vault
Automated monthly scrub:
# Add to crontab
crontab -e
# Run scrub on first Sunday of month at 2am
0 2 1-7 * * [ "$(date +\%u)" -eq 7 ] && zpool scrub vault
Scrub duration: Expect ~8-12 hours for 20TB+ pools.
Replacing Failed Drives¶
⚠️ IMPORTANT: If you have Kubernetes workloads with Longhorn storage running on VMs, follow the full procedure below to prevent data loss.
Prerequisites¶
# 1. Document current state
zpool status vault > /root/vault-status-before-replacement.txt
/root/tower-fleet/scripts/check-hardware-health.sh > /root/hardware-before-replacement.txt
# 2. Identify the failed drive
zpool status vault
# Note the device ID (e.g., ata-ST4000VN008-2DR166_ZDH8PN1H)
# 3. Verify drive mapping
ls -la /dev/disk/by-id/ | grep <SERIAL_NUMBER>
Step 1: Graceful Shutdown (K8s Cluster)¶
If you have K8s VMs running Longhorn storage:
# Drain all K8s nodes (prevents pod disruption during reboot)
kubectl drain k3s-master --ignore-daemonsets --delete-emptydir-data --timeout=120s
kubectl drain k3s-worker-1 --ignore-daemonsets --delete-emptydir-data --timeout=120s
kubectl drain k3s-worker-2 --ignore-daemonsets --delete-emptydir-data --timeout=120s
# Shutdown K8s VMs gracefully
qm shutdown 201 && qm shutdown 202 && qm shutdown 203
# Wait for VMs to stop (verify status)
qm status 201 # Should show: status: stopped
# Shutdown Proxmox host
shutdown -h now
If no K8s cluster, just shutdown:
Step 2: Physical Drive Replacement¶
- Power off server (or hot-swap if controller supports it)
- Remove failed drive from bay
- Record the NEW drive's serial number (printed on label)
- Install new drive in same bay
- Power on server
Step 3: Post-Boot Drive Detection¶
# Wait for boot to complete (2-3 minutes)
# Verify new drive is detected
ls /dev/sd*
lsblk
# Find the new drive's device ID
ls -la /dev/disk/by-id/ | grep ata
# Look for the new serial number you recorded
Step 4: Replace Drive in ZFS Pool¶
# Replace using full device ID (preferred method)
zpool replace vault ata-OLD_DEVICE_ID /dev/disk/by-id/ata-NEW_DEVICE_ID
# Example:
# zpool replace vault ata-ST4000VN008-2DR166_ZDH8PN1H /dev/disk/by-id/ata-ST4000VX016-3CV104_ZW62ABCD
# If by-id path isn't available immediately:
zpool replace vault ata-OLD_DEVICE_ID /dev/sdc
# Monitor resilver progress
watch zpool status vault
Resilver duration: Expect ~24 hours per 4TB drive.
Step 5: Recover Kubernetes Cluster¶
CRITICAL: After reboot, Longhorn volumes may be in "faulted" state. See Post-Reboot Recovery Guide for details.
# 1. Uncordon K8s nodes (allow pods to schedule)
kubectl uncordon k3s-master
kubectl uncordon k3s-worker-1
kubectl uncordon k3s-worker-2
# 2. Wait 2-3 minutes for Longhorn to stabilize
# 3. Check for faulted volumes
kubectl get volumes -n longhorn-system | grep faulted
# 4. If any faulted volumes, run recovery script
/root/tower-fleet/scripts/recover-longhorn-volumes.sh
# 5. Verify all pods are running
kubectl get pods -A | grep -v Running
Step 6: Verify Services¶
# Check critical workloads
kubectl get pods -n docker-registry
kubectl get pods -n authentik
kubectl get pods -n home-portal
kubectl get pods -n supabase
# Test external access
curl -I https://portal.bogocat.com
nc -zv 10.89.97.201 30500 # docker registry (if applicable)
# Verify Longhorn health
kubectl get volumes -n longhorn-system | grep -v healthy || echo "All volumes healthy"
Step 7: Monitor Resilver to Completion¶
# Monitor resilver progress (runs in background)
watch zpool status vault
# Check periodically
zpool status vault | grep -A 3 "scan:"
# When complete (after ~24 hours):
zpool status vault
# Should show: scan: resilvered XGB in HH:MM:SS with 0 errors
Step 8: Verify Replacement Success¶
# Check pool state
zpool status vault
# State should be ONLINE (if all drives replaced) or DEGRADED (if more need replacing)
# Verify new drive SMART health
smartctl -H /dev/sdc
smartctl -a /dev/sdc | grep -E "Reallocated|Pending|Uncorrectable"
# Run full hardware health check
/root/tower-fleet/scripts/check-hardware-health.sh
Multiple Drive Replacements¶
If replacing multiple failed drives:
- Replace one drive at a time
- Wait for resilver to complete 100% before replacing next drive
- RAIDZ2 can tolerate 2 simultaneous failures, but replacing sequentially is safer
- Repeat Steps 1-8 for each drive
Example timeline: - Day 1: Replace drive 1, start resilver - Day 2: Resilver completes, replace drive 2, start resilver - Day 3: All replacements complete, verify pool ONLINE
Quick Reference Card¶
DRIVE REPLACEMENT CHECKLIST
Before shutdown:
[ ] kubectl drain k3s-master k3s-worker-1 k3s-worker-2 --ignore-daemonsets --delete-emptydir-data
[ ] qm shutdown 201 && qm shutdown 202 && qm shutdown 203
[ ] shutdown -h now
After boot:
[ ] kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2
[ ] /root/tower-fleet/scripts/recover-longhorn-volumes.sh
[ ] kubectl get pods -A | grep -v Running
Replace in ZFS:
[ ] zpool replace vault OLD_DEVICE_ID /dev/NEW_DEVICE
Monitor:
[ ] watch zpool status vault
[ ] /root/tower-fleet/scripts/check-hardware-health.sh
SMART Monitoring¶
Full Drive Health Report¶
# Complete SMART report for a drive
smartctl -a /dev/sda
# Key sections:
# - Model and Serial Number
# - Overall health assessment (PASSED/FAILED)
# - SMART Attributes (reallocated sectors, pending sectors, etc.)
# - Error log (recent read/write errors)
# - Self-test results
Running SMART Self-Tests¶
Short test (2 minutes):
# Start short test
smartctl -t short /dev/sda
# Wait 2 minutes, then check results
smartctl -l selftest /dev/sda
Long test (8+ hours):
# Start extended test
smartctl -t long /dev/sda
# Check progress
smartctl -a /dev/sda | grep "Self-test execution status"
# Check results (after completion)
smartctl -l selftest /dev/sda
Automated weekly tests:
# Add to crontab (run Sunday at 3am when load is low)
0 3 * * 0 for dev in sda sdb sdd sde sdf sdg sdh; do smartctl -t long /dev/$dev; done
SMART Daemon (smartd)¶
Enable smartd for automatic monitoring:
# Install smartmontools (if not present)
apt update && apt install smartmontools
# Edit config
nano /etc/smartd.conf
# Add monitoring for all drives
DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../7/03) -m root
# Restart service
systemctl restart smartd
systemctl enable smartd
Config explanation:
- -a: Monitor all attributes
- -o on: Enable automatic offline testing
- -S on: Enable attribute autosave
- -n standby,q: Don't wake sleeping drives
- -s (S/../.././02|L/../../7/03): Short test daily at 2am, long test Sunday at 3am
- -m root: Email root on failures
Prometheus Integration¶
Node Exporter SMART Metrics¶
Install smartmon textfile collector:
# Create script directory
mkdir -p /usr/local/bin
# Download smartmon.sh script
cat > /usr/local/bin/smartmon.sh << 'EOF'
#!/bin/bash
# Prometheus node_exporter textfile collector for SMART metrics
OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/smart.prom"
TEMP_FILE="${OUTPUT_FILE}.$$"
mkdir -p "$(dirname "$OUTPUT_FILE")"
# Clear temp file
> "$TEMP_FILE"
# Iterate over all drives
for device in /dev/sd[a-z]; do
[ -e "$device" ] || continue
# Get device info
model=$(smartctl -i "$device" | awk -F': +' '/Device Model:/ {print $2}' | tr ' ' '_')
serial=$(smartctl -i "$device" | awk -F': +' '/Serial Number:/ {print $2}')
# Get SMART health
health=$(smartctl -H "$device" | awk '/SMART overall-health/ {print $NF}')
health_value=0
[ "$health" = "PASSED" ] && health_value=1
echo "# HELP smart_health SMART overall health (1=PASSED, 0=FAILED)" >> "$TEMP_FILE"
echo "# TYPE smart_health gauge" >> "$TEMP_FILE"
echo "smart_health{device=\"$device\",model=\"$model\",serial=\"$serial\"} $health_value" >> "$TEMP_FILE"
# Get key SMART attributes
smartctl -A "$device" | awk -v device="$device" -v model="$model" -v serial="$serial" '
/Reallocated_Sector_Ct/ {print "smart_reallocated_sectors{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
/Current_Pending_Sector/ {print "smart_pending_sectors{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
/Offline_Uncorrectable/ {print "smart_uncorrectable_sectors{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
/Power_On_Hours/ {print "smart_power_on_hours{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
/Temperature_Celsius/ {print "smart_temperature_celsius{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
' >> "$TEMP_FILE"
done
# Move to final location atomically
mv "$TEMP_FILE" "$OUTPUT_FILE"
EOF
chmod +x /usr/local/bin/smartmon.sh
# Create node_exporter textfile directory
mkdir -p /var/lib/node_exporter/textfile_collector
# Test script
/usr/local/bin/smartmon.sh
cat /var/lib/node_exporter/textfile_collector/smart.prom
Add to crontab (run every 5 minutes):
Update node_exporter systemd service:
# Edit service file
systemctl edit node-exporter --full
# Add to ExecStart:
ExecStart=/usr/bin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
# Restart
systemctl restart node-exporter
Prometheus Alert Rules¶
Create SMART alert rules:
# Create alert file
cat > /root/tower-fleet/k8s/prometheus/smart-alerts.yaml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-smart-alerts
namespace: monitoring
data:
smart-alerts.yaml: |
groups:
- name: smart_health
interval: 5m
rules:
- alert: SmartHealthFailed
expr: smart_health == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Drive SMART health check failed"
description: "Drive {{ $labels.device }} ({{ $labels.model }}, S/N: {{ $labels.serial }}) has failed SMART health check."
- alert: SmartReallocatedSectorsHigh
expr: smart_reallocated_sectors > 50
for: 30m
labels:
severity: warning
annotations:
summary: "High number of reallocated sectors"
description: "Drive {{ $labels.device }} ({{ $labels.model }}) has {{ $value }} reallocated sectors. Consider replacing."
- alert: SmartReallocatedSectorsCritical
expr: smart_reallocated_sectors > 500
for: 5m
labels:
severity: critical
annotations:
summary: "Critical number of reallocated sectors"
description: "Drive {{ $labels.device }} ({{ $labels.model }}) has {{ $value }} reallocated sectors. Replace immediately."
- alert: SmartPendingSectors
expr: smart_pending_sectors > 0
for: 1h
labels:
severity: warning
annotations:
summary: "Drive has pending sectors"
description: "Drive {{ $labels.device }} ({{ $labels.model }}) has {{ $value }} pending sectors waiting to be remapped."
- alert: SmartUncorrectableSectors
expr: smart_uncorrectable_sectors > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Drive has uncorrectable sectors"
description: "Drive {{ $labels.device }} ({{ $labels.model }}) has {{ $value }} uncorrectable sectors. Data loss possible. Replace immediately."
- alert: SmartDriveAgeWarning
expr: smart_power_on_hours > 43800 # 5 years
for: 1d
labels:
severity: info
annotations:
summary: "Drive approaching end of life"
description: "Drive {{ $labels.device }} ({{ $labels.model }}) has been powered on for {{ $value }} hours (>5 years). Monitor closely."
- alert: SmartTemperatureHigh
expr: smart_temperature_celsius > 50
for: 30m
labels:
severity: warning
annotations:
summary: "Drive temperature high"
description: "Drive {{ $labels.device }} temperature is {{ $value }}°C (threshold: 50°C). Check cooling."
EOF
# Apply to cluster
kubectl apply -f /root/tower-fleet/k8s/prometheus/smart-alerts.yaml
Add ZFS pool alert rules:
cat > /root/tower-fleet/k8s/prometheus/zfs-alerts.yaml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-zfs-alerts
namespace: monitoring
data:
zfs-alerts.yaml: |
groups:
- name: zfs_health
interval: 5m
rules:
- alert: ZFSPoolDegraded
expr: node_zfs_pool_state{state="DEGRADED"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "ZFS pool degraded"
description: "ZFS pool {{ $labels.pool }} is in DEGRADED state. One or more devices have failed."
- alert: ZFSPoolFaulted
expr: node_zfs_pool_state{state="FAULTED"} > 0
for: 1m
labels:
severity: critical
annotations:
summary: "ZFS pool faulted"
description: "ZFS pool {{ $labels.pool }} is FAULTED. Data loss possible. Immediate action required."
- alert: ZFSPoolErrors
expr: increase(node_zfs_pool_errors_total[1h]) > 0
labels:
severity: warning
annotations:
summary: "ZFS pool has errors"
description: "ZFS pool {{ $labels.pool }} has {{ $value }} errors in the last hour. Type: {{ $labels.type }}."
EOF
kubectl apply -f /root/tower-fleet/k8s/prometheus/zfs-alerts.yaml
Grafana Dashboard¶
Import ZFS + SMART dashboard:
- Access Grafana: http://10.89.97.211 (admin/prom-operator)
- Go to Dashboards → Import
- Create new dashboard with panels:
Panel 1: Drive SMART Health
Panel 2: Reallocated Sectors by Drive
Panel 3: Pending + Uncorrectable Sectors
Panel 4: Drive Temperature
Panel 5: Drive Age (Power-On Hours)
(Displays age in years)Maintenance Schedule¶
Daily¶
- Automated: smartd runs short SMART self-tests (via smartd.conf)
- Automated: Prometheus scrapes SMART metrics every 5 minutes
Weekly¶
- Automated: Long SMART self-tests on all drives (Sunday 3am)
- Manual: Review Grafana hardware health dashboard
Monthly¶
- Automated: ZFS scrub (first Sunday, 2am)
- Manual: Review ZFS scrub results
- Manual: Check drive temperatures and reallocated sectors
Quarterly¶
- Manual: Review drive ages and plan replacements for drives >4 years old
- Manual: Test alert rules (trigger test alert, verify Discord notification)
Annually¶
- Manual: Backup and restore test (disaster recovery)
- Manual: Replace drives >5 years old (proactive replacement)
Troubleshooting¶
ZFS Pool Won't Import¶
# Try importing with force
zpool import -f vault
# If that fails, import read-only to recover data
zpool import -o readonly=on vault
# Check pool configuration
zpool import
Drive Not Detected After Replacement¶
# Rescan SCSI bus
echo "- - -" > /sys/class/scsi_host/host0/scan
echo "- - -" > /sys/class/scsi_host/host1/scan
# Check kernel messages
dmesg | tail -50
# List all drives
lsblk
ls /dev/sd*
SMART Data Not Available¶
# Check if SMART is supported
smartctl -i /dev/sda
# Enable SMART
smartctl -s on /dev/sda
# Some USB/RAID controllers hide SMART data
# Use controller-specific tools if needed
High Drive Temperature¶
# Check current temperature
smartctl -a /dev/sda | grep Temperature
# Verify cooling fans are working
# Check system airflow
# Consider adding case fans
# Target: Keep drives < 45°C
Historical Events & Lessons Learned¶
Drive Replacement: 2025-12-08¶
Event: First production drive replacement with active K8s workloads
Failed Drives:
- Drive 1: /dev/sdc - ata-ST4000VN008-2DR166_ZDH8PN1H (FAULTED)
- Could not read SMART data
- Status: UNAVAIL in ZFS pool
- Drive 2: /dev/sde - ata-WDC_WD40EZRZ-22GXCB0_WD-WCC7K2NCDZ8Y (DEGRADED)
- 52,773 power-on hours (6+ years old)
- 1 pending sector, 3 uncorrectable sectors
- Status: DEGRADED with too many errors
Replacement Drives: - 1x Seagate IronWolf 4TB (ST4000VN006-3CW104_ZW63WH4W) - 1x WD Red Plus 4TB (pending)
Timeline:
2025-12-08 15:30 - Drive replacement initiated
1. Documented pool status: zpool status vault > /root/vault-status-before-replacement.txt
2. Drained K8s nodes:
kubectl drain k3s-master k3s-worker-1 k3s-worker-2 --ignore-daemonsets --delete-emptydir-data --timeout=120s
qm shutdown 201 && qm shutdown 202 && qm shutdown 203
4. Shutdown Proxmox host: shutdown -h now
2025-12-08 18:24 - First drive replaced (physical)
- Removed failed /dev/sdc (Seagate ST4000VN008 ZDH8PN1H)
- Installed new Seagate IronWolf 4TB (ST4000VN006 ZW63WH4W)
- Recorded new serial number from drive label
2025-12-08 18:32 - Post-boot recovery 1. Verified new drive detected:
-
Replaced drive in ZFS pool:
-
Uncordoned K8s nodes:
-
Checked Longhorn volumes:
Issues Encountered:
- ImagePullBackOff on app pods:
- Symptom: home-portal, money-tracker, trip-planner stuck pulling images
- Root cause: Pods scheduled before docker-registry service was fully ready
- Resolution: Deleted stuck pods, they recreated successfully
- Time to fix: 5 minutes
-
Lesson: Wait 2-3 minutes after uncordoning before checking pod status
-
Supabase storage CrashLoopBackOff:
- Symptom: storage pods failing with migration hash mismatch
- Root cause: Pre-existing issue, unrelated to drive replacement
- Impact: None on main applications
- Status: Separate issue to be addressed later
What Worked Well:
✅ Graceful K8s drain - No pod disruptions or data loss ✅ Longhorn auto-recovery - No manual intervention needed (unlike previous reboot) ✅ ZFS resilver - Started immediately, ~24 hour ETA ✅ Documentation - Quick reference card was accurate and helpful ✅ Drive detection - New drive appeared immediately at boot
Key Learnings:
- Timing matters: Wait 2-3 minutes after uncordoning for services to stabilize
- Longhorn improved: No faulted volumes this time (previous reboot required manual salvage)
- Pod restarts safe: Simply deleting stuck pods resolved ImagePullBackOff
- Documentation critical: Having step-by-step guide prevented mistakes
- One drive at a time: Sequential replacement is the right approach
Next Steps:
- [ ] Monitor resilver to completion (~24 hours)
- [ ] Verify pool returns to ONLINE state
- [ ] Replace second drive (
/dev/sde) using same procedure - [ ] Update hardware inventory with new drive serials
Commands Used:
# Before shutdown
kubectl drain k3s-master k3s-worker-1 k3s-worker-2 --ignore-daemonsets --delete-emptydir-data --timeout=120s
qm shutdown 201 && qm shutdown 202 && qm shutdown 203
shutdown -h now
# After boot
ls -la /dev/disk/by-id/ | grep ata
zpool replace vault OLD_DEVICE_ID /dev/disk/by-id/ata-NEW_DEVICE_ID
kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2
kubectl get volumes -n longhorn-system | grep faulted
kubectl get pods -A | grep -v Running
# Fix stuck pods
kubectl delete pod <pod-name> -n <namespace>
# Monitor
zpool status vault | grep scan
/root/tower-fleet/scripts/check-hardware-health.sh
Result: ✅ Successful - All critical services running, resilver in progress
Related Documentation¶
- Storage Infrastructure - ZFS and storage architecture
- Alert Investigation Guide - Alert investigation workflows
- Observability Standards - Monitoring best practices
- Disaster Recovery - Backup and recovery procedures
External Resources¶
- ZFS Documentation
- smartmontools Guide
- Backblaze Drive Stats - Drive failure rate data