Hardware Health Monitoring¶

Last Updated: 2025-12-08 Status: Production

Overview¶

This guide covers hardware health monitoring for Tower Fleet infrastructure, focusing on: - ZFS pool health - RAIDZ2 array status and integrity - SMART drive monitoring - Predictive disk failure detection - Automated alerts - Prometheus integration for proactive monitoring - Maintenance schedules - Regular scrubs and health checks

Quick Health Check¶

ZFS Pool Status¶

# Check all pool health
zpool status

# Detailed status with error counts
zpool status -v

# Check specific pool
zpool status vault
zpool status rpool

What to look for: - State: Should be ONLINE (not DEGRADED or FAULTED) - READ/WRITE/CKSUM errors: All should be 0 - Scan status: Last scrub should be recent (< 1 month) with 0 errors

Example healthy output:

pool: vault
 state: ONLINE
  scan: scrub repaired 0B in 08:23:15 with 0 errors on Sun Dec  1 00:33:17 2025
config:
    NAME                                          STATE     READ WRITE CKSUM
    vault                                         ONLINE       0     0     0
      raidz2-0                                    ONLINE       0     0     0
        ata-ST4000VX016-3CV104_ZW61ZP88           ONLINE       0     0     0
        ata-ST4000VX016-3CV104_ZW626RNN           ONLINE       0     0     0
        ...
errors: No known data errors

SMART Drive Health¶

# Quick health check for all drives
for dev in sda sdb sdc sdd sde sdf sdg sdh; do
  echo "=== /dev/$dev ==="
  smartctl -H /dev/$dev 2>/dev/null | grep -i "overall-health"
done

# Detailed check with key metrics
for dev in sda sdb sdc sdd sde sdf sdg sdh; do
  echo "=== /dev/$dev ==="
  smartctl -i /dev/$dev 2>/dev/null | grep -E "(Model|Serial)"
  smartctl -a /dev/$dev 2>/dev/null | grep -E "(Power_On_Hours|Reallocated_Sector|Current_Pending|Offline_Uncorrectable)" | head -4
  echo
done

Critical SMART attributes to monitor:

Attribute ID	Name	Meaning	Threshold
5	Reallocated_Sector_Ct	Bad sectors remapped	> 50 = warning, > 500 = replace
9	Power_On_Hours	Drive age	> 43,800 (5 years) = watch closely
197	Current_Pending_Sector	Sectors waiting to be remapped	> 0 = warning
198	Offline_Uncorrectable	Sectors that can't be read	> 0 = replace soon
199	UDMA_CRC_Error_Count	Cable/connection errors	> 100 = check cables

ZFS Pool Management¶

Understanding Pool States¶

ONLINE ✅ - All devices working normally - No action required

DEGRADED ⚠️ - One or more devices faulted - Pool still functional but at reduced redundancy - RAIDZ2: Can lose 2 drives - if 2 are faulted, pool is at risk - Action: Replace faulted drive(s) immediately

FAULTED ❌ - Too many devices failed, data loss possible - Pool may be read-only or offline - Action: Emergency recovery procedure required

Drive Mapping¶

Map ZFS device IDs to physical /dev/sd* devices:

# List all drive mappings
ls -la /dev/disk/by-id/ | grep -E "ata-ST|ata-WDC" | grep -v "part"

# Example output:
# ata-ST4000VN008-2DR166_ZDH8P5G1 -> ../../sdg
# ata-WDC_WD40EZRZ-22GXCB0_WD-WCC7K2NCDZ8Y -> ../../sde

This helps correlate ZFS errors with physical drive locations.

ZFS Scrub Operations¶

What is a scrub? - Reads every block in the pool to detect silent corruption - Repairs errors using parity/mirrors - Should run monthly on production pools

Manual scrub:

# Start scrub
zpool scrub vault

# Check scrub progress
zpool status vault

# Cancel scrub (if needed)
zpool scrub -s vault

Automated monthly scrub:

# Add to crontab
crontab -e

# Run scrub on first Sunday of month at 2am
0 2 1-7 * * [ "$(date +\%u)" -eq 7 ] && zpool scrub vault

Scrub duration: Expect ~8-12 hours for 20TB+ pools.

Replacing Failed Drives¶

⚠️ IMPORTANT: If you have Kubernetes workloads with Longhorn storage running on VMs, follow the full procedure below to prevent data loss.

Prerequisites¶

# 1. Document current state
zpool status vault > /root/vault-status-before-replacement.txt
/root/tower-fleet/scripts/check-hardware-health.sh > /root/hardware-before-replacement.txt

# 2. Identify the failed drive
zpool status vault
# Note the device ID (e.g., ata-ST4000VN008-2DR166_ZDH8PN1H)

# 3. Verify drive mapping
ls -la /dev/disk/by-id/ | grep <SERIAL_NUMBER>

Step 1: Graceful Shutdown (K8s Cluster)¶

If you have K8s VMs running Longhorn storage:

# Drain all K8s nodes (prevents pod disruption during reboot)
kubectl drain k3s-master --ignore-daemonsets --delete-emptydir-data --timeout=120s
kubectl drain k3s-worker-1 --ignore-daemonsets --delete-emptydir-data --timeout=120s
kubectl drain k3s-worker-2 --ignore-daemonsets --delete-emptydir-data --timeout=120s

# Shutdown K8s VMs gracefully
qm shutdown 201 && qm shutdown 202 && qm shutdown 203

# Wait for VMs to stop (verify status)
qm status 201  # Should show: status: stopped

# Shutdown Proxmox host
shutdown -h now

If no K8s cluster, just shutdown:

shutdown -h now

Step 2: Physical Drive Replacement¶

Power off server (or hot-swap if controller supports it)
Remove failed drive from bay
Record the NEW drive's serial number (printed on label)
Install new drive in same bay
Power on server

Step 3: Post-Boot Drive Detection¶

# Wait for boot to complete (2-3 minutes)

# Verify new drive is detected
ls /dev/sd*
lsblk

# Find the new drive's device ID
ls -la /dev/disk/by-id/ | grep ata
# Look for the new serial number you recorded

Step 4: Replace Drive in ZFS Pool¶

# Replace using full device ID (preferred method)
zpool replace vault ata-OLD_DEVICE_ID /dev/disk/by-id/ata-NEW_DEVICE_ID

# Example:
# zpool replace vault ata-ST4000VN008-2DR166_ZDH8PN1H /dev/disk/by-id/ata-ST4000VX016-3CV104_ZW62ABCD

# If by-id path isn't available immediately:
zpool replace vault ata-OLD_DEVICE_ID /dev/sdc

# Monitor resilver progress
watch zpool status vault

Resilver duration: Expect ~24 hours per 4TB drive.

Step 5: Recover Kubernetes Cluster¶

CRITICAL: After reboot, Longhorn volumes may be in "faulted" state. See Post-Reboot Recovery Guide for details.

# 1. Uncordon K8s nodes (allow pods to schedule)
kubectl uncordon k3s-master
kubectl uncordon k3s-worker-1
kubectl uncordon k3s-worker-2

# 2. Wait 2-3 minutes for Longhorn to stabilize

# 3. Check for faulted volumes
kubectl get volumes -n longhorn-system | grep faulted

# 4. If any faulted volumes, run recovery script
/root/tower-fleet/scripts/recover-longhorn-volumes.sh

# 5. Verify all pods are running
kubectl get pods -A | grep -v Running

Step 6: Verify Services¶

# Check critical workloads
kubectl get pods -n docker-registry
kubectl get pods -n authentik
kubectl get pods -n home-portal
kubectl get pods -n supabase

# Test external access
curl -I https://portal.bogocat.com
nc -zv 10.89.97.201 30500  # docker registry (if applicable)

# Verify Longhorn health
kubectl get volumes -n longhorn-system | grep -v healthy || echo "All volumes healthy"

Step 7: Monitor Resilver to Completion¶

# Monitor resilver progress (runs in background)
watch zpool status vault

# Check periodically
zpool status vault | grep -A 3 "scan:"

# When complete (after ~24 hours):
zpool status vault
# Should show: scan: resilvered XGB in HH:MM:SS with 0 errors

Step 8: Verify Replacement Success¶

# Check pool state
zpool status vault
# State should be ONLINE (if all drives replaced) or DEGRADED (if more need replacing)

# Verify new drive SMART health
smartctl -H /dev/sdc
smartctl -a /dev/sdc | grep -E "Reallocated|Pending|Uncorrectable"

# Run full hardware health check
/root/tower-fleet/scripts/check-hardware-health.sh

Multiple Drive Replacements¶

If replacing multiple failed drives:

Replace one drive at a time
Wait for resilver to complete 100% before replacing next drive
RAIDZ2 can tolerate 2 simultaneous failures, but replacing sequentially is safer
Repeat Steps 1-8 for each drive

Example timeline: - Day 1: Replace drive 1, start resilver - Day 2: Resilver completes, replace drive 2, start resilver - Day 3: All replacements complete, verify pool ONLINE

Quick Reference Card¶

DRIVE REPLACEMENT CHECKLIST

Before shutdown:
[ ] kubectl drain k3s-master k3s-worker-1 k3s-worker-2 --ignore-daemonsets --delete-emptydir-data
[ ] qm shutdown 201 && qm shutdown 202 && qm shutdown 203
[ ] shutdown -h now

After boot:
[ ] kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2
[ ] /root/tower-fleet/scripts/recover-longhorn-volumes.sh
[ ] kubectl get pods -A | grep -v Running

Replace in ZFS:
[ ] zpool replace vault OLD_DEVICE_ID /dev/NEW_DEVICE

Monitor:
[ ] watch zpool status vault
[ ] /root/tower-fleet/scripts/check-hardware-health.sh

SMART Monitoring¶

Full Drive Health Report¶

# Complete SMART report for a drive
smartctl -a /dev/sda

# Key sections:
# - Model and Serial Number
# - Overall health assessment (PASSED/FAILED)
# - SMART Attributes (reallocated sectors, pending sectors, etc.)
# - Error log (recent read/write errors)
# - Self-test results

Running SMART Self-Tests¶

Short test (2 minutes):

# Start short test
smartctl -t short /dev/sda

# Wait 2 minutes, then check results
smartctl -l selftest /dev/sda

Long test (8+ hours):

# Start extended test
smartctl -t long /dev/sda

# Check progress
smartctl -a /dev/sda | grep "Self-test execution status"

# Check results (after completion)
smartctl -l selftest /dev/sda

Automated weekly tests:

# Add to crontab (run Sunday at 3am when load is low)
0 3 * * 0 for dev in sda sdb sdd sde sdf sdg sdh; do smartctl -t long /dev/$dev; done

SMART Daemon (smartd)¶

Enable smartd for automatic monitoring:

# Install smartmontools (if not present)
apt update && apt install smartmontools

# Edit config
nano /etc/smartd.conf

# Add monitoring for all drives
DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../7/03) -m root

# Restart service
systemctl restart smartd
systemctl enable smartd

Config explanation: - -a: Monitor all attributes - -o on: Enable automatic offline testing - -S on: Enable attribute autosave - -n standby,q: Don't wake sleeping drives - -s (S/../.././02|L/../../7/03): Short test daily at 2am, long test Sunday at 3am - -m root: Email root on failures

Prometheus Integration¶

Node Exporter SMART Metrics¶

Install smartmon textfile collector:

# Create script directory
mkdir -p /usr/local/bin

# Download smartmon.sh script
cat > /usr/local/bin/smartmon.sh << 'EOF'
#!/bin/bash
# Prometheus node_exporter textfile collector for SMART metrics

OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/smart.prom"
TEMP_FILE="${OUTPUT_FILE}.$$"

mkdir -p "$(dirname "$OUTPUT_FILE")"

# Clear temp file
> "$TEMP_FILE"

# Iterate over all drives
for device in /dev/sd[a-z]; do
  [ -e "$device" ] || continue

  # Get device info
  model=$(smartctl -i "$device" | awk -F': +' '/Device Model:/ {print $2}' | tr ' ' '_')
  serial=$(smartctl -i "$device" | awk -F': +' '/Serial Number:/ {print $2}')

  # Get SMART health
  health=$(smartctl -H "$device" | awk '/SMART overall-health/ {print $NF}')
  health_value=0
  [ "$health" = "PASSED" ] && health_value=1

  echo "# HELP smart_health SMART overall health (1=PASSED, 0=FAILED)" >> "$TEMP_FILE"
  echo "# TYPE smart_health gauge" >> "$TEMP_FILE"
  echo "smart_health{device=\"$device\",model=\"$model\",serial=\"$serial\"} $health_value" >> "$TEMP_FILE"

  # Get key SMART attributes
  smartctl -A "$device" | awk -v device="$device" -v model="$model" -v serial="$serial" '
  /Reallocated_Sector_Ct/ {print "smart_reallocated_sectors{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
  /Current_Pending_Sector/ {print "smart_pending_sectors{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
  /Offline_Uncorrectable/ {print "smart_uncorrectable_sectors{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
  /Power_On_Hours/ {print "smart_power_on_hours{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
  /Temperature_Celsius/ {print "smart_temperature_celsius{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
  ' >> "$TEMP_FILE"
done

# Move to final location atomically
mv "$TEMP_FILE" "$OUTPUT_FILE"
EOF

chmod +x /usr/local/bin/smartmon.sh

# Create node_exporter textfile directory
mkdir -p /var/lib/node_exporter/textfile_collector

# Test script
/usr/local/bin/smartmon.sh
cat /var/lib/node_exporter/textfile_collector/smart.prom

Add to crontab (run every 5 minutes):

crontab -e

# Add:
*/5 * * * * /usr/local/bin/smartmon.sh

Update node_exporter systemd service:

# Edit service file
systemctl edit node-exporter --full

# Add to ExecStart:
ExecStart=/usr/bin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector

# Restart
systemctl restart node-exporter

Prometheus Alert Rules¶

Create SMART alert rules:

# Create alert file
cat > /root/tower-fleet/k8s/prometheus/smart-alerts.yaml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-smart-alerts
  namespace: monitoring
data:
  smart-alerts.yaml: |
    groups:
    - name: smart_health
      interval: 5m
      rules:
      - alert: SmartHealthFailed
        expr: smart_health == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Drive SMART health check failed"
          description: "Drive {{ $labels.device }} ({{ $labels.model }}, S/N: {{ $labels.serial }}) has failed SMART health check."

      - alert: SmartReallocatedSectorsHigh
        expr: smart_reallocated_sectors > 50
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "High number of reallocated sectors"
          description: "Drive {{ $labels.device }} ({{ $labels.model }}) has {{ $value }} reallocated sectors. Consider replacing."

      - alert: SmartReallocatedSectorsCritical
        expr: smart_reallocated_sectors > 500
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical number of reallocated sectors"
          description: "Drive {{ $labels.device }} ({{ $labels.model }}) has {{ $value }} reallocated sectors. Replace immediately."

      - alert: SmartPendingSectors
        expr: smart_pending_sectors > 0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Drive has pending sectors"
          description: "Drive {{ $labels.device }} ({{ $labels.model }}) has {{ $value }} pending sectors waiting to be remapped."

      - alert: SmartUncorrectableSectors
        expr: smart_uncorrectable_sectors > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Drive has uncorrectable sectors"
          description: "Drive {{ $labels.device }} ({{ $labels.model }}) has {{ $value }} uncorrectable sectors. Data loss possible. Replace immediately."

      - alert: SmartDriveAgeWarning
        expr: smart_power_on_hours > 43800  # 5 years
        for: 1d
        labels:
          severity: info
        annotations:
          summary: "Drive approaching end of life"
          description: "Drive {{ $labels.device }} ({{ $labels.model }}) has been powered on for {{ $value }} hours (>5 years). Monitor closely."

      - alert: SmartTemperatureHigh
        expr: smart_temperature_celsius > 50
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Drive temperature high"
          description: "Drive {{ $labels.device }} temperature is {{ $value }}°C (threshold: 50°C). Check cooling."
EOF

# Apply to cluster
kubectl apply -f /root/tower-fleet/k8s/prometheus/smart-alerts.yaml

Add ZFS pool alert rules:

cat > /root/tower-fleet/k8s/prometheus/zfs-alerts.yaml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-zfs-alerts
  namespace: monitoring
data:
  zfs-alerts.yaml: |
    groups:
    - name: zfs_health
      interval: 5m
      rules:
      - alert: ZFSPoolDegraded
        expr: node_zfs_pool_state{state="DEGRADED"} > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool degraded"
          description: "ZFS pool {{ $labels.pool }} is in DEGRADED state. One or more devices have failed."

      - alert: ZFSPoolFaulted
        expr: node_zfs_pool_state{state="FAULTED"} > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool faulted"
          description: "ZFS pool {{ $labels.pool }} is FAULTED. Data loss possible. Immediate action required."

      - alert: ZFSPoolErrors
        expr: increase(node_zfs_pool_errors_total[1h]) > 0
        labels:
          severity: warning
        annotations:
          summary: "ZFS pool has errors"
          description: "ZFS pool {{ $labels.pool }} has {{ $value }} errors in the last hour. Type: {{ $labels.type }}."
EOF

kubectl apply -f /root/tower-fleet/k8s/prometheus/zfs-alerts.yaml

Grafana Dashboard¶

Import ZFS + SMART dashboard:

Access Grafana: http://10.89.97.211 (admin/prom-operator)
Go to Dashboards → Import
Create new dashboard with panels:

Panel 1: Drive SMART Health

smart_health{job="node-exporter"}

Panel 2: Reallocated Sectors by Drive

smart_reallocated_sectors{job="node-exporter"}

Panel 3: Pending + Uncorrectable Sectors

smart_pending_sectors{job="node-exporter"} or smart_uncorrectable_sectors{job="node-exporter"}

Panel 4: Drive Temperature

smart_temperature_celsius{job="node-exporter"}

Panel 5: Drive Age (Power-On Hours)

smart_power_on_hours{job="node-exporter"} / 8760

(Displays age in years)

Maintenance Schedule¶

Daily¶

Automated: smartd runs short SMART self-tests (via smartd.conf)
Automated: Prometheus scrapes SMART metrics every 5 minutes

Weekly¶

Automated: Long SMART self-tests on all drives (Sunday 3am)
Manual: Review Grafana hardware health dashboard

Monthly¶

Automated: ZFS scrub (first Sunday, 2am)
Manual: Review ZFS scrub results
Manual: Check drive temperatures and reallocated sectors

Quarterly¶

Manual: Review drive ages and plan replacements for drives >4 years old
Manual: Test alert rules (trigger test alert, verify Discord notification)

Annually¶

Manual: Backup and restore test (disaster recovery)
Manual: Replace drives >5 years old (proactive replacement)

Troubleshooting¶

ZFS Pool Won't Import¶

# Try importing with force
zpool import -f vault

# If that fails, import read-only to recover data
zpool import -o readonly=on vault

# Check pool configuration
zpool import

Drive Not Detected After Replacement¶

# Rescan SCSI bus
echo "- - -" > /sys/class/scsi_host/host0/scan
echo "- - -" > /sys/class/scsi_host/host1/scan

# Check kernel messages
dmesg | tail -50

# List all drives
lsblk
ls /dev/sd*

SMART Data Not Available¶

# Check if SMART is supported
smartctl -i /dev/sda

# Enable SMART
smartctl -s on /dev/sda

# Some USB/RAID controllers hide SMART data
# Use controller-specific tools if needed

High Drive Temperature¶

# Check current temperature
smartctl -a /dev/sda | grep Temperature

# Verify cooling fans are working
# Check system airflow
# Consider adding case fans
# Target: Keep drives < 45°C

Historical Events & Lessons Learned¶

Drive Replacement: 2025-12-08¶

Event: First production drive replacement with active K8s workloads

Failed Drives: - Drive 1: /dev/sdc - ata-ST4000VN008-2DR166_ZDH8PN1H (FAULTED) - Could not read SMART data - Status: UNAVAIL in ZFS pool - Drive 2: /dev/sde - ata-WDC_WD40EZRZ-22GXCB0_WD-WCC7K2NCDZ8Y (DEGRADED) - 52,773 power-on hours (6+ years old) - 1 pending sector, 3 uncorrectable sectors - Status: DEGRADED with too many errors

Replacement Drives: - 1x Seagate IronWolf 4TB (ST4000VN006-3CW104_ZW63WH4W) - 1x WD Red Plus 4TB (pending)

Timeline:

2025-12-08 15:30 - Drive replacement initiated 1. Documented pool status: zpool status vault > /root/vault-status-before-replacement.txt 2. Drained K8s nodes:

kubectl drain k3s-master k3s-worker-1 k3s-worker-2 --ignore-daemonsets --delete-emptydir-data --timeout=120s

3. Shutdown K8s VMs: qm shutdown 201 && qm shutdown 202 && qm shutdown 203 4. Shutdown Proxmox host: shutdown -h now

2025-12-08 18:24 - First drive replaced (physical) - Removed failed /dev/sdc (Seagate ST4000VN008 ZDH8PN1H) - Installed new Seagate IronWolf 4TB (ST4000VN006 ZW63WH4W) - Recorded new serial number from drive label

2025-12-08 18:32 - Post-boot recovery 1. Verified new drive detected:

ls -la /dev/disk/by-id/ | grep ata-ST4000VN006-3CW104_ZW63WH4W
# → /dev/sdc

Replaced drive in ZFS pool:

zpool replace vault 2202904836573237535 /dev/disk/by-id/ata-ST4000VN006-3CW104_ZW63WH4W

Uncordoned K8s nodes:

kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2

Checked Longhorn volumes:

kubectl get volumes -n longhorn-system | grep faulted
# → No faulted volumes! ✅

Issues Encountered:

ImagePullBackOff on app pods:
Symptom: home-portal, money-tracker, trip-planner stuck pulling images
Root cause: Pods scheduled before docker-registry service was fully ready
Resolution: Deleted stuck pods, they recreated successfully
Time to fix: 5 minutes
Lesson: Wait 2-3 minutes after uncordoning before checking pod status
Supabase storage CrashLoopBackOff:
Symptom: storage pods failing with migration hash mismatch
Root cause: Pre-existing issue, unrelated to drive replacement
Impact: None on main applications
Status: Separate issue to be addressed later

What Worked Well:

✅ Graceful K8s drain - No pod disruptions or data loss ✅ Longhorn auto-recovery - No manual intervention needed (unlike previous reboot) ✅ ZFS resilver - Started immediately, ~24 hour ETA ✅ Documentation - Quick reference card was accurate and helpful ✅ Drive detection - New drive appeared immediately at boot

Key Learnings:

Timing matters: Wait 2-3 minutes after uncordoning for services to stabilize
Longhorn improved: No faulted volumes this time (previous reboot required manual salvage)
Pod restarts safe: Simply deleting stuck pods resolved ImagePullBackOff
Documentation critical: Having step-by-step guide prevented mistakes
One drive at a time: Sequential replacement is the right approach

Next Steps:

[ ] Monitor resilver to completion (~24 hours)
[ ] Verify pool returns to ONLINE state
[ ] Replace second drive (/dev/sde) using same procedure
[ ] Update hardware inventory with new drive serials

Commands Used:

# Before shutdown
kubectl drain k3s-master k3s-worker-1 k3s-worker-2 --ignore-daemonsets --delete-emptydir-data --timeout=120s
qm shutdown 201 && qm shutdown 202 && qm shutdown 203
shutdown -h now

# After boot
ls -la /dev/disk/by-id/ | grep ata
zpool replace vault OLD_DEVICE_ID /dev/disk/by-id/ata-NEW_DEVICE_ID
kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2
kubectl get volumes -n longhorn-system | grep faulted
kubectl get pods -A | grep -v Running

# Fix stuck pods
kubectl delete pod <pod-name> -n <namespace>

# Monitor
zpool status vault | grep scan
/root/tower-fleet/scripts/check-hardware-health.sh

Result: ✅ Successful - All critical services running, resilver in progress

Storage Infrastructure - ZFS and storage architecture
Alert Investigation Guide - Alert investigation workflows
Observability Standards - Monitoring best practices
Disaster Recovery - Backup and recovery procedures

External Resources¶

ZFS Documentation
smartmontools Guide
Backblaze Drive Stats - Drive failure rate data