Skip to content

Hardware Health Monitoring

Last Updated: 2025-12-08 Status: Production

Overview

This guide covers hardware health monitoring for Tower Fleet infrastructure, focusing on: - ZFS pool health - RAIDZ2 array status and integrity - SMART drive monitoring - Predictive disk failure detection - Automated alerts - Prometheus integration for proactive monitoring - Maintenance schedules - Regular scrubs and health checks


Quick Health Check

ZFS Pool Status

# Check all pool health
zpool status

# Detailed status with error counts
zpool status -v

# Check specific pool
zpool status vault
zpool status rpool

What to look for: - State: Should be ONLINE (not DEGRADED or FAULTED) - READ/WRITE/CKSUM errors: All should be 0 - Scan status: Last scrub should be recent (< 1 month) with 0 errors

Example healthy output:

pool: vault
 state: ONLINE
  scan: scrub repaired 0B in 08:23:15 with 0 errors on Sun Dec  1 00:33:17 2025
config:
    NAME                                          STATE     READ WRITE CKSUM
    vault                                         ONLINE       0     0     0
      raidz2-0                                    ONLINE       0     0     0
        ata-ST4000VX016-3CV104_ZW61ZP88           ONLINE       0     0     0
        ata-ST4000VX016-3CV104_ZW626RNN           ONLINE       0     0     0
        ...
errors: No known data errors

SMART Drive Health

# Quick health check for all drives
for dev in sda sdb sdc sdd sde sdf sdg sdh; do
  echo "=== /dev/$dev ==="
  smartctl -H /dev/$dev 2>/dev/null | grep -i "overall-health"
done

# Detailed check with key metrics
for dev in sda sdb sdc sdd sde sdf sdg sdh; do
  echo "=== /dev/$dev ==="
  smartctl -i /dev/$dev 2>/dev/null | grep -E "(Model|Serial)"
  smartctl -a /dev/$dev 2>/dev/null | grep -E "(Power_On_Hours|Reallocated_Sector|Current_Pending|Offline_Uncorrectable)" | head -4
  echo
done

Critical SMART attributes to monitor:

Attribute ID Name Meaning Threshold
5 Reallocated_Sector_Ct Bad sectors remapped > 50 = warning, > 500 = replace
9 Power_On_Hours Drive age > 43,800 (5 years) = watch closely
197 Current_Pending_Sector Sectors waiting to be remapped > 0 = warning
198 Offline_Uncorrectable Sectors that can't be read > 0 = replace soon
199 UDMA_CRC_Error_Count Cable/connection errors > 100 = check cables

ZFS Pool Management

Understanding Pool States

ONLINE ✅ - All devices working normally - No action required

DEGRADED ⚠️ - One or more devices faulted - Pool still functional but at reduced redundancy - RAIDZ2: Can lose 2 drives - if 2 are faulted, pool is at risk - Action: Replace faulted drive(s) immediately

FAULTED ❌ - Too many devices failed, data loss possible - Pool may be read-only or offline - Action: Emergency recovery procedure required

Drive Mapping

Map ZFS device IDs to physical /dev/sd* devices:

# List all drive mappings
ls -la /dev/disk/by-id/ | grep -E "ata-ST|ata-WDC" | grep -v "part"

# Example output:
# ata-ST4000VN008-2DR166_ZDH8P5G1 -> ../../sdg
# ata-WDC_WD40EZRZ-22GXCB0_WD-WCC7K2NCDZ8Y -> ../../sde

This helps correlate ZFS errors with physical drive locations.

ZFS Scrub Operations

What is a scrub? - Reads every block in the pool to detect silent corruption - Repairs errors using parity/mirrors - Should run monthly on production pools

Manual scrub:

# Start scrub
zpool scrub vault

# Check scrub progress
zpool status vault

# Cancel scrub (if needed)
zpool scrub -s vault

Automated monthly scrub:

# Add to crontab
crontab -e

# Run scrub on first Sunday of month at 2am
0 2 1-7 * * [ "$(date +\%u)" -eq 7 ] && zpool scrub vault

Scrub duration: Expect ~8-12 hours for 20TB+ pools.

Replacing Failed Drives

⚠️ IMPORTANT: If you have Kubernetes workloads with Longhorn storage running on VMs, follow the full procedure below to prevent data loss.

Prerequisites

# 1. Document current state
zpool status vault > /root/vault-status-before-replacement.txt
/root/tower-fleet/scripts/check-hardware-health.sh > /root/hardware-before-replacement.txt

# 2. Identify the failed drive
zpool status vault
# Note the device ID (e.g., ata-ST4000VN008-2DR166_ZDH8PN1H)

# 3. Verify drive mapping
ls -la /dev/disk/by-id/ | grep <SERIAL_NUMBER>

Step 1: Graceful Shutdown (K8s Cluster)

If you have K8s VMs running Longhorn storage:

# Drain all K8s nodes (prevents pod disruption during reboot)
kubectl drain k3s-master --ignore-daemonsets --delete-emptydir-data --timeout=120s
kubectl drain k3s-worker-1 --ignore-daemonsets --delete-emptydir-data --timeout=120s
kubectl drain k3s-worker-2 --ignore-daemonsets --delete-emptydir-data --timeout=120s

# Shutdown K8s VMs gracefully
qm shutdown 201 && qm shutdown 202 && qm shutdown 203

# Wait for VMs to stop (verify status)
qm status 201  # Should show: status: stopped

# Shutdown Proxmox host
shutdown -h now

If no K8s cluster, just shutdown:

shutdown -h now

Step 2: Physical Drive Replacement

  1. Power off server (or hot-swap if controller supports it)
  2. Remove failed drive from bay
  3. Record the NEW drive's serial number (printed on label)
  4. Install new drive in same bay
  5. Power on server

Step 3: Post-Boot Drive Detection

# Wait for boot to complete (2-3 minutes)

# Verify new drive is detected
ls /dev/sd*
lsblk

# Find the new drive's device ID
ls -la /dev/disk/by-id/ | grep ata
# Look for the new serial number you recorded

Step 4: Replace Drive in ZFS Pool

# Replace using full device ID (preferred method)
zpool replace vault ata-OLD_DEVICE_ID /dev/disk/by-id/ata-NEW_DEVICE_ID

# Example:
# zpool replace vault ata-ST4000VN008-2DR166_ZDH8PN1H /dev/disk/by-id/ata-ST4000VX016-3CV104_ZW62ABCD

# If by-id path isn't available immediately:
zpool replace vault ata-OLD_DEVICE_ID /dev/sdc

# Monitor resilver progress
watch zpool status vault

Resilver duration: Expect ~24 hours per 4TB drive.

Step 5: Recover Kubernetes Cluster

CRITICAL: After reboot, Longhorn volumes may be in "faulted" state. See Post-Reboot Recovery Guide for details.

# 1. Uncordon K8s nodes (allow pods to schedule)
kubectl uncordon k3s-master
kubectl uncordon k3s-worker-1
kubectl uncordon k3s-worker-2

# 2. Wait 2-3 minutes for Longhorn to stabilize

# 3. Check for faulted volumes
kubectl get volumes -n longhorn-system | grep faulted

# 4. If any faulted volumes, run recovery script
/root/tower-fleet/scripts/recover-longhorn-volumes.sh

# 5. Verify all pods are running
kubectl get pods -A | grep -v Running

Step 6: Verify Services

# Check critical workloads
kubectl get pods -n docker-registry
kubectl get pods -n authentik
kubectl get pods -n home-portal
kubectl get pods -n supabase

# Test external access
curl -I https://portal.bogocat.com
nc -zv 10.89.97.201 30500  # docker registry (if applicable)

# Verify Longhorn health
kubectl get volumes -n longhorn-system | grep -v healthy || echo "All volumes healthy"

Step 7: Monitor Resilver to Completion

# Monitor resilver progress (runs in background)
watch zpool status vault

# Check periodically
zpool status vault | grep -A 3 "scan:"

# When complete (after ~24 hours):
zpool status vault
# Should show: scan: resilvered XGB in HH:MM:SS with 0 errors

Step 8: Verify Replacement Success

# Check pool state
zpool status vault
# State should be ONLINE (if all drives replaced) or DEGRADED (if more need replacing)

# Verify new drive SMART health
smartctl -H /dev/sdc
smartctl -a /dev/sdc | grep -E "Reallocated|Pending|Uncorrectable"

# Run full hardware health check
/root/tower-fleet/scripts/check-hardware-health.sh

Multiple Drive Replacements

If replacing multiple failed drives:

  1. Replace one drive at a time
  2. Wait for resilver to complete 100% before replacing next drive
  3. RAIDZ2 can tolerate 2 simultaneous failures, but replacing sequentially is safer
  4. Repeat Steps 1-8 for each drive

Example timeline: - Day 1: Replace drive 1, start resilver - Day 2: Resilver completes, replace drive 2, start resilver - Day 3: All replacements complete, verify pool ONLINE

Quick Reference Card

DRIVE REPLACEMENT CHECKLIST

Before shutdown:
[ ] kubectl drain k3s-master k3s-worker-1 k3s-worker-2 --ignore-daemonsets --delete-emptydir-data
[ ] qm shutdown 201 && qm shutdown 202 && qm shutdown 203
[ ] shutdown -h now

After boot:
[ ] kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2
[ ] /root/tower-fleet/scripts/recover-longhorn-volumes.sh
[ ] kubectl get pods -A | grep -v Running

Replace in ZFS:
[ ] zpool replace vault OLD_DEVICE_ID /dev/NEW_DEVICE

Monitor:
[ ] watch zpool status vault
[ ] /root/tower-fleet/scripts/check-hardware-health.sh

SMART Monitoring

Full Drive Health Report

# Complete SMART report for a drive
smartctl -a /dev/sda

# Key sections:
# - Model and Serial Number
# - Overall health assessment (PASSED/FAILED)
# - SMART Attributes (reallocated sectors, pending sectors, etc.)
# - Error log (recent read/write errors)
# - Self-test results

Running SMART Self-Tests

Short test (2 minutes):

# Start short test
smartctl -t short /dev/sda

# Wait 2 minutes, then check results
smartctl -l selftest /dev/sda

Long test (8+ hours):

# Start extended test
smartctl -t long /dev/sda

# Check progress
smartctl -a /dev/sda | grep "Self-test execution status"

# Check results (after completion)
smartctl -l selftest /dev/sda

Automated weekly tests:

# Add to crontab (run Sunday at 3am when load is low)
0 3 * * 0 for dev in sda sdb sdd sde sdf sdg sdh; do smartctl -t long /dev/$dev; done

SMART Daemon (smartd)

Enable smartd for automatic monitoring:

# Install smartmontools (if not present)
apt update && apt install smartmontools

# Edit config
nano /etc/smartd.conf

# Add monitoring for all drives
DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../7/03) -m root

# Restart service
systemctl restart smartd
systemctl enable smartd

Config explanation: - -a: Monitor all attributes - -o on: Enable automatic offline testing - -S on: Enable attribute autosave - -n standby,q: Don't wake sleeping drives - -s (S/../.././02|L/../../7/03): Short test daily at 2am, long test Sunday at 3am - -m root: Email root on failures


Prometheus Integration

Node Exporter SMART Metrics

Install smartmon textfile collector:

# Create script directory
mkdir -p /usr/local/bin

# Download smartmon.sh script
cat > /usr/local/bin/smartmon.sh << 'EOF'
#!/bin/bash
# Prometheus node_exporter textfile collector for SMART metrics

OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/smart.prom"
TEMP_FILE="${OUTPUT_FILE}.$$"

mkdir -p "$(dirname "$OUTPUT_FILE")"

# Clear temp file
> "$TEMP_FILE"

# Iterate over all drives
for device in /dev/sd[a-z]; do
  [ -e "$device" ] || continue

  # Get device info
  model=$(smartctl -i "$device" | awk -F': +' '/Device Model:/ {print $2}' | tr ' ' '_')
  serial=$(smartctl -i "$device" | awk -F': +' '/Serial Number:/ {print $2}')

  # Get SMART health
  health=$(smartctl -H "$device" | awk '/SMART overall-health/ {print $NF}')
  health_value=0
  [ "$health" = "PASSED" ] && health_value=1

  echo "# HELP smart_health SMART overall health (1=PASSED, 0=FAILED)" >> "$TEMP_FILE"
  echo "# TYPE smart_health gauge" >> "$TEMP_FILE"
  echo "smart_health{device=\"$device\",model=\"$model\",serial=\"$serial\"} $health_value" >> "$TEMP_FILE"

  # Get key SMART attributes
  smartctl -A "$device" | awk -v device="$device" -v model="$model" -v serial="$serial" '
  /Reallocated_Sector_Ct/ {print "smart_reallocated_sectors{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
  /Current_Pending_Sector/ {print "smart_pending_sectors{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
  /Offline_Uncorrectable/ {print "smart_uncorrectable_sectors{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
  /Power_On_Hours/ {print "smart_power_on_hours{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
  /Temperature_Celsius/ {print "smart_temperature_celsius{device=\""device"\",model=\""model"\",serial=\""serial"\"} "$10}
  ' >> "$TEMP_FILE"
done

# Move to final location atomically
mv "$TEMP_FILE" "$OUTPUT_FILE"
EOF

chmod +x /usr/local/bin/smartmon.sh

# Create node_exporter textfile directory
mkdir -p /var/lib/node_exporter/textfile_collector

# Test script
/usr/local/bin/smartmon.sh
cat /var/lib/node_exporter/textfile_collector/smart.prom

Add to crontab (run every 5 minutes):

crontab -e

# Add:
*/5 * * * * /usr/local/bin/smartmon.sh

Update node_exporter systemd service:

# Edit service file
systemctl edit node-exporter --full

# Add to ExecStart:
ExecStart=/usr/bin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector

# Restart
systemctl restart node-exporter

Prometheus Alert Rules

Create SMART alert rules:

# Create alert file
cat > /root/tower-fleet/k8s/prometheus/smart-alerts.yaml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-smart-alerts
  namespace: monitoring
data:
  smart-alerts.yaml: |
    groups:
    - name: smart_health
      interval: 5m
      rules:
      - alert: SmartHealthFailed
        expr: smart_health == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Drive SMART health check failed"
          description: "Drive {{ $labels.device }} ({{ $labels.model }}, S/N: {{ $labels.serial }}) has failed SMART health check."

      - alert: SmartReallocatedSectorsHigh
        expr: smart_reallocated_sectors > 50
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "High number of reallocated sectors"
          description: "Drive {{ $labels.device }} ({{ $labels.model }}) has {{ $value }} reallocated sectors. Consider replacing."

      - alert: SmartReallocatedSectorsCritical
        expr: smart_reallocated_sectors > 500
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical number of reallocated sectors"
          description: "Drive {{ $labels.device }} ({{ $labels.model }}) has {{ $value }} reallocated sectors. Replace immediately."

      - alert: SmartPendingSectors
        expr: smart_pending_sectors > 0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Drive has pending sectors"
          description: "Drive {{ $labels.device }} ({{ $labels.model }}) has {{ $value }} pending sectors waiting to be remapped."

      - alert: SmartUncorrectableSectors
        expr: smart_uncorrectable_sectors > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Drive has uncorrectable sectors"
          description: "Drive {{ $labels.device }} ({{ $labels.model }}) has {{ $value }} uncorrectable sectors. Data loss possible. Replace immediately."

      - alert: SmartDriveAgeWarning
        expr: smart_power_on_hours > 43800  # 5 years
        for: 1d
        labels:
          severity: info
        annotations:
          summary: "Drive approaching end of life"
          description: "Drive {{ $labels.device }} ({{ $labels.model }}) has been powered on for {{ $value }} hours (>5 years). Monitor closely."

      - alert: SmartTemperatureHigh
        expr: smart_temperature_celsius > 50
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Drive temperature high"
          description: "Drive {{ $labels.device }} temperature is {{ $value }}°C (threshold: 50°C). Check cooling."
EOF

# Apply to cluster
kubectl apply -f /root/tower-fleet/k8s/prometheus/smart-alerts.yaml

Add ZFS pool alert rules:

cat > /root/tower-fleet/k8s/prometheus/zfs-alerts.yaml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-zfs-alerts
  namespace: monitoring
data:
  zfs-alerts.yaml: |
    groups:
    - name: zfs_health
      interval: 5m
      rules:
      - alert: ZFSPoolDegraded
        expr: node_zfs_pool_state{state="DEGRADED"} > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool degraded"
          description: "ZFS pool {{ $labels.pool }} is in DEGRADED state. One or more devices have failed."

      - alert: ZFSPoolFaulted
        expr: node_zfs_pool_state{state="FAULTED"} > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool faulted"
          description: "ZFS pool {{ $labels.pool }} is FAULTED. Data loss possible. Immediate action required."

      - alert: ZFSPoolErrors
        expr: increase(node_zfs_pool_errors_total[1h]) > 0
        labels:
          severity: warning
        annotations:
          summary: "ZFS pool has errors"
          description: "ZFS pool {{ $labels.pool }} has {{ $value }} errors in the last hour. Type: {{ $labels.type }}."
EOF

kubectl apply -f /root/tower-fleet/k8s/prometheus/zfs-alerts.yaml

Grafana Dashboard

Import ZFS + SMART dashboard:

  1. Access Grafana: http://10.89.97.211 (admin/prom-operator)
  2. Go to DashboardsImport
  3. Create new dashboard with panels:

Panel 1: Drive SMART Health

smart_health{job="node-exporter"}

Panel 2: Reallocated Sectors by Drive

smart_reallocated_sectors{job="node-exporter"}

Panel 3: Pending + Uncorrectable Sectors

smart_pending_sectors{job="node-exporter"} or smart_uncorrectable_sectors{job="node-exporter"}

Panel 4: Drive Temperature

smart_temperature_celsius{job="node-exporter"}

Panel 5: Drive Age (Power-On Hours)

smart_power_on_hours{job="node-exporter"} / 8760
(Displays age in years)


Maintenance Schedule

Daily

  • Automated: smartd runs short SMART self-tests (via smartd.conf)
  • Automated: Prometheus scrapes SMART metrics every 5 minutes

Weekly

  • Automated: Long SMART self-tests on all drives (Sunday 3am)
  • Manual: Review Grafana hardware health dashboard

Monthly

  • Automated: ZFS scrub (first Sunday, 2am)
  • Manual: Review ZFS scrub results
  • Manual: Check drive temperatures and reallocated sectors

Quarterly

  • Manual: Review drive ages and plan replacements for drives >4 years old
  • Manual: Test alert rules (trigger test alert, verify Discord notification)

Annually

  • Manual: Backup and restore test (disaster recovery)
  • Manual: Replace drives >5 years old (proactive replacement)

Troubleshooting

ZFS Pool Won't Import

# Try importing with force
zpool import -f vault

# If that fails, import read-only to recover data
zpool import -o readonly=on vault

# Check pool configuration
zpool import

Drive Not Detected After Replacement

# Rescan SCSI bus
echo "- - -" > /sys/class/scsi_host/host0/scan
echo "- - -" > /sys/class/scsi_host/host1/scan

# Check kernel messages
dmesg | tail -50

# List all drives
lsblk
ls /dev/sd*

SMART Data Not Available

# Check if SMART is supported
smartctl -i /dev/sda

# Enable SMART
smartctl -s on /dev/sda

# Some USB/RAID controllers hide SMART data
# Use controller-specific tools if needed

High Drive Temperature

# Check current temperature
smartctl -a /dev/sda | grep Temperature

# Verify cooling fans are working
# Check system airflow
# Consider adding case fans
# Target: Keep drives < 45°C

Historical Events & Lessons Learned

Drive Replacement: 2025-12-08

Event: First production drive replacement with active K8s workloads

Failed Drives: - Drive 1: /dev/sdc - ata-ST4000VN008-2DR166_ZDH8PN1H (FAULTED) - Could not read SMART data - Status: UNAVAIL in ZFS pool - Drive 2: /dev/sde - ata-WDC_WD40EZRZ-22GXCB0_WD-WCC7K2NCDZ8Y (DEGRADED) - 52,773 power-on hours (6+ years old) - 1 pending sector, 3 uncorrectable sectors - Status: DEGRADED with too many errors

Replacement Drives: - 1x Seagate IronWolf 4TB (ST4000VN006-3CW104_ZW63WH4W) - 1x WD Red Plus 4TB (pending)

Timeline:

2025-12-08 15:30 - Drive replacement initiated 1. Documented pool status: zpool status vault > /root/vault-status-before-replacement.txt 2. Drained K8s nodes:

kubectl drain k3s-master k3s-worker-1 k3s-worker-2 --ignore-daemonsets --delete-emptydir-data --timeout=120s
3. Shutdown K8s VMs: qm shutdown 201 && qm shutdown 202 && qm shutdown 203 4. Shutdown Proxmox host: shutdown -h now

2025-12-08 18:24 - First drive replaced (physical) - Removed failed /dev/sdc (Seagate ST4000VN008 ZDH8PN1H) - Installed new Seagate IronWolf 4TB (ST4000VN006 ZW63WH4W) - Recorded new serial number from drive label

2025-12-08 18:32 - Post-boot recovery 1. Verified new drive detected:

ls -la /dev/disk/by-id/ | grep ata-ST4000VN006-3CW104_ZW63WH4W
# → /dev/sdc

  1. Replaced drive in ZFS pool:

    zpool replace vault 2202904836573237535 /dev/disk/by-id/ata-ST4000VN006-3CW104_ZW63WH4W
    

  2. Uncordoned K8s nodes:

    kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2
    

  3. Checked Longhorn volumes:

    kubectl get volumes -n longhorn-system | grep faulted
    # → No faulted volumes! ✅
    

Issues Encountered:

  1. ImagePullBackOff on app pods:
  2. Symptom: home-portal, money-tracker, trip-planner stuck pulling images
  3. Root cause: Pods scheduled before docker-registry service was fully ready
  4. Resolution: Deleted stuck pods, they recreated successfully
  5. Time to fix: 5 minutes
  6. Lesson: Wait 2-3 minutes after uncordoning before checking pod status

  7. Supabase storage CrashLoopBackOff:

  8. Symptom: storage pods failing with migration hash mismatch
  9. Root cause: Pre-existing issue, unrelated to drive replacement
  10. Impact: None on main applications
  11. Status: Separate issue to be addressed later

What Worked Well:

Graceful K8s drain - No pod disruptions or data loss ✅ Longhorn auto-recovery - No manual intervention needed (unlike previous reboot) ✅ ZFS resilver - Started immediately, ~24 hour ETA ✅ Documentation - Quick reference card was accurate and helpful ✅ Drive detection - New drive appeared immediately at boot

Key Learnings:

  1. Timing matters: Wait 2-3 minutes after uncordoning for services to stabilize
  2. Longhorn improved: No faulted volumes this time (previous reboot required manual salvage)
  3. Pod restarts safe: Simply deleting stuck pods resolved ImagePullBackOff
  4. Documentation critical: Having step-by-step guide prevented mistakes
  5. One drive at a time: Sequential replacement is the right approach

Next Steps:

  • [ ] Monitor resilver to completion (~24 hours)
  • [ ] Verify pool returns to ONLINE state
  • [ ] Replace second drive (/dev/sde) using same procedure
  • [ ] Update hardware inventory with new drive serials

Commands Used:

# Before shutdown
kubectl drain k3s-master k3s-worker-1 k3s-worker-2 --ignore-daemonsets --delete-emptydir-data --timeout=120s
qm shutdown 201 && qm shutdown 202 && qm shutdown 203
shutdown -h now

# After boot
ls -la /dev/disk/by-id/ | grep ata
zpool replace vault OLD_DEVICE_ID /dev/disk/by-id/ata-NEW_DEVICE_ID
kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2
kubectl get volumes -n longhorn-system | grep faulted
kubectl get pods -A | grep -v Running

# Fix stuck pods
kubectl delete pod <pod-name> -n <namespace>

# Monitor
zpool status vault | grep scan
/root/tower-fleet/scripts/check-hardware-health.sh

Result:Successful - All critical services running, resilver in progress



External Resources