Skip to content

Post-Reboot Recovery Guide

Complete guide for recovering the K8s cluster after Proxmox host reboots.

Last Updated: 2025-12-12 Affected Systems: K8s cluster (VMs 201-203), Supabase PostgreSQL, GPU passthrough VMs


Quick Recovery Checklist

Run these commands immediately after host reboot:

# 1. Uncordon K8s nodes (often stuck in SchedulingDisabled)
kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2

# 2. Check for Longhorn faulted volumes
kubectl get volumes -n longhorn-system | grep faulted
# If any faulted: /root/tower-fleet/scripts/recover-longhorn-volumes.sh

# 3. Check PostgreSQL connections (if pods crashing)
kubectl exec -n supabase postgres-0 -- psql -U postgres -c \
  "SELECT usename, count(*) FROM pg_stat_activity GROUP BY usename;"
# If near 100: kubectl rollout restart statefulset -n supabase postgres

# 4. Check GPU passthrough VM
qm status 100
# If won't start: See GPU Passthrough Recovery section below

Pre-Reboot Preparation

Before rebooting, follow the shutdown procedure to prevent issues: - Host Shutdown Procedure

This prevents connection pool exhaustion and GPU passthrough failures.


Overview

This guide documents recovery procedures learned from multiple reboots: - Dec 8, 2025: Proxmox upgrade - Longhorn volume salvage issues - Dec 12, 2025: Drive replacement - K8s nodes cordoned, PostgreSQL connection storm, GPU passthrough failure

What Happened

Context: - Proxmox host rebooted at 18:29 UTC on 2025-12-08 - Part of Proxmox 8→9 upgrade preparation (updated to 8.4.14) - First reboot with production K8s workloads running

Symptoms: - K8s cluster came back online (all 3 nodes Ready) - Longhorn volumes stuck in "faulted" state - Pods using Longhorn PVCs stuck in ContainerCreating - Services unavailable: docker-registry, authentik-redis, home-portal

Impact: - portal.bogocat.com - 503 error - Docker registry (10.89.97.201:30500) - connection refused - home-portal - ImagePullBackOff - authentik - Redis unavailable


Root Cause Analysis

Technical Details

Primary Issue: Longhorn auto-salvage failure after unexpected volume detachment

What happened: 1. Proxmox host rebooted at 18:29 UTC 2. K8s VMs (201-203) rebooted cleanly 3. Longhorn detected volumes detached unexpectedly 4. Auto-salvage triggered but brought up 0 replicas 5. Volumes remained in "faulted" state indefinitely

Key Evidence:

Longhorn manager logs:
- "Engine of volume pvc-45decf57 dead unexpectedly, setting v.Status.Robustness to faulted"
- "All replicas are failed, auto-salvaging volume"
- "Bringing up 0 replicas for auto-salvage"  ← THE PROBLEM

Replica Status:

{
  "failedAt": "2025-12-08T18:29:19Z",  // Marked failed at reboot
  "lastHealthyAt": "2025-11-27T14:25:18Z",  // Last known good state
  "currentState": "stopped",
  "salvageRequested": false  // ← Not set, so salvage skipped
}

Why 0 Replicas?

Longhorn's auto-salvage logic (v1.10.0) has a critical flaw: - Engine has salvageRequested: true - But individual replicas have salvageRequested: false - Auto-salvage counts replicas with salvageRequested: true → finds 0 - Without replicas to salvage, volume stays faulted

Underlying Causes:

  1. Disk Pressure on k3s-master:
    ScheduledTotal = 296GB (Size + StorageScheduled) >
    ProvisionedLimit = 291GB (100% of StorageMax - StorageReserved)
    
  2. Disk at 51% usage but Longhorn reservation prevents scheduling
  3. May prevent replica recovery

  4. Disk Monitor Mismatch:

    "Failed to sync with disk monitor due to mismatching disks"
    node=k3s-master
    

  5. Non-fatal but indicates Longhorn node state desync

  6. Replica Data Intact:

  7. Physical files exist on all nodes
  8. /var/lib/longhorn/replicas/pvc-*/volume-head-001.img present
  9. Last modified during normal operations, not corrupted

Recovery Procedure

Step 1: Verify Cluster Health

# Check K8s nodes
kubectl get nodes
# All should be Ready

# Check Longhorn system
kubectl get pods -n longhorn-system
# All should be Running (may have recent restarts from reboot)

# Identify faulted volumes
kubectl get volumes -n longhorn-system | grep faulted

Step 2: Manual Salvage (Replicas)

For each faulted volume, manually trigger replica salvage:

# Get list of replicas for the volume
VOLUME_NAME="pvc-45decf57-e6bc-4139-92d5-9410b8da79ac"
kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME_NAME

# Set salvageRequested on each replica
for REPLICA in $(kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME_NAME -o name); do
  kubectl patch -n longhorn-system $REPLICA --type='json' \
    -p='[{"op": "replace", "path": "/spec/salvageRequested", "value": true}]'
done

# Verify salvage requested
kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME_NAME \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.salvageRequested}{"\n"}{end}'

Step 3: Trigger Volume Reattachment

# Delete stuck pods to force volume reattachment
kubectl get pods -A | grep ContainerCreating
kubectl delete pod <pod-name> -n <namespace>

# Monitor volume status
watch kubectl get volumes -n longhorn-system | grep $VOLUME_NAME
# Should transition: faulted → detached → attached → healthy

Step 4: Verify Services

# Check pods are running
kubectl get pods -n docker-registry
kubectl get pods -n authentik -l app.kubernetes.io/component=master
kubectl get pods -n home-portal

# Test registry
nc -zv 10.89.97.201 30500

# Test portal
curl -I https://portal.bogocat.com

K8s Node Recovery (SchedulingDisabled)

Symptoms

After reboot, nodes show SchedulingDisabled or Ready,SchedulingDisabled:

kubectl get nodes
# NAME           STATUS                     ROLES                  AGE   VERSION
# k3s-master     Ready,SchedulingDisabled   control-plane,master   32d   v1.33.5+k3s1
# k3s-worker-1   Ready,SchedulingDisabled   <none>                 32d   v1.33.5+k3s1

All pods stuck in Pending state because no nodes accept scheduling.

Cause

K8s nodes were cordoned (drained) before shutdown, or k3s marks them unschedulable during unexpected shutdown.

Recovery

# Uncordon all nodes
kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2

# Verify
kubectl get nodes
# All should show just "Ready" without SchedulingDisabled

PostgreSQL Connection Pool Recovery

Symptoms

Pods crash with connection errors:

FATAL: remaining connection slots are reserved for non-replication superuser connections

Or pods in CrashLoopBackOff with database connection failures.

Cause

After reboot, all pods reconnect to PostgreSQL simultaneously: - Supabase PostgreSQL has max_connections=100 - Authentik alone uses 50-95 connections - Add other apps → exceeds limit - New pods can't connect → crash → retry → connection storm

Diagnosis

# Check current connection count
kubectl exec -n supabase postgres-0 -- psql -U postgres -c \
  "SELECT usename, count(*) FROM pg_stat_activity GROUP BY usename ORDER BY count DESC;"

# Example problematic output:
#  usename   | count
# -----------+-------
#  authentik |    95  ← Too many!
#  postgres  |     7
#            |     4

Recovery

# Option 1: Restart PostgreSQL (clears all connections)
kubectl rollout restart statefulset -n supabase postgres
sleep 30

# Then restart crashing pods to clear backoff
kubectl delete pods -n authentik -l app.kubernetes.io/name=authentik
kubectl delete pods -n supabase -l app=kong

# Option 2: If specific app is hogging connections, restart just that app
kubectl rollout restart deployment -n authentik authentik-server
kubectl rollout restart deployment -n authentik authentik-worker

Prevention

Short-term: Use the shutdown procedure to scale down apps before reboot.

Implemented: PgBouncer connection pooler deployed (2025-12-15). See PgBouncer Migration.


GPU Passthrough Recovery (VM 100 - arr-stack)

Symptoms

Arr-stack VM won't start or network unreachable after host reboot:

qm start 100
# error writing '1' to '/sys/bus/pci/devices/0000:0b:00.0/reset': Inappropriate ioctl for device
# failed to reset PCI device '0000:0b:00.0'

Or VM starts but has no network connectivity:

ping 10.89.97.50
# Destination Host Unreachable

Cause

NVIDIA GPUs (Quadro M2000 in this case) don't support proper PCI Function Level Reset (FLR): 1. VM shutdown doesn't fully release GPU 2. Force-stop makes it worse 3. GPU stuck in "dirty" state 4. Can't be reassigned until host power cycle

Recovery Options

Option 1: Boot without GPU (immediate, temporary)

# Remove GPU passthrough
qm stop 100 --timeout 10 2>/dev/null
qm set 100 -delete hostpci0

# Start VM
qm start 100

# Verify arr-stack works (Tdarr won't have hardware transcoding)
ssh root@10.89.97.50 "docker ps"

Option 2: Try GPU reset (may work)

# Stop VM first
qm stop 100 --timeout 10 2>/dev/null

# Try to reset GPU
echo 1 > /sys/bus/pci/devices/0000:0b:00.0/reset 2>/dev/null

# Try starting VM
qm start 100

Option 3: Full host power cycle (guaranteed)

Only option if GPU is stuck. Must: 1. Shutdown host completely: poweroff 2. Wait 30+ seconds (capacitors discharge) 3. Power on

Re-enabling GPU After Fix

After host power cycle or successful reset:

# Re-add GPU passthrough
qm set 100 -hostpci0 0b:00,pcie=1

# Start VM
qm start 100

# Verify Tdarr can use GPU
ssh root@10.89.97.50 "docker logs tdarr_node 2>&1 | grep -i nvenc"

Prevention

Always use graceful VM shutdown before host reboot:

qm shutdown 100  # NOT qm stop 100

See Host Shutdown Procedure for complete pre-reboot checklist.


Automated Recovery Script

Location: /root/tower-fleet/scripts/recover-longhorn-volumes.sh

#!/bin/bash
# Automated Longhorn volume recovery after reboot

set -e

echo "=== Longhorn Post-Reboot Recovery ==="
echo "Started: $(date)"

# Find all faulted volumes
FAULTED_VOLUMES=$(kubectl get volumes -n longhorn-system -o json | \
  jq -r '.items[] | select(.status.robustness=="faulted") | .metadata.name')

if [ -z "$FAULTED_VOLUMES" ]; then
  echo "✓ No faulted volumes found"
  exit 0
fi

echo "Found faulted volumes:"
echo "$FAULTED_VOLUMES"
echo ""

# Process each faulted volume
for VOLUME in $FAULTED_VOLUMES; do
  echo "Processing volume: $VOLUME"

  # Get replicas for this volume
  REPLICAS=$(kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME -o name)

  if [ -z "$REPLICAS" ]; then
    echo "  ⚠️  No replicas found"
    continue
  fi

  echo "  Found $(echo "$REPLICAS" | wc -l) replicas"

  # Enable salvage on each replica
  for REPLICA in $REPLICAS; do
    echo "  Setting salvageRequested on $REPLICA"
    kubectl patch -n longhorn-system $REPLICA --type='json' \
      -p='[{"op": "replace", "path": "/spec/salvageRequested", "value": true}]' \
      2>&1 | grep -v "no change" || true
  done

  echo "  ✓ Salvage requested for all replicas"
  echo ""
done

echo "=== Waiting for volumes to recover (30s) ==="
sleep 30

# Check results
STILL_FAULTED=$(kubectl get volumes -n longhorn-system -o json | \
  jq -r '.items[] | select(.status.robustness=="faulted") | .metadata.name')

if [ -z "$STILL_FAULTED" ]; then
  echo "✓ All volumes recovered successfully"
else
  echo "⚠️  Still faulted volumes:"
  echo "$STILL_FAULTED"
  echo ""
  echo "Manual intervention may be required"
  exit 1
fi

echo "Completed: $(date)"

Usage:

chmod +x /root/tower-fleet/scripts/recover-longhorn-volumes.sh
/root/tower-fleet/scripts/recover-longhorn-volumes.sh


Post-Reboot Checklist

Use this checklist after every Proxmox host reboot:

Immediate (Within 2 minutes)

  • [ ] Uncordon K8s nodes (often stuck in SchedulingDisabled)

    kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2
    kubectl get nodes  # All should be Ready (no SchedulingDisabled)
    

  • [ ] Check Longhorn volume health

    kubectl get volumes -n longhorn-system | grep faulted
    # If any faulted:
    /root/tower-fleet/scripts/recover-longhorn-volumes.sh
    

  • [ ] Check for PostgreSQL connection pool exhaustion

    kubectl get pods -n authentik | grep -E "CrashLoop|Error"
    kubectl get pods -n supabase | grep -E "CrashLoop|Error"
    # If crashing pods with connection errors:
    kubectl rollout restart statefulset -n supabase postgres
    sleep 30
    kubectl delete pods -n authentik -l app.kubernetes.io/name=authentik
    

VMs (Within 5 minutes)

  • [ ] Verify arr-stack VM (GPU passthrough)

    qm status 100  # Should be running
    ping -c 2 10.89.97.50  # Should respond
    # If not responding, see GPU Passthrough Recovery section
    

  • [ ] Verify game-servers VM

    qm status 360
    ping -c 2 10.89.97.60
    

Services (Within 15 minutes)

  • [ ] Verify critical pods are running

    kubectl get pods -n docker-registry
    kubectl get pods -n authentik
    kubectl get pods -n home-portal
    kubectl get pods -n supabase
    # All should be Running with no restarts accumulating
    

  • [ ] Test external access

    curl -I https://portal.bogocat.com
    nc -zv 10.89.97.201 30500  # docker registry
    

  • [ ] Check Ingress controller

    kubectl get pods -n ingress-nginx
    kubectl get ingress -A
    

  • [ ] Verify arr-stack services

    ssh root@10.89.97.50 "docker ps --format 'table {{.Names}}\t{{.Status}}' | head -15"
    

Follow-up (Within 1 hour)

  • [ ] Review Longhorn manager logs

    kubectl logs -n longhorn-system -l app=longhorn-manager --tail=100 | grep -i error
    

  • [ ] Check disk space on nodes

    for NODE in k3s-master k3s-worker-1 k3s-worker-2; do
      echo "=== $NODE ==="
      ssh root@$NODE "df -h /var/lib/longhorn"
    done
    

  • [ ] Verify Longhorn UI accessible

    kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
    # Access: http://localhost:8080
    


Prevention & Long-term Fixes

Immediate Actions

  1. Add recovery script to systemd (auto-run after reboot)

    cat > /etc/systemd/system/longhorn-recovery.service <<'EOF'
    [Unit]
    Description=Longhorn Volume Recovery After Reboot
    After=k3s.service
    Requires=k3s.service
    
    [Service]
    Type=oneshot
    ExecStartPre=/bin/sleep 60
    ExecStart=/root/tower-fleet/scripts/recover-longhorn-volumes.sh
    RemainAfterExit=yes
    
    [Install]
    WantedBy=multi-user.target
    EOF
    
    systemctl enable longhorn-recovery.service
    

  2. Add to disaster recovery docs

  3. Update /root/tower-fleet/docs/reference/disaster-recovery.md
  4. Add "Post-Reboot Recovery" section

Medium-term (Before Next Reboot)

  1. Upgrade Longhorn to v1.7+ (check if fixed)
  2. Current: v1.10.0
  3. Check release notes for auto-salvage improvements

  4. Increase disk space on k3s-master

  5. Currently at DiskPressure (296GB scheduled > 291GB limit)
  6. Expand VM disk or adjust Longhorn storage reservation

  7. Test graceful shutdown/restart

    # Drain nodes before reboot
    kubectl drain k3s-master --ignore-daemonsets --delete-emptydir-data
    # Reboot
    # Uncordon after boot
    kubectl uncordon k3s-master
    

  8. Document in Proxmox upgrade plan

  9. Add to /root/tower-fleet/docs/operations/proxmox-upgrade-plan.md
  10. Include this guide in "Post-Upgrade Verification" section

Long-term (Future Architecture)

  1. Longhorn volume backup automation
  2. Implement recurring snapshots
  3. Offsite backup of critical volumes (registry, authentik)

  4. High-availability improvements

  5. Consider 5-node cluster (tolerate 2 node failures)
  6. Separate Longhorn storage nodes from workload nodes

  7. Monitoring & alerting

  8. Prometheus alert for Longhorn volume faulted state
  9. Alert if volumes don't recover within 5 minutes of reboot

Troubleshooting

Volume Still Faulted After Salvage

Symptom: Replica salvage requested but volume stays faulted

Diagnosis:

VOLUME="pvc-45decf57-e6bc-4139-92d5-9410b8da79ac"
kubectl describe volume $VOLUME -n longhorn-system | grep -A 20 "Events:"
kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME -o yaml

Solutions: 1. Check if replicas actually started:

kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.currentState}{"\n"}{end}'

  1. If still "stopped", check instance-manager logs:

    kubectl logs -n longhorn-system -l longhorn.io/component=instance-manager --tail=100
    

  2. Nuclear option - delete and recreate replicas:

    # CAUTION: Only if data is backed up
    kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME -o name | \
      xargs kubectl delete -n longhorn-system
    # Longhorn will recreate replicas automatically
    

Disk Pressure Preventing Recovery

Symptom: "DiskPressure" condition on Longhorn nodes

Check:

kubectl get nodes.longhorn.io -n longhorn-system -o json | \
  jq '.items[] | {name: .metadata.name, diskStatus: .status.diskStatus}'

Fix: Expand node disk or adjust Longhorn reservation:

# Option 1: Expand VM disk (on Proxmox host)
qm resize 201 scsi0 +100G
# Then extend filesystem on VM

# Option 2: Reduce Longhorn storage reservation
kubectl edit settings.longhorn.io -n longhorn-system storage-reserved-percentage-for-default-disk
# Reduce from 30% to 20%

Pod Stuck in ContainerCreating After Salvage

Symptom: Volume recovered but pod won't start

Check volume attachment:

kubectl get volumeattachments | grep $VOLUME
kubectl describe volumeattachment <attachment-id>

Fix:

# Delete pod and let it recreate
kubectl delete pod <pod-name> -n <namespace>

# If still stuck, force delete volume attachment
kubectl delete volumeattachment <attachment-id>


References

External Resources

Historical Context

  • First Reboot: 2025-12-08 18:29 UTC
  • Cause: Proxmox 8.3.0 → 8.4.14 upgrade (prep for 8→9)
  • Recovery Time: ~1 hour (manual diagnosis + fix)
  • Data Loss: None (all replica data intact)
  • Lessons Learned: Longhorn auto-salvage unreliable, manual recovery needed

Change Log

  • 2025-12-15: PgBouncer deployed - connection pool exhaustion issue resolved. Updated prevention section.
  • 2025-12-12: Added K8s node uncordon, PostgreSQL connection pool recovery, GPU passthrough recovery sections. Added pre-reboot preparation link. Updated checklist with all recovery items.
  • 2025-12-08: Initial version after first K8s reboot incident (Longhorn volume salvage)