Post-Reboot Recovery Guide¶

Complete guide for recovering the K8s cluster after Proxmox host reboots.

Last Updated: 2025-12-12 Affected Systems: K8s cluster (VMs 201-203), Supabase PostgreSQL, GPU passthrough VMs

Quick Recovery Checklist¶

Run these commands immediately after host reboot:

# 1. Uncordon K8s nodes (often stuck in SchedulingDisabled)
kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2

# 2. Check for Longhorn faulted volumes
kubectl get volumes -n longhorn-system | grep faulted
# If any faulted: /root/tower-fleet/scripts/recover-longhorn-volumes.sh

# 3. Check PostgreSQL connections (if pods crashing)
kubectl exec -n supabase postgres-0 -- psql -U postgres -c \
  "SELECT usename, count(*) FROM pg_stat_activity GROUP BY usename;"
# If near 100: kubectl rollout restart statefulset -n supabase postgres

# 4. Check GPU passthrough VM
qm status 100
# If won't start: See GPU Passthrough Recovery section below

Pre-Reboot Preparation¶

Before rebooting, follow the shutdown procedure to prevent issues: - Host Shutdown Procedure

This prevents connection pool exhaustion and GPU passthrough failures.

Overview¶

This guide documents recovery procedures learned from multiple reboots: - Dec 8, 2025: Proxmox upgrade - Longhorn volume salvage issues - Dec 12, 2025: Drive replacement - K8s nodes cordoned, PostgreSQL connection storm, GPU passthrough failure

What Happened¶

Context: - Proxmox host rebooted at 18:29 UTC on 2025-12-08 - Part of Proxmox 8→9 upgrade preparation (updated to 8.4.14) - First reboot with production K8s workloads running

Symptoms: - K8s cluster came back online (all 3 nodes Ready) - Longhorn volumes stuck in "faulted" state - Pods using Longhorn PVCs stuck in ContainerCreating - Services unavailable: docker-registry, authentik-redis, home-portal

Impact: - portal.bogocat.com - 503 error - Docker registry (10.89.97.201:30500) - connection refused - home-portal - ImagePullBackOff - authentik - Redis unavailable

Root Cause Analysis¶

Technical Details¶

Primary Issue: Longhorn auto-salvage failure after unexpected volume detachment

What happened: 1. Proxmox host rebooted at 18:29 UTC 2. K8s VMs (201-203) rebooted cleanly 3. Longhorn detected volumes detached unexpectedly 4. Auto-salvage triggered but brought up 0 replicas 5. Volumes remained in "faulted" state indefinitely

Key Evidence:

Longhorn manager logs:
- "Engine of volume pvc-45decf57 dead unexpectedly, setting v.Status.Robustness to faulted"
- "All replicas are failed, auto-salvaging volume"
- "Bringing up 0 replicas for auto-salvage"  ← THE PROBLEM

Replica Status:

{
  "failedAt": "2025-12-08T18:29:19Z",  // Marked failed at reboot
  "lastHealthyAt": "2025-11-27T14:25:18Z",  // Last known good state
  "currentState": "stopped",
  "salvageRequested": false  // ← Not set, so salvage skipped
}

Why 0 Replicas?

Longhorn's auto-salvage logic (v1.10.0) has a critical flaw: - Engine has salvageRequested: true - But individual replicas have salvageRequested: false - Auto-salvage counts replicas with salvageRequested: true → finds 0 - Without replicas to salvage, volume stays faulted

Underlying Causes:

Disk Pressure on k3s-master:

ScheduledTotal = 296GB (Size + StorageScheduled) >
ProvisionedLimit = 291GB (100% of StorageMax - StorageReserved)

Disk at 51% usage but Longhorn reservation prevents scheduling
May prevent replica recovery

Disk Monitor Mismatch:

"Failed to sync with disk monitor due to mismatching disks"
node=k3s-master

Non-fatal but indicates Longhorn node state desync
Replica Data Intact:
Physical files exist on all nodes
/var/lib/longhorn/replicas/pvc-*/volume-head-001.img present
Last modified during normal operations, not corrupted

Recovery Procedure¶

Step 1: Verify Cluster Health¶

# Check K8s nodes
kubectl get nodes
# All should be Ready

# Check Longhorn system
kubectl get pods -n longhorn-system
# All should be Running (may have recent restarts from reboot)

# Identify faulted volumes
kubectl get volumes -n longhorn-system | grep faulted

Step 2: Manual Salvage (Replicas)¶

For each faulted volume, manually trigger replica salvage:

# Get list of replicas for the volume
VOLUME_NAME="pvc-45decf57-e6bc-4139-92d5-9410b8da79ac"
kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME_NAME

# Set salvageRequested on each replica
for REPLICA in $(kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME_NAME -o name); do
  kubectl patch -n longhorn-system $REPLICA --type='json' \
    -p='[{"op": "replace", "path": "/spec/salvageRequested", "value": true}]'
done

# Verify salvage requested
kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME_NAME \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.salvageRequested}{"\n"}{end}'

Step 3: Trigger Volume Reattachment¶

# Delete stuck pods to force volume reattachment
kubectl get pods -A | grep ContainerCreating
kubectl delete pod <pod-name> -n <namespace>

# Monitor volume status
watch kubectl get volumes -n longhorn-system | grep $VOLUME_NAME
# Should transition: faulted → detached → attached → healthy

Step 4: Verify Services¶

# Check pods are running
kubectl get pods -n docker-registry
kubectl get pods -n authentik -l app.kubernetes.io/component=master
kubectl get pods -n home-portal

# Test registry
nc -zv 10.89.97.201 30500

# Test portal
curl -I https://portal.bogocat.com

K8s Node Recovery (SchedulingDisabled)¶

Symptoms¶

After reboot, nodes show SchedulingDisabled or Ready,SchedulingDisabled:

kubectl get nodes
# NAME           STATUS                     ROLES                  AGE   VERSION
# k3s-master     Ready,SchedulingDisabled   control-plane,master   32d   v1.33.5+k3s1
# k3s-worker-1   Ready,SchedulingDisabled   <none>                 32d   v1.33.5+k3s1

All pods stuck in Pending state because no nodes accept scheduling.

Cause¶

K8s nodes were cordoned (drained) before shutdown, or k3s marks them unschedulable during unexpected shutdown.

Recovery¶

# Uncordon all nodes
kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2

# Verify
kubectl get nodes
# All should show just "Ready" without SchedulingDisabled

PostgreSQL Connection Pool Recovery¶

Symptoms¶

Pods crash with connection errors:

FATAL: remaining connection slots are reserved for non-replication superuser connections

Or pods in CrashLoopBackOff with database connection failures.

Cause¶

After reboot, all pods reconnect to PostgreSQL simultaneously: - Supabase PostgreSQL has max_connections=100 - Authentik alone uses 50-95 connections - Add other apps → exceeds limit - New pods can't connect → crash → retry → connection storm

Diagnosis¶

# Check current connection count
kubectl exec -n supabase postgres-0 -- psql -U postgres -c \
  "SELECT usename, count(*) FROM pg_stat_activity GROUP BY usename ORDER BY count DESC;"

# Example problematic output:
#  usename   | count
# -----------+-------
#  authentik |    95  ← Too many!
#  postgres  |     7
#            |     4

Recovery¶

# Option 1: Restart PostgreSQL (clears all connections)
kubectl rollout restart statefulset -n supabase postgres
sleep 30

# Then restart crashing pods to clear backoff
kubectl delete pods -n authentik -l app.kubernetes.io/name=authentik
kubectl delete pods -n supabase -l app=kong

# Option 2: If specific app is hogging connections, restart just that app
kubectl rollout restart deployment -n authentik authentik-server
kubectl rollout restart deployment -n authentik authentik-worker

Prevention¶

Short-term: Use the shutdown procedure to scale down apps before reboot.

Implemented: PgBouncer connection pooler deployed (2025-12-15). See PgBouncer Migration.

GPU Passthrough Recovery (VM 100 - arr-stack)¶

Symptoms¶

Arr-stack VM won't start or network unreachable after host reboot:

qm start 100
# error writing '1' to '/sys/bus/pci/devices/0000:0b:00.0/reset': Inappropriate ioctl for device
# failed to reset PCI device '0000:0b:00.0'

Or VM starts but has no network connectivity:

ping 10.89.97.50
# Destination Host Unreachable

Cause¶

NVIDIA GPUs (Quadro M2000 in this case) don't support proper PCI Function Level Reset (FLR): 1. VM shutdown doesn't fully release GPU 2. Force-stop makes it worse 3. GPU stuck in "dirty" state 4. Can't be reassigned until host power cycle

Recovery Options¶

Option 1: Boot without GPU (immediate, temporary)

# Remove GPU passthrough
qm stop 100 --timeout 10 2>/dev/null
qm set 100 -delete hostpci0

# Start VM
qm start 100

# Verify arr-stack works (Tdarr won't have hardware transcoding)
ssh root@10.89.97.50 "docker ps"

Option 2: Try GPU reset (may work)

# Stop VM first
qm stop 100 --timeout 10 2>/dev/null

# Try to reset GPU
echo 1 > /sys/bus/pci/devices/0000:0b:00.0/reset 2>/dev/null

# Try starting VM
qm start 100

Option 3: Full host power cycle (guaranteed)

Only option if GPU is stuck. Must: 1. Shutdown host completely: poweroff 2. Wait 30+ seconds (capacitors discharge) 3. Power on

Re-enabling GPU After Fix¶

After host power cycle or successful reset:

# Re-add GPU passthrough
qm set 100 -hostpci0 0b:00,pcie=1

# Start VM
qm start 100

# Verify Tdarr can use GPU
ssh root@10.89.97.50 "docker logs tdarr_node 2>&1 | grep -i nvenc"

Prevention¶

Always use graceful VM shutdown before host reboot:

qm shutdown 100  # NOT qm stop 100

See Host Shutdown Procedure for complete pre-reboot checklist.

Automated Recovery Script¶

Location: /root/tower-fleet/scripts/recover-longhorn-volumes.sh

#!/bin/bash
# Automated Longhorn volume recovery after reboot

set -e

echo "=== Longhorn Post-Reboot Recovery ==="
echo "Started: $(date)"

# Find all faulted volumes
FAULTED_VOLUMES=$(kubectl get volumes -n longhorn-system -o json | \
  jq -r '.items[] | select(.status.robustness=="faulted") | .metadata.name')

if [ -z "$FAULTED_VOLUMES" ]; then
  echo "✓ No faulted volumes found"
  exit 0
fi

echo "Found faulted volumes:"
echo "$FAULTED_VOLUMES"
echo ""

# Process each faulted volume
for VOLUME in $FAULTED_VOLUMES; do
  echo "Processing volume: $VOLUME"

  # Get replicas for this volume
  REPLICAS=$(kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME -o name)

  if [ -z "$REPLICAS" ]; then
    echo "  ⚠️  No replicas found"
    continue
  fi

  echo "  Found $(echo "$REPLICAS" | wc -l) replicas"

  # Enable salvage on each replica
  for REPLICA in $REPLICAS; do
    echo "  Setting salvageRequested on $REPLICA"
    kubectl patch -n longhorn-system $REPLICA --type='json' \
      -p='[{"op": "replace", "path": "/spec/salvageRequested", "value": true}]' \
      2>&1 | grep -v "no change" || true
  done

  echo "  ✓ Salvage requested for all replicas"
  echo ""
done

echo "=== Waiting for volumes to recover (30s) ==="
sleep 30

# Check results
STILL_FAULTED=$(kubectl get volumes -n longhorn-system -o json | \
  jq -r '.items[] | select(.status.robustness=="faulted") | .metadata.name')

if [ -z "$STILL_FAULTED" ]; then
  echo "✓ All volumes recovered successfully"
else
  echo "⚠️  Still faulted volumes:"
  echo "$STILL_FAULTED"
  echo ""
  echo "Manual intervention may be required"
  exit 1
fi

echo "Completed: $(date)"

Usage:

chmod +x /root/tower-fleet/scripts/recover-longhorn-volumes.sh
/root/tower-fleet/scripts/recover-longhorn-volumes.sh

Post-Reboot Checklist¶

Use this checklist after every Proxmox host reboot:

Immediate (Within 2 minutes)¶

[ ] Uncordon K8s nodes (often stuck in SchedulingDisabled)

kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2
kubectl get nodes  # All should be Ready (no SchedulingDisabled)

[ ] Check Longhorn volume health

kubectl get volumes -n longhorn-system | grep faulted
# If any faulted:
/root/tower-fleet/scripts/recover-longhorn-volumes.sh

[ ] Check for PostgreSQL connection pool exhaustion

kubectl get pods -n authentik | grep -E "CrashLoop|Error"
kubectl get pods -n supabase | grep -E "CrashLoop|Error"
# If crashing pods with connection errors:
kubectl rollout restart statefulset -n supabase postgres
sleep 30
kubectl delete pods -n authentik -l app.kubernetes.io/name=authentik

VMs (Within 5 minutes)¶

[ ] Verify arr-stack VM (GPU passthrough)

qm status 100  # Should be running
ping -c 2 10.89.97.50  # Should respond
# If not responding, see GPU Passthrough Recovery section

[ ] Verify game-servers VM
```
qm status 360
ping -c 2 10.89.97.60
```

Services (Within 15 minutes)¶

[ ] Verify critical pods are running

kubectl get pods -n docker-registry
kubectl get pods -n authentik
kubectl get pods -n home-portal
kubectl get pods -n supabase
# All should be Running with no restarts accumulating

[ ] Test external access

curl -I https://portal.bogocat.com
nc -zv 10.89.97.201 30500  # docker registry

[ ] Check Ingress controller

kubectl get pods -n ingress-nginx
kubectl get ingress -A

[ ] Verify arr-stack services

ssh root@10.89.97.50 "docker ps --format 'table {{.Names}}\t{{.Status}}' | head -15"

Follow-up (Within 1 hour)¶

[ ] Review Longhorn manager logs

kubectl logs -n longhorn-system -l app=longhorn-manager --tail=100 | grep -i error

[ ] Check disk space on nodes

for NODE in k3s-master k3s-worker-1 k3s-worker-2; do
  echo "=== $NODE ==="
  ssh root@$NODE "df -h /var/lib/longhorn"
done

[ ] Verify Longhorn UI accessible

kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# Access: http://localhost:8080

Prevention & Long-term Fixes¶

Immediate Actions¶

Add recovery script to systemd (auto-run after reboot)

cat > /etc/systemd/system/longhorn-recovery.service <<'EOF'
[Unit]
Description=Longhorn Volume Recovery After Reboot
After=k3s.service
Requires=k3s.service

[Service]
Type=oneshot
ExecStartPre=/bin/sleep 60
ExecStart=/root/tower-fleet/scripts/recover-longhorn-volumes.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

systemctl enable longhorn-recovery.service

Add to disaster recovery docs
Update /root/tower-fleet/docs/reference/disaster-recovery.md
Add "Post-Reboot Recovery" section

Medium-term (Before Next Reboot)¶

Upgrade Longhorn to v1.7+ (check if fixed)
Current: v1.10.0
Check release notes for auto-salvage improvements
Increase disk space on k3s-master
Currently at DiskPressure (296GB scheduled > 291GB limit)
Expand VM disk or adjust Longhorn storage reservation

Test graceful shutdown/restart

# Drain nodes before reboot
kubectl drain k3s-master --ignore-daemonsets --delete-emptydir-data
# Reboot
# Uncordon after boot
kubectl uncordon k3s-master

Document in Proxmox upgrade plan
Add to /root/tower-fleet/docs/operations/proxmox-upgrade-plan.md
Include this guide in "Post-Upgrade Verification" section

Long-term (Future Architecture)¶

Longhorn volume backup automation
Implement recurring snapshots
Offsite backup of critical volumes (registry, authentik)
High-availability improvements
Consider 5-node cluster (tolerate 2 node failures)
Separate Longhorn storage nodes from workload nodes
Monitoring & alerting
Prometheus alert for Longhorn volume faulted state
Alert if volumes don't recover within 5 minutes of reboot

Troubleshooting¶

Volume Still Faulted After Salvage¶

Symptom: Replica salvage requested but volume stays faulted

Diagnosis:

VOLUME="pvc-45decf57-e6bc-4139-92d5-9410b8da79ac"
kubectl describe volume $VOLUME -n longhorn-system | grep -A 20 "Events:"
kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME -o yaml

Solutions: 1. Check if replicas actually started:

kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.currentState}{"\n"}{end}'

If still "stopped", check instance-manager logs:

kubectl logs -n longhorn-system -l longhorn.io/component=instance-manager --tail=100

Nuclear option - delete and recreate replicas:

# CAUTION: Only if data is backed up
kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME -o name | \
  xargs kubectl delete -n longhorn-system
# Longhorn will recreate replicas automatically

Disk Pressure Preventing Recovery¶

Symptom: "DiskPressure" condition on Longhorn nodes

Check:

kubectl get nodes.longhorn.io -n longhorn-system -o json | \
  jq '.items[] | {name: .metadata.name, diskStatus: .status.diskStatus}'

Fix: Expand node disk or adjust Longhorn reservation:

# Option 1: Expand VM disk (on Proxmox host)
qm resize 201 scsi0 +100G
# Then extend filesystem on VM

# Option 2: Reduce Longhorn storage reservation
kubectl edit settings.longhorn.io -n longhorn-system storage-reserved-percentage-for-default-disk
# Reduce from 30% to 20%

Pod Stuck in ContainerCreating After Salvage¶

Symptom: Volume recovered but pod won't start

Check volume attachment:

kubectl get volumeattachments | grep $VOLUME
kubectl describe volumeattachment <attachment-id>

Fix:

# Delete pod and let it recreate
kubectl delete pod <pod-name> -n <namespace>

# If still stuck, force delete volume attachment
kubectl delete volumeattachment <attachment-id>

References¶

External Resources¶

Longhorn Auto-Salvage Documentation
Longhorn GitHub Issue #1234 (auto-salvage 0 replicas bug)
K8s VolumeAttachment Debugging

Historical Context¶

First Reboot: 2025-12-08 18:29 UTC
Cause: Proxmox 8.3.0 → 8.4.14 upgrade (prep for 8→9)
Recovery Time: ~1 hour (manual diagnosis + fix)
Data Loss: None (all replica data intact)
Lessons Learned: Longhorn auto-salvage unreliable, manual recovery needed

Change Log¶

2025-12-15: PgBouncer deployed - connection pool exhaustion issue resolved. Updated prevention section.
2025-12-12: Added K8s node uncordon, PostgreSQL connection pool recovery, GPU passthrough recovery sections. Added pre-reboot preparation link. Updated checklist with all recovery items.
2025-12-08: Initial version after first K8s reboot incident (Longhorn volume salvage)

Post-Reboot Recovery Guide¶

Quick Recovery Checklist¶

Pre-Reboot Preparation¶

Overview¶

What Happened¶

Root Cause Analysis¶

Technical Details¶

Recovery Procedure¶

Step 1: Verify Cluster Health¶

Step 2: Manual Salvage (Replicas)¶

Step 3: Trigger Volume Reattachment¶

Step 4: Verify Services¶

K8s Node Recovery (SchedulingDisabled)¶

Symptoms¶

Cause¶

Recovery¶

PostgreSQL Connection Pool Recovery¶

Symptoms¶

Cause¶

Diagnosis¶

Recovery¶

Prevention¶

GPU Passthrough Recovery (VM 100 - arr-stack)¶

Symptoms¶

Cause¶

Recovery Options¶

Re-enabling GPU After Fix¶

Prevention¶

Automated Recovery Script¶

Post-Reboot Checklist¶

Immediate (Within 2 minutes)¶

VMs (Within 5 minutes)¶

Services (Within 15 minutes)¶

Follow-up (Within 1 hour)¶

Prevention & Long-term Fixes¶

Immediate Actions¶

Medium-term (Before Next Reboot)¶

Long-term (Future Architecture)¶

Troubleshooting¶

Volume Still Faulted After Salvage¶

Disk Pressure Preventing Recovery¶

Pod Stuck in ContainerCreating After Salvage¶

References¶

Related Documentation¶

External Resources¶

Historical Context¶

Change Log¶