Post-Reboot Recovery Guide¶
Complete guide for recovering the K8s cluster after Proxmox host reboots.
Last Updated: 2025-12-12 Affected Systems: K8s cluster (VMs 201-203), Supabase PostgreSQL, GPU passthrough VMs
Quick Recovery Checklist¶
Run these commands immediately after host reboot:
# 1. Uncordon K8s nodes (often stuck in SchedulingDisabled)
kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2
# 2. Check for Longhorn faulted volumes
kubectl get volumes -n longhorn-system | grep faulted
# If any faulted: /root/tower-fleet/scripts/recover-longhorn-volumes.sh
# 3. Check PostgreSQL connections (if pods crashing)
kubectl exec -n supabase postgres-0 -- psql -U postgres -c \
"SELECT usename, count(*) FROM pg_stat_activity GROUP BY usename;"
# If near 100: kubectl rollout restart statefulset -n supabase postgres
# 4. Check GPU passthrough VM
qm status 100
# If won't start: See GPU Passthrough Recovery section below
Pre-Reboot Preparation¶
Before rebooting, follow the shutdown procedure to prevent issues: - Host Shutdown Procedure
This prevents connection pool exhaustion and GPU passthrough failures.
Overview¶
This guide documents recovery procedures learned from multiple reboots: - Dec 8, 2025: Proxmox upgrade - Longhorn volume salvage issues - Dec 12, 2025: Drive replacement - K8s nodes cordoned, PostgreSQL connection storm, GPU passthrough failure
What Happened¶
Context: - Proxmox host rebooted at 18:29 UTC on 2025-12-08 - Part of Proxmox 8→9 upgrade preparation (updated to 8.4.14) - First reboot with production K8s workloads running
Symptoms: - K8s cluster came back online (all 3 nodes Ready) - Longhorn volumes stuck in "faulted" state - Pods using Longhorn PVCs stuck in ContainerCreating - Services unavailable: docker-registry, authentik-redis, home-portal
Impact: - portal.bogocat.com - 503 error - Docker registry (10.89.97.201:30500) - connection refused - home-portal - ImagePullBackOff - authentik - Redis unavailable
Root Cause Analysis¶
Technical Details¶
Primary Issue: Longhorn auto-salvage failure after unexpected volume detachment
What happened: 1. Proxmox host rebooted at 18:29 UTC 2. K8s VMs (201-203) rebooted cleanly 3. Longhorn detected volumes detached unexpectedly 4. Auto-salvage triggered but brought up 0 replicas 5. Volumes remained in "faulted" state indefinitely
Key Evidence:
Longhorn manager logs:
- "Engine of volume pvc-45decf57 dead unexpectedly, setting v.Status.Robustness to faulted"
- "All replicas are failed, auto-salvaging volume"
- "Bringing up 0 replicas for auto-salvage" ← THE PROBLEM
Replica Status:
{
"failedAt": "2025-12-08T18:29:19Z", // Marked failed at reboot
"lastHealthyAt": "2025-11-27T14:25:18Z", // Last known good state
"currentState": "stopped",
"salvageRequested": false // ← Not set, so salvage skipped
}
Why 0 Replicas?
Longhorn's auto-salvage logic (v1.10.0) has a critical flaw:
- Engine has salvageRequested: true
- But individual replicas have salvageRequested: false
- Auto-salvage counts replicas with salvageRequested: true → finds 0
- Without replicas to salvage, volume stays faulted
Underlying Causes:
- Disk Pressure on k3s-master:
- Disk at 51% usage but Longhorn reservation prevents scheduling
-
May prevent replica recovery
-
Disk Monitor Mismatch:
-
Non-fatal but indicates Longhorn node state desync
-
Replica Data Intact:
- Physical files exist on all nodes
/var/lib/longhorn/replicas/pvc-*/volume-head-001.imgpresent- Last modified during normal operations, not corrupted
Recovery Procedure¶
Step 1: Verify Cluster Health¶
# Check K8s nodes
kubectl get nodes
# All should be Ready
# Check Longhorn system
kubectl get pods -n longhorn-system
# All should be Running (may have recent restarts from reboot)
# Identify faulted volumes
kubectl get volumes -n longhorn-system | grep faulted
Step 2: Manual Salvage (Replicas)¶
For each faulted volume, manually trigger replica salvage:
# Get list of replicas for the volume
VOLUME_NAME="pvc-45decf57-e6bc-4139-92d5-9410b8da79ac"
kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME_NAME
# Set salvageRequested on each replica
for REPLICA in $(kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME_NAME -o name); do
kubectl patch -n longhorn-system $REPLICA --type='json' \
-p='[{"op": "replace", "path": "/spec/salvageRequested", "value": true}]'
done
# Verify salvage requested
kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME_NAME \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.salvageRequested}{"\n"}{end}'
Step 3: Trigger Volume Reattachment¶
# Delete stuck pods to force volume reattachment
kubectl get pods -A | grep ContainerCreating
kubectl delete pod <pod-name> -n <namespace>
# Monitor volume status
watch kubectl get volumes -n longhorn-system | grep $VOLUME_NAME
# Should transition: faulted → detached → attached → healthy
Step 4: Verify Services¶
# Check pods are running
kubectl get pods -n docker-registry
kubectl get pods -n authentik -l app.kubernetes.io/component=master
kubectl get pods -n home-portal
# Test registry
nc -zv 10.89.97.201 30500
# Test portal
curl -I https://portal.bogocat.com
K8s Node Recovery (SchedulingDisabled)¶
Symptoms¶
After reboot, nodes show SchedulingDisabled or Ready,SchedulingDisabled:
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# k3s-master Ready,SchedulingDisabled control-plane,master 32d v1.33.5+k3s1
# k3s-worker-1 Ready,SchedulingDisabled <none> 32d v1.33.5+k3s1
All pods stuck in Pending state because no nodes accept scheduling.
Cause¶
K8s nodes were cordoned (drained) before shutdown, or k3s marks them unschedulable during unexpected shutdown.
Recovery¶
# Uncordon all nodes
kubectl uncordon k3s-master k3s-worker-1 k3s-worker-2
# Verify
kubectl get nodes
# All should show just "Ready" without SchedulingDisabled
PostgreSQL Connection Pool Recovery¶
Symptoms¶
Pods crash with connection errors:
Or pods in CrashLoopBackOff with database connection failures.
Cause¶
After reboot, all pods reconnect to PostgreSQL simultaneously:
- Supabase PostgreSQL has max_connections=100
- Authentik alone uses 50-95 connections
- Add other apps → exceeds limit
- New pods can't connect → crash → retry → connection storm
Diagnosis¶
# Check current connection count
kubectl exec -n supabase postgres-0 -- psql -U postgres -c \
"SELECT usename, count(*) FROM pg_stat_activity GROUP BY usename ORDER BY count DESC;"
# Example problematic output:
# usename | count
# -----------+-------
# authentik | 95 ← Too many!
# postgres | 7
# | 4
Recovery¶
# Option 1: Restart PostgreSQL (clears all connections)
kubectl rollout restart statefulset -n supabase postgres
sleep 30
# Then restart crashing pods to clear backoff
kubectl delete pods -n authentik -l app.kubernetes.io/name=authentik
kubectl delete pods -n supabase -l app=kong
# Option 2: If specific app is hogging connections, restart just that app
kubectl rollout restart deployment -n authentik authentik-server
kubectl rollout restart deployment -n authentik authentik-worker
Prevention¶
Short-term: Use the shutdown procedure to scale down apps before reboot.
Implemented: PgBouncer connection pooler deployed (2025-12-15). See PgBouncer Migration.
GPU Passthrough Recovery (VM 100 - arr-stack)¶
Symptoms¶
Arr-stack VM won't start or network unreachable after host reboot:
qm start 100
# error writing '1' to '/sys/bus/pci/devices/0000:0b:00.0/reset': Inappropriate ioctl for device
# failed to reset PCI device '0000:0b:00.0'
Or VM starts but has no network connectivity:
Cause¶
NVIDIA GPUs (Quadro M2000 in this case) don't support proper PCI Function Level Reset (FLR): 1. VM shutdown doesn't fully release GPU 2. Force-stop makes it worse 3. GPU stuck in "dirty" state 4. Can't be reassigned until host power cycle
Recovery Options¶
Option 1: Boot without GPU (immediate, temporary)
# Remove GPU passthrough
qm stop 100 --timeout 10 2>/dev/null
qm set 100 -delete hostpci0
# Start VM
qm start 100
# Verify arr-stack works (Tdarr won't have hardware transcoding)
ssh root@10.89.97.50 "docker ps"
Option 2: Try GPU reset (may work)
# Stop VM first
qm stop 100 --timeout 10 2>/dev/null
# Try to reset GPU
echo 1 > /sys/bus/pci/devices/0000:0b:00.0/reset 2>/dev/null
# Try starting VM
qm start 100
Option 3: Full host power cycle (guaranteed)
Only option if GPU is stuck. Must:
1. Shutdown host completely: poweroff
2. Wait 30+ seconds (capacitors discharge)
3. Power on
Re-enabling GPU After Fix¶
After host power cycle or successful reset:
# Re-add GPU passthrough
qm set 100 -hostpci0 0b:00,pcie=1
# Start VM
qm start 100
# Verify Tdarr can use GPU
ssh root@10.89.97.50 "docker logs tdarr_node 2>&1 | grep -i nvenc"
Prevention¶
Always use graceful VM shutdown before host reboot:
See Host Shutdown Procedure for complete pre-reboot checklist.
Automated Recovery Script¶
Location: /root/tower-fleet/scripts/recover-longhorn-volumes.sh
#!/bin/bash
# Automated Longhorn volume recovery after reboot
set -e
echo "=== Longhorn Post-Reboot Recovery ==="
echo "Started: $(date)"
# Find all faulted volumes
FAULTED_VOLUMES=$(kubectl get volumes -n longhorn-system -o json | \
jq -r '.items[] | select(.status.robustness=="faulted") | .metadata.name')
if [ -z "$FAULTED_VOLUMES" ]; then
echo "✓ No faulted volumes found"
exit 0
fi
echo "Found faulted volumes:"
echo "$FAULTED_VOLUMES"
echo ""
# Process each faulted volume
for VOLUME in $FAULTED_VOLUMES; do
echo "Processing volume: $VOLUME"
# Get replicas for this volume
REPLICAS=$(kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME -o name)
if [ -z "$REPLICAS" ]; then
echo " ⚠️ No replicas found"
continue
fi
echo " Found $(echo "$REPLICAS" | wc -l) replicas"
# Enable salvage on each replica
for REPLICA in $REPLICAS; do
echo " Setting salvageRequested on $REPLICA"
kubectl patch -n longhorn-system $REPLICA --type='json' \
-p='[{"op": "replace", "path": "/spec/salvageRequested", "value": true}]' \
2>&1 | grep -v "no change" || true
done
echo " ✓ Salvage requested for all replicas"
echo ""
done
echo "=== Waiting for volumes to recover (30s) ==="
sleep 30
# Check results
STILL_FAULTED=$(kubectl get volumes -n longhorn-system -o json | \
jq -r '.items[] | select(.status.robustness=="faulted") | .metadata.name')
if [ -z "$STILL_FAULTED" ]; then
echo "✓ All volumes recovered successfully"
else
echo "⚠️ Still faulted volumes:"
echo "$STILL_FAULTED"
echo ""
echo "Manual intervention may be required"
exit 1
fi
echo "Completed: $(date)"
Usage:
chmod +x /root/tower-fleet/scripts/recover-longhorn-volumes.sh
/root/tower-fleet/scripts/recover-longhorn-volumes.sh
Post-Reboot Checklist¶
Use this checklist after every Proxmox host reboot:
Immediate (Within 2 minutes)¶
-
[ ] Uncordon K8s nodes (often stuck in SchedulingDisabled)
-
[ ] Check Longhorn volume health
-
[ ] Check for PostgreSQL connection pool exhaustion
VMs (Within 5 minutes)¶
-
[ ] Verify arr-stack VM (GPU passthrough)
-
[ ] Verify game-servers VM
Services (Within 15 minutes)¶
-
[ ] Verify critical pods are running
-
[ ] Test external access
-
[ ] Check Ingress controller
-
[ ] Verify arr-stack services
Follow-up (Within 1 hour)¶
-
[ ] Review Longhorn manager logs
-
[ ] Check disk space on nodes
-
[ ] Verify Longhorn UI accessible
Prevention & Long-term Fixes¶
Immediate Actions¶
-
Add recovery script to systemd (auto-run after reboot)
cat > /etc/systemd/system/longhorn-recovery.service <<'EOF' [Unit] Description=Longhorn Volume Recovery After Reboot After=k3s.service Requires=k3s.service [Service] Type=oneshot ExecStartPre=/bin/sleep 60 ExecStart=/root/tower-fleet/scripts/recover-longhorn-volumes.sh RemainAfterExit=yes [Install] WantedBy=multi-user.target EOF systemctl enable longhorn-recovery.service -
Add to disaster recovery docs
- Update
/root/tower-fleet/docs/reference/disaster-recovery.md - Add "Post-Reboot Recovery" section
Medium-term (Before Next Reboot)¶
- Upgrade Longhorn to v1.7+ (check if fixed)
- Current: v1.10.0
-
Check release notes for auto-salvage improvements
-
Increase disk space on k3s-master
- Currently at DiskPressure (296GB scheduled > 291GB limit)
-
Expand VM disk or adjust Longhorn storage reservation
-
Test graceful shutdown/restart
-
Document in Proxmox upgrade plan
- Add to
/root/tower-fleet/docs/operations/proxmox-upgrade-plan.md - Include this guide in "Post-Upgrade Verification" section
Long-term (Future Architecture)¶
- Longhorn volume backup automation
- Implement recurring snapshots
-
Offsite backup of critical volumes (registry, authentik)
-
High-availability improvements
- Consider 5-node cluster (tolerate 2 node failures)
-
Separate Longhorn storage nodes from workload nodes
-
Monitoring & alerting
- Prometheus alert for Longhorn volume faulted state
- Alert if volumes don't recover within 5 minutes of reboot
Troubleshooting¶
Volume Still Faulted After Salvage¶
Symptom: Replica salvage requested but volume stays faulted
Diagnosis:
VOLUME="pvc-45decf57-e6bc-4139-92d5-9410b8da79ac"
kubectl describe volume $VOLUME -n longhorn-system | grep -A 20 "Events:"
kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME -o yaml
Solutions: 1. Check if replicas actually started:
kubectl get replicas -n longhorn-system -l longhornvolume=$VOLUME \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.currentState}{"\n"}{end}'
-
If still "stopped", check instance-manager logs:
-
Nuclear option - delete and recreate replicas:
Disk Pressure Preventing Recovery¶
Symptom: "DiskPressure" condition on Longhorn nodes
Check:
kubectl get nodes.longhorn.io -n longhorn-system -o json | \
jq '.items[] | {name: .metadata.name, diskStatus: .status.diskStatus}'
Fix: Expand node disk or adjust Longhorn reservation:
# Option 1: Expand VM disk (on Proxmox host)
qm resize 201 scsi0 +100G
# Then extend filesystem on VM
# Option 2: Reduce Longhorn storage reservation
kubectl edit settings.longhorn.io -n longhorn-system storage-reserved-percentage-for-default-disk
# Reduce from 30% to 20%
Pod Stuck in ContainerCreating After Salvage¶
Symptom: Volume recovered but pod won't start
Check volume attachment:
Fix:
# Delete pod and let it recreate
kubectl delete pod <pod-name> -n <namespace>
# If still stuck, force delete volume attachment
kubectl delete volumeattachment <attachment-id>
References¶
Related Documentation¶
External Resources¶
- Longhorn Auto-Salvage Documentation
- Longhorn GitHub Issue #1234 (auto-salvage 0 replicas bug)
- K8s VolumeAttachment Debugging
Historical Context¶
- First Reboot: 2025-12-08 18:29 UTC
- Cause: Proxmox 8.3.0 → 8.4.14 upgrade (prep for 8→9)
- Recovery Time: ~1 hour (manual diagnosis + fix)
- Data Loss: None (all replica data intact)
- Lessons Learned: Longhorn auto-salvage unreliable, manual recovery needed
Change Log¶
- 2025-12-15: PgBouncer deployed - connection pool exhaustion issue resolved. Updated prevention section.
- 2025-12-12: Added K8s node uncordon, PostgreSQL connection pool recovery, GPU passthrough recovery sections. Added pre-reboot preparation link. Updated checklist with all recovery items.
- 2025-12-08: Initial version after first K8s reboot incident (Longhorn volume salvage)