Proxmox 8 → 9 Upgrade Plan¶
Comprehensive evaluation and action plan for upgrading Proxmox VE from 8.3.0 to 9.1.
Status: Ready for upgrade - Prerequisites complete Last Updated: 2025-12-15 Current Version: Proxmox VE 8.4.0 (kernel 6.8.12-17-pve) Target Version: Proxmox VE 9.1 (kernel 6.17.2)
Executive Summary¶
Should We Upgrade?¶
🟢 YES - Ready to proceed
Key Drivers: - ZFS 2.3.4 adds RAIDZ expansion (future capacity expansion) - Kernel 6.17.2 improved hardware support and performance - New features useful for container management (OCI images)
Blockers (ALL RESOLVED): - ~~Vault pool DEGRADED~~ ✅ ONLINE - disks replaced Dec 2025 - ~~Not on 8.4.1+~~ ✅ On PVE 8.4.0 - Database backups - Scripts exist, recommend manual run before upgrade
Timeline: Ready for next maintenance window (2-hour window recommended)
Current System Status¶
Proxmox Host¶
Version: 8.4.0 (pve-manager 8.4.14)
Kernel: 6.8.12-17-pve
Bootloader: UEFI systemd-boot (dual ESP)
Root: ZFS (rpool)
Storage: 2 ZFS pools (rpool, vault)
Infrastructure¶
- 23 LXC containers - Mix of services, media, development
- 5 VMs - arr-stack (100), k3s cluster (201-203), pc (300)
- K8s cluster - 3-node k3s with production workloads
- Critical services - Supabase (k8s), Authentik (k8s), Immich (k8s)
Storage Health¶
- rpool: ONLINE, healthy
- vault: ONLINE (repaired Dec 2025)
- ✅ All 8 disks healthy (RAIDZ2)
- ⚠️ 1 legacy data error (non-critical, file-level corruption from before repair)
Feature Analysis¶
🔴 Critical Features for Our Stack¶
ZFS 2.3.4 - RAIDZ Expansion Support¶
Current Problem: - Vault pool is DEGRADED with 2 failing disks - ZFS 2.2.7 (PVE 8) cannot expand existing RAIDZ arrays - Replacing disks requires full pool rebuild (days of downtime for 29TB)
With Proxmox 9: - ZFS 2.3.4 can add disks directly to RAIDZ arrays - Replace failing disks without full rebuild - Expand capacity by adding drives (future-proofing)
Example:
# Current (PVE 8): Must rebuild entire pool
# With PVE 9: Just add/replace disk
zpool replace vault ata-ST4000VN008-2DR166_ZDH8PN1H /dev/disk/by-id/new-disk
This feature alone justifies the upgrade once vault issues are resolved.
🟡 Medium Value Features¶
OCI Image Support for LXC¶
What: Deploy LXC containers directly from Docker Hub/OCI registries
Use cases for our infrastructure:
- WikiJS (LXC 117) - Available as linuxserver/wikijs
- Kiwix (LXC 125) - Available as ghcr.io/kiwix/kiwix-serve
- Excalidraw (LXC 410) - Available as excalidraw/excalidraw
Benefits: - Faster deployment (no manual setup) - Automatic updates via registry - Consistent configuration
Example:
# Deploy WikiJS from OCI image
pct create 117 local:oci/linuxserver/wikijs \
--hostname wikijs \
--memory 2048 \
--cores 2
Enhanced SDN Monitoring¶
What: Improved GUI for network monitoring, EVPN tracking, fabric views
Relevance: Limited - we use k3s for most networking, but useful for Proxmox network troubleshooting
🟢 Low Value Features (Not Applicable)¶
- Intel TDX - Security feature for confidential computing (not needed)
- Enhanced Nested Virtualization - Not running hypervisors in VMs
- vTPM Snapshots - Not using virtual TPM
- Datacenter Bulk Actions - Single-node setup, minimal benefit
Upgrade Process¶
Prerequisites¶
1. Must Complete Before Upgrade¶
- [ ] Fix vault pool - Replace 2 failing disks (see Vault Pool Recovery section)
- [ ] Update to PVE 8.4.1+ - Currently on 8.3.0 (must be latest 8.x)
- [ ] Create ZFS snapshot -
zfs snapshot rpool/ROOT/pve-1@pre-pve9-upgrade - [ ] Backup critical VMs - K3s cluster VMs (201-203), arr-stack (100)
- [ ] Test database restore - Manual backup/restore dry run (automation pending)
- [ ] Document network interfaces - Record MAC/IP mappings (may change post-upgrade)
2. Recommended (Strongly Advised)¶
- [ ] Backup dev LXCs - money-tracker (150), home-portal (160), rms (170), trip-planner (190)
- [ ] Export K8s state -
kubectl get all -A -o yaml > cluster-backup.yaml - [ ] Test pve8to9 checklist - Run
pve8to9 --fullto identify issues - [ ] Schedule maintenance window - 2-hour window, weekend preferred
- [ ] Prepare rollback plan - Document ZFS snapshot rollback procedure
Step-by-Step Upgrade¶
Phase 1: Pre-Upgrade (Week Before)¶
# 1. Update to latest Proxmox 8.x
apt update && apt dist-upgrade
# Reboot if kernel updated
reboot
# 2. Verify system health
pveversion -v
zpool status rpool vault
pct list
qm list
# 3. Run pre-upgrade checks
pve8to9 --full > /tmp/pve8to9-check.txt
cat /tmp/pve8to9-check.txt
# 4. Create ZFS snapshot
zfs snapshot rpool/ROOT/pve-1@pre-pve9-upgrade-$(date +%Y%m%d)
zfs list -t snapshot | grep pre-pve9
Phase 2: Backups (Day Before)¶
# 1. Manual database backups (automation pending)
# Follow /root/tower-fleet/docs/reference/disaster-recovery.md
# - Supabase PostgreSQL
# - Immich PostgreSQL
# 2. Export K8s state
kubectl get all -A -o yaml > /vault/backups/k8s/cluster-backup-$(date +%Y%m%d).yaml
# 3. Backup critical LXCs (optional)
vzdump 150 160 170 190 --storage local --compress zstd
# 4. Document network state
ip addr show > /tmp/network-pre-upgrade.txt
cat /etc/network/interfaces > /tmp/interfaces-pre-upgrade.txt
Phase 3: Repository Updates¶
# 1. Update Debian sources (bookworm → trixie)
sed -i 's/bookworm/trixie/g' /etc/apt/sources.list
# 2. Update Proxmox VE repository
# Edit /etc/apt/sources.list.d/pve-enterprise.list (if using enterprise)
# OR /etc/apt/sources.list.d/pve-no-subscription.list
# Change: deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription
# To: deb http://download.proxmox.com/debian/pve trixie pve-no-subscription
# 3. Refresh package index
apt update
Phase 4: Perform Upgrade¶
# 1. Disable kernel audit logging (prevents log flooding)
systemctl stop systemd-journald-audit.socket
systemctl disable systemd-journald-audit.socket
# 2. Start upgrade (interactive - answer prompts carefully)
apt dist-upgrade
# Important prompts to watch for:
# - /etc/ssh/sshd_config - Keep current version (K)
# - /etc/lvm/lvm.conf - Keep current version (K)
# - GRUB configuration - Review changes carefully
# 3. Reboot to new kernel
reboot
Phase 5: Post-Upgrade Verification¶
# 1. Verify version
pveversion -v
# Should show: pve-manager 9.1.x, kernel 6.17.x
# 2. Check system health
zpool status
pct list
qm list
systemctl status pve-cluster pvedaemon pveproxy
# 3. Verify network (critical!)
ip addr show
# Compare with /tmp/network-pre-upgrade.txt
# If interfaces renamed, update /etc/network/interfaces
# 4. Test services
ping 8.8.8.8
curl https://otterwiki.bogocat.com # docs site (OtterWiki on K8s)
ssh root@10.89.97.201 # k3s master
# 5. Start critical containers/VMs
pct start 150 160 170 190 # dev LXCs
kubectl get nodes # k3s cluster
kubectl get pods -A # all workloads
# 6. Remove systemd-boot meta-package (prevents future boot issues)
apt remove systemd-boot
# 7. Re-run upgrade checklist (should be clean)
pve8to9 --full
Estimated Timeline¶
| Phase | Duration | Downtime |
|---|---|---|
| Pre-upgrade checks | 30 min | None |
| Repository updates | 5 min | None |
| Package upgrade (apt dist-upgrade) | 30-45 min | Yes |
| Reboot + boot time | 3-5 min | Yes |
| Post-upgrade verification | 20-30 min | Partial |
| Total | 90-120 min | 60-90 min |
Rollback Strategy¶
⚠️ No Official Downgrade Path¶
Proxmox does not support downgrading from 9 → 8. If upgrade fails:
Option 1: ZFS Snapshot Rollback (Recommended)¶
Prerequisites: - ZFS snapshot created before upgrade - Boot from Proxmox 8 ISO in rescue mode
Steps:
# 1. Boot from Proxmox 8 ISO (rescue mode)
# 2. Import pool
zpool import -f rpool
# 3. Rollback to snapshot
zfs rollback rpool/ROOT/pve-1@pre-pve9-upgrade
# 4. Reboot (should boot into Proxmox 8)
reboot
# Note: This destroys ALL changes made after snapshot
Recovery time: 15-30 minutes Data loss: Any changes since snapshot (VM changes, config changes)
Option 2: Kernel Downgrade Only (Partial Rollback)¶
If upgrade succeeds but new kernel fails:
Recovery time: 5 minutes Limitations: Still on Proxmox 9 userspace, only kernel reverted
Option 3: Fresh Install + Restore (Nuclear Option)¶
If rollback fails: 1. Reinstall Proxmox 8 from ISO (wipes host) 2. Import ZFS pools 3. Restore VMs/LXCs from backups 4. Rebuild k3s cluster (30-45 min via scripts) 5. Restore databases from backups
Recovery time: 4-6 hours Requirements: Valid backups of all critical data
Disaster Recovery Readiness Assessment¶
Current State: 🟢 Medium-High Confidence¶
| Component | Backup Status | Restore Tested | RTO | RPO | Confidence |
|---|---|---|---|---|---|
| OPNsense | ✅ Daily + GitHub | ✅ Yes | 30-60 min | 1 day | 🟢 High |
| K8s Manifests | ✅ Git (real-time) | ✅ Yes | 10 min | Real-time | 🟢 High |
| Supabase DBs | ✅ Daily + GitHub | ⚠️ Scripts tested | 15-30 min | 1 day | 🟢 High |
| Immich DB | ✅ Daily + GitHub | ⚠️ Scripts tested | 5-10 min | 1 day | 🟢 High |
| K8s State | 🔶 Manual export | ⚠️ Untested | 45 min | Manual | 🟡 Medium |
| Proxmox Config | ❌ None | ❌ No | 4-6 hours | N/A | 🔴 Low |
| LXC Filesystems | ❌ None | ❌ No | 2-3 hours | N/A | 🔴 Low |
| Vault Pool (29TB) | ❌ None | ❌ No | Days | N/A | 🔴 Critical |
Backup Scripts Status¶
| Script | Location | Cron | Offsite (GitHub) |
|---|---|---|---|
backup-opnsense.sh |
/root/scripts/ |
✅ Daily 2AM | ✅ Weekly (Sunday) |
backup-supabase.sh |
/root/scripts/ |
✅ Daily 3AM | ✅ Weekly (Sunday) |
backup-immich-db.sh |
/root/scripts/ |
✅ Daily 3:30AM | ✅ Weekly (Sunday) |
Features:
- GPG encryption (AES256) with passphrase in /root/.database-backup-passphrase
- 30-day local retention (/vault/backups/)
- Weekly offsite to GitHub (tower-fleet repo, Sundays)
- Backup sizes: OPNsense ~26KB, Supabase ~8MB, Immich ~21MB
Critical Gaps¶
- ~~Database backups not automated~~ ✅ Resolved Dec 2025
- ~~No offsite backups~~ ✅ Weekly GitHub push enabled Dec 2025
- LXC containers not backed up - Would need to manually recreate
- Vault pool has no backup - 29TB of media/data at risk
Recommendations¶
Immediate (Before Upgrade)¶
# 1. Manual database backups (until automation ready)
/root/scripts/backup-supabase.sh
/root/scripts/backup-immich-db.sh
# Store in /vault/backups/databases/
# 2. Create ZFS snapshot (lightweight, instant)
zfs snapshot rpool/ROOT/pve-1@pre-upgrade-$(date +%Y%m%d)
# 3. Export K8s state
kubectl get all -A -o yaml > /vault/backups/k8s/cluster-backup-$(date +%Y%m%d).yaml
# 4. Document Proxmox config
pct list > /vault/backups/proxmox/lxc-list.txt
qm list > /vault/backups/proxmox/vm-list.txt
ip addr show > /vault/backups/proxmox/network-config.txt
Short-term (Post-Upgrade)¶
- [ ] Implement database backup automation (when scripts ready)
- [ ] Set up Proxmox Backup Server (PBS) or vzdump for LXC backups
- [ ] Test restore procedures quarterly
- [ ] Document Proxmox host rebuild procedure
Long-term (Nice to Have)¶
- [ ] GitOps with Flux/ArgoCD (auto-recovery of K8s workloads)
- [ ] Offsite backup for vault pool (29TB - expensive)
- [ ] LXC container templates (quick recreation)
Vault Pool Recovery Plan¶
Current Status: ✅ ONLINE - RESOLVED (Dec 2025)¶
vault pool: ONLINE (all 8 disks healthy)
Configuration: RAIDZ2 (can tolerate 2 disk failures)
Status: 1 legacy data error (file-level, from before repair)
All disks:
- ata-ST4000VX016-3CV104_ZW61ZP88 ONLINE
- ata-ST4000VX016-3CV104_ZW626RNN ONLINE
- ata-ST4000VN006-3CW104_ZW63WH4W ONLINE
- ata-WDC_WD43PURZ-85BWPY0 ONLINE
- ata-WDC_WD40EFPX-68C6CN0 (x2) ONLINE
- ata-ST4000VN008-2DR166_ZGY7RQQQ ONLINE
- ata-ST4000VX016-3CV104_ZW61ZKPG ONLINE
Recovery History¶
Problem (Nov 2025): 2 disks failing with read/checksum errors Solution (Dec 2025): Replaced both failing disks, resilver completed Remaining: 1 file-level data error (non-critical, from before repair)
Future Disk Replacement Procedure (Reference)¶
⚠️ Replace ONE disk at a time - never both simultaneously
# 1. Identify failing disk
zpool status vault
# 2. Order replacement drive (CMR, not SMR)
# Recommended: WD Red Plus 4TB (WD40EFPX) or Seagate IronWolf 4TB
# 3. Replace disk
zpool replace vault <old-disk-id> /dev/disk/by-id/<new-disk-id>
# 4. Monitor resilver (8-12 hours per disk)
watch zpool status vault
# 5. Verify completion and scrub
zpool scrub vault
LXC Consolidation Opportunities¶
Current State: 23 LXC Containers¶
Breakdown: - 4 dev containers (money-tracker, home-portal, rms, trip-planner) - 6 media services (Plex, Jellyfin, Tautulli, etc.) - 4 web apps (WikiJS, Excalidraw, Kiwix, docs) - 3 storage/sync (NAS, Syncthing, NextCloudPi) - 2 tools (Calibre-web, Vaultwarden) - 4 stopped/legacy (RMS old stack)
Problem: Overhead of maintaining 23 separate containers - Individual updates for each - Separate backups needed - Resource fragmentation - Complexity in monitoring
Consolidation Strategy¶
Phase 1: Quick Wins (2-4 hours)¶
1. Consolidate Dev Containers → Single Dev LXC
Current:
LXC 150 (money-tracker) - 4GB RAM, 4 cores
LXC 160 (home-portal) - 4GB RAM, 4 cores
LXC 170 (rms) - 4GB RAM, 4 cores
LXC 190 (trip-planner) - 4GB RAM, 4 cores
Total: 16GB RAM, 16 cores
Proposed:
LXC 200 (dev-workstation) - 8GB RAM, 8 cores
/root/
├── money-tracker/
├── home-portal/
├── rms/
└── trip-planner/
Total: 8GB RAM, 8 cores (50% reduction)
Benefits: - Single filesystem to backup - Shared dev tooling (Node.js, Docker, Supabase CLI) - Easier to manage tmux sessions - Still isolated from Proxmox host
Implementation:
# 1. Create new dev LXC
/lxc:create-nextjs 200 dev-workstation
# 2. Clone all projects into new container
pct enter 200
cd /root
git clone git@github.com:jakecelentano/money-tracker.git
git clone git@github.com:jakecelentano/home-portal.git
git clone git@github.com:jakecelentano/openrms.git rms
git clone git@github.com:jakecelentano/trip-planner.git
# 3. Setup each project
cd money-tracker && npm install && npx supabase start
cd ../home-portal && npm install && npx supabase start
# etc.
# 4. Test dev servers
tmux new -s money-tracker -d 'cd /root/money-tracker && npm run dev -- -H 0.0.0.0'
tmux new -s home-portal -d 'cd /root/home-portal && npm run dev -- -H 0.0.0.0'
# etc.
# 5. Once verified, stop old LXCs
pct stop 150 160 170 190
# 6. After 1 week of stability, destroy old LXCs
pct destroy 150 160 170 190
Estimated time: 2 hours Risk: Low (keep old LXCs until proven stable)
2. Migrate Simple Services to K8s
Candidates: - docs (LXC 411) - COMPLETED: Migrated to OtterWiki on K8s (Dec 2025) - Excalidraw (LXC 410) - Stateless drawing app - WikiJS (LXC 117) - DEPRECATED: Replaced by OtterWiki
Benefits: - No LXC to maintain - K8s handles restarts/HA - Easier to backup (manifests in Git)
Implementation (docs - COMPLETED):
# Docs migration completed Dec 2025:
# - OtterWiki deployed to K8s (otterwiki namespace)
# - git-sync sidecar auto-pulls from tower-fleet repo
# - Authentik SSO integration
# - URL: https://otterwiki.bogocat.com
# - LXC 411 can be stopped
# Estimated: 1 hour per service
3. Remove Stopped/Legacy LXCs
# These are already stopped and unused
pct destroy 221 # rms-backend (old)
pct destroy 222 # rms-database (old)
pct destroy 223 # rms-loki (old)
pct destroy 126 # homer (stopped)
Immediate savings: 4 containers removed (5 minutes)
Phase 2: Complex Migrations (Future)¶
Candidates: - Jellyfin → K8s (needs GPU passthrough, complex) - Plex → K8s (similar complexity) - Kiwix → K8s (needs large storage for wiki dumps)
Timeline: Phase 3+, after Phase 1 proven stable
Expected Results¶
| Metric | Current | After Phase 1 | After Phase 2 | Savings |
|---|---|---|---|---|
| LXC Count | 23 | 16 | 12 | 48% reduction |
| RAM Allocated | ~50GB | ~42GB | ~38GB | 24% reduction |
| Backup Complexity | High | Medium | Low | Significant |
| Maintenance Time | 3 hrs/month | 2 hrs/month | 1.5 hrs/month | 50% reduction |
Action Plan¶
Timeline Overview (Updated Dec 2025)¶
✅ COMPLETED: Vault pool recovery
✅ COMPLETED: Update to Proxmox 8.4.x
✅ COMPLETED: pve8to9 checklist (0 failures)
REMAINING (Before Upgrade):
├─ Install amd64-microcode package
├─ Migrate /etc/sysctl.conf to /etc/sysctl.d/
├─ Create ZFS snapshot
├─ Run manual database backups
└─ Export K8s state
UPGRADE (2-hour maintenance window):
├─ Update repositories (bookworm → trixie)
├─ apt dist-upgrade
├─ Reboot
└─ Post-upgrade verification
Prerequisites Checklist¶
Before scheduling upgrade:
- [x] Vault pool is ONLINE ✅ Completed Dec 2025
- [x] Updated to PVE 8.4.1+ ✅ On 8.4.0/8.4.14
- [x] pve8to9 checklist passed ✅ 0 failures, 5 warnings
- [ ] Install amd64-microcode (apt install amd64-microcode)
- [ ] Migrate sysctl.conf (mv /etc/sysctl.conf /etc/sysctl.d/99-legacy.conf)
- [ ] ZFS snapshot created (
rpool/ROOT/pve-1@pre-pve9-upgrade) - [ ] Database backups completed (run backup-supabase.sh, backup-immich-db.sh)
- [ ] K8s state exported (
kubectl get all -A -o yaml) - [ ] Maintenance window scheduled (2-hour window, weekend)
Low-Hanging Fruit (Execute Now)¶
These can be done immediately:
-
Create ZFS snapshot (30 seconds)
-
Check vault pool corruption details (5 minutes)
-
Remove stopped legacy LXCs (5 minutes)
-
Export K8s state (2 minutes)
-
Update to Proxmox 8.4.x (30 minutes)
Post-Upgrade Next Steps¶
After successful upgrade to PVE 9:
- Leverage ZFS 2.3.4 for vault pool expansion (add capacity)
- Migrate simple services to K8s (docs, Excalidraw)
- Consolidate dev LXCs to single dev-workstation
- Implement PBS (Proxmox Backup Server) for automated LXC backups
- Test disaster recovery procedures quarterly
References¶
Documentation Links¶
- Proxmox 8 → 9 Official Guide
- Proxmox VE 9.1 Release Notes
- Disaster Recovery Procedures
- Proxmox/LXC Operations
Internal Documentation¶
- Backup Scripts:
/root/scripts/backup-*.sh - Migration Scripts:
/root/scripts/migrate-app.sh - LXC Templates:
/root/tower-fleet/lxc-templates/ - K8s Manifests:
/root/tower-fleet/manifests/
External Resources¶
Change Log¶
- 2025-12-02: Initial plan created based on upgrade evaluation
- 2025-12-15: Updated status - vault pool ONLINE, PVE 8.4.0, pve8to9 passed (0 failures)
- Vault pool recovery completed (2 disks replaced)
- Added backup scripts status table
- Marked prerequisites as ready for upgrade
- Enabled automated database backups (Supabase, Immich, OPNsense) with GitHub offsite
- B2 offsite backup scripts prepared (pending credentials):
/root/scripts/backup-to-b2.sh- modular backup runner/root/scripts/b2-jobs/- job configs for Immich, Plex, Jellyfin, arr-stack- Plan:
/root/.claude/plans/iterative-pondering-gray.md