Skip to content

Proxmox 8 → 9 Upgrade Plan

Comprehensive evaluation and action plan for upgrading Proxmox VE from 8.3.0 to 9.1.

Status: Ready for upgrade - Prerequisites complete Last Updated: 2025-12-15 Current Version: Proxmox VE 8.4.0 (kernel 6.8.12-17-pve) Target Version: Proxmox VE 9.1 (kernel 6.17.2)


Executive Summary

Should We Upgrade?

🟢 YES - Ready to proceed

Key Drivers: - ZFS 2.3.4 adds RAIDZ expansion (future capacity expansion) - Kernel 6.17.2 improved hardware support and performance - New features useful for container management (OCI images)

Blockers (ALL RESOLVED): - ~~Vault pool DEGRADED~~ ✅ ONLINE - disks replaced Dec 2025 - ~~Not on 8.4.1+~~ ✅ On PVE 8.4.0 - Database backups - Scripts exist, recommend manual run before upgrade

Timeline: Ready for next maintenance window (2-hour window recommended)


Current System Status

Proxmox Host

Version: 8.4.0 (pve-manager 8.4.14)
Kernel: 6.8.12-17-pve
Bootloader: UEFI systemd-boot (dual ESP)
Root: ZFS (rpool)
Storage: 2 ZFS pools (rpool, vault)

Infrastructure

  • 23 LXC containers - Mix of services, media, development
  • 5 VMs - arr-stack (100), k3s cluster (201-203), pc (300)
  • K8s cluster - 3-node k3s with production workloads
  • Critical services - Supabase (k8s), Authentik (k8s), Immich (k8s)

Storage Health

  • rpool: ONLINE, healthy
  • vault: ONLINE (repaired Dec 2025)
  • ✅ All 8 disks healthy (RAIDZ2)
  • ⚠️ 1 legacy data error (non-critical, file-level corruption from before repair)

Feature Analysis

🔴 Critical Features for Our Stack

ZFS 2.3.4 - RAIDZ Expansion Support

Current Problem: - Vault pool is DEGRADED with 2 failing disks - ZFS 2.2.7 (PVE 8) cannot expand existing RAIDZ arrays - Replacing disks requires full pool rebuild (days of downtime for 29TB)

With Proxmox 9: - ZFS 2.3.4 can add disks directly to RAIDZ arrays - Replace failing disks without full rebuild - Expand capacity by adding drives (future-proofing)

Example:

# Current (PVE 8): Must rebuild entire pool
# With PVE 9: Just add/replace disk
zpool replace vault ata-ST4000VN008-2DR166_ZDH8PN1H /dev/disk/by-id/new-disk

This feature alone justifies the upgrade once vault issues are resolved.


🟡 Medium Value Features

OCI Image Support for LXC

What: Deploy LXC containers directly from Docker Hub/OCI registries

Use cases for our infrastructure: - WikiJS (LXC 117) - Available as linuxserver/wikijs - Kiwix (LXC 125) - Available as ghcr.io/kiwix/kiwix-serve - Excalidraw (LXC 410) - Available as excalidraw/excalidraw

Benefits: - Faster deployment (no manual setup) - Automatic updates via registry - Consistent configuration

Example:

# Deploy WikiJS from OCI image
pct create 117 local:oci/linuxserver/wikijs \
  --hostname wikijs \
  --memory 2048 \
  --cores 2

Enhanced SDN Monitoring

What: Improved GUI for network monitoring, EVPN tracking, fabric views

Relevance: Limited - we use k3s for most networking, but useful for Proxmox network troubleshooting


🟢 Low Value Features (Not Applicable)

  • Intel TDX - Security feature for confidential computing (not needed)
  • Enhanced Nested Virtualization - Not running hypervisors in VMs
  • vTPM Snapshots - Not using virtual TPM
  • Datacenter Bulk Actions - Single-node setup, minimal benefit

Upgrade Process

Prerequisites

1. Must Complete Before Upgrade

  • [ ] Fix vault pool - Replace 2 failing disks (see Vault Pool Recovery section)
  • [ ] Update to PVE 8.4.1+ - Currently on 8.3.0 (must be latest 8.x)
  • [ ] Create ZFS snapshot - zfs snapshot rpool/ROOT/pve-1@pre-pve9-upgrade
  • [ ] Backup critical VMs - K3s cluster VMs (201-203), arr-stack (100)
  • [ ] Test database restore - Manual backup/restore dry run (automation pending)
  • [ ] Document network interfaces - Record MAC/IP mappings (may change post-upgrade)
  • [ ] Backup dev LXCs - money-tracker (150), home-portal (160), rms (170), trip-planner (190)
  • [ ] Export K8s state - kubectl get all -A -o yaml > cluster-backup.yaml
  • [ ] Test pve8to9 checklist - Run pve8to9 --full to identify issues
  • [ ] Schedule maintenance window - 2-hour window, weekend preferred
  • [ ] Prepare rollback plan - Document ZFS snapshot rollback procedure

Step-by-Step Upgrade

Phase 1: Pre-Upgrade (Week Before)

# 1. Update to latest Proxmox 8.x
apt update && apt dist-upgrade
# Reboot if kernel updated
reboot

# 2. Verify system health
pveversion -v
zpool status rpool vault
pct list
qm list

# 3. Run pre-upgrade checks
pve8to9 --full > /tmp/pve8to9-check.txt
cat /tmp/pve8to9-check.txt

# 4. Create ZFS snapshot
zfs snapshot rpool/ROOT/pve-1@pre-pve9-upgrade-$(date +%Y%m%d)
zfs list -t snapshot | grep pre-pve9

Phase 2: Backups (Day Before)

# 1. Manual database backups (automation pending)
# Follow /root/tower-fleet/docs/reference/disaster-recovery.md
# - Supabase PostgreSQL
# - Immich PostgreSQL

# 2. Export K8s state
kubectl get all -A -o yaml > /vault/backups/k8s/cluster-backup-$(date +%Y%m%d).yaml

# 3. Backup critical LXCs (optional)
vzdump 150 160 170 190 --storage local --compress zstd

# 4. Document network state
ip addr show > /tmp/network-pre-upgrade.txt
cat /etc/network/interfaces > /tmp/interfaces-pre-upgrade.txt

Phase 3: Repository Updates

# 1. Update Debian sources (bookworm → trixie)
sed -i 's/bookworm/trixie/g' /etc/apt/sources.list

# 2. Update Proxmox VE repository
# Edit /etc/apt/sources.list.d/pve-enterprise.list (if using enterprise)
# OR /etc/apt/sources.list.d/pve-no-subscription.list
# Change: deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription
# To: deb http://download.proxmox.com/debian/pve trixie pve-no-subscription

# 3. Refresh package index
apt update

Phase 4: Perform Upgrade

# 1. Disable kernel audit logging (prevents log flooding)
systemctl stop systemd-journald-audit.socket
systemctl disable systemd-journald-audit.socket

# 2. Start upgrade (interactive - answer prompts carefully)
apt dist-upgrade

# Important prompts to watch for:
# - /etc/ssh/sshd_config - Keep current version (K)
# - /etc/lvm/lvm.conf - Keep current version (K)
# - GRUB configuration - Review changes carefully

# 3. Reboot to new kernel
reboot

Phase 5: Post-Upgrade Verification

# 1. Verify version
pveversion -v
# Should show: pve-manager 9.1.x, kernel 6.17.x

# 2. Check system health
zpool status
pct list
qm list
systemctl status pve-cluster pvedaemon pveproxy

# 3. Verify network (critical!)
ip addr show
# Compare with /tmp/network-pre-upgrade.txt
# If interfaces renamed, update /etc/network/interfaces

# 4. Test services
ping 8.8.8.8
curl https://otterwiki.bogocat.com  # docs site (OtterWiki on K8s)
ssh root@10.89.97.201  # k3s master

# 5. Start critical containers/VMs
pct start 150 160 170 190  # dev LXCs
kubectl get nodes  # k3s cluster
kubectl get pods -A  # all workloads

# 6. Remove systemd-boot meta-package (prevents future boot issues)
apt remove systemd-boot

# 7. Re-run upgrade checklist (should be clean)
pve8to9 --full

Estimated Timeline

Phase Duration Downtime
Pre-upgrade checks 30 min None
Repository updates 5 min None
Package upgrade (apt dist-upgrade) 30-45 min Yes
Reboot + boot time 3-5 min Yes
Post-upgrade verification 20-30 min Partial
Total 90-120 min 60-90 min

Rollback Strategy

⚠️ No Official Downgrade Path

Proxmox does not support downgrading from 9 → 8. If upgrade fails:

Prerequisites: - ZFS snapshot created before upgrade - Boot from Proxmox 8 ISO in rescue mode

Steps:

# 1. Boot from Proxmox 8 ISO (rescue mode)

# 2. Import pool
zpool import -f rpool

# 3. Rollback to snapshot
zfs rollback rpool/ROOT/pve-1@pre-pve9-upgrade

# 4. Reboot (should boot into Proxmox 8)
reboot

# Note: This destroys ALL changes made after snapshot

Recovery time: 15-30 minutes Data loss: Any changes since snapshot (VM changes, config changes)

Option 2: Kernel Downgrade Only (Partial Rollback)

If upgrade succeeds but new kernel fails:

# Pin old kernel
proxmox-boot-tool kernel pin 6.8.12-8-pve
proxmox-boot-tool refresh
reboot

Recovery time: 5 minutes Limitations: Still on Proxmox 9 userspace, only kernel reverted

Option 3: Fresh Install + Restore (Nuclear Option)

If rollback fails: 1. Reinstall Proxmox 8 from ISO (wipes host) 2. Import ZFS pools 3. Restore VMs/LXCs from backups 4. Rebuild k3s cluster (30-45 min via scripts) 5. Restore databases from backups

Recovery time: 4-6 hours Requirements: Valid backups of all critical data


Disaster Recovery Readiness Assessment

Current State: 🟢 Medium-High Confidence

Component Backup Status Restore Tested RTO RPO Confidence
OPNsense ✅ Daily + GitHub ✅ Yes 30-60 min 1 day 🟢 High
K8s Manifests ✅ Git (real-time) ✅ Yes 10 min Real-time 🟢 High
Supabase DBs ✅ Daily + GitHub ⚠️ Scripts tested 15-30 min 1 day 🟢 High
Immich DB ✅ Daily + GitHub ⚠️ Scripts tested 5-10 min 1 day 🟢 High
K8s State 🔶 Manual export ⚠️ Untested 45 min Manual 🟡 Medium
Proxmox Config ❌ None ❌ No 4-6 hours N/A 🔴 Low
LXC Filesystems ❌ None ❌ No 2-3 hours N/A 🔴 Low
Vault Pool (29TB) ❌ None ❌ No Days N/A 🔴 Critical

Backup Scripts Status

Script Location Cron Offsite (GitHub)
backup-opnsense.sh /root/scripts/ ✅ Daily 2AM ✅ Weekly (Sunday)
backup-supabase.sh /root/scripts/ ✅ Daily 3AM ✅ Weekly (Sunday)
backup-immich-db.sh /root/scripts/ ✅ Daily 3:30AM ✅ Weekly (Sunday)

Features: - GPG encryption (AES256) with passphrase in /root/.database-backup-passphrase - 30-day local retention (/vault/backups/) - Weekly offsite to GitHub (tower-fleet repo, Sundays) - Backup sizes: OPNsense ~26KB, Supabase ~8MB, Immich ~21MB

Critical Gaps

  1. ~~Database backups not automated~~ ✅ Resolved Dec 2025
  2. ~~No offsite backups~~ ✅ Weekly GitHub push enabled Dec 2025
  3. LXC containers not backed up - Would need to manually recreate
  4. Vault pool has no backup - 29TB of media/data at risk

Recommendations

Immediate (Before Upgrade)

# 1. Manual database backups (until automation ready)
/root/scripts/backup-supabase.sh
/root/scripts/backup-immich-db.sh
# Store in /vault/backups/databases/

# 2. Create ZFS snapshot (lightweight, instant)
zfs snapshot rpool/ROOT/pve-1@pre-upgrade-$(date +%Y%m%d)

# 3. Export K8s state
kubectl get all -A -o yaml > /vault/backups/k8s/cluster-backup-$(date +%Y%m%d).yaml

# 4. Document Proxmox config
pct list > /vault/backups/proxmox/lxc-list.txt
qm list > /vault/backups/proxmox/vm-list.txt
ip addr show > /vault/backups/proxmox/network-config.txt

Short-term (Post-Upgrade)

  • [ ] Implement database backup automation (when scripts ready)
  • [ ] Set up Proxmox Backup Server (PBS) or vzdump for LXC backups
  • [ ] Test restore procedures quarterly
  • [ ] Document Proxmox host rebuild procedure

Long-term (Nice to Have)

  • [ ] GitOps with Flux/ArgoCD (auto-recovery of K8s workloads)
  • [ ] Offsite backup for vault pool (29TB - expensive)
  • [ ] LXC container templates (quick recreation)

Vault Pool Recovery Plan

Current Status: ✅ ONLINE - RESOLVED (Dec 2025)

vault pool: ONLINE (all 8 disks healthy)
Configuration: RAIDZ2 (can tolerate 2 disk failures)
Status: 1 legacy data error (file-level, from before repair)

All disks:
- ata-ST4000VX016-3CV104_ZW61ZP88   ONLINE
- ata-ST4000VX016-3CV104_ZW626RNN   ONLINE
- ata-ST4000VN006-3CW104_ZW63WH4W   ONLINE
- ata-WDC_WD43PURZ-85BWPY0          ONLINE
- ata-WDC_WD40EFPX-68C6CN0 (x2)     ONLINE
- ata-ST4000VN008-2DR166_ZGY7RQQQ   ONLINE
- ata-ST4000VX016-3CV104_ZW61ZKPG   ONLINE

Recovery History

Problem (Nov 2025): 2 disks failing with read/checksum errors Solution (Dec 2025): Replaced both failing disks, resilver completed Remaining: 1 file-level data error (non-critical, from before repair)

Future Disk Replacement Procedure (Reference)

⚠️ Replace ONE disk at a time - never both simultaneously

# 1. Identify failing disk
zpool status vault

# 2. Order replacement drive (CMR, not SMR)
# Recommended: WD Red Plus 4TB (WD40EFPX) or Seagate IronWolf 4TB

# 3. Replace disk
zpool replace vault <old-disk-id> /dev/disk/by-id/<new-disk-id>

# 4. Monitor resilver (8-12 hours per disk)
watch zpool status vault

# 5. Verify completion and scrub
zpool scrub vault

LXC Consolidation Opportunities

Current State: 23 LXC Containers

Breakdown: - 4 dev containers (money-tracker, home-portal, rms, trip-planner) - 6 media services (Plex, Jellyfin, Tautulli, etc.) - 4 web apps (WikiJS, Excalidraw, Kiwix, docs) - 3 storage/sync (NAS, Syncthing, NextCloudPi) - 2 tools (Calibre-web, Vaultwarden) - 4 stopped/legacy (RMS old stack)

Problem: Overhead of maintaining 23 separate containers - Individual updates for each - Separate backups needed - Resource fragmentation - Complexity in monitoring

Consolidation Strategy

Phase 1: Quick Wins (2-4 hours)

1. Consolidate Dev Containers → Single Dev LXC

Current:

LXC 150 (money-tracker)  - 4GB RAM, 4 cores
LXC 160 (home-portal)    - 4GB RAM, 4 cores
LXC 170 (rms)            - 4GB RAM, 4 cores
LXC 190 (trip-planner)   - 4GB RAM, 4 cores
Total: 16GB RAM, 16 cores

Proposed:

LXC 200 (dev-workstation) - 8GB RAM, 8 cores
  /root/
    ├── money-tracker/
    ├── home-portal/
    ├── rms/
    └── trip-planner/

Total: 8GB RAM, 8 cores (50% reduction)

Benefits: - Single filesystem to backup - Shared dev tooling (Node.js, Docker, Supabase CLI) - Easier to manage tmux sessions - Still isolated from Proxmox host

Implementation:

# 1. Create new dev LXC
/lxc:create-nextjs 200 dev-workstation

# 2. Clone all projects into new container
pct enter 200
cd /root
git clone git@github.com:jakecelentano/money-tracker.git
git clone git@github.com:jakecelentano/home-portal.git
git clone git@github.com:jakecelentano/openrms.git rms
git clone git@github.com:jakecelentano/trip-planner.git

# 3. Setup each project
cd money-tracker && npm install && npx supabase start
cd ../home-portal && npm install && npx supabase start
# etc.

# 4. Test dev servers
tmux new -s money-tracker -d 'cd /root/money-tracker && npm run dev -- -H 0.0.0.0'
tmux new -s home-portal -d 'cd /root/home-portal && npm run dev -- -H 0.0.0.0'
# etc.

# 5. Once verified, stop old LXCs
pct stop 150 160 170 190

# 6. After 1 week of stability, destroy old LXCs
pct destroy 150 160 170 190

Estimated time: 2 hours Risk: Low (keep old LXCs until proven stable)

2. Migrate Simple Services to K8s

Candidates: - docs (LXC 411) - COMPLETED: Migrated to OtterWiki on K8s (Dec 2025) - Excalidraw (LXC 410) - Stateless drawing app - WikiJS (LXC 117) - DEPRECATED: Replaced by OtterWiki

Benefits: - No LXC to maintain - K8s handles restarts/HA - Easier to backup (manifests in Git)

Implementation (docs - COMPLETED):

# Docs migration completed Dec 2025:
# - OtterWiki deployed to K8s (otterwiki namespace)
# - git-sync sidecar auto-pulls from tower-fleet repo
# - Authentik SSO integration
# - URL: https://otterwiki.bogocat.com
# - LXC 411 can be stopped

# Estimated: 1 hour per service

3. Remove Stopped/Legacy LXCs

# These are already stopped and unused
pct destroy 221  # rms-backend (old)
pct destroy 222  # rms-database (old)
pct destroy 223  # rms-loki (old)
pct destroy 126  # homer (stopped)

Immediate savings: 4 containers removed (5 minutes)

Phase 2: Complex Migrations (Future)

Candidates: - Jellyfin → K8s (needs GPU passthrough, complex) - Plex → K8s (similar complexity) - Kiwix → K8s (needs large storage for wiki dumps)

Timeline: Phase 3+, after Phase 1 proven stable

Expected Results

Metric Current After Phase 1 After Phase 2 Savings
LXC Count 23 16 12 48% reduction
RAM Allocated ~50GB ~42GB ~38GB 24% reduction
Backup Complexity High Medium Low Significant
Maintenance Time 3 hrs/month 2 hrs/month 1.5 hrs/month 50% reduction

Action Plan

Timeline Overview (Updated Dec 2025)

✅ COMPLETED: Vault pool recovery
✅ COMPLETED: Update to Proxmox 8.4.x
✅ COMPLETED: pve8to9 checklist (0 failures)

REMAINING (Before Upgrade):
  ├─ Install amd64-microcode package
  ├─ Migrate /etc/sysctl.conf to /etc/sysctl.d/
  ├─ Create ZFS snapshot
  ├─ Run manual database backups
  └─ Export K8s state

UPGRADE (2-hour maintenance window):
  ├─ Update repositories (bookworm → trixie)
  ├─ apt dist-upgrade
  ├─ Reboot
  └─ Post-upgrade verification

Prerequisites Checklist

Before scheduling upgrade:

  • [x] Vault pool is ONLINE ✅ Completed Dec 2025
  • [x] Updated to PVE 8.4.1+ ✅ On 8.4.0/8.4.14
  • [x] pve8to9 checklist passed ✅ 0 failures, 5 warnings
  • [ ] Install amd64-microcode (apt install amd64-microcode)
  • [ ] Migrate sysctl.conf (mv /etc/sysctl.conf /etc/sysctl.d/99-legacy.conf)
  • [ ] ZFS snapshot created (rpool/ROOT/pve-1@pre-pve9-upgrade)
  • [ ] Database backups completed (run backup-supabase.sh, backup-immich-db.sh)
  • [ ] K8s state exported (kubectl get all -A -o yaml)
  • [ ] Maintenance window scheduled (2-hour window, weekend)

Low-Hanging Fruit (Execute Now)

These can be done immediately:

  1. Create ZFS snapshot (30 seconds)

    zfs snapshot rpool/ROOT/pve-1@before-upgrade-planning
    

  2. Check vault pool corruption details (5 minutes)

    zpool status -v vault > /tmp/vault-status-$(date +%Y%m%d).txt
    cat /tmp/vault-status-*.txt
    

  3. Remove stopped legacy LXCs (5 minutes)

    pct destroy 221 222 223 126
    

  4. Export K8s state (2 minutes)

    mkdir -p /vault/backups/k8s
    kubectl get all -A -o yaml > /vault/backups/k8s/cluster-backup-$(date +%Y%m%d).yaml
    

  5. Update to Proxmox 8.4.x (30 minutes)

    apt update && apt dist-upgrade
    # Reboot if kernel updated
    

Post-Upgrade Next Steps

After successful upgrade to PVE 9:

  1. Leverage ZFS 2.3.4 for vault pool expansion (add capacity)
  2. Migrate simple services to K8s (docs, Excalidraw)
  3. Consolidate dev LXCs to single dev-workstation
  4. Implement PBS (Proxmox Backup Server) for automated LXC backups
  5. Test disaster recovery procedures quarterly

References

Internal Documentation

  • Backup Scripts: /root/scripts/backup-*.sh
  • Migration Scripts: /root/scripts/migrate-app.sh
  • LXC Templates: /root/tower-fleet/lxc-templates/
  • K8s Manifests: /root/tower-fleet/manifests/

External Resources


Change Log

  • 2025-12-02: Initial plan created based on upgrade evaluation
  • 2025-12-15: Updated status - vault pool ONLINE, PVE 8.4.0, pve8to9 passed (0 failures)
  • Vault pool recovery completed (2 disks replaced)
  • Added backup scripts status table
  • Marked prerequisites as ready for upgrade
  • Enabled automated database backups (Supabase, Immich, OPNsense) with GitHub offsite
  • B2 offsite backup scripts prepared (pending credentials):
    • /root/scripts/backup-to-b2.sh - modular backup runner
    • /root/scripts/b2-jobs/ - job configs for Immich, Plex, Jellyfin, arr-stack
    • Plan: /root/.claude/plans/iterative-pondering-gray.md