Proxmox 8 → 9 Upgrade Plan¶

Comprehensive evaluation and action plan for upgrading Proxmox VE from 8.3.0 to 9.1.

Status: Ready for upgrade - Prerequisites complete Last Updated: 2025-12-15 Current Version: Proxmox VE 8.4.0 (kernel 6.8.12-17-pve) Target Version: Proxmox VE 9.1 (kernel 6.17.2)

Executive Summary¶

Should We Upgrade?¶

🟢 YES - Ready to proceed

Key Drivers: - ZFS 2.3.4 adds RAIDZ expansion (future capacity expansion) - Kernel 6.17.2 improved hardware support and performance - New features useful for container management (OCI images)

Blockers (ALL RESOLVED): - ~~Vault pool DEGRADED~~ ✅ ONLINE - disks replaced Dec 2025 - ~~Not on 8.4.1+~~ ✅ On PVE 8.4.0 - Database backups - Scripts exist, recommend manual run before upgrade

Timeline: Ready for next maintenance window (2-hour window recommended)

Current System Status¶

Proxmox Host¶

Version: 8.4.0 (pve-manager 8.4.14)
Kernel: 6.8.12-17-pve
Bootloader: UEFI systemd-boot (dual ESP)
Root: ZFS (rpool)
Storage: 2 ZFS pools (rpool, vault)

Infrastructure¶

23 LXC containers - Mix of services, media, development
5 VMs - arr-stack (100), k3s cluster (201-203), pc (300)
K8s cluster - 3-node k3s with production workloads
Critical services - Supabase (k8s), Authentik (k8s), Immich (k8s)

Storage Health¶

rpool: ONLINE, healthy
vault: ONLINE (repaired Dec 2025)
✅ All 8 disks healthy (RAIDZ2)
⚠️ 1 legacy data error (non-critical, file-level corruption from before repair)

Feature Analysis¶

🔴 Critical Features for Our Stack¶

ZFS 2.3.4 - RAIDZ Expansion Support¶

Current Problem: - Vault pool is DEGRADED with 2 failing disks - ZFS 2.2.7 (PVE 8) cannot expand existing RAIDZ arrays - Replacing disks requires full pool rebuild (days of downtime for 29TB)

With Proxmox 9: - ZFS 2.3.4 can add disks directly to RAIDZ arrays - Replace failing disks without full rebuild - Expand capacity by adding drives (future-proofing)

Example:

# Current (PVE 8): Must rebuild entire pool
# With PVE 9: Just add/replace disk
zpool replace vault ata-ST4000VN008-2DR166_ZDH8PN1H /dev/disk/by-id/new-disk

This feature alone justifies the upgrade once vault issues are resolved.

🟡 Medium Value Features¶

OCI Image Support for LXC¶

What: Deploy LXC containers directly from Docker Hub/OCI registries

Use cases for our infrastructure: - WikiJS (LXC 117) - Available as linuxserver/wikijs - Kiwix (LXC 125) - Available as ghcr.io/kiwix/kiwix-serve - Excalidraw (LXC 410) - Available as excalidraw/excalidraw

Benefits: - Faster deployment (no manual setup) - Automatic updates via registry - Consistent configuration

Example:

# Deploy WikiJS from OCI image
pct create 117 local:oci/linuxserver/wikijs \
  --hostname wikijs \
  --memory 2048 \
  --cores 2

Enhanced SDN Monitoring¶

What: Improved GUI for network monitoring, EVPN tracking, fabric views

Relevance: Limited - we use k3s for most networking, but useful for Proxmox network troubleshooting

🟢 Low Value Features (Not Applicable)¶

Intel TDX - Security feature for confidential computing (not needed)
Enhanced Nested Virtualization - Not running hypervisors in VMs
vTPM Snapshots - Not using virtual TPM
Datacenter Bulk Actions - Single-node setup, minimal benefit

Upgrade Process¶

Prerequisites¶

1. Must Complete Before Upgrade¶

[ ] Fix vault pool - Replace 2 failing disks (see Vault Pool Recovery section)
[ ] Update to PVE 8.4.1+ - Currently on 8.3.0 (must be latest 8.x)
[ ] Create ZFS snapshot - zfs snapshot rpool/ROOT/pve-1@pre-pve9-upgrade
[ ] Backup critical VMs - K3s cluster VMs (201-203), arr-stack (100)
[ ] Test database restore - Manual backup/restore dry run (automation pending)
[ ] Document network interfaces - Record MAC/IP mappings (may change post-upgrade)

2. Recommended (Strongly Advised)¶

[ ] Backup dev LXCs - money-tracker (150), home-portal (160), rms (170), trip-planner (190)
[ ] Export K8s state - kubectl get all -A -o yaml > cluster-backup.yaml
[ ] Test pve8to9 checklist - Run pve8to9 --full to identify issues
[ ] Schedule maintenance window - 2-hour window, weekend preferred
[ ] Prepare rollback plan - Document ZFS snapshot rollback procedure

Step-by-Step Upgrade¶

Phase 1: Pre-Upgrade (Week Before)¶

# 1. Update to latest Proxmox 8.x
apt update && apt dist-upgrade
# Reboot if kernel updated
reboot

# 2. Verify system health
pveversion -v
zpool status rpool vault
pct list
qm list

# 3. Run pre-upgrade checks
pve8to9 --full > /tmp/pve8to9-check.txt
cat /tmp/pve8to9-check.txt

# 4. Create ZFS snapshot
zfs snapshot rpool/ROOT/pve-1@pre-pve9-upgrade-$(date +%Y%m%d)
zfs list -t snapshot | grep pre-pve9

Phase 2: Backups (Day Before)¶

# 1. Manual database backups (automation pending)
# Follow /root/tower-fleet/docs/reference/disaster-recovery.md
# - Supabase PostgreSQL
# - Immich PostgreSQL

# 2. Export K8s state
kubectl get all -A -o yaml > /vault/backups/k8s/cluster-backup-$(date +%Y%m%d).yaml

# 3. Backup critical LXCs (optional)
vzdump 150 160 170 190 --storage local --compress zstd

# 4. Document network state
ip addr show > /tmp/network-pre-upgrade.txt
cat /etc/network/interfaces > /tmp/interfaces-pre-upgrade.txt

Phase 3: Repository Updates¶

# 1. Update Debian sources (bookworm → trixie)
sed -i 's/bookworm/trixie/g' /etc/apt/sources.list

# 2. Update Proxmox VE repository
# Edit /etc/apt/sources.list.d/pve-enterprise.list (if using enterprise)
# OR /etc/apt/sources.list.d/pve-no-subscription.list
# Change: deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription
# To: deb http://download.proxmox.com/debian/pve trixie pve-no-subscription

# 3. Refresh package index
apt update

Phase 4: Perform Upgrade¶

# 1. Disable kernel audit logging (prevents log flooding)
systemctl stop systemd-journald-audit.socket
systemctl disable systemd-journald-audit.socket

# 2. Start upgrade (interactive - answer prompts carefully)
apt dist-upgrade

# Important prompts to watch for:
# - /etc/ssh/sshd_config - Keep current version (K)
# - /etc/lvm/lvm.conf - Keep current version (K)
# - GRUB configuration - Review changes carefully

# 3. Reboot to new kernel
reboot

Phase 5: Post-Upgrade Verification¶

# 1. Verify version
pveversion -v
# Should show: pve-manager 9.1.x, kernel 6.17.x

# 2. Check system health
zpool status
pct list
qm list
systemctl status pve-cluster pvedaemon pveproxy

# 3. Verify network (critical!)
ip addr show
# Compare with /tmp/network-pre-upgrade.txt
# If interfaces renamed, update /etc/network/interfaces

# 4. Test services
ping 8.8.8.8
curl https://otterwiki.bogocat.com  # docs site (OtterWiki on K8s)
ssh root@10.89.97.201  # k3s master

# 5. Start critical containers/VMs
pct start 150 160 170 190  # dev LXCs
kubectl get nodes  # k3s cluster
kubectl get pods -A  # all workloads

# 6. Remove systemd-boot meta-package (prevents future boot issues)
apt remove systemd-boot

# 7. Re-run upgrade checklist (should be clean)
pve8to9 --full

Estimated Timeline¶

Phase	Duration	Downtime
Pre-upgrade checks	30 min	None
Repository updates	5 min	None
Package upgrade (apt dist-upgrade)	30-45 min	Yes
Reboot + boot time	3-5 min	Yes
Post-upgrade verification	20-30 min	Partial
Total	90-120 min	60-90 min

Rollback Strategy¶

⚠️ No Official Downgrade Path¶

Proxmox does not support downgrading from 9 → 8. If upgrade fails:

Option 1: ZFS Snapshot Rollback (Recommended)¶

Prerequisites: - ZFS snapshot created before upgrade - Boot from Proxmox 8 ISO in rescue mode

Steps:

# 1. Boot from Proxmox 8 ISO (rescue mode)

# 2. Import pool
zpool import -f rpool

# 3. Rollback to snapshot
zfs rollback rpool/ROOT/pve-1@pre-pve9-upgrade

# 4. Reboot (should boot into Proxmox 8)
reboot

# Note: This destroys ALL changes made after snapshot

Recovery time: 15-30 minutes Data loss: Any changes since snapshot (VM changes, config changes)

Option 2: Kernel Downgrade Only (Partial Rollback)¶

If upgrade succeeds but new kernel fails:

# Pin old kernel
proxmox-boot-tool kernel pin 6.8.12-8-pve
proxmox-boot-tool refresh
reboot

Recovery time: 5 minutes Limitations: Still on Proxmox 9 userspace, only kernel reverted

Option 3: Fresh Install + Restore (Nuclear Option)¶

If rollback fails: 1. Reinstall Proxmox 8 from ISO (wipes host) 2. Import ZFS pools 3. Restore VMs/LXCs from backups 4. Rebuild k3s cluster (30-45 min via scripts) 5. Restore databases from backups

Recovery time: 4-6 hours Requirements: Valid backups of all critical data

Disaster Recovery Readiness Assessment¶

Current State: 🟢 Medium-High Confidence¶

Component	Backup Status	Restore Tested	RTO	RPO	Confidence
OPNsense	✅ Daily + GitHub	✅ Yes	30-60 min	1 day	🟢 High
K8s Manifests	✅ Git (real-time)	✅ Yes	10 min	Real-time	🟢 High
Supabase DBs	✅ Daily + GitHub	⚠️ Scripts tested	15-30 min	1 day	🟢 High
Immich DB	✅ Daily + GitHub	⚠️ Scripts tested	5-10 min	1 day	🟢 High
K8s State	🔶 Manual export	⚠️ Untested	45 min	Manual	🟡 Medium
Proxmox Config	❌ None	❌ No	4-6 hours	N/A	🔴 Low
LXC Filesystems	❌ None	❌ No	2-3 hours	N/A	🔴 Low
Vault Pool (29TB)	❌ None	❌ No	Days	N/A	🔴 Critical

Backup Scripts Status¶

Script	Location	Cron	Offsite (GitHub)
`backup-opnsense.sh`	`/root/scripts/`	✅ Daily 2AM	✅ Weekly (Sunday)
`backup-supabase.sh`	`/root/scripts/`	✅ Daily 3AM	✅ Weekly (Sunday)
`backup-immich-db.sh`	`/root/scripts/`	✅ Daily 3:30AM	✅ Weekly (Sunday)

Features: - GPG encryption (AES256) with passphrase in /root/.database-backup-passphrase - 30-day local retention (/vault/backups/) - Weekly offsite to GitHub (tower-fleet repo, Sundays) - Backup sizes: OPNsense ~26KB, Supabase ~8MB, Immich ~21MB

Critical Gaps¶

~~Database backups not automated~~ ✅ Resolved Dec 2025
~~No offsite backups~~ ✅ Weekly GitHub push enabled Dec 2025
LXC containers not backed up - Would need to manually recreate
Vault pool has no backup - 29TB of media/data at risk

Recommendations¶

Immediate (Before Upgrade)¶

# 1. Manual database backups (until automation ready)
/root/scripts/backup-supabase.sh
/root/scripts/backup-immich-db.sh
# Store in /vault/backups/databases/

# 2. Create ZFS snapshot (lightweight, instant)
zfs snapshot rpool/ROOT/pve-1@pre-upgrade-$(date +%Y%m%d)

# 3. Export K8s state
kubectl get all -A -o yaml > /vault/backups/k8s/cluster-backup-$(date +%Y%m%d).yaml

# 4. Document Proxmox config
pct list > /vault/backups/proxmox/lxc-list.txt
qm list > /vault/backups/proxmox/vm-list.txt
ip addr show > /vault/backups/proxmox/network-config.txt

Short-term (Post-Upgrade)¶

[ ] Implement database backup automation (when scripts ready)
[ ] Set up Proxmox Backup Server (PBS) or vzdump for LXC backups
[ ] Test restore procedures quarterly
[ ] Document Proxmox host rebuild procedure

Long-term (Nice to Have)¶

[ ] GitOps with Flux/ArgoCD (auto-recovery of K8s workloads)
[ ] Offsite backup for vault pool (29TB - expensive)
[ ] LXC container templates (quick recreation)

Vault Pool Recovery Plan¶

Current Status: ✅ ONLINE - RESOLVED (Dec 2025)¶

vault pool: ONLINE (all 8 disks healthy)
Configuration: RAIDZ2 (can tolerate 2 disk failures)
Status: 1 legacy data error (file-level, from before repair)

All disks:
- ata-ST4000VX016-3CV104_ZW61ZP88   ONLINE
- ata-ST4000VX016-3CV104_ZW626RNN   ONLINE
- ata-ST4000VN006-3CW104_ZW63WH4W   ONLINE
- ata-WDC_WD43PURZ-85BWPY0          ONLINE
- ata-WDC_WD40EFPX-68C6CN0 (x2)     ONLINE
- ata-ST4000VN008-2DR166_ZGY7RQQQ   ONLINE
- ata-ST4000VX016-3CV104_ZW61ZKPG   ONLINE

Recovery History¶

Problem (Nov 2025): 2 disks failing with read/checksum errors Solution (Dec 2025): Replaced both failing disks, resilver completed Remaining: 1 file-level data error (non-critical, from before repair)

Future Disk Replacement Procedure (Reference)¶

⚠️ Replace ONE disk at a time - never both simultaneously

# 1. Identify failing disk
zpool status vault

# 2. Order replacement drive (CMR, not SMR)
# Recommended: WD Red Plus 4TB (WD40EFPX) or Seagate IronWolf 4TB

# 3. Replace disk
zpool replace vault <old-disk-id> /dev/disk/by-id/<new-disk-id>

# 4. Monitor resilver (8-12 hours per disk)
watch zpool status vault

# 5. Verify completion and scrub
zpool scrub vault

LXC Consolidation Opportunities¶

Current State: 23 LXC Containers¶

Breakdown: - 4 dev containers (money-tracker, home-portal, rms, trip-planner) - 6 media services (Plex, Jellyfin, Tautulli, etc.) - 4 web apps (WikiJS, Excalidraw, Kiwix, docs) - 3 storage/sync (NAS, Syncthing, NextCloudPi) - 2 tools (Calibre-web, Vaultwarden) - 4 stopped/legacy (RMS old stack)

Problem: Overhead of maintaining 23 separate containers - Individual updates for each - Separate backups needed - Resource fragmentation - Complexity in monitoring

Consolidation Strategy¶

Phase 1: Quick Wins (2-4 hours)¶

1. Consolidate Dev Containers → Single Dev LXC

Current:

LXC 150 (money-tracker)  - 4GB RAM, 4 cores
LXC 160 (home-portal)    - 4GB RAM, 4 cores
LXC 170 (rms)            - 4GB RAM, 4 cores
LXC 190 (trip-planner)   - 4GB RAM, 4 cores
Total: 16GB RAM, 16 cores

Proposed:

LXC 200 (dev-workstation) - 8GB RAM, 8 cores
  /root/
    ├── money-tracker/
    ├── home-portal/
    ├── rms/
    └── trip-planner/

Total: 8GB RAM, 8 cores (50% reduction)

Benefits: - Single filesystem to backup - Shared dev tooling (Node.js, Docker, Supabase CLI) - Easier to manage tmux sessions - Still isolated from Proxmox host

Implementation:

# 1. Create new dev LXC
/lxc:create-nextjs 200 dev-workstation

# 2. Clone all projects into new container
pct enter 200
cd /root
git clone git@github.com:jakecelentano/money-tracker.git
git clone git@github.com:jakecelentano/home-portal.git
git clone git@github.com:jakecelentano/openrms.git rms
git clone git@github.com:jakecelentano/trip-planner.git

# 3. Setup each project
cd money-tracker && npm install && npx supabase start
cd ../home-portal && npm install && npx supabase start
# etc.

# 4. Test dev servers
tmux new -s money-tracker -d 'cd /root/money-tracker && npm run dev -- -H 0.0.0.0'
tmux new -s home-portal -d 'cd /root/home-portal && npm run dev -- -H 0.0.0.0'
# etc.

# 5. Once verified, stop old LXCs
pct stop 150 160 170 190

# 6. After 1 week of stability, destroy old LXCs
pct destroy 150 160 170 190

Estimated time: 2 hours Risk: Low (keep old LXCs until proven stable)

2. Migrate Simple Services to K8s

Candidates: - docs (LXC 411) - COMPLETED: Migrated to OtterWiki on K8s (Dec 2025) - Excalidraw (LXC 410) - Stateless drawing app - WikiJS (LXC 117) - DEPRECATED: Replaced by OtterWiki

Benefits: - No LXC to maintain - K8s handles restarts/HA - Easier to backup (manifests in Git)

Implementation (docs - COMPLETED):

# Docs migration completed Dec 2025:
# - OtterWiki deployed to K8s (otterwiki namespace)
# - git-sync sidecar auto-pulls from tower-fleet repo
# - Authentik SSO integration
# - URL: https://otterwiki.bogocat.com
# - LXC 411 can be stopped

# Estimated: 1 hour per service

3. Remove Stopped/Legacy LXCs

# These are already stopped and unused
pct destroy 221  # rms-backend (old)
pct destroy 222  # rms-database (old)
pct destroy 223  # rms-loki (old)
pct destroy 126  # homer (stopped)

Immediate savings: 4 containers removed (5 minutes)

Phase 2: Complex Migrations (Future)¶

Candidates: - Jellyfin → K8s (needs GPU passthrough, complex) - Plex → K8s (similar complexity) - Kiwix → K8s (needs large storage for wiki dumps)

Timeline: Phase 3+, after Phase 1 proven stable

Expected Results¶

Metric	Current	After Phase 1	After Phase 2	Savings
LXC Count	23	16	12	48% reduction
RAM Allocated	~50GB	~42GB	~38GB	24% reduction
Backup Complexity	High	Medium	Low	Significant
Maintenance Time	3 hrs/month	2 hrs/month	1.5 hrs/month	50% reduction

Action Plan¶

Timeline Overview (Updated Dec 2025)¶

✅ COMPLETED: Vault pool recovery
✅ COMPLETED: Update to Proxmox 8.4.x
✅ COMPLETED: pve8to9 checklist (0 failures)

REMAINING (Before Upgrade):
  ├─ Install amd64-microcode package
  ├─ Migrate /etc/sysctl.conf to /etc/sysctl.d/
  ├─ Create ZFS snapshot
  ├─ Run manual database backups
  └─ Export K8s state

UPGRADE (2-hour maintenance window):
  ├─ Update repositories (bookworm → trixie)
  ├─ apt dist-upgrade
  ├─ Reboot
  └─ Post-upgrade verification

Prerequisites Checklist¶

Before scheduling upgrade:

[x] Vault pool is ONLINE ✅ Completed Dec 2025
[x] Updated to PVE 8.4.1+ ✅ On 8.4.0/8.4.14
[x] pve8to9 checklist passed ✅ 0 failures, 5 warnings
[ ] Install amd64-microcode (apt install amd64-microcode)
[ ] Migrate sysctl.conf (mv /etc/sysctl.conf /etc/sysctl.d/99-legacy.conf)
[ ] ZFS snapshot created (rpool/ROOT/pve-1@pre-pve9-upgrade)
[ ] Database backups completed (run backup-supabase.sh, backup-immich-db.sh)
[ ] K8s state exported (kubectl get all -A -o yaml)
[ ] Maintenance window scheduled (2-hour window, weekend)

Low-Hanging Fruit (Execute Now)¶

These can be done immediately:

Create ZFS snapshot (30 seconds)

zfs snapshot rpool/ROOT/pve-1@before-upgrade-planning

Check vault pool corruption details (5 minutes)

zpool status -v vault > /tmp/vault-status-$(date +%Y%m%d).txt
cat /tmp/vault-status-*.txt

Remove stopped legacy LXCs (5 minutes)
```
pct destroy 221 222 223 126
```

Export K8s state (2 minutes)

mkdir -p /vault/backups/k8s
kubectl get all -A -o yaml > /vault/backups/k8s/cluster-backup-$(date +%Y%m%d).yaml

Update to Proxmox 8.4.x (30 minutes)

apt update && apt dist-upgrade
# Reboot if kernel updated

Post-Upgrade Next Steps¶

After successful upgrade to PVE 9:

Leverage ZFS 2.3.4 for vault pool expansion (add capacity)
Migrate simple services to K8s (docs, Excalidraw)
Consolidate dev LXCs to single dev-workstation
Implement PBS (Proxmox Backup Server) for automated LXC backups
Test disaster recovery procedures quarterly

References¶

Documentation Links¶

Internal Documentation¶

Backup Scripts: /root/scripts/backup-*.sh
Migration Scripts: /root/scripts/migrate-app.sh
LXC Templates: /root/tower-fleet/lxc-templates/
K8s Manifests: /root/tower-fleet/manifests/

External Resources¶

Change Log¶

2025-12-02: Initial plan created based on upgrade evaluation
2025-12-15: Updated status - vault pool ONLINE, PVE 8.4.0, pve8to9 passed (0 failures)
Vault pool recovery completed (2 disks replaced)
Added backup scripts status table
Marked prerequisites as ready for upgrade
Enabled automated database backups (Supabase, Immich, OPNsense) with GitHub offsite
B2 offsite backup scripts prepared (pending credentials):
- /root/scripts/backup-to-b2.sh - modular backup runner
- /root/scripts/b2-jobs/ - job configs for Immich, Plex, Jellyfin, arr-stack
- Plan: /root/.claude/plans/iterative-pondering-gray.md