arr-stack Kubernetes Migration Assessment¶

Status: Assessment Phase Priority: Low (Backlog) Created: 2025-12-02 Owner: Infrastructure Team

Executive Summary¶

This document assesses the feasibility, benefits, risks, and implementation approach for migrating the arr-stack media automation system from Docker Compose on VM 100 to Kubernetes.

Current State: arr-stack runs on VM 100 (10.89.97.50) using Docker Compose with 11 services, stable and operational.

Key Finding: Migration is technically feasible but introduces significant complexity with moderate benefits. The VPN networking requirement (Gluetun) is the primary technical challenge.

Recommendation: Defer migration until one of these conditions is met: 1. VM 100 experiences reliability issues 2. Need arises for advanced K8s features (autoscaling, canary deployments) 3. Unified K8s management becomes critical operational requirement 4. Solution for VPN sidecar networking is proven in homelab context

Current Architecture¶

Infrastructure¶

VM: VM 100 (10.89.97.50) OS: Debian 12 Orchestration: Docker Compose Location: /opt/arr-stack/docker-compose.yml Storage: /mnt mounted from NAS (LXC 101)

Service Topology¶

┌─────────────────────────────────────────────────────────┐
│                       VM 100                             │
│                                                          │
│  ┌──────────────────────────────────────────────────┐  │
│  │ Gluetun (VPN Container - Mullvad WireGuard)     │  │
│  │  • Routes all download traffic through VPN      │  │
│  │  • Exposes ports for SABnzbd (8080) & Deluge    │  │
│  └──────────────────────────────────────────────────┘  │
│         ▲                                ▲              │
│         │ network_mode: service         │              │
│  ┌──────┴──────┐                 ┌──────┴──────┐      │
│  │  SABnzbd    │                 │   Deluge    │      │
│  │  (Usenet)   │                 │  (Torrent)  │      │
│  └─────────────┘                 └─────────────┘      │
│                                                         │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐ │
│  │ Sonarr  │  │ Radarr  │  │ Lidarr  │  │ Bazarr  │ │
│  └─────────┘  └─────────┘  └─────────┘  └─────────┘ │
│                                                         │
│  ┌──────────┐  ┌───────────┐  ┌───────────┐          │
│  │ Prowlarr │  │ Overseerr │  │Jellyseerr │          │
│  └──────────┘  └───────────┘  └───────────┘          │
│                                                         │
│  └──────────┘                                          │
│  │Watchtower│  (Auto-updates at 3:00 AM daily)        │
│  └──────────┘                                          │
└─────────────────────────────────────────────────────────┘
                    │
                    ▼
         /mnt (NFS from LXC 101)
         ├── media/
         │   ├── tv/
         │   ├── movies/
         │   ├── music/
         │   └── torrents/
         └── downloads/

Key Characteristics¶

VPN Dependency: - Gluetun container provides VPN tunnel - SABnzbd and Deluge use network_mode: "service:gluetun" - All download traffic routes through Mullvad WireGuard

Storage: - Config data: /opt/arr-stack/configs/ (local to VM) - Media data: /mnt (NFS mount from NAS) - Total media size: ~2TB - Config size: ~500MB

Networking: - All services exposed via direct port mapping - Forward auth via K8s Ingress (already implemented) - Services communicate via Docker bridge network

Updates: - Watchtower auto-updates containers daily at 3:00 AM - LinuxServer.io images (well-maintained)

Migration Drivers¶

Benefits of Migrating to Kubernetes¶

1. Unified Management¶

Single orchestration platform (K8s) for all services
Consistent deployment patterns across homelab
Centralized configuration management

2. Improved Observability¶

Native Prometheus metrics scraping
Grafana dashboards for arr-stack services
Centralized logging via Loki (already deployed)
Better visibility into resource usage

3. Enhanced Reliability¶

Automatic pod restarts on failure
Health checks and readiness probes
Resource limits and requests enforced
Better isolation between services

4. Advanced Features¶

Horizontal pod autoscaling (if needed)
Rolling updates with zero downtime
Canary deployments for testing
Network policies for security

5. Disaster Recovery¶

Declarative manifests in git (GitOps)
Easier backup/restore via Velero
Consistent with other homelab apps

Why Current Setup Works Well¶

Stability: Docker Compose is battle-tested, no issues in production

Simplicity: Single docker-compose.yml, easy to understand and modify

VPN Integration: network_mode: service works perfectly for routing traffic

Resource Efficiency: No K8s overhead, direct access to host resources

Update Strategy: Watchtower handles updates automatically

Low Maintenance: "Set and forget" - hasn't required intervention

Technical Challenges¶

1. VPN Networking (Primary Challenge)¶

Problem: Kubernetes doesn't support network_mode: "service:container" directly.

Current Approach:

sabnzbd:
  network_mode: "service:gluetun"  # Routes all traffic through Gluetun

K8s Options:

Option A: Sidecar Container Pattern¶

Deploy Gluetun as sidecar in each download client pod
Use shared network namespace (pod-level networking)
Pros: Clean, K8s-native approach
Cons: Duplicate VPN connections, more resource usage

Option B: Shared Network Namespace¶

Deploy Gluetun in separate pod with hostNetwork
Route download client traffic through Gluetun pod IP
Pros: Single VPN connection
Cons: Complex networking, requires CNI plugin support

Option C: VPN Gateway Service¶

Create dedicated VPN gateway pod
Use K8s Service with specific routing rules
Pros: Centralized VPN management
Cons: Requires advanced networking configuration

Option D: Keep Gluetun on VM, Connect via Network Policy¶

Leave VPN container on VM 100
Connect K8s pods to VM via external service
Pros: Minimal changes to working VPN setup
Cons: Defeats purpose of full migration, hybrid complexity

Recommendation: Start with Option A (Sidecar) for simplicity and K8s-native approach.

2. Storage Migration¶

Challenge: Large media library and config data need persistent storage.

Current: - Config: Local to VM at /opt/arr-stack/configs/ (~500MB) - Media: NFS mount from LXC 101 at /mnt (~2TB)

K8s Options:

Config Storage¶

Option 1: Longhorn PersistentVolumes (current K8s storage class)
Pros: Integrated, replicated, backed up
Cons: Overhead for small config files
Option 2: ConfigMaps for read-only configs
Pros: K8s-native, version controlled
Cons: Only for non-sensitive, read-only data
Option 3: Hostpath on K8s worker node
Pros: Fast, local storage
Cons: Node-specific, not portable

Recommendation: Use Longhorn PVCs for config data (proper K8s pattern).

Media Storage¶

Option 1: NFS StorageClass pointing to LXC 101
Pros: No data migration needed, shared across nodes
Cons: Requires NFS StorageClass setup (not yet configured)
Option 2: Direct NFS PersistentVolume
Pros: Explicit control over mount
Cons: Manual PV creation per app
Option 3: Keep on VM, mount via external service
Pros: No migration needed
Cons: Hybrid architecture, defeats purpose

Recommendation: Create NFS StorageClass for /vault/subvol-101-disk-0/media.

3. Port Management¶

Current: Direct port exposure via VM (8080, 8112, 8989, etc.)

K8s: Services need LoadBalancer IPs or Ingress routing (already have Ingress for auth).

Solution: Use existing Ingress configuration (forward auth already implemented).

4. State and Data Consistency¶

Concerns: - Database files in configs/ (SQLite for most arr services) - Download state in SABnzbd/Deluge - Queue state in Sonarr/Radarr

Mitigation: 1. Full backup before migration 2. Quiesce services (pause downloads, complete in-progress) 3. Migrate config data to K8s PVCs 4. Test with read-only mounts first 5. Validate data integrity post-migration

5. Auto-Updates (Watchtower Replacement)¶

Current: Watchtower updates containers daily at 3:00 AM.

K8s Options: - Manual image updates with kubectl set image - ArgoCD image updater (if using GitOps) - Renovate bot for manifest updates - Custom CronJob to check for image updates

Recommendation: Manual updates or ArgoCD image updater (if deploying ArgoCD).

Proposed Kubernetes Architecture¶

Namespace Structure¶

apiVersion: v1
kind: Namespace
metadata:
  name: arr-stack
  labels:
    name: arr-stack
    monitoring: enabled

Pod Architecture (Sidecar Pattern)¶

SABnzbd Pod (with Gluetun Sidecar)¶

apiVersion: v1
kind: Pod
metadata:
  name: sabnzbd
  namespace: arr-stack
spec:
  shareProcessNamespace: true
  containers:
    # VPN Sidecar
    - name: gluetun
      image: qmcgaw/gluetun:latest
      securityContext:
        capabilities:
          add:
            - NET_ADMIN
      env:
        - name: VPN_SERVICE_PROVIDER
          value: "mullvad"
        - name: VPN_TYPE
          value: "wireguard"
        - name: WIREGUARD_PRIVATE_KEY
          valueFrom:
            secretKeyRef:
              name: vpn-credentials
              key: wireguard-key
        - name: SERVER_CITIES
          value: "Boston MA"
      # Health check for VPN connectivity
      livenessProbe:
        exec:
          command: ["sh", "-c", "wget -q --spider https://api.ipify.org"]
        initialDelaySeconds: 30
        periodSeconds: 60

    # SABnzbd Application
    - name: sabnzbd
      image: lscr.io/linuxserver/sabnzbd:latest
      env:
        - name: PUID
          value: "1000"
        - name: PGID
          value: "1000"
        - name: TZ
          value: "America/New_York"
      volumeMounts:
        - name: config
          mountPath: /config
        - name: media
          mountPath: /data
      ports:
        - containerPort: 8080
          name: http

  volumes:
    - name: config
      persistentVolumeClaim:
        claimName: sabnzbd-config
    - name: media
      persistentVolumeClaim:
        claimName: arr-stack-media  # Shared NFS volume

Key Points: - shareProcessNamespace: true enables sidecar networking - Both containers share same network namespace - Gluetun provides VPN tunnel, SABnzbd routes through it - Health check validates VPN connectivity

Storage Architecture¶

Config Storage (Longhorn PVC)¶

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: sonarr-config
  namespace: arr-stack
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 5Gi

Media Storage (NFS PVC)¶

apiVersion: v1
kind: PersistentVolume
metadata:
  name: arr-stack-media
spec:
  capacity:
    storage: 5Ti
  accessModes:
    - ReadWriteMany
  nfs:
    server: 10.89.97.237  # LXC 101 NAS
    path: /vault/subvol-101-disk-0/media
  mountOptions:
    - nfsvers=4.1
    - hard
    - timeo=600
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: arr-stack-media
  namespace: arr-stack
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""  # Static PV binding
  volumeName: arr-stack-media
  resources:
    requests:
      storage: 5Ti

Service Deployment Pattern¶

Each arr service (Sonarr, Radarr, etc.) follows this pattern:

Deployment: Manages replica set and rolling updates
PVC: Persistent config storage (Longhorn)
Service: ClusterIP for internal communication
Ingress: Forward auth via Authentik (already configured)

Example (Sonarr):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sonarr
  namespace: arr-stack
spec:
  replicas: 1
  strategy:
    type: Recreate  # SQLite databases can't handle concurrent access
  selector:
    matchLabels:
      app: sonarr
  template:
    metadata:
      labels:
        app: sonarr
    spec:
      containers:
        - name: sonarr
          image: lscr.io/linuxserver/sonarr:latest
          env:
            - name: PUID
              value: "1000"
            - name: PGID
              value: "1000"
            - name: TZ
              value: "America/New_York"
          volumeMounts:
            - name: config
              mountPath: /config
            - name: media
              mountPath: /data
          ports:
            - containerPort: 8989
              name: http
          livenessProbe:
            httpGet:
              path: /ping
              port: 8989
            initialDelaySeconds: 30
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /ping
              port: 8989
            initialDelaySeconds: 15
            periodSeconds: 10
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "1Gi"
              cpu: "1000m"
      volumes:
        - name: config
          persistentVolumeClaim:
            claimName: sonarr-config
        - name: media
          persistentVolumeClaim:
            claimName: arr-stack-media

Implementation Options¶

Option 1: Full Migration (All Services)¶

Scope: Migrate all 11 arr-stack services to K8s at once.

Pros: - Clean cutover, no hybrid state - Unified management from day one - Simpler to reason about

Cons: - Higher risk (all eggs in one basket) - Longer downtime window - Harder to rollback if issues arise

Downtime: 2-4 hours (backup, quiesce, migrate, validate)

Option 2: Phased Migration (Service by Service)¶

Phase 1: Management services (Prowlarr, Overseerr, Jellyseerr) - No VPN dependency - Lower risk - Validate K8s patterns

Phase 2: Content management (Sonarr, Radarr, Lidarr, Bazarr) - Core functionality - Test storage integration - Validate service communication

Phase 3: Download clients (SABnzbd, Deluge + Gluetun) - VPN complexity - Most critical for privacy - Highest risk components

Pros: - Lower risk per phase - Easier rollback - Learn and adapt between phases

Cons: - Hybrid architecture during transition - Longer total migration time - Service intercommunication across VM/K8s

Downtime per phase: 30-60 minutes

Option 3: Parallel Deployment (Keep VM as Fallback)¶

Approach: 1. Deploy arr-stack to K8s alongside VM deployment 2. Run both in parallel with separate configs 3. Test K8s version thoroughly 4. Cutover when confident 5. Keep VM as hot standby for 1-2 weeks

Pros: - Zero downtime migration - Easy rollback (just switch back) - Full validation before cutover

Cons: - Duplicate downloads during testing - More complex (manage two instances) - Requires duplicate storage for configs

Downtime: None (cutover via DNS/Ingress)

Resource Requirements¶

Compute Resources¶

Service	Current (VM)	K8s Requests	K8s Limits	Notes
Gluetun (x2)	-	100m / 128Mi	200m / 256Mi	Per sidecar
SABnzbd	-	250m / 512Mi	1000m / 1Gi	CPU-intensive (extraction)
Deluge	-	250m / 512Mi	500m / 1Gi	Torrent handling
Sonarr	-	250m / 512Mi	1000m / 1Gi	API-heavy
Radarr	-	250m / 512Mi	1000m / 1Gi	API-heavy
Lidarr	-	250m / 512Mi	1000m / 1Gi	API-heavy
Prowlarr	-	250m / 256Mi	500m / 512Mi	Lightweight
Bazarr	-	250m / 256Mi	500m / 512Mi	Subtitle processing
Overseerr	-	250m / 256Mi	500m / 512Mi	Request management
Jellyseerr	-	250m / 256Mi	500m / 512Mi	Request management
Total	~2 cores, 4GB	~2.5 cores, 4.5GB	~7 cores, 9GB	With VPN sidecars

Cluster Capacity: - Current: 3 worker nodes, ~12 cores, ~24GB RAM total - Arr-stack would use: ~21% CPU requests, ~19% memory requests - Verdict: Sufficient capacity exists

Storage Resources¶

Type	Size	K8s Storage	Notes
Config (Sonarr)	50-100MB	Longhorn PVC (5Gi)	SQLite database
Config (Radarr)	50-100MB	Longhorn PVC (5Gi)	SQLite database
Config (Lidarr)	50-100MB	Longhorn PVC (5Gi)	SQLite database
Config (Prowlarr)	10-20MB	Longhorn PVC (1Gi)	Lightweight
Config (Bazarr)	50-100MB	Longhorn PVC (5Gi)	Subtitle cache
Config (SABnzbd)	50MB	Longhorn PVC (5Gi)	Queue/history
Config (Deluge)	50MB	Longhorn PVC (5Gi)	Torrent state
Config (Overseerr)	50MB	Longhorn PVC (5Gi)	Request database
Config (Jellyseerr)	50MB	Longhorn PVC (5Gi)	Request database
Config (Gluetun)	1MB	ConfigMap	VPN config
Media	~2TB	NFS PV	Shared, no migration
Total Longhorn	~500MB	~41Gi provisioned	Overprovisioned for growth

Storage Impact: - Longhorn: +41Gi provisioned (~10Gi actual usage) - NFS: No impact (existing mount) - Verdict: Minimal impact on Longhorn capacity

Network Resources¶

MetalLB IPs: Already using K8s Ingress (no additional IPs needed)

Bandwidth: Same as current (downloads via VPN, LAN access)

DNS: Already configured (*.internal domain)

Migration Plan¶

Prerequisites¶

NFS StorageClass Setup

# Create NFS StorageClass for media storage
kubectl apply -f /root/tower-fleet/manifests/storage/nfs-storageclass.yaml

Secrets Creation

# VPN credentials
kubectl create secret generic vpn-credentials \
  --from-literal=wireguard-key='<WIREGUARD_KEY>' \
  -n arr-stack

# Optional: Migrate existing configs to ConfigMaps/Secrets

Backup Current State

ssh root@10.89.97.50
cd /opt/arr-stack
tar -czf /root/arr-stack-backup-$(date +%Y%m%d).tar.gz configs/
docker compose down

Phase 1: Non-VPN Services (Low Risk)¶

Services: Prowlarr, Overseerr, Jellyseerr

Steps:

Create namespace and storage:

kubectl apply -f /root/tower-fleet/manifests/arr-stack/namespace.yaml
kubectl apply -f /root/tower-fleet/manifests/arr-stack/pvcs.yaml

Migrate config data:

# Copy configs to K8s PVCs via temporary pod
kubectl run -n arr-stack config-migration --image=busybox --restart=Never \
  --overrides='<json with volume mounts>' -- sleep 3600
kubectl cp /opt/arr-stack/configs/prowlarr arr-stack/config-migration:/config

Deploy services:

kubectl apply -f /root/tower-fleet/manifests/arr-stack/prowlarr.yaml
kubectl apply -f /root/tower-fleet/manifests/arr-stack/overseerr.yaml
kubectl apply -f /root/tower-fleet/manifests/arr-stack/jellyseerr.yaml

Validate:
Services start successfully
Configs loaded correctly
Web UI accessible via Ingress
Authentication works (Authentik forward auth)
Monitor for 24-48 hours

Rollback: Restart services on VM 100, update Ingress endpoints back to VM.

Phase 2: Content Management (Medium Risk)¶

Services: Sonarr, Radarr, Lidarr, Bazarr

Steps:

Quiesce services:
Pause all monitoring/searching in Sonarr/Radarr/Lidarr
Let current downloads complete
Wait for idle state
Migrate config data (same process as Phase 1)

Deploy services:

kubectl apply -f /root/tower-fleet/manifests/arr-stack/sonarr.yaml
kubectl apply -f /root/tower-fleet/manifests/arr-stack/radarr.yaml
kubectl apply -f /root/tower-fleet/manifests/arr-stack/lidarr.yaml
kubectl apply -f /root/tower-fleet/manifests/arr-stack/bazarr.yaml

Validate:
All series/movies/music libraries intact
API keys still valid (check connections to Prowlarr, download clients)
Queue processing resumes
File imports work correctly
Monitor for 24-48 hours

Rollback: Restart on VM, copy back any changed config data.

Phase 3: Download Clients (High Risk)¶

Services: Gluetun, SABnzbd, Deluge

Steps:

Pause all downloads:
Pause SABnzbd queue
Pause all torrents in Deluge
Wait for idle state

Validate VPN connectivity on K8s:

# Deploy test pod with Gluetun sidecar
kubectl apply -f /root/tower-fleet/manifests/arr-stack/gluetun-test.yaml

# Validate VPN IP (should be Mullvad exit node)
kubectl exec -n arr-stack gluetun-test -- curl https://api.ipify.org

Deploy download clients:

kubectl apply -f /root/tower-fleet/manifests/arr-stack/sabnzbd.yaml
kubectl apply -f /root/tower-fleet/manifests/arr-stack/deluge.yaml

Validate:
VPN connection active (check IP via Gluetun logs)
SABnzbd/Deluge accessible via K8s service
Download history preserved
Test download through VPN
Verify download completes and imports to Sonarr/Radarr
Update Sonarr/Radarr/Lidarr download client endpoints:
Change from http://10.89.97.50:8080 to http://sabnzbd.arr-stack.svc.cluster.local:8080
Resume operations, monitor closely for 48 hours

Rollback: Critical - VPN credentials stored in K8s secrets, easy to restart on VM if needed.

Phase 4: Decommission VM 100¶

After 1-2 weeks of stable operation:

Final backup of VM 100:

ssh root@10.89.97.50
tar -czf /root/arr-stack-vm-final-backup.tar.gz /opt/arr-stack

Stop Docker Compose on VM:
```
cd /opt/arr-stack
docker compose down
```
Optionally: Repurpose VM 100 or shut down to save resources

Do NOT delete VM until 30+ days of stable K8s operation.

Timeline Estimate¶

Phase	Duration	Downtime	Dependencies
Prerequisites	2-4 hours	None	NFS StorageClass, secrets
Phase 1 (Non-VPN)	4-6 hours	30 min/service	Prerequisites complete
Phase 2 (Content Mgmt)	4-6 hours	1 hour	Phase 1 stable for 48h
Phase 3 (Download Clients)	6-8 hours	2 hours	Phase 2 stable for 48h
Phase 4 (Decommission)	1 hour	None	Phase 3 stable for 2 weeks
Total	17-25 hours	~4 hours total	3-4 weeks elapsed

Risk Assessment¶

High Risks¶

1. VPN Connectivity Issues¶

Risk: Gluetun sidecar fails to establish VPN connection in K8s.

Impact: Download clients exposed to ISP (privacy leak), downloads fail.

Probability: Medium (new networking pattern, untested in homelab)

Mitigation: - Test VPN sidecar pattern extensively before migration - Add liveness probes to validate VPN connectivity - Fail-closed: Block downloads if VPN is down - Keep VM 100 as hot standby during initial rollout

Rollback: Immediately switch back to VM 100, investigate K8s networking.

2. Data Corruption During Migration¶

Risk: Config database corruption during PVC migration.

Impact: Loss of series/movie metadata, download history, custom settings.

Probability: Low (well-tested migration process)

Mitigation: - Full backup before migration - Quiesce all services before copying data - Validate data integrity post-migration (checksum comparison) - Keep VM backup for 30+ days

Rollback: Restore from backup, restart on VM 100.

3. Storage Performance Degradation¶

Risk: NFS storage slower than local disk on VM.

Impact: Slower imports, higher CPU usage, potential timeout issues.

Probability: Medium (network-attached storage inherently slower)

Mitigation: - Benchmark NFS performance before migration - Use NFSv4.1 with optimized mount options - Monitor import times and adjust if needed - Consider caching layer if performance issues persist

Rollback: Move back to VM 100 local storage.

Medium Risks¶

4. Service Intercommunication Issues¶

Risk: Sonarr/Radarr can't reach SABnzbd/Deluge after migration.

Impact: Downloads fail to trigger, imports fail.

Probability: Low (K8s DNS is reliable)

Mitigation: - Test service discovery before full migration - Use K8s service DNS names consistently - Add readiness probes to ensure services are reachable

Rollback: Update service endpoints back to VM IPs.

5. Resource Contention¶

Risk: Arr-stack pods compete for CPU/memory with other K8s apps.

Impact: Performance degradation, OOM kills.

Probability: Low (sufficient cluster capacity)

Mitigation: - Set appropriate resource requests/limits - Monitor cluster resource usage - Scale cluster if needed (add worker node)

Rollback: Reduce replica count or restart on VM.

Low Risks¶

6. Ingress Forward Auth Issues¶

Risk: Authentik forward auth breaks after migration.

Impact: Can't access arr-stack web UIs.

Probability: Very Low (forward auth already working, Ingress config unchanged)

Mitigation: - No changes to Ingress manifests needed - Test forward auth after each phase

Rollback: Update Ingress endpoints back to VM IPs.

7. Auto-Update Disruption¶

Risk: Loss of Watchtower auto-updates.

Impact: Manual update process required.

Probability: High (Watchtower not migrating)

Mitigation: - Document manual update process - Consider ArgoCD image updater for automation - Set up alerts for outdated images

No rollback needed: Manual updates are acceptable.

Cost-Benefit Analysis¶

Costs (Effort & Risk)¶

Category	Effort	Risk	Notes
Planning & Documentation	8 hours	Low	This document
NFS StorageClass Setup	2 hours	Low	One-time, reusable
Manifest Creation	16 hours	Medium	11 services + shared resources
Testing VPN Pattern	8 hours	High	Critical path, unknown unknowns
Phase 1 Migration	6 hours	Low	Non-VPN services
Phase 2 Migration	6 hours	Medium	Content management
Phase 3 Migration	8 hours	High	VPN-dependent downloads
Validation & Monitoring	20 hours	Medium	Over 3-4 weeks
Documentation Updates	4 hours	Low	Update arr-stack.md
Total Effort	~78 hours	Medium-High	~2 weeks full-time

Benefits (Qualitative)¶

Benefit	Value	Timeframe	Notes
Unified Management	Medium	Immediate	Single platform for all services
Improved Observability	Medium	Immediate	Prometheus/Grafana integration
Better Reliability	Low-Medium	1-3 months	Health checks, auto-restarts
GitOps-Ready	Low	Long-term	If implementing ArgoCD
Disaster Recovery	Low-Medium	Long-term	Velero backups, declarative config
Resource Efficiency	Negative	Immediate	K8s overhead vs Docker Compose
Operational Simplicity	Negative	Short-term	More complex during learning curve

Quantitative Analysis¶

Cost: ~78 hours @ $0/hour = $0 (homelab, personal time)

Benefit: Moderate improvement in observability and management

Break-Even: Unclear - benefits are mostly qualitative

Opportunity Cost: 78 hours could be spent on: - New homelab applications (AI subtitle generator, etc.) - Improving existing apps (trip-planner budget tracking) - Infrastructure improvements (SSL/TLS, backups, monitoring)

Recommendation¶

Primary Recommendation: DEFER MIGRATION¶

Rationale:

Current State is Stable: arr-stack on VM 100 has been running reliably with zero issues. "If it ain't broke, don't fix it."
High Effort, Moderate Benefits: 78 hours of migration effort for benefits that are mostly qualitative. The improvement in observability and management doesn't justify the risk and effort.
VPN Networking Uncertainty: The sidecar pattern for Gluetun is untested in this homelab. This introduces unknown risks for a privacy-critical component.
Better Alternatives Exist: The same 78 hours could deliver higher-value improvements:
SSL/TLS for all services (security improvement)
Backup strategy with Velero (disaster recovery)
New applications (AI subtitle generator, knowledge base)
Enhanced monitoring and alerting
Hybrid Architecture Complexity: During phased migration, managing services across VM and K8s adds operational complexity.
Resource Efficiency Loss: Docker Compose is more efficient than K8s for this use case (no orchestration overhead).

Conditions for Reconsidering Migration¶

Revisit this decision if any of these conditions arise:

VM 100 Reliability Issues: Hardware failures, persistent Docker problems, or stability concerns
Need for K8s-Specific Features: Requirements for autoscaling, canary deployments, or advanced traffic management
Unified Management Becomes Critical: If managing hybrid VM/K8s becomes painful operationally
Proven VPN Pattern: If Gluetun sidecar pattern is successfully implemented and validated in another project
ArgoCD Deployment: If implementing GitOps for other apps, arr-stack could benefit from inclusion

Alternative: Incremental Improvements to Current Setup¶

Instead of full migration, consider these low-effort improvements:

Add Prometheus Exporters:
Deploy node exporter on VM 100
Scrape arr service metrics via existing APIs
Create Grafana dashboards
Backup Automation:
Schedule daily config backups to /vault
Test restoration procedure
Monitoring Integration:
Send Docker logs to Loki (already deployed)
Set up alerts for service failures
GitOps for Config:
Version control docker-compose.yml in tower-fleet repo
Document update procedures

Effort: ~8-10 hours total Benefit: Improved observability with minimal risk Cost: $0 (no downtime, no migration risk)

Appendices¶

Appendix A: NFS StorageClass Manifest¶

# /root/tower-fleet/manifests/storage/nfs-storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-media
provisioner: kubernetes.io/no-provisioner  # Static provisioning
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: arr-stack-media
spec:
  capacity:
    storage: 5Ti
  accessModes:
    - ReadWriteMany
  nfs:
    server: 10.89.97.237
    path: /vault/subvol-101-disk-0/media
  mountOptions:
    - nfsvers=4.1
    - hard
    - timeo=600
    - retrans=2
    - noresvport

Appendix B: Gluetun Test Pod¶

# /root/tower-fleet/manifests/arr-stack/gluetun-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gluetun-test
  namespace: arr-stack
spec:
  shareProcessNamespace: true
  containers:
    - name: gluetun
      image: qmcgaw/gluetun:latest
      securityContext:
        capabilities:
          add:
            - NET_ADMIN
      env:
        - name: VPN_SERVICE_PROVIDER
          value: "mullvad"
        - name: VPN_TYPE
          value: "wireguard"
        - name: WIREGUARD_PRIVATE_KEY
          valueFrom:
            secretKeyRef:
              name: vpn-credentials
              key: wireguard-key
        - name: SERVER_CITIES
          value: "Boston MA"
    - name: test
      image: busybox
      command: ["sleep", "3600"]

Test VPN connectivity:

kubectl exec -n arr-stack gluetun-test -c test -- wget -qO- https://api.ipify.org
# Should return Mullvad exit node IP, not home IP

Appendix C: Reference Links¶

Document Status: Draft Next Review: When reconsidering migration (see conditions above) Maintained By: Infrastructure Team