Skip to content

arr-stack Kubernetes Migration Assessment

Status: Assessment Phase Priority: Low (Backlog) Created: 2025-12-02 Owner: Infrastructure Team


Executive Summary

This document assesses the feasibility, benefits, risks, and implementation approach for migrating the arr-stack media automation system from Docker Compose on VM 100 to Kubernetes.

Current State: arr-stack runs on VM 100 (10.89.97.50) using Docker Compose with 11 services, stable and operational.

Key Finding: Migration is technically feasible but introduces significant complexity with moderate benefits. The VPN networking requirement (Gluetun) is the primary technical challenge.

Recommendation: Defer migration until one of these conditions is met: 1. VM 100 experiences reliability issues 2. Need arises for advanced K8s features (autoscaling, canary deployments) 3. Unified K8s management becomes critical operational requirement 4. Solution for VPN sidecar networking is proven in homelab context


Table of Contents

  1. Current Architecture
  2. Migration Drivers
  3. Technical Challenges
  4. Proposed Kubernetes Architecture
  5. Implementation Options
  6. Resource Requirements
  7. Migration Plan
  8. Risk Assessment
  9. Cost-Benefit Analysis
  10. Recommendation

Current Architecture

Infrastructure

VM: VM 100 (10.89.97.50) OS: Debian 12 Orchestration: Docker Compose Location: /opt/arr-stack/docker-compose.yml Storage: /mnt mounted from NAS (LXC 101)

Service Topology

┌─────────────────────────────────────────────────────────┐
│                       VM 100                             │
│                                                          │
│  ┌──────────────────────────────────────────────────┐  │
│  │ Gluetun (VPN Container - Mullvad WireGuard)     │  │
│  │  • Routes all download traffic through VPN      │  │
│  │  • Exposes ports for SABnzbd (8080) & Deluge    │  │
│  └──────────────────────────────────────────────────┘  │
│         ▲                                ▲              │
│         │ network_mode: service         │              │
│  ┌──────┴──────┐                 ┌──────┴──────┐      │
│  │  SABnzbd    │                 │   Deluge    │      │
│  │  (Usenet)   │                 │  (Torrent)  │      │
│  └─────────────┘                 └─────────────┘      │
│                                                         │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐ │
│  │ Sonarr  │  │ Radarr  │  │ Lidarr  │  │ Bazarr  │ │
│  └─────────┘  └─────────┘  └─────────┘  └─────────┘ │
│                                                         │
│  ┌──────────┐  ┌───────────┐  ┌───────────┐          │
│  │ Prowlarr │  │ Overseerr │  │Jellyseerr │          │
│  └──────────┘  └───────────┘  └───────────┘          │
│                                                         │
│  └──────────┘                                          │
│  │Watchtower│  (Auto-updates at 3:00 AM daily)        │
│  └──────────┘                                          │
└─────────────────────────────────────────────────────────┘
         /mnt (NFS from LXC 101)
         ├── media/
         │   ├── tv/
         │   ├── movies/
         │   ├── music/
         │   └── torrents/
         └── downloads/

Key Characteristics

VPN Dependency: - Gluetun container provides VPN tunnel - SABnzbd and Deluge use network_mode: "service:gluetun" - All download traffic routes through Mullvad WireGuard

Storage: - Config data: /opt/arr-stack/configs/ (local to VM) - Media data: /mnt (NFS mount from NAS) - Total media size: ~2TB - Config size: ~500MB

Networking: - All services exposed via direct port mapping - Forward auth via K8s Ingress (already implemented) - Services communicate via Docker bridge network

Updates: - Watchtower auto-updates containers daily at 3:00 AM - LinuxServer.io images (well-maintained)


Migration Drivers

Benefits of Migrating to Kubernetes

1. Unified Management

  • Single orchestration platform (K8s) for all services
  • Consistent deployment patterns across homelab
  • Centralized configuration management

2. Improved Observability

  • Native Prometheus metrics scraping
  • Grafana dashboards for arr-stack services
  • Centralized logging via Loki (already deployed)
  • Better visibility into resource usage

3. Enhanced Reliability

  • Automatic pod restarts on failure
  • Health checks and readiness probes
  • Resource limits and requests enforced
  • Better isolation between services

4. Advanced Features

  • Horizontal pod autoscaling (if needed)
  • Rolling updates with zero downtime
  • Canary deployments for testing
  • Network policies for security

5. Disaster Recovery

  • Declarative manifests in git (GitOps)
  • Easier backup/restore via Velero
  • Consistent with other homelab apps

Why Current Setup Works Well

Stability: Docker Compose is battle-tested, no issues in production

Simplicity: Single docker-compose.yml, easy to understand and modify

VPN Integration: network_mode: service works perfectly for routing traffic

Resource Efficiency: No K8s overhead, direct access to host resources

Update Strategy: Watchtower handles updates automatically

Low Maintenance: "Set and forget" - hasn't required intervention


Technical Challenges

1. VPN Networking (Primary Challenge)

Problem: Kubernetes doesn't support network_mode: "service:container" directly.

Current Approach:

sabnzbd:
  network_mode: "service:gluetun"  # Routes all traffic through Gluetun

K8s Options:

Option A: Sidecar Container Pattern

  • Deploy Gluetun as sidecar in each download client pod
  • Use shared network namespace (pod-level networking)
  • Pros: Clean, K8s-native approach
  • Cons: Duplicate VPN connections, more resource usage

Option B: Shared Network Namespace

  • Deploy Gluetun in separate pod with hostNetwork
  • Route download client traffic through Gluetun pod IP
  • Pros: Single VPN connection
  • Cons: Complex networking, requires CNI plugin support

Option C: VPN Gateway Service

  • Create dedicated VPN gateway pod
  • Use K8s Service with specific routing rules
  • Pros: Centralized VPN management
  • Cons: Requires advanced networking configuration

Option D: Keep Gluetun on VM, Connect via Network Policy

  • Leave VPN container on VM 100
  • Connect K8s pods to VM via external service
  • Pros: Minimal changes to working VPN setup
  • Cons: Defeats purpose of full migration, hybrid complexity

Recommendation: Start with Option A (Sidecar) for simplicity and K8s-native approach.

2. Storage Migration

Challenge: Large media library and config data need persistent storage.

Current: - Config: Local to VM at /opt/arr-stack/configs/ (~500MB) - Media: NFS mount from LXC 101 at /mnt (~2TB)

K8s Options:

Config Storage

  • Option 1: Longhorn PersistentVolumes (current K8s storage class)
  • Pros: Integrated, replicated, backed up
  • Cons: Overhead for small config files

  • Option 2: ConfigMaps for read-only configs

  • Pros: K8s-native, version controlled
  • Cons: Only for non-sensitive, read-only data

  • Option 3: Hostpath on K8s worker node

  • Pros: Fast, local storage
  • Cons: Node-specific, not portable

Recommendation: Use Longhorn PVCs for config data (proper K8s pattern).

Media Storage

  • Option 1: NFS StorageClass pointing to LXC 101
  • Pros: No data migration needed, shared across nodes
  • Cons: Requires NFS StorageClass setup (not yet configured)

  • Option 2: Direct NFS PersistentVolume

  • Pros: Explicit control over mount
  • Cons: Manual PV creation per app

  • Option 3: Keep on VM, mount via external service

  • Pros: No migration needed
  • Cons: Hybrid architecture, defeats purpose

Recommendation: Create NFS StorageClass for /vault/subvol-101-disk-0/media.

3. Port Management

Current: Direct port exposure via VM (8080, 8112, 8989, etc.)

K8s: Services need LoadBalancer IPs or Ingress routing (already have Ingress for auth).

Solution: Use existing Ingress configuration (forward auth already implemented).

4. State and Data Consistency

Concerns: - Database files in configs/ (SQLite for most arr services) - Download state in SABnzbd/Deluge - Queue state in Sonarr/Radarr

Mitigation: 1. Full backup before migration 2. Quiesce services (pause downloads, complete in-progress) 3. Migrate config data to K8s PVCs 4. Test with read-only mounts first 5. Validate data integrity post-migration

5. Auto-Updates (Watchtower Replacement)

Current: Watchtower updates containers daily at 3:00 AM.

K8s Options: - Manual image updates with kubectl set image - ArgoCD image updater (if using GitOps) - Renovate bot for manifest updates - Custom CronJob to check for image updates

Recommendation: Manual updates or ArgoCD image updater (if deploying ArgoCD).


Proposed Kubernetes Architecture

Namespace Structure

apiVersion: v1
kind: Namespace
metadata:
  name: arr-stack
  labels:
    name: arr-stack
    monitoring: enabled

Pod Architecture (Sidecar Pattern)

SABnzbd Pod (with Gluetun Sidecar)

apiVersion: v1
kind: Pod
metadata:
  name: sabnzbd
  namespace: arr-stack
spec:
  shareProcessNamespace: true
  containers:
    # VPN Sidecar
    - name: gluetun
      image: qmcgaw/gluetun:latest
      securityContext:
        capabilities:
          add:
            - NET_ADMIN
      env:
        - name: VPN_SERVICE_PROVIDER
          value: "mullvad"
        - name: VPN_TYPE
          value: "wireguard"
        - name: WIREGUARD_PRIVATE_KEY
          valueFrom:
            secretKeyRef:
              name: vpn-credentials
              key: wireguard-key
        - name: SERVER_CITIES
          value: "Boston MA"
      # Health check for VPN connectivity
      livenessProbe:
        exec:
          command: ["sh", "-c", "wget -q --spider https://api.ipify.org"]
        initialDelaySeconds: 30
        periodSeconds: 60

    # SABnzbd Application
    - name: sabnzbd
      image: lscr.io/linuxserver/sabnzbd:latest
      env:
        - name: PUID
          value: "1000"
        - name: PGID
          value: "1000"
        - name: TZ
          value: "America/New_York"
      volumeMounts:
        - name: config
          mountPath: /config
        - name: media
          mountPath: /data
      ports:
        - containerPort: 8080
          name: http

  volumes:
    - name: config
      persistentVolumeClaim:
        claimName: sabnzbd-config
    - name: media
      persistentVolumeClaim:
        claimName: arr-stack-media  # Shared NFS volume

Key Points: - shareProcessNamespace: true enables sidecar networking - Both containers share same network namespace - Gluetun provides VPN tunnel, SABnzbd routes through it - Health check validates VPN connectivity

Storage Architecture

Config Storage (Longhorn PVC)

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: sonarr-config
  namespace: arr-stack
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 5Gi

Media Storage (NFS PVC)

apiVersion: v1
kind: PersistentVolume
metadata:
  name: arr-stack-media
spec:
  capacity:
    storage: 5Ti
  accessModes:
    - ReadWriteMany
  nfs:
    server: 10.89.97.237  # LXC 101 NAS
    path: /vault/subvol-101-disk-0/media
  mountOptions:
    - nfsvers=4.1
    - hard
    - timeo=600
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: arr-stack-media
  namespace: arr-stack
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""  # Static PV binding
  volumeName: arr-stack-media
  resources:
    requests:
      storage: 5Ti

Service Deployment Pattern

Each arr service (Sonarr, Radarr, etc.) follows this pattern:

  1. Deployment: Manages replica set and rolling updates
  2. PVC: Persistent config storage (Longhorn)
  3. Service: ClusterIP for internal communication
  4. Ingress: Forward auth via Authentik (already configured)

Example (Sonarr):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sonarr
  namespace: arr-stack
spec:
  replicas: 1
  strategy:
    type: Recreate  # SQLite databases can't handle concurrent access
  selector:
    matchLabels:
      app: sonarr
  template:
    metadata:
      labels:
        app: sonarr
    spec:
      containers:
        - name: sonarr
          image: lscr.io/linuxserver/sonarr:latest
          env:
            - name: PUID
              value: "1000"
            - name: PGID
              value: "1000"
            - name: TZ
              value: "America/New_York"
          volumeMounts:
            - name: config
              mountPath: /config
            - name: media
              mountPath: /data
          ports:
            - containerPort: 8989
              name: http
          livenessProbe:
            httpGet:
              path: /ping
              port: 8989
            initialDelaySeconds: 30
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /ping
              port: 8989
            initialDelaySeconds: 15
            periodSeconds: 10
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "1Gi"
              cpu: "1000m"
      volumes:
        - name: config
          persistentVolumeClaim:
            claimName: sonarr-config
        - name: media
          persistentVolumeClaim:
            claimName: arr-stack-media

Implementation Options

Option 1: Full Migration (All Services)

Scope: Migrate all 11 arr-stack services to K8s at once.

Pros: - Clean cutover, no hybrid state - Unified management from day one - Simpler to reason about

Cons: - Higher risk (all eggs in one basket) - Longer downtime window - Harder to rollback if issues arise

Downtime: 2-4 hours (backup, quiesce, migrate, validate)

Option 2: Phased Migration (Service by Service)

Phase 1: Management services (Prowlarr, Overseerr, Jellyseerr) - No VPN dependency - Lower risk - Validate K8s patterns

Phase 2: Content management (Sonarr, Radarr, Lidarr, Bazarr) - Core functionality - Test storage integration - Validate service communication

Phase 3: Download clients (SABnzbd, Deluge + Gluetun) - VPN complexity - Most critical for privacy - Highest risk components

Pros: - Lower risk per phase - Easier rollback - Learn and adapt between phases

Cons: - Hybrid architecture during transition - Longer total migration time - Service intercommunication across VM/K8s

Downtime per phase: 30-60 minutes

Option 3: Parallel Deployment (Keep VM as Fallback)

Approach: 1. Deploy arr-stack to K8s alongside VM deployment 2. Run both in parallel with separate configs 3. Test K8s version thoroughly 4. Cutover when confident 5. Keep VM as hot standby for 1-2 weeks

Pros: - Zero downtime migration - Easy rollback (just switch back) - Full validation before cutover

Cons: - Duplicate downloads during testing - More complex (manage two instances) - Requires duplicate storage for configs

Downtime: None (cutover via DNS/Ingress)


Resource Requirements

Compute Resources

Service Current (VM) K8s Requests K8s Limits Notes
Gluetun (x2) - 100m / 128Mi 200m / 256Mi Per sidecar
SABnzbd - 250m / 512Mi 1000m / 1Gi CPU-intensive (extraction)
Deluge - 250m / 512Mi 500m / 1Gi Torrent handling
Sonarr - 250m / 512Mi 1000m / 1Gi API-heavy
Radarr - 250m / 512Mi 1000m / 1Gi API-heavy
Lidarr - 250m / 512Mi 1000m / 1Gi API-heavy
Prowlarr - 250m / 256Mi 500m / 512Mi Lightweight
Bazarr - 250m / 256Mi 500m / 512Mi Subtitle processing
Overseerr - 250m / 256Mi 500m / 512Mi Request management
Jellyseerr - 250m / 256Mi 500m / 512Mi Request management
Total ~2 cores, 4GB ~2.5 cores, 4.5GB ~7 cores, 9GB With VPN sidecars

Cluster Capacity: - Current: 3 worker nodes, ~12 cores, ~24GB RAM total - Arr-stack would use: ~21% CPU requests, ~19% memory requests - Verdict: Sufficient capacity exists

Storage Resources

Type Size K8s Storage Notes
Config (Sonarr) 50-100MB Longhorn PVC (5Gi) SQLite database
Config (Radarr) 50-100MB Longhorn PVC (5Gi) SQLite database
Config (Lidarr) 50-100MB Longhorn PVC (5Gi) SQLite database
Config (Prowlarr) 10-20MB Longhorn PVC (1Gi) Lightweight
Config (Bazarr) 50-100MB Longhorn PVC (5Gi) Subtitle cache
Config (SABnzbd) 50MB Longhorn PVC (5Gi) Queue/history
Config (Deluge) 50MB Longhorn PVC (5Gi) Torrent state
Config (Overseerr) 50MB Longhorn PVC (5Gi) Request database
Config (Jellyseerr) 50MB Longhorn PVC (5Gi) Request database
Config (Gluetun) 1MB ConfigMap VPN config
Media ~2TB NFS PV Shared, no migration
Total Longhorn ~500MB ~41Gi provisioned Overprovisioned for growth

Storage Impact: - Longhorn: +41Gi provisioned (~10Gi actual usage) - NFS: No impact (existing mount) - Verdict: Minimal impact on Longhorn capacity

Network Resources

MetalLB IPs: Already using K8s Ingress (no additional IPs needed)

Bandwidth: Same as current (downloads via VPN, LAN access)

DNS: Already configured (*.internal domain)


Migration Plan

Prerequisites

  1. NFS StorageClass Setup

    # Create NFS StorageClass for media storage
    kubectl apply -f /root/tower-fleet/manifests/storage/nfs-storageclass.yaml
    

  2. Secrets Creation

    # VPN credentials
    kubectl create secret generic vpn-credentials \
      --from-literal=wireguard-key='<WIREGUARD_KEY>' \
      -n arr-stack
    
    # Optional: Migrate existing configs to ConfigMaps/Secrets
    

  3. Backup Current State

    ssh root@10.89.97.50
    cd /opt/arr-stack
    tar -czf /root/arr-stack-backup-$(date +%Y%m%d).tar.gz configs/
    docker compose down
    

Phase 1: Non-VPN Services (Low Risk)

Services: Prowlarr, Overseerr, Jellyseerr

Steps:

  1. Create namespace and storage:

    kubectl apply -f /root/tower-fleet/manifests/arr-stack/namespace.yaml
    kubectl apply -f /root/tower-fleet/manifests/arr-stack/pvcs.yaml
    

  2. Migrate config data:

    # Copy configs to K8s PVCs via temporary pod
    kubectl run -n arr-stack config-migration --image=busybox --restart=Never \
      --overrides='<json with volume mounts>' -- sleep 3600
    kubectl cp /opt/arr-stack/configs/prowlarr arr-stack/config-migration:/config
    

  3. Deploy services:

    kubectl apply -f /root/tower-fleet/manifests/arr-stack/prowlarr.yaml
    kubectl apply -f /root/tower-fleet/manifests/arr-stack/overseerr.yaml
    kubectl apply -f /root/tower-fleet/manifests/arr-stack/jellyseerr.yaml
    

  4. Validate:

  5. Services start successfully
  6. Configs loaded correctly
  7. Web UI accessible via Ingress
  8. Authentication works (Authentik forward auth)

  9. Monitor for 24-48 hours

Rollback: Restart services on VM 100, update Ingress endpoints back to VM.

Phase 2: Content Management (Medium Risk)

Services: Sonarr, Radarr, Lidarr, Bazarr

Steps:

  1. Quiesce services:
  2. Pause all monitoring/searching in Sonarr/Radarr/Lidarr
  3. Let current downloads complete
  4. Wait for idle state

  5. Migrate config data (same process as Phase 1)

  6. Deploy services:

    kubectl apply -f /root/tower-fleet/manifests/arr-stack/sonarr.yaml
    kubectl apply -f /root/tower-fleet/manifests/arr-stack/radarr.yaml
    kubectl apply -f /root/tower-fleet/manifests/arr-stack/lidarr.yaml
    kubectl apply -f /root/tower-fleet/manifests/arr-stack/bazarr.yaml
    

  7. Validate:

  8. All series/movies/music libraries intact
  9. API keys still valid (check connections to Prowlarr, download clients)
  10. Queue processing resumes
  11. File imports work correctly

  12. Monitor for 24-48 hours

Rollback: Restart on VM, copy back any changed config data.

Phase 3: Download Clients (High Risk)

Services: Gluetun, SABnzbd, Deluge

Steps:

  1. Pause all downloads:
  2. Pause SABnzbd queue
  3. Pause all torrents in Deluge
  4. Wait for idle state

  5. Validate VPN connectivity on K8s:

    # Deploy test pod with Gluetun sidecar
    kubectl apply -f /root/tower-fleet/manifests/arr-stack/gluetun-test.yaml
    
    # Validate VPN IP (should be Mullvad exit node)
    kubectl exec -n arr-stack gluetun-test -- curl https://api.ipify.org
    

  6. Deploy download clients:

    kubectl apply -f /root/tower-fleet/manifests/arr-stack/sabnzbd.yaml
    kubectl apply -f /root/tower-fleet/manifests/arr-stack/deluge.yaml
    

  7. Validate:

  8. VPN connection active (check IP via Gluetun logs)
  9. SABnzbd/Deluge accessible via K8s service
  10. Download history preserved
  11. Test download through VPN
  12. Verify download completes and imports to Sonarr/Radarr

  13. Update Sonarr/Radarr/Lidarr download client endpoints:

  14. Change from http://10.89.97.50:8080 to http://sabnzbd.arr-stack.svc.cluster.local:8080

  15. Resume operations, monitor closely for 48 hours

Rollback: Critical - VPN credentials stored in K8s secrets, easy to restart on VM if needed.

Phase 4: Decommission VM 100

After 1-2 weeks of stable operation:

  1. Final backup of VM 100:

    ssh root@10.89.97.50
    tar -czf /root/arr-stack-vm-final-backup.tar.gz /opt/arr-stack
    

  2. Stop Docker Compose on VM:

    cd /opt/arr-stack
    docker compose down
    

  3. Optionally: Repurpose VM 100 or shut down to save resources

Do NOT delete VM until 30+ days of stable K8s operation.

Timeline Estimate

Phase Duration Downtime Dependencies
Prerequisites 2-4 hours None NFS StorageClass, secrets
Phase 1 (Non-VPN) 4-6 hours 30 min/service Prerequisites complete
Phase 2 (Content Mgmt) 4-6 hours 1 hour Phase 1 stable for 48h
Phase 3 (Download Clients) 6-8 hours 2 hours Phase 2 stable for 48h
Phase 4 (Decommission) 1 hour None Phase 3 stable for 2 weeks
Total 17-25 hours ~4 hours total 3-4 weeks elapsed

Risk Assessment

High Risks

1. VPN Connectivity Issues

Risk: Gluetun sidecar fails to establish VPN connection in K8s.

Impact: Download clients exposed to ISP (privacy leak), downloads fail.

Probability: Medium (new networking pattern, untested in homelab)

Mitigation: - Test VPN sidecar pattern extensively before migration - Add liveness probes to validate VPN connectivity - Fail-closed: Block downloads if VPN is down - Keep VM 100 as hot standby during initial rollout

Rollback: Immediately switch back to VM 100, investigate K8s networking.

2. Data Corruption During Migration

Risk: Config database corruption during PVC migration.

Impact: Loss of series/movie metadata, download history, custom settings.

Probability: Low (well-tested migration process)

Mitigation: - Full backup before migration - Quiesce all services before copying data - Validate data integrity post-migration (checksum comparison) - Keep VM backup for 30+ days

Rollback: Restore from backup, restart on VM 100.

3. Storage Performance Degradation

Risk: NFS storage slower than local disk on VM.

Impact: Slower imports, higher CPU usage, potential timeout issues.

Probability: Medium (network-attached storage inherently slower)

Mitigation: - Benchmark NFS performance before migration - Use NFSv4.1 with optimized mount options - Monitor import times and adjust if needed - Consider caching layer if performance issues persist

Rollback: Move back to VM 100 local storage.

Medium Risks

4. Service Intercommunication Issues

Risk: Sonarr/Radarr can't reach SABnzbd/Deluge after migration.

Impact: Downloads fail to trigger, imports fail.

Probability: Low (K8s DNS is reliable)

Mitigation: - Test service discovery before full migration - Use K8s service DNS names consistently - Add readiness probes to ensure services are reachable

Rollback: Update service endpoints back to VM IPs.

5. Resource Contention

Risk: Arr-stack pods compete for CPU/memory with other K8s apps.

Impact: Performance degradation, OOM kills.

Probability: Low (sufficient cluster capacity)

Mitigation: - Set appropriate resource requests/limits - Monitor cluster resource usage - Scale cluster if needed (add worker node)

Rollback: Reduce replica count or restart on VM.

Low Risks

6. Ingress Forward Auth Issues

Risk: Authentik forward auth breaks after migration.

Impact: Can't access arr-stack web UIs.

Probability: Very Low (forward auth already working, Ingress config unchanged)

Mitigation: - No changes to Ingress manifests needed - Test forward auth after each phase

Rollback: Update Ingress endpoints back to VM IPs.

7. Auto-Update Disruption

Risk: Loss of Watchtower auto-updates.

Impact: Manual update process required.

Probability: High (Watchtower not migrating)

Mitigation: - Document manual update process - Consider ArgoCD image updater for automation - Set up alerts for outdated images

No rollback needed: Manual updates are acceptable.


Cost-Benefit Analysis

Costs (Effort & Risk)

Category Effort Risk Notes
Planning & Documentation 8 hours Low This document
NFS StorageClass Setup 2 hours Low One-time, reusable
Manifest Creation 16 hours Medium 11 services + shared resources
Testing VPN Pattern 8 hours High Critical path, unknown unknowns
Phase 1 Migration 6 hours Low Non-VPN services
Phase 2 Migration 6 hours Medium Content management
Phase 3 Migration 8 hours High VPN-dependent downloads
Validation & Monitoring 20 hours Medium Over 3-4 weeks
Documentation Updates 4 hours Low Update arr-stack.md
Total Effort ~78 hours Medium-High ~2 weeks full-time

Benefits (Qualitative)

Benefit Value Timeframe Notes
Unified Management Medium Immediate Single platform for all services
Improved Observability Medium Immediate Prometheus/Grafana integration
Better Reliability Low-Medium 1-3 months Health checks, auto-restarts
GitOps-Ready Low Long-term If implementing ArgoCD
Disaster Recovery Low-Medium Long-term Velero backups, declarative config
Resource Efficiency Negative Immediate K8s overhead vs Docker Compose
Operational Simplicity Negative Short-term More complex during learning curve

Quantitative Analysis

Cost: ~78 hours @ $0/hour = $0 (homelab, personal time)

Benefit: Moderate improvement in observability and management

Break-Even: Unclear - benefits are mostly qualitative

Opportunity Cost: 78 hours could be spent on: - New homelab applications (AI subtitle generator, etc.) - Improving existing apps (trip-planner budget tracking) - Infrastructure improvements (SSL/TLS, backups, monitoring)


Recommendation

Primary Recommendation: DEFER MIGRATION

Rationale:

  1. Current State is Stable: arr-stack on VM 100 has been running reliably with zero issues. "If it ain't broke, don't fix it."

  2. High Effort, Moderate Benefits: 78 hours of migration effort for benefits that are mostly qualitative. The improvement in observability and management doesn't justify the risk and effort.

  3. VPN Networking Uncertainty: The sidecar pattern for Gluetun is untested in this homelab. This introduces unknown risks for a privacy-critical component.

  4. Better Alternatives Exist: The same 78 hours could deliver higher-value improvements:

  5. SSL/TLS for all services (security improvement)
  6. Backup strategy with Velero (disaster recovery)
  7. New applications (AI subtitle generator, knowledge base)
  8. Enhanced monitoring and alerting

  9. Hybrid Architecture Complexity: During phased migration, managing services across VM and K8s adds operational complexity.

  10. Resource Efficiency Loss: Docker Compose is more efficient than K8s for this use case (no orchestration overhead).

Conditions for Reconsidering Migration

Revisit this decision if any of these conditions arise:

  1. VM 100 Reliability Issues: Hardware failures, persistent Docker problems, or stability concerns

  2. Need for K8s-Specific Features: Requirements for autoscaling, canary deployments, or advanced traffic management

  3. Unified Management Becomes Critical: If managing hybrid VM/K8s becomes painful operationally

  4. Proven VPN Pattern: If Gluetun sidecar pattern is successfully implemented and validated in another project

  5. ArgoCD Deployment: If implementing GitOps for other apps, arr-stack could benefit from inclusion

Alternative: Incremental Improvements to Current Setup

Instead of full migration, consider these low-effort improvements:

  1. Add Prometheus Exporters:
  2. Deploy node exporter on VM 100
  3. Scrape arr service metrics via existing APIs
  4. Create Grafana dashboards

  5. Backup Automation:

  6. Schedule daily config backups to /vault
  7. Test restoration procedure

  8. Monitoring Integration:

  9. Send Docker logs to Loki (already deployed)
  10. Set up alerts for service failures

  11. GitOps for Config:

  12. Version control docker-compose.yml in tower-fleet repo
  13. Document update procedures

Effort: ~8-10 hours total Benefit: Improved observability with minimal risk Cost: $0 (no downtime, no migration risk)


Appendices

Appendix A: NFS StorageClass Manifest

# /root/tower-fleet/manifests/storage/nfs-storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-media
provisioner: kubernetes.io/no-provisioner  # Static provisioning
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: arr-stack-media
spec:
  capacity:
    storage: 5Ti
  accessModes:
    - ReadWriteMany
  nfs:
    server: 10.89.97.237
    path: /vault/subvol-101-disk-0/media
  mountOptions:
    - nfsvers=4.1
    - hard
    - timeo=600
    - retrans=2
    - noresvport

Appendix B: Gluetun Test Pod

# /root/tower-fleet/manifests/arr-stack/gluetun-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gluetun-test
  namespace: arr-stack
spec:
  shareProcessNamespace: true
  containers:
    - name: gluetun
      image: qmcgaw/gluetun:latest
      securityContext:
        capabilities:
          add:
            - NET_ADMIN
      env:
        - name: VPN_SERVICE_PROVIDER
          value: "mullvad"
        - name: VPN_TYPE
          value: "wireguard"
        - name: WIREGUARD_PRIVATE_KEY
          valueFrom:
            secretKeyRef:
              name: vpn-credentials
              key: wireguard-key
        - name: SERVER_CITIES
          value: "Boston MA"
    - name: test
      image: busybox
      command: ["sleep", "3600"]

Test VPN connectivity:

kubectl exec -n arr-stack gluetun-test -c test -- wget -qO- https://api.ipify.org
# Should return Mullvad exit node IP, not home IP


Document Status: Draft Next Review: When reconsidering migration (see conditions above) Maintained By: Infrastructure Team