Skip to content

Tower Fleet Roadmap

This document centralizes all planned improvements, enhancements, and backlog items for the Tower Fleet homelab infrastructure. Items are organized by category and priority.

Last Updated: 2025-12-29


Legend

  • 🔴 Critical: Security issues, broken functionality, blocking problems
  • 🟠 High: Significant improvements, important features
  • 🟡 Medium: Quality of life improvements, optimizations
  • ⚪ Backlog: Nice-to-have features, long-term goals
  • ✅ Completed: Recently completed items (kept for reference)
  • 🔄 In Progress: Currently being worked on

Infrastructure

🟠 High Priority

  • [x] PgBouncer Connection Pooler for Supabase ✓ Completed 2025-12-15
  • Problem: After host reboots, all pods reconnect to PostgreSQL simultaneously, exhausting max_connections (100)
  • Solution: Deploy PgBouncer as connection pooler in front of Supabase PostgreSQL
  • Result: Authentik connections reduced from 21-95 to 5. Post-reboot recovery now smooth.
  • Fix Applied: Added max_prepared_statements=100 for Authentik compatibility (PgBouncer 1.21+ feature)
  • Docs: PgBouncer Migration

  • [ ] Ingress Monitoring & Alerting

  • Add Prometheus ServiceMonitor for NGINX Ingress
  • Create Grafana dashboard for Ingress metrics
  • Set up alerts for: 5xx errors, high latency, backend failures
  • Related: Observability Standards

  • [x] Backup Strategy (Velero) ✓ Completed 2025-12-16, Updated 2025-12-29

  • Deployed Velero v1.17.1 with CSI snapshot support and Data Mover
  • MinIO S3 gateway on NFS (/vault/subvol-101-disk-0/k8s-backups/minio)
  • Daily backups 3 AM (7-day), Weekly Sunday 4 AM (28-day)
  • Configured prepareQueueLength for controlled data uploads (v1.17 feature)
  • Docs: Disaster Recovery Section 3

🟡 Medium Priority

  • [x] Longhorn Backups ✓ Completed 2025-12-16 (via Velero CSI)
  • Velero CSI snapshots replace standalone Longhorn backup config
  • VolumeSnapshotClass longhorn-snapshot with velero.io/csi-volumesnapshot-class label
  • PVC snapshots automatically included in Velero scheduled backups
  • Related: Disaster Recovery Section 3

  • [ ] Add Worker Nodes

  • Evaluate cluster resource usage
  • Add 4th worker node if needed for HA
  • Document node addition procedure
  • Related: Cluster Setup

  • [ ] Pod Autoscaling (HPA)

  • Install metrics-server
  • Configure HorizontalPodAutoscalers for apps
  • Set up CPU/memory-based scaling policies
  • Test scaling behavior under load

  • [ ] GitOps with ArgoCD

  • ~~Choose GitOps tool (ArgoCD vs Flux)~~ ArgoCD selected
  • Install ArgoCD and expose UI at argocd.bogocat.com
  • Migrate existing apps to ArgoCD Applications
  • Configure automated sync with self-heal
  • Add Authentik SSO integration
  • Design Doc: ArgoCD Implementation Plan
  • Related: Production Deployment

⚪ Backlog

  • [ ] Supabase Stack Simplification
  • Evaluate migration from PostgREST to direct PostgreSQL + Drizzle ORM
  • Potential savings: ~1.4 GB RAM across prod/sandbox
  • Migration effort: ~60-80 hours across all apps
  • Immediate: Remove unused GoTrue, Studio, postgres-meta (~270 MB)
  • New apps: Use Drizzle instead of expanding PostgREST dependency
  • Evaluation: Supabase Stack Evaluation

  • [ ] Velero Grafana Dashboard

  • Add Prometheus ServiceMonitor for Velero metrics (:8085/metrics)
  • Import Grafana dashboard (ID: 16829 - Velero Overview)
  • Set up alerts for backup failures
  • Related: Disaster Recovery

  • [ ] Service Mesh (Linkerd/Istio)

  • Research Linkerd vs Istio for homelab use case
  • Deploy service mesh for advanced traffic management
  • Implement mutual TLS between services
  • Add distributed tracing
  • Note: Advanced feature, evaluate ROI for homelab

✅ Recently Completed

  • [x] SSL/TLS for Ingress (Let's Encrypt) (2025-12-06)
  • Deployed cert-manager to cluster
  • Configured letsencrypt-prod and letsencrypt-staging ClusterIssuers
  • Created wildcard certificate wildcard-bogocat-tls for *.bogocat.com
  • All apps now use HTTPS via VPS reverse proxy

  • [x] Authentik SSO (2025-12-01 - 2025-12-10)

  • Deployed Authentik to K8s (authentik namespace)
  • Integrated with all Next.js apps via NextAuth (home-portal, money-tracker, trip-planner, tcg)
  • Configured forward auth outposts for arr-stack apps (bazarr, deluge, jellyseerr, overseerr, prowlarr, radarr, sabnzbd, sonarr, lidarr, ersatztv)
  • LDAP integration for Jellyfin
  • Related: Authentik SSO docs, Authentik Auth Pattern

  • [x] VPS Reverse Proxy (Caddy) (2025-12-01)

  • Configured Caddy on VPS for external access
  • All *.bogocat.com domains route through VPS to homelab
  • Automatic HTTPS with Let's Encrypt
  • Config: /root/tower-fleet/manifests/vps/Caddyfile.production
  • Related: VPS Reverse Proxy docs

  • [x] DNS Hostname Standardization (2025-12-01)

  • Migrated from .internal to *.bogocat.com for all production apps
  • OPNsense DNS overrides for internal direct access
  • Standardized domain pattern established

  • [x] NGINX Ingress Controller (2025-11-19)

  • Deployed ingress-nginx with MetalLB LoadBalancer
  • Migrated home-portal, money-tracker, trip-planner to Ingress
  • Configured .internal domain routing via OPNsense DNS
  • Freed 3 MetalLB IPs (10.89.97.211-213)

  • [x] arr-stack Kubernetes Migration (2025-12-08)

  • Migrated arr-stack services to K8s with Authentik forward auth
  • Services: Sonarr, Radarr, Lidarr, Bazarr, Prowlarr, SABnzbd, Deluge, Overseerr, Jellyseerr, ErsatzTV
  • Jellyfin remains on VM 100 (GPU passthrough requirement)
  • Note: VPN-dependent services (Deluge) may still use VM 100 Gluetun

Networking

🟡 Medium Priority

  • [ ] VLANs for Network Segmentation
  • Segment: Management, Services, IoT, Guest
  • Configure VLAN tagging in OPNsense
  • Update Proxmox network bridges
  • Implement firewall rules between VLANs
  • Related: Network Infrastructure

  • [ ] WireGuard/Tailscale VPN for Remote Access

  • Evaluate WireGuard vs Tailscale for remote access
  • Configure client access to homelab
  • Set up split-tunnel routing
  • Document connection procedures

  • [ ] Network Monitoring (Prometheus Exporters)

  • Deploy SNMP exporter for OPNsense
  • Add node exporters to key infrastructure
  • Create Grafana dashboard for network metrics
  • Set up bandwidth alerts

⚪ Backlog

  • [ ] Bandwidth QoS Policies
  • Configure traffic shaping in OPNsense
  • Prioritize: Interactive (SSH, web) > Streaming > Downloads
  • Test QoS under load

Applications

🔄 In Progress

  • [ ] RMS: Next.js Migration
  • ✅ Migrated to Next.js 16 with Supabase SSR
  • ✅ Using shared K8s Supabase with rms schema
  • ✅ Authentication middleware implemented
  • 🔄 Feature development (recipe generation, pantry management)
  • [ ] Deploy to K8s with Ingress at rms.bogocat.com
  • Related: RMS docs

🟡 Medium Priority

  • [ ] Money Tracker: Advanced Analytics
  • Add spending trends visualization
  • Create budget vs actual comparisons
  • Implement forecasting based on historical data
  • Export reports (PDF, CSV)

  • [ ] Home Portal: Real-time Status

  • Add service health checks (ping/API)
  • Show online/offline status for each service
  • Display resource usage (from Prometheus)
  • Add uptime tracking

  • [ ] Trip Planner: Budget Tracking

  • Add budget database schema
  • Create budget tracking UI
  • Integrate with itinerary planning
  • Related: Trip Planner docs

  • [ ] SubtitleAI: Phase 2 Enhancements

  • LLM enhancement service for cleanup/translation (OpenAI GPT-4o)
  • Additional output formats (VTT, ASS)
  • Bazarr integration for media library batch processing
  • Whisper model selection (base/small/medium)
  • Worker auto-scaling
  • Related: SubtitleAI docs

⚪ Backlog (New Application Ideas)

  • [ ] BrainLearn - Learning Platform
  • Spaced repetition learning system
  • AI-powered content generation
  • Progress tracking
  • Location: /root/projects/brainlearn (scaffolded)

  • [ ] ReplyFlow - Communication Manager

  • Email/message queue management
  • AI-assisted responses
  • Location: /root/projects/replyflow (scaffolded)

✅ Recently Completed

  • [x] SubtitleAI (2025-11-20)
  • Deployed to K8s at subtitles.bogocat.com
  • Features: Whisper transcription, multi-language (12 languages), SRT output
  • Job queue with Celery + Redis
  • Prometheus metrics (45 exported)
  • Related: SubtitleAI docs

  • [x] TCG (Trading Card Game) (2025-12-10)

  • Deployed to K8s at tcg.bogocat.com
  • Authentik SSO integration
  • Supabase backend with tcg schema

  • [x] OtterWiki (2025-12-01)

  • Deployed to K8s at otterwiki.bogocat.com
  • Syncs from /root/tower-fleet/docs/
  • Authentik forward auth protection
  • Serves as primary documentation site

  • [x] Immich Photo Management (2025-12-05)

  • Deployed to K8s at photos.bogocat.com
  • Photo/video backup and organization
  • ML-powered search and face recognition

  • [x] Pelican Game Server Manager (2025-12-08)

  • Deployed to K8s at pelican.bogocat.com
  • Game server provisioning and management
  • Authentik forward auth protection
  • Related: Pelican docs

  • [x] ROMM Game Library (2025-12-08)

  • Deployed to K8s at romm.bogocat.com
  • ROM management and organization
  • Authentik forward auth protection

  • [x] ErsatzTV (2025-12-10)

  • Deployed to K8s at ersatztv.bogocat.com
  • IPTV stream generation from media library
  • Authentik forward auth protection

Operations

🟠 High Priority

  • [ ] Automated Backup Verification
  • Schedule quarterly disaster recovery tests
  • Automate backup integrity checks
  • Document recovery procedures for each app
  • Related: Disaster Recovery

  • [x] Verify Backups Intent ✓ Completed 2025-12-29

  • /intents:verify-backups slash command
  • Check Velero backup status (success/failed/partial)
  • Verify BSL connectivity (MinIO, B2)
  • Report backup freshness, size, coverage
  • Related: Disaster Recovery

🟡 Medium Priority

  • [ ] Centralized Secret Management
  • Evaluate: Sealed Secrets (current) vs Vault vs External Secrets Operator
  • Document secret rotation procedures
  • Implement automated secret rotation for critical services

  • [ ] Log Retention Policies

  • Configure log retention in Loki (currently 30 days default)
  • Archive important logs to /vault/logs/
  • Implement log cleanup automation
  • Related: Loki Operations

  • [ ] Alert Tuning

  • Review and reduce alert noise
  • Fine-tune thresholds based on actual usage
  • Add runbooks for each alert
  • Related: Alerting Guide

⚪ Backlog

  • [ ] Infrastructure as Code (OpenTofu/Terraform)
  • Codify Proxmox infrastructure (VMs, LXCs, networking)
  • Version control infrastructure changes
  • Enable reproducible deployments

  • [ ] Intent System: Learning Loop

  • Add outcome verification to intents (success/failure checks)
  • Capture execution context (before/after state, commands, logs)
  • Anomaly detection rules per intent
  • Auto-draft incidents from anomalies
  • Learning feedback: suggest policy/intent updates from patterns
  • Design Doc: Intent Learning Loop

✅ Recently Completed

  • [x] Velero v1.17 Upgrade & Data Mover Tuning (2025-12-29)
  • Upgraded from v1.16.0 to v1.17.1
  • Configured prepareQueueLength=6 to prevent Longhorn overload
  • Backup success rate improved from 45% to 77% (17/22 PVCs vs 10/22)
  • Excluded NFS volumes (immich-library, romm-library-nfs) from CSI snapshots
  • Related: Disaster Recovery, Velero Manifests

  • [x] Incident Management Process (2025-12-23)

  • Created incident severity levels (P0-P3)
  • Incident report template and filing process
  • Incident archive at docs/incidents/
  • Related: Incident Management

  • [x] Velero Grafana Dashboard (2025-12-23)

  • Added ServiceMonitor for Velero metrics
  • Imported Grafana dashboard (ID 16829)
  • Related: Velero Manifests

Observability

🟠 High Priority

  • [ ] Distributed Tracing (Phase 2)
  • Tools: Jaeger, Zipkin, or Grafana Tempo
  • Add trace IDs to application logs
  • Instrument HTTP clients with tracing
  • Add tracing headers to service-to-service calls
  • Create trace visualization dashboards
  • Related: Observability Standards

🟡 Medium Priority

  • [ ] Application Performance Monitoring (APM)
  • Add performance profiling to Next.js apps
  • Track slow API routes
  • Monitor database query performance
  • Set up alerts for performance regressions

  • [ ] Cost Tracking

  • Track infrastructure costs (electricity, hardware depreciation)
  • Monitor Kubernetes resource usage per app
  • Identify optimization opportunities

⚪ Backlog

  • [ ] Synthetic Monitoring
  • Set up periodic health checks for all services
  • Monitor from external perspective (uptime)
  • Alert on service degradation before users notice

Documentation

🟡 Medium Priority

  • [ ] Beginner Guide to Contributions
  • Step-by-step guide for tackling roadmap items using Claude Code
  • Explain how to read and understand project context (CLAUDE.md, PROJECTS.md)
  • Document common workflows: picking an item, understanding scope, implementing, testing
  • Include examples of good first contributions
  • Cover the code review process and getting changes merged

  • [ ] Video Tutorials

  • Record walkthrough of common operations
  • Create onboarding video for new contributors
  • Document complex procedures visually

  • [ ] Architecture Diagrams

  • Create visual network topology
  • Add Kubernetes architecture diagram
  • Document data flow between services
  • Use Mermaid or draw.io

⚪ Backlog

  • [ ] OtterWiki Bidirectional Sync
  • Enable editing docs directly in OtterWiki UI
  • Changes push back to tower-fleet/docs
  • Options: Git hooks, sidecar watcher, or GitHub webhooks
  • Handle merge conflicts gracefully
  • Design doc: otterwiki-bidirectional-sync.md
  • Current: Edit in repo, wiki is read-only display (works fine)

  • [ ] API Documentation

  • Document internal APIs for each app
  • Add OpenAPI/Swagger specs
  • Generate API reference docs

  • [ ] Runbook Library

  • Expand troubleshooting guides
  • Add step-by-step incident response procedures
  • Document escalation paths

Storage

🟡 Medium Priority

  • [ ] NFS StorageClass for Large Files
  • Create NFS-backed StorageClass in Kubernetes
  • Use for media files, large datasets
  • Mount /vault NFS shares in cluster
  • Related: ADR-002

  • [ ] Volume Expansion Planning

  • Monitor Longhorn capacity usage
  • Plan expansion when >70% utilized
  • Document volume resize procedures
  • Related: Expanding Storage

⚪ Backlog

  • [ ] S3-Compatible Object Storage
  • Deploy MinIO for object storage
  • Use for: backups, media assets, logs
  • Integrate with applications (uploads, exports)

Testing & Quality

🟡 Medium Priority

  • [ ] Implement Testing Framework
  • Add unit tests to all Next.js apps (Vitest)
  • Integration tests for API routes
  • E2E tests for critical flows (Playwright)
  • Related: Testing Standards

  • [ ] Pre-deployment Validation

  • Automated linting in CI/CD
  • Build verification before deploy
  • Database migration dry-run tests

✅ Recently Completed

  • [x] Automated PR Review (2025-12-01)
  • GitHub Actions workflow with Claude API integration
  • Runs on all PRs to main branch
  • Reviews for type safety, security, React patterns, performance
  • Related: Automated PR Review docs

⚪ Backlog

  • [ ] Load Testing
  • Establish baseline performance metrics
  • Test application behavior under load (k6, Locust)
  • Verify autoscaling configuration

Security

🟠 High Priority

  • [ ] Security Audit
  • Review exposed services and ports
  • Audit Supabase RLS policies
  • Check for hardcoded secrets
  • Validate authentication flows

  • [ ] Automated Security Scanning

  • Add container vulnerability scanning (Trivy)
  • Scan dependencies for CVEs (Dependabot)
  • Set up alerts for critical vulnerabilities

🟡 Medium Priority

  • [ ] Network Security Hardening
  • Implement firewall rules per VLAN
  • Review and minimize exposed services
  • Add intrusion detection (Suricata/Snort)

✅ Recently Completed

  • [x] SSO Integration (Authentik) (2025-12-01)
  • Deployed Authentik as central identity provider
  • OAuth/OIDC for all Next.js apps
  • Forward auth for third-party apps
  • LDAP for legacy app support (Jellyfin)
  • Note: MFA support available but not enforced

How to Use This Roadmap

Updating Items

When you complete an item: 1. Move it to "✅ Recently Completed" section with date 2. Update related documentation 3. Commit changes with descriptive message

When adding new items: 1. Choose appropriate category and priority 2. Link to related documentation 3. Include brief implementation notes

Priority Guidelines

  • 🔴 Critical: Address immediately (security, broken functionality)
  • 🟠 High: Work on next after critical items cleared
  • 🟡 Medium: Plan for upcoming quarter
  • ⚪ Backlog: Review quarterly, prioritize as needed

Maintained By: Tower Fleet Infrastructure Team Review Cadence: Monthly Last Review: 2025-12-29