Tower Fleet Roadmap¶
This document centralizes all planned improvements, enhancements, and backlog items for the Tower Fleet homelab infrastructure. Items are organized by category and priority.
Last Updated: 2025-12-29
Legend¶
- 🔴 Critical: Security issues, broken functionality, blocking problems
- 🟠 High: Significant improvements, important features
- 🟡 Medium: Quality of life improvements, optimizations
- ⚪ Backlog: Nice-to-have features, long-term goals
- ✅ Completed: Recently completed items (kept for reference)
- 🔄 In Progress: Currently being worked on
Infrastructure¶
🟠 High Priority¶
- [x] PgBouncer Connection Pooler for Supabase ✓ Completed 2025-12-15
- Problem: After host reboots, all pods reconnect to PostgreSQL simultaneously, exhausting max_connections (100)
- Solution: Deploy PgBouncer as connection pooler in front of Supabase PostgreSQL
- Result: Authentik connections reduced from 21-95 to 5. Post-reboot recovery now smooth.
- Fix Applied: Added
max_prepared_statements=100for Authentik compatibility (PgBouncer 1.21+ feature) -
Docs: PgBouncer Migration
-
[ ] Ingress Monitoring & Alerting
- Add Prometheus ServiceMonitor for NGINX Ingress
- Create Grafana dashboard for Ingress metrics
- Set up alerts for: 5xx errors, high latency, backend failures
-
Related: Observability Standards
-
[x] Backup Strategy (Velero) ✓ Completed 2025-12-16, Updated 2025-12-29
- Deployed Velero v1.17.1 with CSI snapshot support and Data Mover
- MinIO S3 gateway on NFS (
/vault/subvol-101-disk-0/k8s-backups/minio) - Daily backups 3 AM (7-day), Weekly Sunday 4 AM (28-day)
- Configured
prepareQueueLengthfor controlled data uploads (v1.17 feature) - Docs: Disaster Recovery Section 3
🟡 Medium Priority¶
- [x] Longhorn Backups ✓ Completed 2025-12-16 (via Velero CSI)
- Velero CSI snapshots replace standalone Longhorn backup config
- VolumeSnapshotClass
longhorn-snapshotwithvelero.io/csi-volumesnapshot-classlabel - PVC snapshots automatically included in Velero scheduled backups
-
Related: Disaster Recovery Section 3
-
[ ] Add Worker Nodes
- Evaluate cluster resource usage
- Add 4th worker node if needed for HA
- Document node addition procedure
-
Related: Cluster Setup
-
[ ] Pod Autoscaling (HPA)
- Install metrics-server
- Configure HorizontalPodAutoscalers for apps
- Set up CPU/memory-based scaling policies
-
Test scaling behavior under load
-
[ ] GitOps with ArgoCD
- ~~Choose GitOps tool (ArgoCD vs Flux)~~ ArgoCD selected
- Install ArgoCD and expose UI at argocd.bogocat.com
- Migrate existing apps to ArgoCD Applications
- Configure automated sync with self-heal
- Add Authentik SSO integration
- Design Doc: ArgoCD Implementation Plan
- Related: Production Deployment
⚪ Backlog¶
- [ ] Supabase Stack Simplification
- Evaluate migration from PostgREST to direct PostgreSQL + Drizzle ORM
- Potential savings: ~1.4 GB RAM across prod/sandbox
- Migration effort: ~60-80 hours across all apps
- Immediate: Remove unused GoTrue, Studio, postgres-meta (~270 MB)
- New apps: Use Drizzle instead of expanding PostgREST dependency
-
Evaluation: Supabase Stack Evaluation
-
[ ] Velero Grafana Dashboard
- Add Prometheus ServiceMonitor for Velero metrics (
:8085/metrics) - Import Grafana dashboard (ID: 16829 - Velero Overview)
- Set up alerts for backup failures
-
Related: Disaster Recovery
-
[ ] Service Mesh (Linkerd/Istio)
- Research Linkerd vs Istio for homelab use case
- Deploy service mesh for advanced traffic management
- Implement mutual TLS between services
- Add distributed tracing
- Note: Advanced feature, evaluate ROI for homelab
✅ Recently Completed¶
- [x] SSL/TLS for Ingress (Let's Encrypt) (2025-12-06)
- Deployed cert-manager to cluster
- Configured
letsencrypt-prodandletsencrypt-stagingClusterIssuers - Created wildcard certificate
wildcard-bogocat-tlsfor*.bogocat.com -
All apps now use HTTPS via VPS reverse proxy
-
[x] Authentik SSO (2025-12-01 - 2025-12-10)
- Deployed Authentik to K8s (
authentiknamespace) - Integrated with all Next.js apps via NextAuth (home-portal, money-tracker, trip-planner, tcg)
- Configured forward auth outposts for arr-stack apps (bazarr, deluge, jellyseerr, overseerr, prowlarr, radarr, sabnzbd, sonarr, lidarr, ersatztv)
- LDAP integration for Jellyfin
-
Related: Authentik SSO docs, Authentik Auth Pattern
-
[x] VPS Reverse Proxy (Caddy) (2025-12-01)
- Configured Caddy on VPS for external access
- All
*.bogocat.comdomains route through VPS to homelab - Automatic HTTPS with Let's Encrypt
- Config:
/root/tower-fleet/manifests/vps/Caddyfile.production -
Related: VPS Reverse Proxy docs
-
[x] DNS Hostname Standardization (2025-12-01)
- Migrated from
.internalto*.bogocat.comfor all production apps - OPNsense DNS overrides for internal direct access
-
Standardized domain pattern established
-
[x] NGINX Ingress Controller (2025-11-19)
- Deployed ingress-nginx with MetalLB LoadBalancer
- Migrated home-portal, money-tracker, trip-planner to Ingress
- Configured
.internaldomain routing via OPNsense DNS -
Freed 3 MetalLB IPs (10.89.97.211-213)
-
[x] arr-stack Kubernetes Migration (2025-12-08)
- Migrated arr-stack services to K8s with Authentik forward auth
- Services: Sonarr, Radarr, Lidarr, Bazarr, Prowlarr, SABnzbd, Deluge, Overseerr, Jellyseerr, ErsatzTV
- Jellyfin remains on VM 100 (GPU passthrough requirement)
- Note: VPN-dependent services (Deluge) may still use VM 100 Gluetun
Networking¶
🟡 Medium Priority¶
- [ ] VLANs for Network Segmentation
- Segment: Management, Services, IoT, Guest
- Configure VLAN tagging in OPNsense
- Update Proxmox network bridges
- Implement firewall rules between VLANs
-
Related: Network Infrastructure
-
[ ] WireGuard/Tailscale VPN for Remote Access
- Evaluate WireGuard vs Tailscale for remote access
- Configure client access to homelab
- Set up split-tunnel routing
-
Document connection procedures
-
[ ] Network Monitoring (Prometheus Exporters)
- Deploy SNMP exporter for OPNsense
- Add node exporters to key infrastructure
- Create Grafana dashboard for network metrics
- Set up bandwidth alerts
⚪ Backlog¶
- [ ] Bandwidth QoS Policies
- Configure traffic shaping in OPNsense
- Prioritize: Interactive (SSH, web) > Streaming > Downloads
- Test QoS under load
Applications¶
🔄 In Progress¶
- [ ] RMS: Next.js Migration
- ✅ Migrated to Next.js 16 with Supabase SSR
- ✅ Using shared K8s Supabase with
rmsschema - ✅ Authentication middleware implemented
- 🔄 Feature development (recipe generation, pantry management)
- [ ] Deploy to K8s with Ingress at
rms.bogocat.com - Related: RMS docs
🟡 Medium Priority¶
- [ ] Money Tracker: Advanced Analytics
- Add spending trends visualization
- Create budget vs actual comparisons
- Implement forecasting based on historical data
-
Export reports (PDF, CSV)
-
[ ] Home Portal: Real-time Status
- Add service health checks (ping/API)
- Show online/offline status for each service
- Display resource usage (from Prometheus)
-
Add uptime tracking
-
[ ] Trip Planner: Budget Tracking
- Add budget database schema
- Create budget tracking UI
- Integrate with itinerary planning
-
Related: Trip Planner docs
-
[ ] SubtitleAI: Phase 2 Enhancements
- LLM enhancement service for cleanup/translation (OpenAI GPT-4o)
- Additional output formats (VTT, ASS)
- Bazarr integration for media library batch processing
- Whisper model selection (base/small/medium)
- Worker auto-scaling
- Related: SubtitleAI docs
⚪ Backlog (New Application Ideas)¶
- [ ] BrainLearn - Learning Platform
- Spaced repetition learning system
- AI-powered content generation
- Progress tracking
-
Location:
/root/projects/brainlearn(scaffolded) -
[ ] ReplyFlow - Communication Manager
- Email/message queue management
- AI-assisted responses
- Location:
/root/projects/replyflow(scaffolded)
✅ Recently Completed¶
- [x] SubtitleAI (2025-11-20)
- Deployed to K8s at
subtitles.bogocat.com - Features: Whisper transcription, multi-language (12 languages), SRT output
- Job queue with Celery + Redis
- Prometheus metrics (45 exported)
-
Related: SubtitleAI docs
-
[x] TCG (Trading Card Game) (2025-12-10)
- Deployed to K8s at
tcg.bogocat.com - Authentik SSO integration
-
Supabase backend with
tcgschema -
[x] OtterWiki (2025-12-01)
- Deployed to K8s at
otterwiki.bogocat.com - Syncs from
/root/tower-fleet/docs/ - Authentik forward auth protection
-
Serves as primary documentation site
-
[x] Immich Photo Management (2025-12-05)
- Deployed to K8s at
photos.bogocat.com - Photo/video backup and organization
-
ML-powered search and face recognition
-
[x] Pelican Game Server Manager (2025-12-08)
- Deployed to K8s at
pelican.bogocat.com - Game server provisioning and management
- Authentik forward auth protection
-
Related: Pelican docs
-
[x] ROMM Game Library (2025-12-08)
- Deployed to K8s at
romm.bogocat.com - ROM management and organization
-
Authentik forward auth protection
-
[x] ErsatzTV (2025-12-10)
- Deployed to K8s at
ersatztv.bogocat.com - IPTV stream generation from media library
- Authentik forward auth protection
Operations¶
🟠 High Priority¶
- [ ] Automated Backup Verification
- Schedule quarterly disaster recovery tests
- Automate backup integrity checks
- Document recovery procedures for each app
-
Related: Disaster Recovery
-
[x] Verify Backups Intent ✓ Completed 2025-12-29
/intents:verify-backupsslash command- Check Velero backup status (success/failed/partial)
- Verify BSL connectivity (MinIO, B2)
- Report backup freshness, size, coverage
- Related: Disaster Recovery
🟡 Medium Priority¶
- [ ] Centralized Secret Management
- Evaluate: Sealed Secrets (current) vs Vault vs External Secrets Operator
- Document secret rotation procedures
-
Implement automated secret rotation for critical services
-
[ ] Log Retention Policies
- Configure log retention in Loki (currently 30 days default)
- Archive important logs to
/vault/logs/ - Implement log cleanup automation
-
Related: Loki Operations
-
[ ] Alert Tuning
- Review and reduce alert noise
- Fine-tune thresholds based on actual usage
- Add runbooks for each alert
- Related: Alerting Guide
⚪ Backlog¶
- [ ] Infrastructure as Code (OpenTofu/Terraform)
- Codify Proxmox infrastructure (VMs, LXCs, networking)
- Version control infrastructure changes
-
Enable reproducible deployments
-
[ ] Intent System: Learning Loop
- Add outcome verification to intents (success/failure checks)
- Capture execution context (before/after state, commands, logs)
- Anomaly detection rules per intent
- Auto-draft incidents from anomalies
- Learning feedback: suggest policy/intent updates from patterns
- Design Doc: Intent Learning Loop
✅ Recently Completed¶
- [x] Velero v1.17 Upgrade & Data Mover Tuning (2025-12-29)
- Upgraded from v1.16.0 to v1.17.1
- Configured
prepareQueueLength=6to prevent Longhorn overload - Backup success rate improved from 45% to 77% (17/22 PVCs vs 10/22)
- Excluded NFS volumes (immich-library, romm-library-nfs) from CSI snapshots
-
Related: Disaster Recovery, Velero Manifests
-
[x] Incident Management Process (2025-12-23)
- Created incident severity levels (P0-P3)
- Incident report template and filing process
- Incident archive at
docs/incidents/ -
Related: Incident Management
-
[x] Velero Grafana Dashboard (2025-12-23)
- Added ServiceMonitor for Velero metrics
- Imported Grafana dashboard (ID 16829)
- Related: Velero Manifests
Observability¶
🟠 High Priority¶
- [ ] Distributed Tracing (Phase 2)
- Tools: Jaeger, Zipkin, or Grafana Tempo
- Add trace IDs to application logs
- Instrument HTTP clients with tracing
- Add tracing headers to service-to-service calls
- Create trace visualization dashboards
- Related: Observability Standards
🟡 Medium Priority¶
- [ ] Application Performance Monitoring (APM)
- Add performance profiling to Next.js apps
- Track slow API routes
- Monitor database query performance
-
Set up alerts for performance regressions
-
[ ] Cost Tracking
- Track infrastructure costs (electricity, hardware depreciation)
- Monitor Kubernetes resource usage per app
- Identify optimization opportunities
⚪ Backlog¶
- [ ] Synthetic Monitoring
- Set up periodic health checks for all services
- Monitor from external perspective (uptime)
- Alert on service degradation before users notice
Documentation¶
🟡 Medium Priority¶
- [ ] Beginner Guide to Contributions
- Step-by-step guide for tackling roadmap items using Claude Code
- Explain how to read and understand project context (CLAUDE.md, PROJECTS.md)
- Document common workflows: picking an item, understanding scope, implementing, testing
- Include examples of good first contributions
-
Cover the code review process and getting changes merged
-
[ ] Video Tutorials
- Record walkthrough of common operations
- Create onboarding video for new contributors
-
Document complex procedures visually
-
[ ] Architecture Diagrams
- Create visual network topology
- Add Kubernetes architecture diagram
- Document data flow between services
- Use Mermaid or draw.io
⚪ Backlog¶
- [ ] OtterWiki Bidirectional Sync
- Enable editing docs directly in OtterWiki UI
- Changes push back to tower-fleet/docs
- Options: Git hooks, sidecar watcher, or GitHub webhooks
- Handle merge conflicts gracefully
- Design doc: otterwiki-bidirectional-sync.md
-
Current: Edit in repo, wiki is read-only display (works fine)
-
[ ] API Documentation
- Document internal APIs for each app
- Add OpenAPI/Swagger specs
-
Generate API reference docs
-
[ ] Runbook Library
- Expand troubleshooting guides
- Add step-by-step incident response procedures
- Document escalation paths
Storage¶
🟡 Medium Priority¶
- [ ] NFS StorageClass for Large Files
- Create NFS-backed StorageClass in Kubernetes
- Use for media files, large datasets
- Mount
/vaultNFS shares in cluster -
Related: ADR-002
-
[ ] Volume Expansion Planning
- Monitor Longhorn capacity usage
- Plan expansion when >70% utilized
- Document volume resize procedures
- Related: Expanding Storage
⚪ Backlog¶
- [ ] S3-Compatible Object Storage
- Deploy MinIO for object storage
- Use for: backups, media assets, logs
- Integrate with applications (uploads, exports)
Testing & Quality¶
🟡 Medium Priority¶
- [ ] Implement Testing Framework
- Add unit tests to all Next.js apps (Vitest)
- Integration tests for API routes
- E2E tests for critical flows (Playwright)
-
Related: Testing Standards
-
[ ] Pre-deployment Validation
- Automated linting in CI/CD
- Build verification before deploy
- Database migration dry-run tests
✅ Recently Completed¶
- [x] Automated PR Review (2025-12-01)
- GitHub Actions workflow with Claude API integration
- Runs on all PRs to main branch
- Reviews for type safety, security, React patterns, performance
- Related: Automated PR Review docs
⚪ Backlog¶
- [ ] Load Testing
- Establish baseline performance metrics
- Test application behavior under load (k6, Locust)
- Verify autoscaling configuration
Security¶
🟠 High Priority¶
- [ ] Security Audit
- Review exposed services and ports
- Audit Supabase RLS policies
- Check for hardcoded secrets
-
Validate authentication flows
-
[ ] Automated Security Scanning
- Add container vulnerability scanning (Trivy)
- Scan dependencies for CVEs (Dependabot)
- Set up alerts for critical vulnerabilities
🟡 Medium Priority¶
- [ ] Network Security Hardening
- Implement firewall rules per VLAN
- Review and minimize exposed services
- Add intrusion detection (Suricata/Snort)
✅ Recently Completed¶
- [x] SSO Integration (Authentik) (2025-12-01)
- Deployed Authentik as central identity provider
- OAuth/OIDC for all Next.js apps
- Forward auth for third-party apps
- LDAP for legacy app support (Jellyfin)
- Note: MFA support available but not enforced
How to Use This Roadmap¶
Updating Items¶
When you complete an item: 1. Move it to "✅ Recently Completed" section with date 2. Update related documentation 3. Commit changes with descriptive message
When adding new items: 1. Choose appropriate category and priority 2. Link to related documentation 3. Include brief implementation notes
Priority Guidelines¶
- 🔴 Critical: Address immediately (security, broken functionality)
- 🟠 High: Work on next after critical items cleared
- 🟡 Medium: Plan for upcoming quarter
- ⚪ Backlog: Review quarterly, prioritize as needed
Related Files¶
- Documentation Index - All documentation links
Maintained By: Tower Fleet Infrastructure Team Review Cadence: Monthly Last Review: 2025-12-29