SubtitleAI Production Deployment Plan¶
Overview¶
Deploy SubtitleAI to k3s cluster following the same patterns as home-portal and money-tracker, with additional components for background processing.
Architecture Components¶
1. Web Application (Next.js)¶
- Container: Next.js 16 app
- Deployment: Standard k8s Deployment
- Resources: 256Mi RAM, 100m CPU (like money-tracker)
- Ports: 3000 (HTTP)
- Features: UI for upload, job management, download
2. Worker Service (Python/Celery)¶
- Container: Custom Python image with Whisper
- Deployment: k8s Deployment (scalable)
- Resources: 2Gi RAM, 1000m CPU (transcription is CPU-intensive)
- Dependencies:
- ffmpeg
- OpenAI Whisper (~3.5GB models)
- PyTorch
- Supabase Python SDK
- Volume Mount:
/vaultfor future media library scanning
3. Poller Service (Python)¶
- Container: Lightweight Python poller
- Deployment: k8s Deployment (single replica)
- Resources: 128Mi RAM, 50m CPU
- Function: Polls database every 5s for pending jobs
4. Redis (Message Queue)¶
- Deployment: StatefulSet or use existing Redis if available
- Resources: 256Mi RAM, 100m CPU
- Persistence: Optional (jobs stored in Supabase)
Deployment Workflow¶
Phase 1: Prerequisites ✓¶
- [x] App developed in LXC 180 (dev environment)
- [x] Worker tested in LXC 181
- [x] Supabase schema configured (subtitleai)
- [x] Storage buckets created (subtitleai-uploads, subtitleai-outputs)
Phase 2: Containerization¶
Location: LXC 180 (has Docker)
2.1 Create Dockerfiles¶
- Next.js App: Standard multi-stage build (like money-tracker)
- Worker: Python 3.11 + Whisper + dependencies
- Poller: Lightweight Python image (shares base with worker)
2.2 Build Images Locally¶
# In LXC 180
cd /root/subtitleai
# Build Next.js app
docker build -t subtitleai-web:v1.0.0 -f Dockerfile.web .
# Build worker (includes poller)
docker build -t subtitleai-worker:v1.0.0 -f Dockerfile.worker ./worker
2.3 Push to Private Registry¶
REGISTRY="10.89.97.201:30500"
# Tag and push web app
docker tag subtitleai-web:v1.0.0 ${REGISTRY}/subtitleai-web:v1.0.0
docker push ${REGISTRY}/subtitleai-web:v1.0.0
# Tag and push worker
docker tag subtitleai-worker:v1.0.0 ${REGISTRY}/subtitleai-worker:v1.0.0
docker push ${REGISTRY}/subtitleai-worker:v1.0.0
Phase 3: Kubernetes Manifests¶
Location: /root/tower-fleet/manifests/apps/subtitleai/
3.1 Create Manifest Files¶
namespace.yaml- subtitleai namespaceweb-deployment.yaml- Next.js app deploymentweb-service.yaml- LoadBalancer service (port 80→3000)web-ingress.yaml- Ingress (subtitles.internal)worker-deployment.yaml- Celery worker deploymentpoller-deployment.yaml- Database poller deploymentredis-statefulset.yaml- Redis for Celeryredis-service.yaml- ClusterIP service for Redisconfigmap.yaml- Shared config (Redis URL, poll interval)secret.yaml- Supabase credentials (from sealed-secrets)
3.2 Apply Manifests¶
Phase 4: Deployment Script¶
Location: /root/tower-fleet/scripts/deploy-subtitleai.sh
Pattern: Follow deploy-home-portal.sh structure
Key Steps:
1. Pull latest code from git (LXC 180)
2. Update .env.production with Supabase keys from k8s secrets
3. Build Docker images (web + worker)
4. Tag images with auto-incremented semver
5. Push to private registry
6. Update k8s deployments with new image tags
7. Wait for rollout completion
8. Run health checks
Versioning Strategy:
- Web app: v1.0.0 → v1.0.1 (auto-increment)
- Worker: Same version as web app (keep in sync)
Phase 5: Initial Deployment¶
5.1 Environment Variables¶
# Web App (.env.production)
NEXT_PUBLIC_SUPABASE_URL=http://10.89.97.214:8000
NEXT_PUBLIC_SUPABASE_ANON_KEY=<from k8s secrets>
# Worker (ConfigMap/Secret)
SUPABASE_URL=http://10.89.97.214:8000
SUPABASE_SERVICE_KEY=<from k8s secrets>
REDIS_URL=redis://subtitleai-redis:6379/0
POLL_INTERVAL=5
5.2 Resource Allocation¶
# Web App
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
# Worker (transcription intensive)
requests: { cpu: 1000m, memory: 2Gi }
limits: { cpu: 2000m, memory: 4Gi }
# Poller (lightweight)
requests: { cpu: 50m, memory: 128Mi }
limits: { cpu: 100m, memory: 256Mi }
# Redis
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 200m, memory: 512Mi }
5.3 Ingress Configuration¶
host: subtitles.internal
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "2048m" # 2GB file uploads
nginx.ingress.kubernetes.io/proxy-read-timeout: "600" # 10min for long uploads
Phase 6: Testing¶
6.1 Smoke Tests¶
- [ ] Web UI loads at
http://subtitles.internal - [ ] Can login with Supabase auth
- [ ] Upload small video file
- [ ] Worker picks up job
- [ ] Transcription completes
- [ ] Download SRT file
6.2 Load Testing¶
- [ ] Multiple concurrent uploads
- [ ] Worker scales (test with
kubectl scale) - [ ] Large file uploads (2GB max)
Phase 7: Monitoring & Observability¶
7.1 Add ServiceMonitor (Prometheus)¶
# Similar to home-portal-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: subtitleai-web
namespace: subtitleai
spec:
selector:
matchLabels:
app: subtitleai-web
endpoints:
- port: http
path: /api/metrics
7.2 Logging¶
- Use Loki for log aggregation (already in cluster)
- Query:
{namespace="subtitleai"} - Worker logs: Celery task output
- Poller logs: Job polling activity
7.3 Alerts (Optional)¶
- Worker pod crashes
- High memory usage (>3.5Gi worker)
- Failed jobs rate
Volume Mounts (Future)¶
Media Library Access¶
For scanning existing media files, mount /vault into worker pods:
# worker-deployment.yaml
volumes:
- name: media
hostPath:
path: /vault/subvol-101-disk-0/media
type: Directory
volumeMounts:
- name: media
mountPath: /media
readOnly: true # Read-only for safety
Note: Requires nodeSelector or nodeAffinity to ensure pods run on nodes with /vault access.
Bazarr Integration (Future)¶
Option 1: API Integration¶
- Bazarr at
http://10.89.97.50:6767 - Use Bazarr API to query subtitle status
- No volume mounts needed
Option 2: File System Scanning¶
- Mount same
/vaultpath as Bazarr - Scan for
.srt,.ass,.vttfiles - Build own subtitle inventory
Migration Plan (Dev → Prod)¶
Current State¶
- Dev: LXC 180 (Next.js on port 3000)
- Worker: LXC 181 (systemd services)
Transition Steps¶
- Deploy to k3s (new namespace)
- Test thoroughly in k3s
- Update DNS/ingress to point to k3s
- Decommission LXC dev environments (or keep for development)
Rollback Plan¶
- Keep LXC 180/181 running during initial k3s deployment
- If k8s deployment fails, traffic still on LXC
- Use
kubectl rollout undofor k8s rollbacks
Checklist¶
Pre-Deployment¶
- [ ] Review and finalize this plan
- [ ] Create Dockerfiles (web + worker)
- [ ] Create k8s manifests (namespace, deployments, services, ingress)
- [ ] Create deployment script
- [ ] Test build locally in LXC 180
Deployment¶
- [ ] Apply namespace and secrets
- [ ] Deploy Redis
- [ ] Deploy worker and poller
- [ ] Deploy web app
- [ ] Configure ingress
- [ ] Run smoke tests
Post-Deployment¶
- [ ] Add monitoring/alerts
- [ ] Document in tower-fleet repo
- [ ] Update
/root/PROJECTS.md - [ ] Add to home-portal dashboard (optional)
Lessons Learned (Post-Deployment)¶
Issue 1: TypeScript Build Errors Not Caught in Development¶
Problem: Next.js 16 production build (npm run build) failed with TypeScript errors that weren't caught during development (npm run dev).
Root Cause: - Dev mode uses lenient TypeScript checking and runtime type coercion - Production build enforces strict TypeScript compilation - Errors existed all along but only surfaced during Docker build
Specific Errors Fixed:
1. Async params in Next.js 16: Dynamic route params are now Promise<{ id: string }> instead of sync objects
- Fixed in: /api/jobs/[id]/route.ts, /api/jobs/[id]/retry/route.ts, /api/subtitles/[id]/download/route.ts
- Solution: Change params type and await them: const { id } = await params
- Supabase foreign key joins return arrays: Joins like
jobs!inner(user_id)return arrays, not single objects - Fixed in:
/app/jobs/page.tsx,/app/jobs/[id]/page.tsx,/api/subtitles/[id]/download/route.ts - Solution: Use
!innerhint and handle array access:const video = Array.isArray(job.videos) ? job.videos[0] : job.videos
Prevention: Run npm run build locally before deploying to catch TypeScript errors early.
Issue 2: Missing Environment Variables in Poller¶
Problem: Poller pod crashed on startup with "Connection refused" to Redis.
Root Cause: REDIS_URL environment variable not configured in poller deployment manifest.
Fix: Added to manifests/apps/subtitleai/poller-deployment.yaml:
Prevention: Cross-reference all os.getenv() calls in code with deployment manifest env vars.
Issue 3: Docker Registry HTTP vs HTTPS¶
Problem: docker push failed with "server gave HTTP response to HTTPS client".
Root Cause: Private registry at 10.89.97.201:30500 uses HTTP, not HTTPS.
Fix: Configured Docker daemon in LXC 180:
Note: This is already configured in build environments, but new containers need this config.
Issue 4: Silent Poller (Not Actually an Issue)¶
Observation: Poller logs only showed startup messages, no polling activity.
Explanation: Poller is designed to only log when it finds pending jobs. Silent operation is normal when no jobs are pending.
Verification: Manually tested database query - confirmed poller polls every 5s but only logs on job discovery.
Development Workflow¶
Local Development (LXC 180)¶
Production Deployment (K8s Cluster)¶
cd /root/tower-fleet
./scripts/deploy-subtitleai.sh # Build, push, deploy
kubectl get pods -n subtitleai # Verify deployment
LXC 181 Status¶
- Previous: Ran worker and poller via systemd for testing
- Current: Decommissioned - poller service stopped/disabled
- Reason: All services now running in k8s production
Next Steps¶
- Create Dockerfiles - ✅ DONE (web + worker)
- Build and test locally - ✅ DONE (LXC 180)
- Create k8s manifests - ✅ DONE (11 manifest files)
- Create deployment script - ✅ DONE (deploy-subtitleai.sh)
- Deploy to k3s - ✅ DONE (v1.0.0 live at 10.89.97.213)
- Test end-to-end - ✅ DONE (upload → transcription → download working)