Skip to content

SubtitleAI Production Deployment Plan

Overview

Deploy SubtitleAI to k3s cluster following the same patterns as home-portal and money-tracker, with additional components for background processing.


Architecture Components

1. Web Application (Next.js)

  • Container: Next.js 16 app
  • Deployment: Standard k8s Deployment
  • Resources: 256Mi RAM, 100m CPU (like money-tracker)
  • Ports: 3000 (HTTP)
  • Features: UI for upload, job management, download

2. Worker Service (Python/Celery)

  • Container: Custom Python image with Whisper
  • Deployment: k8s Deployment (scalable)
  • Resources: 2Gi RAM, 1000m CPU (transcription is CPU-intensive)
  • Dependencies:
  • ffmpeg
  • OpenAI Whisper (~3.5GB models)
  • PyTorch
  • Supabase Python SDK
  • Volume Mount: /vault for future media library scanning

3. Poller Service (Python)

  • Container: Lightweight Python poller
  • Deployment: k8s Deployment (single replica)
  • Resources: 128Mi RAM, 50m CPU
  • Function: Polls database every 5s for pending jobs

4. Redis (Message Queue)

  • Deployment: StatefulSet or use existing Redis if available
  • Resources: 256Mi RAM, 100m CPU
  • Persistence: Optional (jobs stored in Supabase)

Deployment Workflow

Phase 1: Prerequisites ✓

  • [x] App developed in LXC 180 (dev environment)
  • [x] Worker tested in LXC 181
  • [x] Supabase schema configured (subtitleai)
  • [x] Storage buckets created (subtitleai-uploads, subtitleai-outputs)

Phase 2: Containerization

Location: LXC 180 (has Docker)

2.1 Create Dockerfiles

  • Next.js App: Standard multi-stage build (like money-tracker)
  • Worker: Python 3.11 + Whisper + dependencies
  • Poller: Lightweight Python image (shares base with worker)

2.2 Build Images Locally

# In LXC 180
cd /root/subtitleai

# Build Next.js app
docker build -t subtitleai-web:v1.0.0 -f Dockerfile.web .

# Build worker (includes poller)
docker build -t subtitleai-worker:v1.0.0 -f Dockerfile.worker ./worker

2.3 Push to Private Registry

REGISTRY="10.89.97.201:30500"

# Tag and push web app
docker tag subtitleai-web:v1.0.0 ${REGISTRY}/subtitleai-web:v1.0.0
docker push ${REGISTRY}/subtitleai-web:v1.0.0

# Tag and push worker
docker tag subtitleai-worker:v1.0.0 ${REGISTRY}/subtitleai-worker:v1.0.0
docker push ${REGISTRY}/subtitleai-worker:v1.0.0

Phase 3: Kubernetes Manifests

Location: /root/tower-fleet/manifests/apps/subtitleai/

3.1 Create Manifest Files

  • namespace.yaml - subtitleai namespace
  • web-deployment.yaml - Next.js app deployment
  • web-service.yaml - LoadBalancer service (port 80→3000)
  • web-ingress.yaml - Ingress (subtitles.internal)
  • worker-deployment.yaml - Celery worker deployment
  • poller-deployment.yaml - Database poller deployment
  • redis-statefulset.yaml - Redis for Celery
  • redis-service.yaml - ClusterIP service for Redis
  • configmap.yaml - Shared config (Redis URL, poll interval)
  • secret.yaml - Supabase credentials (from sealed-secrets)

3.2 Apply Manifests

kubectl apply -f /root/tower-fleet/manifests/apps/subtitleai/

Phase 4: Deployment Script

Location: /root/tower-fleet/scripts/deploy-subtitleai.sh

Pattern: Follow deploy-home-portal.sh structure

Key Steps: 1. Pull latest code from git (LXC 180) 2. Update .env.production with Supabase keys from k8s secrets 3. Build Docker images (web + worker) 4. Tag images with auto-incremented semver 5. Push to private registry 6. Update k8s deployments with new image tags 7. Wait for rollout completion 8. Run health checks

Versioning Strategy: - Web app: v1.0.0v1.0.1 (auto-increment) - Worker: Same version as web app (keep in sync)

Phase 5: Initial Deployment

5.1 Environment Variables

# Web App (.env.production)
NEXT_PUBLIC_SUPABASE_URL=http://10.89.97.214:8000
NEXT_PUBLIC_SUPABASE_ANON_KEY=<from k8s secrets>

# Worker (ConfigMap/Secret)
SUPABASE_URL=http://10.89.97.214:8000
SUPABASE_SERVICE_KEY=<from k8s secrets>
REDIS_URL=redis://subtitleai-redis:6379/0
POLL_INTERVAL=5

5.2 Resource Allocation

# Web App
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }

# Worker (transcription intensive)
requests: { cpu: 1000m, memory: 2Gi }
limits: { cpu: 2000m, memory: 4Gi }

# Poller (lightweight)
requests: { cpu: 50m, memory: 128Mi }
limits: { cpu: 100m, memory: 256Mi }

# Redis
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 200m, memory: 512Mi }

5.3 Ingress Configuration

host: subtitles.internal
annotations:
  nginx.ingress.kubernetes.io/proxy-body-size: "2048m"  # 2GB file uploads
  nginx.ingress.kubernetes.io/proxy-read-timeout: "600"  # 10min for long uploads

Phase 6: Testing

6.1 Smoke Tests

  • [ ] Web UI loads at http://subtitles.internal
  • [ ] Can login with Supabase auth
  • [ ] Upload small video file
  • [ ] Worker picks up job
  • [ ] Transcription completes
  • [ ] Download SRT file

6.2 Load Testing

  • [ ] Multiple concurrent uploads
  • [ ] Worker scales (test with kubectl scale)
  • [ ] Large file uploads (2GB max)

Phase 7: Monitoring & Observability

7.1 Add ServiceMonitor (Prometheus)

# Similar to home-portal-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: subtitleai-web
  namespace: subtitleai
spec:
  selector:
    matchLabels:
      app: subtitleai-web
  endpoints:
  - port: http
    path: /api/metrics

7.2 Logging

  • Use Loki for log aggregation (already in cluster)
  • Query: {namespace="subtitleai"}
  • Worker logs: Celery task output
  • Poller logs: Job polling activity

7.3 Alerts (Optional)

  • Worker pod crashes
  • High memory usage (>3.5Gi worker)
  • Failed jobs rate

Volume Mounts (Future)

Media Library Access

For scanning existing media files, mount /vault into worker pods:

# worker-deployment.yaml
volumes:
  - name: media
    hostPath:
      path: /vault/subvol-101-disk-0/media
      type: Directory
volumeMounts:
  - name: media
    mountPath: /media
    readOnly: true  # Read-only for safety

Note: Requires nodeSelector or nodeAffinity to ensure pods run on nodes with /vault access.


Bazarr Integration (Future)

Option 1: API Integration

  • Bazarr at http://10.89.97.50:6767
  • Use Bazarr API to query subtitle status
  • No volume mounts needed

Option 2: File System Scanning

  • Mount same /vault path as Bazarr
  • Scan for .srt, .ass, .vtt files
  • Build own subtitle inventory

Migration Plan (Dev → Prod)

Current State

  • Dev: LXC 180 (Next.js on port 3000)
  • Worker: LXC 181 (systemd services)

Transition Steps

  1. Deploy to k3s (new namespace)
  2. Test thoroughly in k3s
  3. Update DNS/ingress to point to k3s
  4. Decommission LXC dev environments (or keep for development)

Rollback Plan

  • Keep LXC 180/181 running during initial k3s deployment
  • If k8s deployment fails, traffic still on LXC
  • Use kubectl rollout undo for k8s rollbacks

Checklist

Pre-Deployment

  • [ ] Review and finalize this plan
  • [ ] Create Dockerfiles (web + worker)
  • [ ] Create k8s manifests (namespace, deployments, services, ingress)
  • [ ] Create deployment script
  • [ ] Test build locally in LXC 180

Deployment

  • [ ] Apply namespace and secrets
  • [ ] Deploy Redis
  • [ ] Deploy worker and poller
  • [ ] Deploy web app
  • [ ] Configure ingress
  • [ ] Run smoke tests

Post-Deployment

  • [ ] Add monitoring/alerts
  • [ ] Document in tower-fleet repo
  • [ ] Update /root/PROJECTS.md
  • [ ] Add to home-portal dashboard (optional)

Lessons Learned (Post-Deployment)

Issue 1: TypeScript Build Errors Not Caught in Development

Problem: Next.js 16 production build (npm run build) failed with TypeScript errors that weren't caught during development (npm run dev).

Root Cause: - Dev mode uses lenient TypeScript checking and runtime type coercion - Production build enforces strict TypeScript compilation - Errors existed all along but only surfaced during Docker build

Specific Errors Fixed: 1. Async params in Next.js 16: Dynamic route params are now Promise<{ id: string }> instead of sync objects - Fixed in: /api/jobs/[id]/route.ts, /api/jobs/[id]/retry/route.ts, /api/subtitles/[id]/download/route.ts - Solution: Change params type and await them: const { id } = await params

  1. Supabase foreign key joins return arrays: Joins like jobs!inner(user_id) return arrays, not single objects
  2. Fixed in: /app/jobs/page.tsx, /app/jobs/[id]/page.tsx, /api/subtitles/[id]/download/route.ts
  3. Solution: Use !inner hint and handle array access: const video = Array.isArray(job.videos) ? job.videos[0] : job.videos

Prevention: Run npm run build locally before deploying to catch TypeScript errors early.

Issue 2: Missing Environment Variables in Poller

Problem: Poller pod crashed on startup with "Connection refused" to Redis.

Root Cause: REDIS_URL environment variable not configured in poller deployment manifest.

Fix: Added to manifests/apps/subtitleai/poller-deployment.yaml:

env:
  - name: REDIS_URL
    valueFrom:
      configMapKeyRef:
        name: subtitleai-config
        key: REDIS_URL

Prevention: Cross-reference all os.getenv() calls in code with deployment manifest env vars.

Issue 3: Docker Registry HTTP vs HTTPS

Problem: docker push failed with "server gave HTTP response to HTTPS client".

Root Cause: Private registry at 10.89.97.201:30500 uses HTTP, not HTTPS.

Fix: Configured Docker daemon in LXC 180:

// /etc/docker/daemon.json
{
  "insecure-registries": ["10.89.97.201:30500"]
}

Note: This is already configured in build environments, but new containers need this config.

Issue 4: Silent Poller (Not Actually an Issue)

Observation: Poller logs only showed startup messages, no polling activity.

Explanation: Poller is designed to only log when it finds pending jobs. Silent operation is normal when no jobs are pending.

Verification: Manually tested database query - confirmed poller polls every 5s but only logs on job discovery.


Development Workflow

Local Development (LXC 180)

cd /root/projects/subtitleai
npm run dev                  # Web app on :3000
npx supabase status          # Local Supabase

Production Deployment (K8s Cluster)

cd /root/tower-fleet
./scripts/deploy-subtitleai.sh   # Build, push, deploy
kubectl get pods -n subtitleai   # Verify deployment

LXC 181 Status

  • Previous: Ran worker and poller via systemd for testing
  • Current: Decommissioned - poller service stopped/disabled
  • Reason: All services now running in k8s production

Next Steps

  1. Create Dockerfiles - ✅ DONE (web + worker)
  2. Build and test locally - ✅ DONE (LXC 180)
  3. Create k8s manifests - ✅ DONE (11 manifest files)
  4. Create deployment script - ✅ DONE (deploy-subtitleai.sh)
  5. Deploy to k3s - ✅ DONE (v1.0.0 live at 10.89.97.213)
  6. Test end-to-end - ✅ DONE (upload → transcription → download working)