Skip to content

ArgoCD Implementation Plan v2

Status: Ready for Implementation Author: Claude (iteration on v1) Date: 2025-01-10 Previous: v1

Executive Summary

This is an updated ArgoCD implementation plan that addresses gaps in v1 and incorporates lessons learned from the 2025-01-10 post-reboot incident where multiple services remained scaled to 0 due to incomplete recovery scripts.

Key changes from v1: - Complete inventory of all 21 applications - App-of-Apps pattern for manageable scale - ApplicationSets for similar Next.js apps - Bootstrap procedure for ArgoCD itself - Explicit solution for post-reboot recovery - Gradual rollout strategy (parallel operation first) - Prune safeguards to prevent accidental deletion


Problem Statement (Updated)

Current State Issues

  1. Manifest drift - Production diverges from git (unchanged from v1)
  2. Incomplete recovery - Post-reboot script only covers 2 of 21 apps
  3. Stale configuration - ConfigMaps reference old IPs (e.g., Supabase 10.89.97.214 vs 10.89.97.221)
  4. No single source of truth - Mix of kubectl, deploy scripts, and ad-hoc changes
  5. Visibility gap - No dashboard showing health of all services

What Happened on 2025-01-10

Reboot occurred
  → post-reboot-recovery.sh ran
  → Only scaled up: authentik, home-portal
  → Left at 0 replicas: money-tracker, trip-planner, subtitleai (4 pods)
  → Production supabase kong/rest/gotrue/storage stayed at 0
  → home-portal couldn't reach Supabase (stale IP in ConfigMap)
  → User saw "No services yet"

With ArgoCD: Self-heal would have restored all services to their declared state within minutes.


Complete Application Inventory

User Applications (12)

App Namespace Manifest Path Priority Notes
home-portal home-portal manifests/apps/home-portal Critical Main dashboard
money-tracker money-tracker manifests/apps/money-tracker High Finance data
trip-planner trip-planner manifests/apps/trip-planner Medium AI travel
palimpsest-api palimpsest-api manifests/apps/palimpsest-api Medium RPG backend
palimpsest-web palimpsest-web manifests/apps/palimpsest-web Medium RPG frontend
tcg tcg manifests/apps/tcg Low Card game
notes-app notes-app manifests/apps/notes-app Low Notes
music-control music-control manifests/apps/music-control Medium HEOS control
subtitleai subtitleai manifests/apps/subtitleai Low Subtitle processing
vault-platform vault-platform manifests/apps/vault-platform Low Vault app
anythingllm anythingllm manifests/apps/anythingllm Low Local LLM
otterwiki otterwiki manifests/apps/otterwiki Medium Documentation

Media/Content Applications (3)

App Namespace Manifest Path Priority
immich immich manifests/apps/immich High
romm romm manifests/apps/romm Low
pelican pelican manifests/apps/pelican Low

Infrastructure Services (6)

Service Namespace Manifest Path Managed by ArgoCD?
supabase-sandbox supabase-sandbox manifests/supabase-sandbox Yes
supabase (prod) supabase manifests/supabase Deprecate
authentik authentik manifests/authentik Yes (careful)
ingress-nginx ingress-nginx manifests/core Yes
cert-manager cert-manager manifests/cert-manager Yes
monitoring monitoring manifests/monitoring Yes
docker-registry docker-registry manifests/infrastructure Yes
minio minio manifests/apps/minio Yes

Decision: Deprecate Production Supabase

The supabase namespace (production) has been partially scaled down for a while. All apps should use supabase-sandbox. Action: Migrate any remaining references to sandbox, then delete production namespace.


Architecture: App-of-Apps Pattern

Instead of managing 21+ Application CRDs individually, use a hierarchical approach:

argocd/
├── bootstrap/
│   └── root-app.yaml           # The ONE Application you apply manually
├── projects/
│   ├── apps.yaml               # Project for user applications
│   └── infrastructure.yaml     # Project for infra services
├── apps/
│   ├── _app-of-apps.yaml       # Parent that manages all apps below
│   ├── home-portal.yaml
│   ├── money-tracker.yaml
│   ├── trip-planner.yaml
│   └── ...
└── infrastructure/
    ├── _app-of-apps.yaml       # Parent for infra
    ├── supabase-sandbox.yaml
    ├── authentik.yaml
    └── ...

Bootstrap procedure:

# Only command needed to set up everything
kubectl apply -f argocd/bootstrap/root-app.yaml

The root app syncs the argocd/ directory, which creates all other Applications.


ApplicationSets for Similar Apps

Many apps follow the same pattern (Next.js + Supabase). Use ApplicationSet to reduce duplication:

# argocd/applicationsets/nextjs-apps.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: nextjs-apps
  namespace: argocd
spec:
  generators:
  - list:
      elements:
      - name: home-portal
        priority: critical
      - name: money-tracker
        priority: high
      - name: trip-planner
        priority: medium
      - name: tcg
        priority: low
      - name: notes-app
        priority: low
  template:
    metadata:
      name: '{{name}}'
    spec:
      project: apps
      source:
        repoURL: git@github.com:jakecelentano/tower-fleet.git
        targetRevision: main
        path: 'manifests/apps/{{name}}'
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{name}}'
      syncPolicy:
        automated:
          prune: false  # SAFETY: Don't auto-delete
          selfHeal: true
        syncOptions:
        - CreateNamespace=true

Implementation Phases (Revised)

Phase 0: Prerequisites (Before Starting)

  • [ ] Resolve Supabase confusion: Migrate all apps to sandbox, document decision
  • [ ] Update all ConfigMaps with correct Supabase URL (10.89.97.221)
  • [ ] Ensure all manifests in git match current production state
  • [ ] Audit manifests for hardcoded IPs that should be service names

Phase 1: Install ArgoCD

1.1 Install with HA manifests (recommended for self-heal reliability):

kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml

1.2 Wait for ArgoCD to be ready:

kubectl wait --for=condition=available deployment/argocd-server -n argocd --timeout=300s

1.3 Expose via Ingress:

# manifests/infrastructure/argocd/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: argocd-server
  namespace: argocd
  annotations:
    nginx.ingress.kubernetes.io/ssl-passthrough: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
spec:
  ingressClassName: nginx
  rules:
  - host: argocd.internal
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: argocd-server
            port:
              number: 443
  tls:
  - hosts:
    - argocd.internal
    secretName: wildcard-bogocat-tls

1.4 Get admin password and login:

# Get password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

# Install CLI
curl -sSL -o argocd https://github.com/argoproj/argo-cd/releases/latest/download/argocd-linux-amd64
chmod +x argocd && mv argocd /usr/local/bin/

# Login
argocd login argocd.internal --username admin --password <password> --insecure

Phase 2: Connect Repository

# Generate deploy key
ssh-keygen -t ed25519 -C "argocd@tower-fleet" -f ~/.ssh/argocd-deploy-key -N ""

# Add to GitHub (Settings → Deploy Keys, read-only)
cat ~/.ssh/argocd-deploy-key.pub

# Add to ArgoCD
argocd repo add git@github.com:jakecelentano/tower-fleet.git \
  --ssh-private-key-path ~/.ssh/argocd-deploy-key \
  --name tower-fleet

Phase 3: Create Project Structure

kubectl apply -f argocd/projects/

Phase 4: Gradual Rollout (Parallel Operation)

Critical: Don't migrate everything at once. Run ArgoCD in parallel with existing approach.

Week 1: Low-risk apps (manual sync) - otterwiki - tcg - notes-app - romm

syncPolicy: {}  # Empty = manual sync only

Week 2: Medium-risk apps (manual sync) - pelican - music-control - vault-platform - palimpsest-api - palimpsest-web

Week 3: Enable auto-sync on Week 1-2 apps - Convert to selfHeal: true, prune: false

Week 4: High-risk apps - subtitleai - trip-planner - anythingllm - immich

Week 5: Critical apps - money-tracker - home-portal

Week 6: Infrastructure - supabase-sandbox - authentik (very careful) - monitoring

Phase 5: Configure Sync Policies

Safe defaults for all apps:

syncPolicy:
  automated:
    prune: false      # NEVER auto-delete resources
    selfHeal: true    # DO auto-revert manual changes
  retry:
    limit: 5
    backoff:
      duration: 5s
      factor: 2
      maxDuration: 3m
  syncOptions:
  - CreateNamespace=true
  - PrunePropagationPolicy=foreground
  - ServerSideApply=true

Why prune: false? If someone accidentally deletes a manifest from git, ArgoCD won't delete the running resource. You'll see "OutOfSync" in the dashboard and can investigate.

Phase 6: Authentik SSO

(Same as v1, unchanged)

Phase 7: Monitoring & Alerts

Add to existing Prometheus stack:

# manifests/monitoring/argocd-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argocd
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: argocd
  namespaceSelector:
    matchNames:
    - argocd
  endpoints:
  - port: metrics

Alert rules:

groups:
- name: argocd
  rules:
  - alert: ArgoAppOutOfSync
    expr: argocd_app_info{sync_status!="Synced"} == 1
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "ArgoCD app {{ $labels.name }} is out of sync"

  - alert: ArgoAppDegraded
    expr: argocd_app_info{health_status="Degraded"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "ArgoCD app {{ $labels.name }} is degraded"


Bootstrap & Disaster Recovery

If ArgoCD Itself Doesn't Start After Reboot

ArgoCD is just another K8s deployment. If cluster comes up, ArgoCD comes up. But if you need to recover:

# Re-apply ArgoCD installation
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml

# Re-apply the root app (all other apps will sync automatically)
kubectl apply -f /root/tower-fleet/argocd/bootstrap/root-app.yaml

Backup ArgoCD Configuration

ArgoCD stores config in K8s resources. Velero already backs these up. But for extra safety:

# Export all ArgoCD Applications
kubectl get applications -n argocd -o yaml > /root/tower-fleet/backups/argocd-apps-$(date +%Y%m%d).yaml

# Export ArgoCD config
kubectl get cm,secret -n argocd -o yaml > /root/tower-fleet/backups/argocd-config-$(date +%Y%m%d).yaml

Post-Reboot: How ArgoCD Solves It

Current flow:

Reboot → Pods restart with current replica count → Some at 0 → Manual recovery needed

With ArgoCD (selfHeal: true):

Reboot → ArgoCD starts → Compares cluster to git → Sees replicas=0 but git says replicas=1 → Auto-scales up

Timeline: - T+0: Cluster comes up - T+1-2 min: ArgoCD pods ready - T+2-3 min: ArgoCD reconciliation loop runs - T+3-5 min: All apps scaled to declared state

No manual intervention required.


Risks & Mitigations

Risk Mitigation
Accidental manifest deletion deletes resources prune: false by default
ArgoCD becomes single point of failure HA installation, Velero backups
Learning curve slows operations Gradual rollout, keep kubectl as fallback
Webhook exposes attack surface Internal hostname only, Authentik SSO
Secrets in git Continue using K8s Secrets (not in git), or migrate to SealedSecrets

Migration Checklist

For each application:

  • [ ] Verify manifests in tower-fleet/manifests/apps/<app>/ are complete
  • [ ] Verify manifests match current production (kubectl diff)
  • [ ] Create ArgoCD Application definition
  • [ ] Apply Application (manual sync mode first)
  • [ ] Verify app appears in ArgoCD UI
  • [ ] Test manual sync
  • [ ] Enable auto-sync with selfHeal
  • [ ] Verify selfHeal works (scale to 0, watch it recover)
  • [ ] Remove/deprecate old deploy script

Success Criteria

  • [ ] ArgoCD accessible at argocd.internal with Authentik SSO
  • [ ] All 21 applications managed by ArgoCD
  • [ ] Dashboard shows health status of all apps
  • [ ] Self-heal tested: manual kubectl scale --replicas=0 auto-recovers
  • [ ] Post-reboot tested: all apps come back automatically
  • [ ] Alerting on sync failures integrated with monitoring stack
  • [ ] Team can deploy by pushing to git (no kubectl needed)

Decisions (Resolved 2025-01-10)

  1. Supabase consolidation: ✅ Done - kept prod (supabase), deprecated sandbox

  2. Secrets strategy: Use SealedSecrets - already have controller in kube-system + 4 sealed secrets in use (home-portal, tcg, trip-planner, vault-platform)

  3. External access: Internal only (argocd.internal) - external would require:

  4. Adding to VPS Caddyfile reverse proxy
  5. Authentik forward auth for protection
  6. Increased attack surface (ArgoCD has cluster-admin)

  7. Notification channels: Discord - extend existing Alertmanager Discord integration for sync notifications

  8. Sync policy: Auto-sync with safeguards

  9. selfHeal: true - revert manual kubectl changes
  10. prune: false - never auto-delete resources (safety)
  11. No manual gates initially, add later if needed

Appendix: Files to Create

tower-fleet/
├── argocd/
│   ├── bootstrap/
│   │   └── root-app.yaml
│   ├── projects/
│   │   ├── apps.yaml
│   │   └── infrastructure.yaml
│   ├── apps/
│   │   ├── _app-of-apps.yaml
│   │   ├── home-portal.yaml
│   │   ├── money-tracker.yaml
│   │   └── ... (19 more)
│   ├── infrastructure/
│   │   ├── _app-of-apps.yaml
│   │   ├── supabase-sandbox.yaml
│   │   └── ...
│   └── applicationsets/
│       └── nextjs-apps.yaml
└── manifests/
    └── infrastructure/
        └── argocd/
            ├── ingress.yaml
            └── servicemonitor.yaml