ArgoCD Implementation Plan v2¶
Status: Ready for Implementation Author: Claude (iteration on v1) Date: 2025-01-10 Previous: v1
Executive Summary¶
This is an updated ArgoCD implementation plan that addresses gaps in v1 and incorporates lessons learned from the 2025-01-10 post-reboot incident where multiple services remained scaled to 0 due to incomplete recovery scripts.
Key changes from v1: - Complete inventory of all 21 applications - App-of-Apps pattern for manageable scale - ApplicationSets for similar Next.js apps - Bootstrap procedure for ArgoCD itself - Explicit solution for post-reboot recovery - Gradual rollout strategy (parallel operation first) - Prune safeguards to prevent accidental deletion
Problem Statement (Updated)¶
Current State Issues¶
- Manifest drift - Production diverges from git (unchanged from v1)
- Incomplete recovery - Post-reboot script only covers 2 of 21 apps
- Stale configuration - ConfigMaps reference old IPs (e.g., Supabase
10.89.97.214vs10.89.97.221) - No single source of truth - Mix of kubectl, deploy scripts, and ad-hoc changes
- Visibility gap - No dashboard showing health of all services
What Happened on 2025-01-10¶
Reboot occurred
→ post-reboot-recovery.sh ran
→ Only scaled up: authentik, home-portal
→ Left at 0 replicas: money-tracker, trip-planner, subtitleai (4 pods)
→ Production supabase kong/rest/gotrue/storage stayed at 0
→ home-portal couldn't reach Supabase (stale IP in ConfigMap)
→ User saw "No services yet"
With ArgoCD: Self-heal would have restored all services to their declared state within minutes.
Complete Application Inventory¶
User Applications (12)¶
| App | Namespace | Manifest Path | Priority | Notes |
|---|---|---|---|---|
| home-portal | home-portal | manifests/apps/home-portal |
Critical | Main dashboard |
| money-tracker | money-tracker | manifests/apps/money-tracker |
High | Finance data |
| trip-planner | trip-planner | manifests/apps/trip-planner |
Medium | AI travel |
| palimpsest-api | palimpsest-api | manifests/apps/palimpsest-api |
Medium | RPG backend |
| palimpsest-web | palimpsest-web | manifests/apps/palimpsest-web |
Medium | RPG frontend |
| tcg | tcg | manifests/apps/tcg |
Low | Card game |
| notes-app | notes-app | manifests/apps/notes-app |
Low | Notes |
| music-control | music-control | manifests/apps/music-control |
Medium | HEOS control |
| subtitleai | subtitleai | manifests/apps/subtitleai |
Low | Subtitle processing |
| vault-platform | vault-platform | manifests/apps/vault-platform |
Low | Vault app |
| anythingllm | anythingllm | manifests/apps/anythingllm |
Low | Local LLM |
| otterwiki | otterwiki | manifests/apps/otterwiki |
Medium | Documentation |
Media/Content Applications (3)¶
| App | Namespace | Manifest Path | Priority |
|---|---|---|---|
| immich | immich | manifests/apps/immich |
High |
| romm | romm | manifests/apps/romm |
Low |
| pelican | pelican | manifests/apps/pelican |
Low |
Infrastructure Services (6)¶
| Service | Namespace | Manifest Path | Managed by ArgoCD? |
|---|---|---|---|
| supabase-sandbox | supabase-sandbox | manifests/supabase-sandbox |
Yes |
| supabase (prod) | supabase | manifests/supabase |
Deprecate |
| authentik | authentik | manifests/authentik |
Yes (careful) |
| ingress-nginx | ingress-nginx | manifests/core |
Yes |
| cert-manager | cert-manager | manifests/cert-manager |
Yes |
| monitoring | monitoring | manifests/monitoring |
Yes |
| docker-registry | docker-registry | manifests/infrastructure |
Yes |
| minio | minio | manifests/apps/minio |
Yes |
Decision: Deprecate Production Supabase¶
The supabase namespace (production) has been partially scaled down for a while. All apps should use supabase-sandbox. Action: Migrate any remaining references to sandbox, then delete production namespace.
Architecture: App-of-Apps Pattern¶
Instead of managing 21+ Application CRDs individually, use a hierarchical approach:
argocd/
├── bootstrap/
│ └── root-app.yaml # The ONE Application you apply manually
│
├── projects/
│ ├── apps.yaml # Project for user applications
│ └── infrastructure.yaml # Project for infra services
│
├── apps/
│ ├── _app-of-apps.yaml # Parent that manages all apps below
│ ├── home-portal.yaml
│ ├── money-tracker.yaml
│ ├── trip-planner.yaml
│ └── ...
│
└── infrastructure/
├── _app-of-apps.yaml # Parent for infra
├── supabase-sandbox.yaml
├── authentik.yaml
└── ...
Bootstrap procedure:
The root app syncs the argocd/ directory, which creates all other Applications.
ApplicationSets for Similar Apps¶
Many apps follow the same pattern (Next.js + Supabase). Use ApplicationSet to reduce duplication:
# argocd/applicationsets/nextjs-apps.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: nextjs-apps
namespace: argocd
spec:
generators:
- list:
elements:
- name: home-portal
priority: critical
- name: money-tracker
priority: high
- name: trip-planner
priority: medium
- name: tcg
priority: low
- name: notes-app
priority: low
template:
metadata:
name: '{{name}}'
spec:
project: apps
source:
repoURL: git@github.com:jakecelentano/tower-fleet.git
targetRevision: main
path: 'manifests/apps/{{name}}'
destination:
server: https://kubernetes.default.svc
namespace: '{{name}}'
syncPolicy:
automated:
prune: false # SAFETY: Don't auto-delete
selfHeal: true
syncOptions:
- CreateNamespace=true
Implementation Phases (Revised)¶
Phase 0: Prerequisites (Before Starting)¶
- [ ] Resolve Supabase confusion: Migrate all apps to sandbox, document decision
- [ ] Update all ConfigMaps with correct Supabase URL (
10.89.97.221) - [ ] Ensure all manifests in git match current production state
- [ ] Audit manifests for hardcoded IPs that should be service names
Phase 1: Install ArgoCD¶
1.1 Install with HA manifests (recommended for self-heal reliability):
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml
1.2 Wait for ArgoCD to be ready:
1.3 Expose via Ingress:
# manifests/infrastructure/argocd/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: argocd-server
namespace: argocd
annotations:
nginx.ingress.kubernetes.io/ssl-passthrough: "true"
nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
spec:
ingressClassName: nginx
rules:
- host: argocd.internal
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: argocd-server
port:
number: 443
tls:
- hosts:
- argocd.internal
secretName: wildcard-bogocat-tls
1.4 Get admin password and login:
# Get password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
# Install CLI
curl -sSL -o argocd https://github.com/argoproj/argo-cd/releases/latest/download/argocd-linux-amd64
chmod +x argocd && mv argocd /usr/local/bin/
# Login
argocd login argocd.internal --username admin --password <password> --insecure
Phase 2: Connect Repository¶
# Generate deploy key
ssh-keygen -t ed25519 -C "argocd@tower-fleet" -f ~/.ssh/argocd-deploy-key -N ""
# Add to GitHub (Settings → Deploy Keys, read-only)
cat ~/.ssh/argocd-deploy-key.pub
# Add to ArgoCD
argocd repo add git@github.com:jakecelentano/tower-fleet.git \
--ssh-private-key-path ~/.ssh/argocd-deploy-key \
--name tower-fleet
Phase 3: Create Project Structure¶
Phase 4: Gradual Rollout (Parallel Operation)¶
Critical: Don't migrate everything at once. Run ArgoCD in parallel with existing approach.
Week 1: Low-risk apps (manual sync) - otterwiki - tcg - notes-app - romm
Week 2: Medium-risk apps (manual sync) - pelican - music-control - vault-platform - palimpsest-api - palimpsest-web
Week 3: Enable auto-sync on Week 1-2 apps
- Convert to selfHeal: true, prune: false
Week 4: High-risk apps - subtitleai - trip-planner - anythingllm - immich
Week 5: Critical apps - money-tracker - home-portal
Week 6: Infrastructure - supabase-sandbox - authentik (very careful) - monitoring
Phase 5: Configure Sync Policies¶
Safe defaults for all apps:
syncPolicy:
automated:
prune: false # NEVER auto-delete resources
selfHeal: true # DO auto-revert manual changes
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- ServerSideApply=true
Why prune: false?
If someone accidentally deletes a manifest from git, ArgoCD won't delete the running resource. You'll see "OutOfSync" in the dashboard and can investigate.
Phase 6: Authentik SSO¶
(Same as v1, unchanged)
Phase 7: Monitoring & Alerts¶
Add to existing Prometheus stack:
# manifests/monitoring/argocd-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: argocd
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/part-of: argocd
namespaceSelector:
matchNames:
- argocd
endpoints:
- port: metrics
Alert rules:
groups:
- name: argocd
rules:
- alert: ArgoAppOutOfSync
expr: argocd_app_info{sync_status!="Synced"} == 1
for: 15m
labels:
severity: warning
annotations:
summary: "ArgoCD app {{ $labels.name }} is out of sync"
- alert: ArgoAppDegraded
expr: argocd_app_info{health_status="Degraded"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "ArgoCD app {{ $labels.name }} is degraded"
Bootstrap & Disaster Recovery¶
If ArgoCD Itself Doesn't Start After Reboot¶
ArgoCD is just another K8s deployment. If cluster comes up, ArgoCD comes up. But if you need to recover:
# Re-apply ArgoCD installation
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml
# Re-apply the root app (all other apps will sync automatically)
kubectl apply -f /root/tower-fleet/argocd/bootstrap/root-app.yaml
Backup ArgoCD Configuration¶
ArgoCD stores config in K8s resources. Velero already backs these up. But for extra safety:
# Export all ArgoCD Applications
kubectl get applications -n argocd -o yaml > /root/tower-fleet/backups/argocd-apps-$(date +%Y%m%d).yaml
# Export ArgoCD config
kubectl get cm,secret -n argocd -o yaml > /root/tower-fleet/backups/argocd-config-$(date +%Y%m%d).yaml
Post-Reboot: How ArgoCD Solves It¶
Current flow:
With ArgoCD (selfHeal: true):
Reboot → ArgoCD starts → Compares cluster to git → Sees replicas=0 but git says replicas=1 → Auto-scales up
Timeline: - T+0: Cluster comes up - T+1-2 min: ArgoCD pods ready - T+2-3 min: ArgoCD reconciliation loop runs - T+3-5 min: All apps scaled to declared state
No manual intervention required.
Risks & Mitigations¶
| Risk | Mitigation |
|---|---|
| Accidental manifest deletion deletes resources | prune: false by default |
| ArgoCD becomes single point of failure | HA installation, Velero backups |
| Learning curve slows operations | Gradual rollout, keep kubectl as fallback |
| Webhook exposes attack surface | Internal hostname only, Authentik SSO |
| Secrets in git | Continue using K8s Secrets (not in git), or migrate to SealedSecrets |
Migration Checklist¶
For each application:
- [ ] Verify manifests in
tower-fleet/manifests/apps/<app>/are complete - [ ] Verify manifests match current production (
kubectl diff) - [ ] Create ArgoCD Application definition
- [ ] Apply Application (manual sync mode first)
- [ ] Verify app appears in ArgoCD UI
- [ ] Test manual sync
- [ ] Enable auto-sync with selfHeal
- [ ] Verify selfHeal works (scale to 0, watch it recover)
- [ ] Remove/deprecate old deploy script
Success Criteria¶
- [ ] ArgoCD accessible at argocd.internal with Authentik SSO
- [ ] All 21 applications managed by ArgoCD
- [ ] Dashboard shows health status of all apps
- [ ] Self-heal tested: manual
kubectl scale --replicas=0auto-recovers - [ ] Post-reboot tested: all apps come back automatically
- [ ] Alerting on sync failures integrated with monitoring stack
- [ ] Team can deploy by pushing to git (no kubectl needed)
Decisions (Resolved 2025-01-10)¶
-
Supabase consolidation: ✅ Done - kept prod (
supabase), deprecated sandbox -
Secrets strategy: Use SealedSecrets - already have controller in
kube-system+ 4 sealed secrets in use (home-portal, tcg, trip-planner, vault-platform) -
External access: Internal only (
argocd.internal) - external would require: - Adding to VPS Caddyfile reverse proxy
- Authentik forward auth for protection
-
Increased attack surface (ArgoCD has cluster-admin)
-
Notification channels: Discord - extend existing Alertmanager Discord integration for sync notifications
-
Sync policy: Auto-sync with safeguards
selfHeal: true- revert manual kubectl changesprune: false- never auto-delete resources (safety)- No manual gates initially, add later if needed
Appendix: Files to Create¶
tower-fleet/
├── argocd/
│ ├── bootstrap/
│ │ └── root-app.yaml
│ ├── projects/
│ │ ├── apps.yaml
│ │ └── infrastructure.yaml
│ ├── apps/
│ │ ├── _app-of-apps.yaml
│ │ ├── home-portal.yaml
│ │ ├── money-tracker.yaml
│ │ └── ... (19 more)
│ ├── infrastructure/
│ │ ├── _app-of-apps.yaml
│ │ ├── supabase-sandbox.yaml
│ │ └── ...
│ └── applicationsets/
│ └── nextjs-apps.yaml
│
└── manifests/
└── infrastructure/
└── argocd/
├── ingress.yaml
└── servicemonitor.yaml