Skip to content

ADR-002: Storage Strategy

Date: 2025-11-09 Status: Accepted Deciders: User, Claude Code


Context

Need to decide where to store persistent data for: - Observability data: Prometheus metrics, Loki logs, Grafana dashboards (14-day retention, ~10-15GB) - Application data: PostgreSQL databases, uploaded files, user data (~40-60GB)

Options: - Option A: Pure k8s with Longhorn (distributed storage on VM disks) - Option B: Hybrid (Longhorn for apps, NFS to /vault/logs for observability)

Resources: - Each VM: 80GB disk - NAS: LXC 101 with /vault (22TB, 75% used = 5.3TB free)


Decision

Chosen: Option A - Pure k8s with Longhorn (2-replica)

Storage Allocation

VM Disks (80GB each): - k3s-master: 80GB - k3s-worker-1: 80GB - k3s-worker-2: 80GB - Total: 240GB raw → ~120GB usable (with 2-replica Longhorn)

Data Retention: - Observability: 14 days (~10-15GB) - Application databases: ~40-60GB - Total estimated usage: ~70GB (well within 120GB capacity)


Rationale

Why Pure k8s (Option A)?

  1. Cloud-native learning experience
  2. This is how production k8s works (EBS on AWS, Persistent Disks on GKE)
  3. Learn native k8s storage concepts (PV, PVC, StorageClass)
  4. No external dependencies - cluster is self-contained

  5. High Availability

  6. Longhorn replicates data across nodes (2-replica)
  7. If worker-1 dies, data is still on worker-2
  8. Automatic failover and rebalancing

  9. Sufficient capacity

  10. 120GB usable storage is plenty for 5 apps with 14-day observability retention
  11. Estimated usage: ~70GB (~58% utilization)
  12. Can extend retention or add more data

  13. Simpler mental model

  14. Everything k8s-related lives in k8s
  15. No mixing of k8s and external NFS mounts
  16. Easier to understand for learning

  17. Flexibility to change later

  18. Can add NFS StorageClass in 5 minutes if we hit limits
  19. Can increase VM disk size in Proxmox easily
  20. Can add more worker nodes for more storage

Alternatives Considered

Option B: Hybrid Storage

Configuration: - Longhorn (on VM disks): Application databases, critical data - NFS to /vault/logs/k8s/: Observability data (unlimited space)

Pros: - ✅ Unlimited storage for logs (5.3TB available on /vault) - ✅ Easy to extend retention (14 days → 90 days → 1 year) - ✅ Observability data survives k8s cluster rebuild

Cons: - ❌ External dependency (if LXC 101 dies, can't write new logs) - ❌ Less "production-like" (most k8s uses native storage) - ❌ More complexity (two storage systems to manage) - ❌ NFS performance slower than local disks

Why rejected: - Primary goal is learning - use k8s native storage - 14-day retention is sufficient for homelab (not production) - Can always add NFS later if needed (5-minute config change) - Simpler is better for starting out


Option C: All NFS (No Longhorn)

Configuration: - All persistent volumes on NFS to /vault/k8s-volumes/

Pros: - ✅ Unlimited storage - ✅ Simpler (no Longhorn to install/manage)

Cons: - ❌ Single point of failure (LXC 101 dies = all data gone) - ❌ Worse performance (network latency vs local disk) - ❌ Doesn't teach cloud-native storage - ❌ Not how production k8s works

Why rejected: - Defeats the purpose of learning k8s-native storage - No HA (NFS is single point of failure) - Performance concerns for databases


Consequences

Positive

  1. Learn production-grade storage
  2. Longhorn is similar to Rook/Ceph used in production
  3. Understand replication, replica placement, volume snapshots
  4. Practice with PV/PVC lifecycle

  5. High Availability

  6. Data replicated across 2 nodes
  7. Survives single node failure
  8. Automatic recovery and rebalancing

  9. Self-contained cluster

  10. No external dependencies
  11. Can move cluster to different hardware easily
  12. GitOps + Longhorn backups = full disaster recovery

  13. Good performance

  14. Local SSD storage (fast)
  15. No network latency for database operations

Negative

  1. Limited storage capacity
  2. 120GB usable vs unlimited on /vault
  3. Requires monitoring disk usage
  4. Mitigation: 14-day retention (vs 30-day originally planned)

  5. Tighter data retention

  6. 14 days of observability data vs 30+ days possible with NFS
  7. Sufficient for homelab (can always extend later)
  8. Mitigation: Export important metrics/logs to /vault if needed

  9. Need to monitor disk usage

  10. Can't let storage fill up (PVC creation fails)
  11. Mitigation: Grafana alerts on disk usage (Phase 5)

Neutral

  1. Can pivot to NFS easily
  2. If we hit 80% disk usage, add NFS StorageClass
  3. Migration: create PVC with NFS, copy data, switch
  4. 5-minute operation

  5. Can increase VM disk size

  6. Proxmox allows expanding VM disks online
  7. qm resize 201 scsi0 +20G → adds 20GB
  8. Longhorn automatically uses new space

Capacity Planning

Current Estimate (5 apps, 14-day retention)

Component Storage Notes
Prometheus (metrics) 3-5GB 14 days, 5 apps
Loki (logs) 5-7GB 14 days, 5 apps (depends on verbosity)
Tempo (traces) 3-5GB 14 days, sampled traces
PostgreSQL (money-tracker) 5-10GB Transactions, budgets, user data
PostgreSQL (home-portal) 2-5GB Bookmarks, configs
PostgreSQL (RMS) 10-20GB Recipes, images, nutrition data
Other apps 10-15GB Future applications, uploads
Total 48-72GB Well within 120GB capacity

Headroom

  • Usable: 120GB
  • Estimated: 60GB (50% utilization)
  • Headroom: 60GB (~100% buffer)

Scaling Options (if needed)

  1. Reduce retention: 14 days → 7 days (saves ~5-10GB)
  2. Add NFS StorageClass: Offload old logs to /vault
  3. Increase VM disks: qm resize +20GB per VM (+60GB total)
  4. Add worker node: VM 204 with 80GB (+40GB usable)

Monitoring & Alerts

To implement in Phase 5:

# Grafana alert: Disk usage >80%
- alert: HighDiskUsage
  expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "PVC {{ $labels.persistentvolumeclaim }} is >80% full"

Manual checks:

# Check PVC usage
kubectl get pvc -A

# Longhorn UI (after Phase 3)
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# http://localhost:8080


Validation

After 3 months, evaluate: - [ ] Is 120GB sufficient? (target: <70% utilization) - [ ] Is 14-day retention adequate for debugging? - [ ] Has Longhorn been stable? (no data loss) - [ ] Do we need to add NFS StorageClass?

Success criteria: - Disk usage stays below 80% over 3 months - No data loss events - Longhorn replication working (survive node failure)



References


Implementation Notes

Phase 3 (Core Infrastructure): - Install Longhorn via Helm - Set default StorageClass to Longhorn - Configure 2-replica for HA - Set up recurring snapshots (daily)

Future Enhancements:

See Tower Fleet Roadmap - Storage for planned storage improvements: - Longhorn backups to /vault/k8s-backups/ (disaster recovery) - NFS StorageClass (if needed for large files) - Volume expansion planning and procedures