ADR-002: Storage Strategy¶

Date: 2025-11-09 Status: Accepted Deciders: User, Claude Code

Context¶

Need to decide where to store persistent data for: - Observability data: Prometheus metrics, Loki logs, Grafana dashboards (14-day retention, ~10-15GB) - Application data: PostgreSQL databases, uploaded files, user data (~40-60GB)

Options: - Option A: Pure k8s with Longhorn (distributed storage on VM disks) - Option B: Hybrid (Longhorn for apps, NFS to /vault/logs for observability)

Resources: - Each VM: 80GB disk - NAS: LXC 101 with /vault (22TB, 75% used = 5.3TB free)

Decision¶

Chosen: Option A - Pure k8s with Longhorn (2-replica)

Storage Allocation¶

VM Disks (80GB each): - k3s-master: 80GB - k3s-worker-1: 80GB - k3s-worker-2: 80GB - Total: 240GB raw → ~120GB usable (with 2-replica Longhorn)

Data Retention: - Observability: 14 days (~10-15GB) - Application databases: ~40-60GB - Total estimated usage: ~70GB (well within 120GB capacity)

Rationale¶

Why Pure k8s (Option A)?¶

Cloud-native learning experience
This is how production k8s works (EBS on AWS, Persistent Disks on GKE)
Learn native k8s storage concepts (PV, PVC, StorageClass)
No external dependencies - cluster is self-contained
High Availability
Longhorn replicates data across nodes (2-replica)
If worker-1 dies, data is still on worker-2
Automatic failover and rebalancing
Sufficient capacity
120GB usable storage is plenty for 5 apps with 14-day observability retention
Estimated usage: ~70GB (~58% utilization)
Can extend retention or add more data
Simpler mental model
Everything k8s-related lives in k8s
No mixing of k8s and external NFS mounts
Easier to understand for learning
Flexibility to change later
Can add NFS StorageClass in 5 minutes if we hit limits
Can increase VM disk size in Proxmox easily
Can add more worker nodes for more storage

Alternatives Considered¶

Option B: Hybrid Storage¶

Configuration: - Longhorn (on VM disks): Application databases, critical data - NFS to /vault/logs/k8s/: Observability data (unlimited space)

Pros: - ✅ Unlimited storage for logs (5.3TB available on /vault) - ✅ Easy to extend retention (14 days → 90 days → 1 year) - ✅ Observability data survives k8s cluster rebuild

Cons: - ❌ External dependency (if LXC 101 dies, can't write new logs) - ❌ Less "production-like" (most k8s uses native storage) - ❌ More complexity (two storage systems to manage) - ❌ NFS performance slower than local disks

Why rejected: - Primary goal is learning - use k8s native storage - 14-day retention is sufficient for homelab (not production) - Can always add NFS later if needed (5-minute config change) - Simpler is better for starting out

Option C: All NFS (No Longhorn)¶

Configuration: - All persistent volumes on NFS to /vault/k8s-volumes/

Pros: - ✅ Unlimited storage - ✅ Simpler (no Longhorn to install/manage)

Cons: - ❌ Single point of failure (LXC 101 dies = all data gone) - ❌ Worse performance (network latency vs local disk) - ❌ Doesn't teach cloud-native storage - ❌ Not how production k8s works

Why rejected: - Defeats the purpose of learning k8s-native storage - No HA (NFS is single point of failure) - Performance concerns for databases

Consequences¶

Positive¶

Learn production-grade storage
Longhorn is similar to Rook/Ceph used in production
Understand replication, replica placement, volume snapshots
Practice with PV/PVC lifecycle
High Availability
Data replicated across 2 nodes
Survives single node failure
Automatic recovery and rebalancing
Self-contained cluster
No external dependencies
Can move cluster to different hardware easily
GitOps + Longhorn backups = full disaster recovery
Good performance
Local SSD storage (fast)
No network latency for database operations

Negative¶

Limited storage capacity
120GB usable vs unlimited on /vault
Requires monitoring disk usage
Mitigation: 14-day retention (vs 30-day originally planned)
Tighter data retention
14 days of observability data vs 30+ days possible with NFS
Sufficient for homelab (can always extend later)
Mitigation: Export important metrics/logs to /vault if needed
Need to monitor disk usage
Can't let storage fill up (PVC creation fails)
Mitigation: Grafana alerts on disk usage (Phase 5)

Neutral¶

Can pivot to NFS easily
If we hit 80% disk usage, add NFS StorageClass
Migration: create PVC with NFS, copy data, switch
5-minute operation
Can increase VM disk size
Proxmox allows expanding VM disks online
qm resize 201 scsi0 +20G → adds 20GB
Longhorn automatically uses new space

Capacity Planning¶

Current Estimate (5 apps, 14-day retention)¶

Component	Storage	Notes
Prometheus (metrics)	3-5GB	14 days, 5 apps
Loki (logs)	5-7GB	14 days, 5 apps (depends on verbosity)
Tempo (traces)	3-5GB	14 days, sampled traces
PostgreSQL (money-tracker)	5-10GB	Transactions, budgets, user data
PostgreSQL (home-portal)	2-5GB	Bookmarks, configs
PostgreSQL (RMS)	10-20GB	Recipes, images, nutrition data
Other apps	10-15GB	Future applications, uploads
Total	48-72GB	Well within 120GB capacity

Headroom¶

Usable: 120GB
Estimated: 60GB (50% utilization)
Headroom: 60GB (~100% buffer)

Scaling Options (if needed)¶

Reduce retention: 14 days → 7 days (saves ~5-10GB)
Add NFS StorageClass: Offload old logs to /vault
Increase VM disks: qm resize +20GB per VM (+60GB total)
Add worker node: VM 204 with 80GB (+40GB usable)

Monitoring & Alerts¶

To implement in Phase 5:

# Grafana alert: Disk usage >80%
- alert: HighDiskUsage
  expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "PVC {{ $labels.persistentvolumeclaim }} is >80% full"

Manual checks:

# Check PVC usage
kubectl get pvc -A

# Longhorn UI (after Phase 3)
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# http://localhost:8080

Validation¶

After 3 months, evaluate: - [ ] Is 120GB sufficient? (target: <70% utilization) - [ ] Is 14-day retention adequate for debugging? - [ ] Has Longhorn been stable? (no data loss) - [ ] Do we need to add NFS StorageClass?

Success criteria: - Disk usage stays below 80% over 3 months - No data loss events - Longhorn replication working (survive node failure)

ADR-001: Multi-Node k3s - Multi-node enables Longhorn replication
ADR-005: Observability - 14-day retention decision
Future: "ADR-006: Backup Strategy" - Longhorn snapshots to /vault

References¶

Implementation Notes¶

Phase 3 (Core Infrastructure): - Install Longhorn via Helm - Set default StorageClass to Longhorn - Configure 2-replica for HA - Set up recurring snapshots (daily)

Future Enhancements:

See Tower Fleet Roadmap - Storage for planned storage improvements: - Longhorn backups to /vault/k8s-backups/ (disaster recovery) - NFS StorageClass (if needed for large files) - Volume expansion planning and procedures