ADR-002: Storage Strategy¶
Date: 2025-11-09 Status: Accepted Deciders: User, Claude Code
Context¶
Need to decide where to store persistent data for: - Observability data: Prometheus metrics, Loki logs, Grafana dashboards (14-day retention, ~10-15GB) - Application data: PostgreSQL databases, uploaded files, user data (~40-60GB)
Options:
- Option A: Pure k8s with Longhorn (distributed storage on VM disks)
- Option B: Hybrid (Longhorn for apps, NFS to /vault/logs for observability)
Resources:
- Each VM: 80GB disk
- NAS: LXC 101 with /vault (22TB, 75% used = 5.3TB free)
Decision¶
Chosen: Option A - Pure k8s with Longhorn (2-replica)
Storage Allocation¶
VM Disks (80GB each): - k3s-master: 80GB - k3s-worker-1: 80GB - k3s-worker-2: 80GB - Total: 240GB raw → ~120GB usable (with 2-replica Longhorn)
Data Retention: - Observability: 14 days (~10-15GB) - Application databases: ~40-60GB - Total estimated usage: ~70GB (well within 120GB capacity)
Rationale¶
Why Pure k8s (Option A)?¶
- Cloud-native learning experience
- This is how production k8s works (EBS on AWS, Persistent Disks on GKE)
- Learn native k8s storage concepts (PV, PVC, StorageClass)
-
No external dependencies - cluster is self-contained
-
High Availability
- Longhorn replicates data across nodes (2-replica)
- If worker-1 dies, data is still on worker-2
-
Automatic failover and rebalancing
-
Sufficient capacity
- 120GB usable storage is plenty for 5 apps with 14-day observability retention
- Estimated usage: ~70GB (~58% utilization)
-
Can extend retention or add more data
-
Simpler mental model
- Everything k8s-related lives in k8s
- No mixing of k8s and external NFS mounts
-
Easier to understand for learning
-
Flexibility to change later
- Can add NFS StorageClass in 5 minutes if we hit limits
- Can increase VM disk size in Proxmox easily
- Can add more worker nodes for more storage
Alternatives Considered¶
Option B: Hybrid Storage¶
Configuration:
- Longhorn (on VM disks): Application databases, critical data
- NFS to /vault/logs/k8s/: Observability data (unlimited space)
Pros: - ✅ Unlimited storage for logs (5.3TB available on /vault) - ✅ Easy to extend retention (14 days → 90 days → 1 year) - ✅ Observability data survives k8s cluster rebuild
Cons: - ❌ External dependency (if LXC 101 dies, can't write new logs) - ❌ Less "production-like" (most k8s uses native storage) - ❌ More complexity (two storage systems to manage) - ❌ NFS performance slower than local disks
Why rejected: - Primary goal is learning - use k8s native storage - 14-day retention is sufficient for homelab (not production) - Can always add NFS later if needed (5-minute config change) - Simpler is better for starting out
Option C: All NFS (No Longhorn)¶
Configuration:
- All persistent volumes on NFS to /vault/k8s-volumes/
Pros: - ✅ Unlimited storage - ✅ Simpler (no Longhorn to install/manage)
Cons: - ❌ Single point of failure (LXC 101 dies = all data gone) - ❌ Worse performance (network latency vs local disk) - ❌ Doesn't teach cloud-native storage - ❌ Not how production k8s works
Why rejected: - Defeats the purpose of learning k8s-native storage - No HA (NFS is single point of failure) - Performance concerns for databases
Consequences¶
Positive¶
- Learn production-grade storage
- Longhorn is similar to Rook/Ceph used in production
- Understand replication, replica placement, volume snapshots
-
Practice with PV/PVC lifecycle
-
High Availability
- Data replicated across 2 nodes
- Survives single node failure
-
Automatic recovery and rebalancing
-
Self-contained cluster
- No external dependencies
- Can move cluster to different hardware easily
-
GitOps + Longhorn backups = full disaster recovery
-
Good performance
- Local SSD storage (fast)
- No network latency for database operations
Negative¶
- Limited storage capacity
- 120GB usable vs unlimited on /vault
- Requires monitoring disk usage
-
Mitigation: 14-day retention (vs 30-day originally planned)
-
Tighter data retention
- 14 days of observability data vs 30+ days possible with NFS
- Sufficient for homelab (can always extend later)
-
Mitigation: Export important metrics/logs to /vault if needed
-
Need to monitor disk usage
- Can't let storage fill up (PVC creation fails)
- Mitigation: Grafana alerts on disk usage (Phase 5)
Neutral¶
- Can pivot to NFS easily
- If we hit 80% disk usage, add NFS StorageClass
- Migration: create PVC with NFS, copy data, switch
-
5-minute operation
-
Can increase VM disk size
- Proxmox allows expanding VM disks online
qm resize 201 scsi0 +20G→ adds 20GB- Longhorn automatically uses new space
Capacity Planning¶
Current Estimate (5 apps, 14-day retention)¶
| Component | Storage | Notes |
|---|---|---|
| Prometheus (metrics) | 3-5GB | 14 days, 5 apps |
| Loki (logs) | 5-7GB | 14 days, 5 apps (depends on verbosity) |
| Tempo (traces) | 3-5GB | 14 days, sampled traces |
| PostgreSQL (money-tracker) | 5-10GB | Transactions, budgets, user data |
| PostgreSQL (home-portal) | 2-5GB | Bookmarks, configs |
| PostgreSQL (RMS) | 10-20GB | Recipes, images, nutrition data |
| Other apps | 10-15GB | Future applications, uploads |
| Total | 48-72GB | Well within 120GB capacity |
Headroom¶
- Usable: 120GB
- Estimated: 60GB (50% utilization)
- Headroom: 60GB (~100% buffer)
Scaling Options (if needed)¶
- Reduce retention: 14 days → 7 days (saves ~5-10GB)
- Add NFS StorageClass: Offload old logs to /vault
- Increase VM disks:
qm resize+20GB per VM (+60GB total) - Add worker node: VM 204 with 80GB (+40GB usable)
Monitoring & Alerts¶
To implement in Phase 5:
# Grafana alert: Disk usage >80%
- alert: HighDiskUsage
expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} is >80% full"
Manual checks:
# Check PVC usage
kubectl get pvc -A
# Longhorn UI (after Phase 3)
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# http://localhost:8080
Validation¶
After 3 months, evaluate: - [ ] Is 120GB sufficient? (target: <70% utilization) - [ ] Is 14-day retention adequate for debugging? - [ ] Has Longhorn been stable? (no data loss) - [ ] Do we need to add NFS StorageClass?
Success criteria: - Disk usage stays below 80% over 3 months - No data loss events - Longhorn replication working (survive node failure)
Related Decisions¶
- ADR-001: Multi-Node k3s - Multi-node enables Longhorn replication
- ADR-005: Observability - 14-day retention decision
- Future: "ADR-006: Backup Strategy" - Longhorn snapshots to /vault
References¶
Implementation Notes¶
Phase 3 (Core Infrastructure): - Install Longhorn via Helm - Set default StorageClass to Longhorn - Configure 2-replica for HA - Set up recurring snapshots (daily)
Future Enhancements:
See Tower Fleet Roadmap - Storage for planned storage improvements:
- Longhorn backups to /vault/k8s-backups/ (disaster recovery)
- NFS StorageClass (if needed for large files)
- Volume expansion planning and procedures