ADR-001: Multi-Node vs Single-Node k3s¶
Date: 2025-11-09 Status: Accepted Deciders: User, Claude Code
Context¶
Building a k3s cluster on Proxmox for homelab learning and application hosting. Need to decide between: - Single-node: 1 VM running both control plane and workloads - Multi-node: 1 master + 2 workers (3 VMs total)
Goals: - Primary: Learning Kubernetes and cloud-native practices for career development - Secondary: Operational efficiency for homelab applications
Resources available: - Proxmox host: AMD Ryzen 9 5950X (32 threads, 135GB RAM) - Currently ~107GB RAM in use, 28GB free - Multi-node would use: 12 cores, 24GB RAM
Decision¶
Chosen: Multi-node k3s cluster (1 master + 2 workers)
Configuration¶
- VM 201: k3s-master (control plane) - 4 cores, 8GB RAM, 80GB disk
- VM 202: k3s-worker-1 - 4 cores, 8GB RAM, 80GB disk
- VM 203: k3s-worker-2 - 4 cores, 8GB RAM, 80GB disk
Total: 12 cores, 24GB RAM, 240GB disk
Rationale¶
Why Multi-Node?¶
- Realistic production experience
- Multi-node is how k8s runs in production (AWS EKS, GKE, AKS)
- Learn pod scheduling, node affinity, taints/tolerations
-
Understand failure scenarios (what happens when node dies?)
-
High Availability (HA)
- Can tolerate worker node failure
- Workloads automatically reschedule to healthy nodes
-
Valuable skill: designing for HA is critical in production
-
Resource isolation
- Control plane separate from workloads
- Master isn't affected if workload consumes all resources
-
Mirrors production best practices
-
Server resources permit it
- Ryzen 9 5950X has 32 threads - 12 cores is only 37%
- 24GB RAM is 18% of total 135GB
-
Plenty of headroom for growth
-
Learning value justifies cost
- Experience with multi-node operations (draining, cordoning nodes)
- Practice with distributed storage (Longhorn replication)
- Better resume talking points ("managed 3-node k8s cluster")
Alternatives Considered¶
Option A: Single-Node k3s¶
Configuration: - 1 VM: 6 cores, 12GB RAM, 100GB disk
Pros: - ✅ Simpler setup (~2 hours vs 3-4 hours) - ✅ Lower resource usage (12GB vs 24GB RAM) - ✅ Easier to maintain (one node to manage)
Cons: - ❌ No HA - single point of failure - ❌ Less production-like experience - ❌ Can't learn multi-node operations - ❌ Can add workers later, but migration is work
Why rejected: - Primary goal is learning - single-node doesn't teach multi-node operations - Server has plenty of resources - not a constraint - Would need to rebuild later for HA experience anyway
Option B: HA Control Plane (3 masters + 2 workers)¶
Configuration: - 3 master VMs (etcd quorum) - 2 worker VMs - Total: 5 VMs, 40GB RAM
Pros: - ✅ True production-grade HA - ✅ Survives master failure - ✅ Industry best practice
Cons: - ❌ Overkill for homelab - ❌ More complexity (etcd clustering, leader election) - ❌ Higher resource usage (40GB RAM) - ❌ Longer setup time
Why rejected: - Homelab doesn't need master HA (can restart VM quickly) - Single-master k3s teaches 90% of concepts - Can upgrade to HA control plane later if desired - Better to start simpler and iterate
Consequences¶
Positive¶
- Hands-on HA experience
- Can kill worker node and watch pods reschedule
- Learn how to drain/cordon nodes for maintenance
-
Practice with pod disruption budgets
-
Distributed storage practice
- Longhorn replicates across nodes
-
Learn about replica placement, disk management
-
Realistic pod scheduling
- See how k8s spreads pods across nodes
-
Learn node selectors, affinity, anti-affinity
-
Resume-worthy
- "Managed multi-node k8s cluster" > "ran single-node k3s"
- Real production concepts
Negative¶
- Higher resource usage
- 24GB RAM vs 12GB (but server has capacity)
- 240GB disk vs 100GB
-
Acceptable trade-off for learning value
-
Slightly more complex
- More VMs to manage
- More points of failure to monitor
-
Mitigated by: automation, monitoring (Phase 5)
-
Longer initial setup
- 3 VMs vs 1 (30 min vs 15 min)
- One-time cost - worth it
Neutral¶
- Can scale down later
- If resource-constrained, can remove workers
-
Flexible architecture
-
Can scale up later
- Easy to add more workers (just join to cluster)
- Or add HA control plane if desired
Validation¶
After 1 month, evaluate: - [ ] Did multi-node teach valuable skills? - [ ] Are resources acceptable? (CPU/RAM usage) - [ ] Has cluster been stable? - [ ] Would single-node have been sufficient?
Success criteria: - All 3 nodes stay healthy (>95% uptime) - Can demonstrate HA (kill worker, pods reschedule) - Resource usage <50% of available capacity
Related Decisions¶
- ADR-002: Storage Strategy - Longhorn replication depends on multi-node
- ADR-004: Networking - MetalLB configuration for multi-node
- ADR-005: Observability - Monitor all 3 nodes