ADR-001: Multi-Node vs Single-Node k3s¶

Date: 2025-11-09 Status: Accepted Deciders: User, Claude Code

Context¶

Building a k3s cluster on Proxmox for homelab learning and application hosting. Need to decide between: - Single-node: 1 VM running both control plane and workloads - Multi-node: 1 master + 2 workers (3 VMs total)

Goals: - Primary: Learning Kubernetes and cloud-native practices for career development - Secondary: Operational efficiency for homelab applications

Resources available: - Proxmox host: AMD Ryzen 9 5950X (32 threads, 135GB RAM) - Currently ~107GB RAM in use, 28GB free - Multi-node would use: 12 cores, 24GB RAM

Decision¶

Chosen: Multi-node k3s cluster (1 master + 2 workers)

Configuration¶

VM 201: k3s-master (control plane) - 4 cores, 8GB RAM, 80GB disk
VM 202: k3s-worker-1 - 4 cores, 8GB RAM, 80GB disk
VM 203: k3s-worker-2 - 4 cores, 8GB RAM, 80GB disk

Total: 12 cores, 24GB RAM, 240GB disk

Rationale¶

Why Multi-Node?¶

Realistic production experience
Multi-node is how k8s runs in production (AWS EKS, GKE, AKS)
Learn pod scheduling, node affinity, taints/tolerations
Understand failure scenarios (what happens when node dies?)
High Availability (HA)
Can tolerate worker node failure
Workloads automatically reschedule to healthy nodes
Valuable skill: designing for HA is critical in production
Resource isolation
Control plane separate from workloads
Master isn't affected if workload consumes all resources
Mirrors production best practices
Server resources permit it
Ryzen 9 5950X has 32 threads - 12 cores is only 37%
24GB RAM is 18% of total 135GB
Plenty of headroom for growth
Learning value justifies cost
Experience with multi-node operations (draining, cordoning nodes)
Practice with distributed storage (Longhorn replication)
Better resume talking points ("managed 3-node k8s cluster")

Alternatives Considered¶

Option A: Single-Node k3s¶

Configuration: - 1 VM: 6 cores, 12GB RAM, 100GB disk

Pros: - ✅ Simpler setup (~2 hours vs 3-4 hours) - ✅ Lower resource usage (12GB vs 24GB RAM) - ✅ Easier to maintain (one node to manage)

Cons: - ❌ No HA - single point of failure - ❌ Less production-like experience - ❌ Can't learn multi-node operations - ❌ Can add workers later, but migration is work

Why rejected: - Primary goal is learning - single-node doesn't teach multi-node operations - Server has plenty of resources - not a constraint - Would need to rebuild later for HA experience anyway

Option B: HA Control Plane (3 masters + 2 workers)¶

Configuration: - 3 master VMs (etcd quorum) - 2 worker VMs - Total: 5 VMs, 40GB RAM

Pros: - ✅ True production-grade HA - ✅ Survives master failure - ✅ Industry best practice

Cons: - ❌ Overkill for homelab - ❌ More complexity (etcd clustering, leader election) - ❌ Higher resource usage (40GB RAM) - ❌ Longer setup time

Why rejected: - Homelab doesn't need master HA (can restart VM quickly) - Single-master k3s teaches 90% of concepts - Can upgrade to HA control plane later if desired - Better to start simpler and iterate

Consequences¶

Positive¶

Hands-on HA experience
Can kill worker node and watch pods reschedule
Learn how to drain/cordon nodes for maintenance
Practice with pod disruption budgets
Distributed storage practice
Longhorn replicates across nodes
Learn about replica placement, disk management
Realistic pod scheduling
See how k8s spreads pods across nodes
Learn node selectors, affinity, anti-affinity
Resume-worthy
"Managed multi-node k8s cluster" > "ran single-node k3s"
Real production concepts

Negative¶

Higher resource usage
24GB RAM vs 12GB (but server has capacity)
240GB disk vs 100GB
Acceptable trade-off for learning value
Slightly more complex
More VMs to manage
More points of failure to monitor
Mitigated by: automation, monitoring (Phase 5)
Longer initial setup
3 VMs vs 1 (30 min vs 15 min)
One-time cost - worth it

Neutral¶

Can scale down later
If resource-constrained, can remove workers
Flexible architecture
Can scale up later
Easy to add more workers (just join to cluster)
Or add HA control plane if desired

Validation¶

After 1 month, evaluate: - [ ] Did multi-node teach valuable skills? - [ ] Are resources acceptable? (CPU/RAM usage) - [ ] Has cluster been stable? - [ ] Would single-node have been sufficient?

Success criteria: - All 3 nodes stay healthy (>95% uptime) - Can demonstrate HA (kill worker, pods reschedule) - Resource usage <50% of available capacity

ADR-002: Storage Strategy - Longhorn replication depends on multi-node
ADR-004: Networking - MetalLB configuration for multi-node
ADR-005: Observability - Monitor all 3 nodes