Skip to content

ADR-001: Multi-Node vs Single-Node k3s

Date: 2025-11-09 Status: Accepted Deciders: User, Claude Code


Context

Building a k3s cluster on Proxmox for homelab learning and application hosting. Need to decide between: - Single-node: 1 VM running both control plane and workloads - Multi-node: 1 master + 2 workers (3 VMs total)

Goals: - Primary: Learning Kubernetes and cloud-native practices for career development - Secondary: Operational efficiency for homelab applications

Resources available: - Proxmox host: AMD Ryzen 9 5950X (32 threads, 135GB RAM) - Currently ~107GB RAM in use, 28GB free - Multi-node would use: 12 cores, 24GB RAM


Decision

Chosen: Multi-node k3s cluster (1 master + 2 workers)

Configuration

  • VM 201: k3s-master (control plane) - 4 cores, 8GB RAM, 80GB disk
  • VM 202: k3s-worker-1 - 4 cores, 8GB RAM, 80GB disk
  • VM 203: k3s-worker-2 - 4 cores, 8GB RAM, 80GB disk

Total: 12 cores, 24GB RAM, 240GB disk


Rationale

Why Multi-Node?

  1. Realistic production experience
  2. Multi-node is how k8s runs in production (AWS EKS, GKE, AKS)
  3. Learn pod scheduling, node affinity, taints/tolerations
  4. Understand failure scenarios (what happens when node dies?)

  5. High Availability (HA)

  6. Can tolerate worker node failure
  7. Workloads automatically reschedule to healthy nodes
  8. Valuable skill: designing for HA is critical in production

  9. Resource isolation

  10. Control plane separate from workloads
  11. Master isn't affected if workload consumes all resources
  12. Mirrors production best practices

  13. Server resources permit it

  14. Ryzen 9 5950X has 32 threads - 12 cores is only 37%
  15. 24GB RAM is 18% of total 135GB
  16. Plenty of headroom for growth

  17. Learning value justifies cost

  18. Experience with multi-node operations (draining, cordoning nodes)
  19. Practice with distributed storage (Longhorn replication)
  20. Better resume talking points ("managed 3-node k8s cluster")

Alternatives Considered

Option A: Single-Node k3s

Configuration: - 1 VM: 6 cores, 12GB RAM, 100GB disk

Pros: - ✅ Simpler setup (~2 hours vs 3-4 hours) - ✅ Lower resource usage (12GB vs 24GB RAM) - ✅ Easier to maintain (one node to manage)

Cons: - ❌ No HA - single point of failure - ❌ Less production-like experience - ❌ Can't learn multi-node operations - ❌ Can add workers later, but migration is work

Why rejected: - Primary goal is learning - single-node doesn't teach multi-node operations - Server has plenty of resources - not a constraint - Would need to rebuild later for HA experience anyway


Option B: HA Control Plane (3 masters + 2 workers)

Configuration: - 3 master VMs (etcd quorum) - 2 worker VMs - Total: 5 VMs, 40GB RAM

Pros: - ✅ True production-grade HA - ✅ Survives master failure - ✅ Industry best practice

Cons: - ❌ Overkill for homelab - ❌ More complexity (etcd clustering, leader election) - ❌ Higher resource usage (40GB RAM) - ❌ Longer setup time

Why rejected: - Homelab doesn't need master HA (can restart VM quickly) - Single-master k3s teaches 90% of concepts - Can upgrade to HA control plane later if desired - Better to start simpler and iterate


Consequences

Positive

  1. Hands-on HA experience
  2. Can kill worker node and watch pods reschedule
  3. Learn how to drain/cordon nodes for maintenance
  4. Practice with pod disruption budgets

  5. Distributed storage practice

  6. Longhorn replicates across nodes
  7. Learn about replica placement, disk management

  8. Realistic pod scheduling

  9. See how k8s spreads pods across nodes
  10. Learn node selectors, affinity, anti-affinity

  11. Resume-worthy

  12. "Managed multi-node k8s cluster" > "ran single-node k3s"
  13. Real production concepts

Negative

  1. Higher resource usage
  2. 24GB RAM vs 12GB (but server has capacity)
  3. 240GB disk vs 100GB
  4. Acceptable trade-off for learning value

  5. Slightly more complex

  6. More VMs to manage
  7. More points of failure to monitor
  8. Mitigated by: automation, monitoring (Phase 5)

  9. Longer initial setup

  10. 3 VMs vs 1 (30 min vs 15 min)
  11. One-time cost - worth it

Neutral

  1. Can scale down later
  2. If resource-constrained, can remove workers
  3. Flexible architecture

  4. Can scale up later

  5. Easy to add more workers (just join to cluster)
  6. Or add HA control plane if desired

Validation

After 1 month, evaluate: - [ ] Did multi-node teach valuable skills? - [ ] Are resources acceptable? (CPU/RAM usage) - [ ] Has cluster been stable? - [ ] Would single-node have been sufficient?

Success criteria: - All 3 nodes stay healthy (>95% uptime) - Can demonstrate HA (kill worker, pods reschedule) - Resource usage <50% of available capacity



References