Skip to content

Understanding Kubernetes

Last Updated: 2025-11-09

This guide explains what Kubernetes is, how it works, and specifically how it operates in our Tower Fleet homelab environment.


Table of Contents

  1. What is Kubernetes?
  2. k3s vs k8s
  3. Cluster Architecture Deep Dive
  4. Scaling the Cluster
  5. Networking Explained
  6. Authentication & Security
  7. How Requests Flow
  8. Storage Architecture
  9. What Makes Our Setup Easy
  10. What Changes for Multi-Site
  11. Practical Examples

What is Kubernetes?

Kubernetes (k8s) is a container orchestration platform that automates deploying, scaling, and managing containerized applications across a cluster of machines.

Core Concepts

Container Orchestration means: - Automatically scheduling containers onto available machines - Restarting failed containers - Scaling up/down based on demand - Load balancing traffic across containers - Managing storage, networking, and secrets

Key Building Blocks:

  • Pod: Smallest deployable unit - one or more containers that share network/storage
  • Deployment: Defines desired state (e.g., "run 3 replicas of nginx")
  • Service: Stable network endpoint to access pods (even as they move/restart)
  • Node: A physical or virtual machine in the cluster
  • Cluster: A set of nodes working together

Why Container Orchestration Matters

Without k8s: - Manually SSH to servers to deploy apps - Write custom scripts for health checks and restarts - Hard to scale (need to manually distribute load) - No standardization (every app deployed differently)

With k8s: - Declare desired state in YAML (kubectl apply -f app.yaml) - k8s ensures current state matches desired state - Self-healing (crashed pods restart automatically) - Standardized (all apps deployed the same way) - Portable (same YAML works on any k8s cluster)


k3s vs k8s

What is k3s?

k3s is a lightweight, certified Kubernetes distribution created by Rancher (now SUSE). It's a full k8s implementation, but optimized for: - Edge computing - IoT devices - CI/CD environments - Homelabs (our use case!)

Key Differences

Feature Kubernetes (k8s) k3s
Binary size ~1GB+ (multiple binaries) ~70MB (single binary)
Memory footprint ~1-2GB per node ~512MB per node
Installation Complex (kubeadm, many steps) One command
Default storage None (must install) local-path provisioner included
Default ingress None (must install) Traefik included
Database Requires etcd cluster Embedded SQLite (or etcd for HA)
Cloud integrations Many built-in Removed (reduces bloat)
ARM support Limited Excellent (Raspberry Pi ready)
Certification N/A (reference impl) 100% certified Kubernetes

What k3s Removed

  • Legacy/alpha features
  • Cloud provider integrations (AWS, GCP, Azure)
  • In-tree storage drivers (uses CSI instead)
  • Non-essential admission controllers

Important: k3s is fully certified Kubernetes. Any app that runs on k8s will run on k3s. It's not a fork or subset - it's the same API.

Why We Chose k3s for Homelab

  1. Resource efficient: Leaves more RAM/CPU for actual workloads
  2. Easy installation: Single command vs hours of kubeadm troubleshooting
  3. Batteries included: Storage and ingress work out of the box
  4. Perfect for learning: Same k8s APIs you'll use professionally
  5. Great documentation: Rancher maintains excellent guides

Cluster Architecture Deep Dive

A Kubernetes cluster has two types of nodes:

Control Plane (Master) Node

The "brain" of the cluster. Runs these critical components:

1. kube-apiserver (The API Server)

  • What it does: Central management point - every operation goes through here
  • Who talks to it: kubectl, web dashboards, other control plane components, node agents
  • Port: 6443 (HTTPS)
  • Example: When you run kubectl get pods, you're talking to the API server

2. etcd (Distributed Database)

  • What it does: Stores all cluster state (what pods exist, what their status is, etc.)
  • Type: Distributed key-value store with strong consistency
  • Critical: If etcd data is lost, cluster state is lost
  • In k3s: Can use embedded SQLite (single-master) or embedded etcd (multi-master)
  • Port: 2379-2380

3. kube-scheduler (The Scheduler)

  • What it does: Decides which worker node should run each new pod
  • Factors it considers:
  • Available CPU/memory on nodes
  • Node affinity rules (e.g., "must run on SSD nodes")
  • Resource requests from pods
  • Taints and tolerations
  • Example: You create a deployment → scheduler assigns pods to workers

4. kube-controller-manager (The Control Loop)

  • What it does: Runs controllers that maintain desired state
  • Controllers include:
  • Deployment controller: Ensures correct number of replicas
  • ReplicaSet controller: Creates/deletes pods to match desired count
  • Node controller: Detects when nodes go down
  • Service controller: Creates load balancers, updates endpoints
  • Pattern: Continuously loops asking "is current state = desired state?"

5. cloud-controller-manager (Cloud Integrations)

  • What it does: Talks to cloud provider APIs (AWS, GCP, Azure)
  • In k3s: Not included (we don't need it for bare-metal)

Worker Nodes

Run your actual application workloads. Components:

1. kubelet (The Node Agent)

  • What it does: Agent on every node (including master in our case)
  • Responsibilities:
  • Watches API server for pods assigned to this node
  • Starts/stops containers via container runtime
  • Reports pod status back to API server
  • Performs health checks on pods
  • Port: 10250

2. kube-proxy (Network Proxy)

  • What it does: Manages network rules for pod-to-pod and pod-to-service communication
  • How it works: Updates iptables rules to route traffic to correct pods
  • Example: Service has IP 10.43.0.10 → kube-proxy routes to backend pods

3. Container Runtime (Runs Containers)

  • What it does: Actually starts/stops containers
  • In k3s: Uses containerd (industry standard)
  • Interface: CRI (Container Runtime Interface) - standardized API

Our Cluster Layout

┌─────────────────────────────────────────────────────────────┐
│ k3s-master (10.89.97.201)                                   │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Control Plane                                           │ │
│ │ - kube-apiserver (port 6443)                           │ │
│ │ - etcd (embedded, port 2379)                           │ │
│ │ - kube-scheduler                                       │ │
│ │ - kube-controller-manager                              │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Worker Components (master also runs workloads)         │ │
│ │ - kubelet                                              │ │
│ │ - kube-proxy                                           │ │
│ │ - containerd                                           │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ k3s-worker-1 (10.89.97.202)                                 │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Worker Components Only                                  │ │
│ │ - kubelet                                              │ │
│ │ - kube-proxy                                           │ │
│ │ - containerd                                           │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ k3s-worker-2 (10.89.97.203)                                 │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Worker Components Only                                  │ │
│ │ - kubelet                                              │ │
│ │ - kube-proxy                                           │ │
│ │ - containerd                                           │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Note: In our setup, the master node also runs workloads (no taints), so it has both control plane AND worker components.


Scaling the Cluster

Adding Worker Nodes

Easy! Workers just need to contact the master and provide a token.

# On the NEW worker node (e.g., 10.89.97.204)

# Get the token from master first
ssh root@10.89.97.201 'cat /var/lib/rancher/k3s/server/node-token'
# Example output: K10abc123def456...::server:xyz789

# Install k3s as agent (worker)
curl -sfL https://get.k3s.io | \
  K3S_URL=https://10.89.97.201:6443 \
  K3S_TOKEN=K10abc123def456...::server:xyz789 \
  sh -

# Verify on master
kubectl get nodes

That's it! The new node: 1. Connects to master:6443 2. Authenticates with token 3. kubelet registers with API server 4. Scheduler starts assigning pods to it

Requirements: - Network access to master on port 6443 - Shared token (stored in /var/lib/rancher/k3s/server/node-token)

Adding Master Nodes (HA Control Plane)

More complex - requires etcd quorum (odd number: 3, 5, 7 masters).

# First master (already done)
curl -sfL https://get.k3s.io | sh -s - server --cluster-init

# Second master
curl -sfL https://get.k3s.io | sh -s - server \
  --server https://10.89.97.201:6443 \
  --token K10abc123...

# Third master
curl -sfL https://get.k3s.io | sh -s - server \
  --server https://10.89.97.201:6443 \
  --token K10abc123...

Why odd numbers? etcd needs quorum (majority) to agree on state: - 3 masters = tolerates 1 failure (2/3 quorum) - 5 masters = tolerates 2 failures (3/5 quorum) - 2 masters = tolerates 0 failures (not HA!)

When do you need HA masters? - Production environments (can't afford downtime) - Critical infrastructure - Not needed for homelab (we can tolerate restarts)

Removing Nodes

# Drain node (move pods elsewhere)
kubectl drain k3s-worker-2 --ignore-daemonsets --delete-emptydir-data

# Delete from cluster
kubectl delete node k3s-worker-2

# On the node itself, uninstall k3s
ssh root@10.89.97.203 '/usr/local/bin/k3s-agent-uninstall.sh'

Networking Explained

Kubernetes networking can be confusing because there are three separate networks:

1. Node Network (Physical/VM Network)

Our case: 10.89.97.0/24 - 10.89.97.201 - k3s-master - 10.89.97.202 - k3s-worker-1 - 10.89.97.203 - k3s-worker-2

Purpose: Nodes communicate with each other Requirement: All nodes must be able to reach each other

2. Pod Network (Overlay Network)

Our case: 10.42.0.0/16 (k3s default) - 10.42.0.x - Pods on master - 10.42.1.x - Pods on worker-1 - 10.42.2.x - Pods on worker-2

Purpose: Every pod gets a unique IP How it works: - CNI plugin (Flannel in k3s) creates an overlay network - Pods on different nodes can talk directly via these IPs - Flannel uses VXLAN tunnels to route traffic between nodes

Example:

Pod A on worker-1 (10.42.1.5) → Pod B on worker-2 (10.42.2.8)

1. Pod A sends packet to 10.42.2.8
2. Flannel on worker-1 sees destination is on worker-2
3. Encapsulates packet in VXLAN tunnel to 10.89.97.203
4. worker-2 receives, de-encapsulates, delivers to Pod B

3. Service Network (Virtual IPs)

Our case: 10.43.0.0/16 (k3s default)

Purpose: Stable IPs for accessing groups of pods - Pods come and go (deployments, restarts, scaling) - Services provide a stable IP that load balances to healthy pods

Types: - ClusterIP: Only accessible within cluster (default) - NodePort: Accessible on every node's IP at a specific port - LoadBalancer: Gets external IP (needs MetalLB in our case)

How it works: - Service gets a virtual IP (e.g., 10.43.0.10) - kube-proxy on each node creates iptables rules - Traffic to 10.43.0.10 is load-balanced to backend pods

Same Network (Our Case)

Why it's easy: - All nodes on 10.89.97.0/24 can directly talk - No NAT, no firewall rules needed between nodes - Flannel just needs to know node IPs (which it gets from k8s API)

Required ports between nodes: - 6443: API server (workers → master) - 10250: kubelet API (master → workers for logs, exec) - 8472: VXLAN for Flannel (pod traffic between nodes)

Different Networks/Machines

Scenario: Master on home network (192.168.1.0/24), workers on VPS (different ISP)

Challenges: 1. Connectivity: Nodes can't directly reach each other 2. Latency: High latency breaks etcd (needs <10ms for HA masters) 3. Firewalls: Need to open/forward ports

Solutions:

Option 1: VPN Mesh (Recommended) - Use Tailscale or Wireguard - Creates overlay network (e.g., 100.64.0.0/10) - All nodes get IPs on same VPN subnet - From k3s perspective, looks like same network!

# Install Tailscale on all nodes
curl -fsSL https://tailscale.com/install.sh | sh
tailscale up

# Nodes get IPs like 100.64.1.1, 100.64.1.2, etc.
# Use these IPs for k3s installation

# On master
curl -sfL https://get.k3s.io | sh -s - server --node-ip 100.64.1.1

# On worker
curl -sfL https://get.k3s.io | \
  K3S_URL=https://100.64.1.1:6443 \
  K3S_TOKEN=xxx \
  sh -s - --node-ip 100.64.1.2

Option 2: Public IPs + Firewall - Use public IPs for node communication - Lock down with firewall rules (only allow specific node IPs) - Not recommended for multi-master (latency + security)

Option 3: Separate Clusters - Run independent cluster per site - Use tools like Kubefed to federate clusters - More complex but better for geo-distributed

DNS in the Cluster

CoreDNS runs as a pod and provides DNS for the cluster:

  • Services get DNS names: <service-name>.<namespace>.svc.cluster.local
  • Example: nginx.default.svc.cluster.local → resolves to service IP
  • Pods can use short names within same namespace: curl http://nginx

Authentication & Security

Node-to-Master Authentication

How workers authenticate to master:

  1. During join: Worker provides K3S_TOKEN from /var/lib/rancher/k3s/server/node-token
  2. After join: Worker uses client certificate issued by k3s CA
  3. Ongoing: All communication is mTLS (mutual TLS)

Token is secret! Anyone with token can join cluster.

To rotate token: Restart k3s-server (generates new token, existing nodes keep certs)

kubectl Authentication

How kubectl authenticates:

Reads ~/.kube/config (kubeconfig):

clusters:
  - name: default
    cluster:
      server: https://10.89.97.201:6443
      certificate-authority-data: <base64-encoded-ca-cert>

users:
  - name: default
    user:
      client-certificate-data: <base64-encoded-client-cert>
      client-key-data: <base64-encoded-client-key>

contexts:
  - name: default
    context:
      cluster: default
      user: default

Flow: 1. kubectl presents client certificate to API server 2. API server verifies cert signed by cluster CA 3. Certificate CN (Common Name) determines permissions 4. k3s default kubeconfig has admin privileges

Pod-to-API Authentication

Service Accounts: - Every pod gets a ServiceAccount (default: default SA in namespace) - ServiceAccount has a token auto-mounted at /var/run/secrets/kubernetes.io/serviceaccount/token - Apps can use this token to talk to k8s API

RBAC (Role-Based Access Control): - Define Roles (what permissions: list pods, create deployments, etc.) - Bind Roles to ServiceAccounts - Pods run as ServiceAccount → get those permissions

Example: Prometheus needs to list pods/nodes to scrape metrics → create ServiceAccount with read-only RBAC

Encryption

All cluster traffic is encrypted: - API server ↔ etcd: TLS - kubectl ↔ API server: TLS - kubelet ↔ API server: mTLS - Pod network: Not encrypted by default (use service mesh like Istio for pod-to-pod encryption)

Cross-Network Security

Same security model regardless of network: - All auth based on certificates/tokens, not network location - Firewall should still restrict access to port 6443 (only allow known IPs) - Use VPN for multi-site (adds network-level security)

Best practices: - Don't expose API server to public internet - Rotate tokens periodically - Use least-privilege RBAC - Enable audit logging (k3s: --kube-apiserver-arg=audit-log-path=/var/log/k3s-audit.log)


How Requests Flow

Example 1: kubectl get pods

┌──────────────┐
│ Your Laptop  │
│ $ kubectl    │
│   get pods   │
└──────┬───────┘
       │ 1. Read ~/.kube/config
       │    - API server: https://10.89.97.201:6443
       │    - Client cert for authentication
┌──────────────────────────────────────┐
│ k3s-master (10.89.97.201)            │
│                                      │
│  2. TLS connection to port 6443      │
│     ┌─────────────────────────────┐  │
│     │ kube-apiserver              │  │
│     │                             │  │
│     │ 3. Verify client cert       │  │
│     │ 4. Check RBAC permissions   │  │
│     └──────────┬──────────────────┘  │
│                │                     │
│                ▼                     │
│     ┌─────────────────────────────┐  │
│     │ etcd                        │  │
│     │                             │  │
│     │ 5. Query for pod objects    │  │
│     │    in all namespaces        │  │
│     └──────────┬──────────────────┘  │
│                │                     │
└────────────────┼─────────────────────┘
          6. Return pod list
             (JSON format)
         7. kubectl formats
            as table and prints

Timeline: Typically 10-50ms

Example 2: Create a Deployment

┌──────────────────────────────────────────┐
│ $ kubectl create deployment nginx        │
│     --image=nginx --replicas=3           │
└──────┬───────────────────────────────────┘
       │ 1. kubectl sends Deployment object
       │    to API server
┌────────────────────────────────────────────────────┐
│ kube-apiserver                                     │
│  2. Validate Deployment spec                       │
│  3. Store in etcd                                  │
│  4. Return success to kubectl                      │
└────────────────────────────────────────────────────┘
       │ (Controllers are watching for new objects)
┌────────────────────────────────────────────────────┐
│ kube-controller-manager                            │
│                                                    │
│  Deployment Controller sees new Deployment:        │
│  5. Create a ReplicaSet (desired: 3 replicas)     │
│     → stores in etcd via API                       │
│                                                    │
│  ReplicaSet Controller sees new ReplicaSet:        │
│  6. Create 3 Pod objects                          │
│     → stores in etcd via API                       │
└────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────┐
│ kube-scheduler                                     │
│                                                    │
│  7. Watches for unscheduled pods                   │
│  8. For each pod:                                  │
│     - Score all nodes (available resources)        │
│     - Pick best node                               │
│     - Update pod.spec.nodeName in etcd             │
│                                                    │
│  Result:                                           │
│  - Pod 1 → k3s-worker-1                           │
│  - Pod 2 → k3s-worker-2                           │
│  - Pod 3 → k3s-master                             │
└────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────┐
│ kubelet (on each node)                             │
│                                                    │
│  9. Watches API for pods assigned to this node     │
│ 10. Sees new pod assigned                          │
│ 11. Tells containerd to pull image & start        │
│ 12. Reports pod status back to API server          │
└────────────────────────────────────────────────────┘
    Pod running!

Timeline: Typically 5-30 seconds (depends on image size)

Example 3: Pod-to-Pod Communication (Different Nodes)

Scenario: nginx pod on worker-1 calls API pod on worker-2

┌─────────────────────────────────────────────────────┐
│ k3s-worker-1 (10.89.97.202)                         │
│                                                     │
│  ┌──────────────────────────────┐                   │
│  │ nginx pod (10.42.1.5)        │                   │
│  │                              │                   │
│  │ 1. curl http://api.default.  │                   │
│  │         svc.cluster.local    │                   │
│  └──────────────┬───────────────┘                   │
│                 │                                   │
│                 ▼                                   │
│  ┌──────────────────────────────┐                   │
│  │ CoreDNS (on some node)       │                   │
│  │                              │                   │
│  │ 2. DNS query                 │                   │
│  │    Returns: 10.43.0.10       │                   │
│  │    (Service IP)              │                   │
│  └──────────────┬───────────────┘                   │
│                 │                                   │
│                 ▼                                   │
│  3. Send packet to 10.43.0.10                       │
│                 │                                   │
│                 ▼                                   │
│  ┌──────────────────────────────┐                   │
│  │ kube-proxy (iptables rules)  │                   │
│  │                              │                   │
│  │ 4. Intercept packet to       │                   │
│  │    10.43.0.10                │                   │
│  │ 5. Load balance to backend:  │                   │
│  │    → 10.42.2.8 (on worker-2) │                   │
│  └──────────────┬───────────────┘                   │
│                 │                                   │
└─────────────────┼───────────────────────────────────┘
   6. Destination 10.42.2.8 is on different node
┌─────────────────┼───────────────────────────────────┐
│                 │                                   │
│  ┌──────────────▼───────────────┐                   │
│  │ Flannel CNI (worker-1)       │                   │
│  │                              │                   │
│  │ 7. Encapsulate in VXLAN:     │                   │
│  │    Outer: 10.89.97.202 →     │                   │
│  │           10.89.97.203       │                   │
│  │    Inner: 10.42.1.5 →        │                   │
│  │           10.42.2.8           │                   │
│  └──────────────┬───────────────┘                   │
└─────────────────┼───────────────────────────────────┘
                  │ 8. Send over node network
                  │    (10.89.97.202 → 10.89.97.203)
┌─────────────────────────────────────────────────────┐
│ k3s-worker-2 (10.89.97.203)                         │
│                                                     │
│  ┌──────────────────────────────┐                   │
│  │ Flannel CNI (worker-2)       │                   │
│  │                              │                   │
│  │ 9. Receive VXLAN packet      │                   │
│  │ 10. De-encapsulate           │                   │
│  │ 11. Route to 10.42.2.8       │                   │
│  └──────────────┬───────────────┘                   │
│                 │                                   │
│                 ▼                                   │
│  ┌──────────────────────────────┐                   │
│  │ api pod (10.42.2.8)          │                   │
│  │                              │                   │
│  │ 12. Receive HTTP request     │                   │
│  │ 13. Process and respond      │                   │
│  └──────────────────────────────┘                   │
└─────────────────────────────────────────────────────┘
         (Response follows reverse path)

Key Points: - Pods use service DNS names, not IPs - kube-proxy translates Service IP to pod IPs - Flannel handles routing between nodes - All transparent to the application (just HTTP call)


Storage Architecture

Understanding storage in Kubernetes is critical because it's separate from both cluster state (etcd/SQLite) and compute resources.

Two Types of Storage

1. Cluster State Storage (etcd/SQLite)

What it stores: Kubernetes objects and configuration - Pod definitions - Deployments, Services, ConfigMaps - RBAC rules - Cluster settings

Size: Typically <1GB (just metadata, not application data)

Current setup: SQLite on master VM - Location: /var/lib/rancher/k3s/server/db/state.db - Replicas: None (single master) - HA option: Use --cluster-init for embedded etcd across 3+ masters

Important: This is NOT where your application data lives!

2. Application Data Storage (Persistent Volumes)

What it stores: Actual application data - Database files (PostgreSQL, MySQL) - User uploads - Application state - Logs and metrics

Size: Can be TBs of data

Managed by: Storage providers (Longhorn, NFS, Ceph, cloud providers)


Kubernetes Storage Concepts

PersistentVolume (PV)

  • A piece of storage in the cluster (e.g., 10GB disk)
  • Created by admin or dynamically provisioned
  • Independent lifecycle (survives pod deletion)

PersistentVolumeClaim (PVC)

  • A request for storage by a pod
  • "I need 10GB of SSD storage"
  • Gets bound to a matching PV

StorageClass

  • Defines "types" of storage available
  • Examples: fast-ssd, slow-hdd, network-nfs
  • Enables dynamic provisioning (auto-creates PVs)

Flow:

1. Pod needs storage
2. Creates PVC: "I need 10GB, class=longhorn-ssd"
3. StorageClass provisions a PV (creates 10GB Longhorn volume)
4. PVC binds to PV
5. Pod mounts the volume


Storage Options for Homelab

Option 1: Longhorn (Distributed Block Storage) - Current Plan

What is Longhorn? - Cloud-native distributed block storage for Kubernetes - Created by Rancher (same company as k3s) - Replicates data across nodes - Automatic snapshots and backups

Our Current Architecture:

┌─────────────────────────────────────────────────┐
│ Proxmox Host (tower)                            │
│                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌────────┐│
│  │ VM 201       │  │ VM 202       │  │ VM 203 ││
│  │ (master)     │  │ (worker-1)   │  │ (w-2)  ││
│  │              │  │              │  │        ││
│  │ 80GB disk    │  │ 80GB disk    │  │ 80GB   ││
│  │ ↓            │  │ ↓            │  │ ↓      ││
│  │ Longhorn     │  │ Longhorn     │  │ Longhorn││
│  │ storage      │  │ storage      │  │ storage││
│  └──────────────┘  └──────────────┘  └────────┘│
└─────────────────────────────────────────────────┘

Total: 240GB raw → 120GB usable (2-replica)
Each PV stored on 2 nodes for redundancy

How it works:

# Create a PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 10Gi

# Longhorn automatically:
# 1. Creates a 10GB volume
# 2. Replicates to 2 nodes (master + worker-1)
# 3. Binds to PVC
# 4. Pod can mount it

Pros: - ✅ Survives single node failure (2-replica redundancy) - ✅ Automatic replication and healing - ✅ Web UI for management - ✅ Snapshots and backups - ✅ Pure Kubernetes-native (no external dependencies) - ✅ Great for learning cloud-native storage

Cons: - ❌ Limited by VM disk size (240GB total, 120GB usable) - ❌ Single Proxmox host = single point of failure - ❌ Network overhead (replica sync between nodes) - ❌ Not ideal for large media files

Best for: - Databases (PostgreSQL, MySQL, MongoDB) - Application state - Small to medium datasets (<100GB) - Learning distributed storage concepts


Option 2: NFS from /vault (Centralized Storage)

What is NFS? - Network File System - Centralized storage accessed over network - Multiple pods can mount same volume (ReadWriteMany)

Architecture with /vault:

┌─────────────────────────────────────────────────┐
│ Proxmox Host (tower)                            │
│                                                 │
│  ┌────────────────┐                             │
│  │ LXC 101 (NAS)  │                             │
│  │ ZFS pool       │                             │
│  │ /vault         │ ← Large ZFS dataset         │
│  └────────┬───────┘                             │
│           │ NFS export: /vault/k8s-volumes      │
│           ↓                                     │
│  ┌─────────────────────────────────────┐        │
│  │ k8s cluster (VMs 201-203)           │        │
│  │                                     │        │
│  │ NFS CSI driver installed            │        │
│  │ Mounts /vault/k8s-volumes           │        │
│  │                                     │        │
│  │ PVs: /vault/k8s-volumes/pvc-abc/    │        │
│  └─────────────────────────────────────┘        │
└─────────────────────────────────────────────────┘

Setup:

# On NAS (LXC 101)
mkdir -p /vault/k8s-volumes
echo "/vault/k8s-volumes 10.89.97.0/24(rw,sync,no_subtree_check,no_root_squash)" >> /etc/exports
exportfs -ra

# In k8s cluster
# Install NFS CSI driver
helm repo add nfs-csi https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts
helm install nfs-csi nfs-csi/csi-driver-nfs

# Create StorageClass
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-vault
provisioner: nfs.csi.k8s.io
parameters:
  server: 10.89.97.101  # NAS IP
  share: /vault/k8s-volumes
EOF

Pros: - ✅ Large capacity (your /vault ZFS pool) - ✅ Leverage existing ZFS features (snapshots, compression, dedup) - ✅ Easy backups (ZFS send/receive) - ✅ Multiple pods can access same volume (ReadWriteMany) - ✅ Good for large datasets (media, logs)

Cons: - ❌ Single point of failure (NAS dies = all data gone) - ❌ Network bottleneck (all I/O over network) - ❌ Not distributed (no HA) - ❌ Slower than local storage

Best for: - Media libraries (read-only or read-mostly) - Shared config files - Log aggregation - Backups - Large datasets that don't fit in Longhorn


Option 3: Ceph (Enterprise Distributed Storage)

What is Ceph? - Enterprise-grade distributed storage system - Provides object, block, and file storage - Industry standard (used by OpenStack, Proxmox, etc.)

Architecture:

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│ Ceph Node 1  │  │ Ceph Node 2  │  │ Ceph Node 3  │
│              │  │              │  │              │
│ OSD:         │  │ OSD:         │  │ OSD:         │
│ - 2TB SSD    │  │ - 2TB SSD    │  │ - 2TB SSD    │
│ - 4TB HDD    │  │ - 4TB HDD    │  │ - 4TB HDD    │
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       └─────────────────┴─────────────────┘
              Ceph Cluster (min 3 nodes)
              ┌──────────────────────┐
              │ k8s cluster          │
              │                      │
              │ Rook-Ceph operator   │ ← Manages Ceph
              │ StorageClasses:      │
              │ - ceph-rbd (block)   │
              │ - cephfs (file)      │
              └──────────────────────┘

Components: - MON (Monitor): Maintains cluster state (min 3 for quorum) - OSD (Object Storage Daemon): Actual storage disks (each disk = 1 OSD) - MDS (Metadata Server): For CephFS file system - Rook: Kubernetes operator that manages Ceph

Pros: - ✅ True HA (survives multiple node failures) - ✅ Highly scalable (add more OSDs to grow) - ✅ Multiple access methods (RBD block, CephFS file, S3 object) - ✅ Production-grade (used by enterprises) - ✅ Excellent resume material

Cons: - ❌ Very complex to set up and manage - ❌ Resource hungry (min 3 nodes, each with multiple disks + RAM) - ❌ Steep learning curve - ❌ Requires dedicated nodes (shouldn't run workloads on Ceph nodes) - ❌ Min 3 OSDs per storage pool (needs lots of disks) - ❌ Overkill for homelab

Requirements: - Min 3 physical servers (or VMs with dedicated disks) - Each node: 4+ cores, 8GB+ RAM, multiple disks - Low-latency network between nodes

Best for: - Large homelab (3+ physical servers) - Learning enterprise storage - High availability requirements - Large storage needs (multi-TB)


Tier your storage by use case:

# Fast, HA, small - Longhorn on VM SSDs
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn-ssd
provisioner: driver.longhorn.io
parameters:
  numberOfReplicas: "2"
  staleReplicaTimeout: "30"

---
# Large, centralized - NFS from /vault
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-bulk
provisioner: nfs.csi.k8s.io
parameters:
  server: 10.89.97.101
  share: /vault/k8s-volumes

---
# Fast, no HA - local VM disk
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-path
provisioner: rancher.io/local-path  # k3s default

Usage guide:

Workload StorageClass Why
PostgreSQL database longhorn-ssd Needs HA, small size, random I/O
MongoDB database longhorn-ssd Needs HA, small size
Prometheus metrics longhorn-ssd Needs HA, time-series data
Media library (Plex) nfs-bulk Large, read-heavy, shared
Log aggregation nfs-bulk Large, append-only
Build cache local-path Fastest, ephemeral, no HA needed
Temp files local-path Ephemeral, speed matters

Example PVC:

# High-performance database
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
spec:
  storageClassName: longhorn-ssd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

---
# Media library (read-only, shared)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: media-library
spec:
  storageClassName: nfs-bulk
  accessModes:
    - ReadOnlyMany  # Multiple pods can read
  resources:
    requests:
      storage: 1Ti


Scaling Storage: Future Options

Scenario 1: Add More Nodes (Same Proxmox)

Easy - Just scale horizontally:

# Create VMs 204, 205, etc.
# Join as workers
# Longhorn automatically uses their disks

Result: - More Longhorn storage pool (5 nodes × 80GB = 400GB raw → 200GB usable) - Better fault tolerance (lose 1 of 5 nodes vs 1 of 3)

Limitation: Still single Proxmox host (all VMs die if host dies)


Scenario 2: Multiple Proxmox Hosts (True HA)

Multi-host architecture:

┌─────────────────────┐  ┌─────────────────────┐  ┌─────────────────────┐
│ Proxmox Host 1      │  │ Proxmox Host 2      │  │ Proxmox Host 3      │
│ (tower)             │  │ (fortress)          │  │ (citadel)           │
│ 192.168.1.10        │  │ 192.168.1.11        │  │ 192.168.1.12        │
│                     │  │                     │  │                     │
│ ┌─────────────────┐ │  │ ┌─────────────────┐ │  │ ┌─────────────────┐ │
│ │ k3s-master-1    │ │  │ │ k3s-master-2    │ │  │ │ k3s-master-3    │ │
│ │ (etcd member)   │ │  │ │ (etcd member)   │ │  │ │ (etcd member)   │ │
│ └─────────────────┘ │  │ └─────────────────┘ │  │ └─────────────────┘ │
│                     │  │                     │  │                     │
│ ┌─────────────────┐ │  │ ┌─────────────────┐ │  │ ┌─────────────────┐ │
│ │ k3s-worker-1    │ │  │ │ k3s-worker-2    │ │  │ │ k3s-worker-3    │ │
│ │ Longhorn        │ │  │ │ Longhorn        │ │  │ │ Longhorn        │ │
│ └─────────────────┘ │  │ └─────────────────┘ │  │ └─────────────────┘ │
└─────────────────────┘  └─────────────────────┘  └─────────────────────┘
         │                        │                        │
         └────────────────────────┴────────────────────────┘
                    Network: LAN or VPN

Benefits: - ✅ True HA: Entire Proxmox host can die, cluster survives - ✅ Control plane HA (etcd quorum: 3 masters across 3 hosts) - ✅ Storage HA (Longhorn replicas on different physical hosts) - ✅ Lose 1 host → still operational

Requirements: - 3+ physical servers - Network between hosts (same LAN or VPN) - Low latency for etcd (<10ms ideal, <50ms acceptable)

Setup notes: - Use --cluster-init on first master - Install k3s on each host's VMs (or bare metal) - Longhorn automatically spreads replicas across hosts


Scenario 3: Hybrid Storage Tiers

Combine everything:

Current Phase (Phase 2):
┌────────────────────────────────────────┐
│ Longhorn on VM disks (120GB usable)    │
│ - Use for: Databases, app state        │
│ - Replicas: 2                          │
│ - Performance: Good                    │
└────────────────────────────────────────┘

Future Phase 3:
┌────────────────────────────────────────┐
│ Add NFS from /vault                    │
│ - Use for: Media, logs, backups        │
│ - Capacity: Multi-TB                   │
│ - Performance: Medium                  │
└────────────────────────────────────────┘

Future Phase 4 (Multi-host):
┌────────────────────────────────────────┐
│ Longhorn across 3 Proxmox hosts        │
│ - Use for: Critical HA data            │
│ - Survives host failure                │
│ - Performance: Good                    │
└────────────────────────────────────────┘

Optional Phase 5 (Enterprise):
┌────────────────────────────────────────┐
│ Ceph cluster (3+ nodes)                │
│ - Use for: Everything (consolidate)    │
│ - RBD, CephFS, S3                      │
│ - Resume material                      │
└────────────────────────────────────────┘

Storage Best Practices

1. Choose the Right Storage for the Workload

Don't use Longhorn for everything: - ✅ Databases → Longhorn (needs HA, small size) - ❌ Media files → NFS (large, centralized better) - ❌ Build artifacts → local-path (ephemeral, speed matters)

2. Set Resource Limits

# Don't let one app consume all storage
apiVersion: v1
kind: ResourceQuota
metadata:
  name: storage-quota
  namespace: my-app
spec:
  hard:
    requests.storage: "50Gi"
    persistentvolumeclaims: "5"

3. Monitor Storage Usage

# Check Longhorn volumes
kubectl get pv

# Check PVC usage
kubectl get pvc -A

# Longhorn UI (Phase 2)
# Access at: http://<node-ip>:30080

4. Backup Regularly

Longhorn: Built-in snapshots and backups to S3/NFS

NFS: ZFS snapshots on /vault

# On NAS
zfs snapshot rpool/data/subvol-101-disk-0@k8s-backup-$(date +%Y%m%d)

Best practice: Backup to external location (not same Proxmox host)

5. Plan for Growth

Current: 120GB usable (Longhorn 2-replica on 3×80GB)

Growth path: 1. Add more worker nodes (more disks in pool) 2. Increase VM disk size (qm resize on Proxmox) 3. Add NFS tier for bulk data 4. Multi-host for true HA 5. Consider Ceph for 500GB+ persistent data


Common Questions

Q: Can I use multiple storage providers at once? A: Yes! That's the hybrid approach. Longhorn + NFS + local-path all coexist.

Q: Can I migrate PVs between storage classes? A: Not directly. Need to backup data, delete PVC, create new PVC with different class, restore data.

Q: What happens if a Longhorn replica node dies? A: Longhorn automatically rebuilds the replica on a healthy node. No data loss (if 2+ replicas exist).

Q: Can I expand a PV after creation? A: Yes, if StorageClass has allowVolumeExpansion: true. Edit PVC size, Longhorn auto-expands.

Q: Should I use Longhorn or NFS for Prometheus? A: Longhorn. Prometheus does lots of random I/O (time-series database), needs fast local storage.

Q: Can I access /vault directly from pods? A: Yes, via NFS CSI driver. Or mount /vault as hostPath (not recommended - ties pod to specific node).

Q: Do I need Ceph? A: No, not for homelab. Longhorn + NFS covers 99% of use cases. Ceph is for large-scale deployments.


Further Reading

  • Longhorn docs: https://longhorn.io/docs/
  • Kubernetes storage: https://kubernetes.io/docs/concepts/storage/
  • Ceph: https://docs.ceph.com/
  • Rook (Ceph operator): https://rook.io/
  • NFS CSI driver: https://github.com/kubernetes-csi/csi-driver-nfs

What Makes Our Setup Easy

1. Same LAN (10.89.97.0/24)

Benefit: Nodes can directly talk to each other - No VPN needed - No port forwarding - No NAT traversal - Low latency (<1ms between nodes)

Comparison to multi-site: - Multi-site needs VPN or public IPs - Higher latency (50-100ms+ typical) - More complex firewall rules

2. All on Same Proxmox Host

Benefits: - Easy VM management (qm list, qm start, etc.) - Shared storage access (could mount host paths if needed) - Console access via Proxmox UI - Backup entire cluster with Proxmox backup - Resource monitoring in one place

Comparison to separate hardware: - Separate hardware = physically visit machines - Different management tools per hypervisor - Harder to troubleshoot (no centralized console)

3. Fresh Install / No Legacy Constraints

Benefits: - Latest k3s version (v1.33.5) - Clean networking (no old iptables rules) - No migration concerns (not moving from existing system) - Can make breaking changes freely (learning environment)

4. k3s Defaults Work Out of the Box

What we didn't need to configure: - ✅ Flannel CNI (pod networking) - included - ✅ CoreDNS (DNS) - included - ✅ local-path storage - included - ✅ Traefik ingress - included (we'll replace with MetalLB) - ✅ Service account tokens - automatic

Comparison to vanilla k8s (kubeadm): - Must install CNI separately (Calico, Flannel, etc.) - Must install CoreDNS - No storage by default - No ingress by default - Much more YAML to write

5. Learning Environment (Can Break Things)

Benefits: - Can experiment without fear - Easy to rebuild (VMs are disposable) - Mistakes are learning opportunities - No SLA requirements


What Changes for Multi-Site

Scenario: Master at Home, Workers in Cloud

Challenges:

1. Connectivity

  • Home IP likely dynamic (DHCP from ISP)
  • NAT/firewall at home router
  • Cloud nodes have public IPs
  • Can't directly connect without setup

Solutions: - Dynamic DNS: DuckDNS, Cloudflare DDNS for home IP - VPN: Tailscale/Wireguard for direct node-to-node - Port forwarding: Forward 6443 on home router (less secure)

2. Latency

  • Home to cloud: 50-150ms typical
  • etcd breaks with >100ms latency (consensus protocol)
  • Impact: HA masters won't work reliably

Solutions: - Single master at home (accept single point of failure) - Master in cloud (lower latency to cloud workers) - Separate clusters (home cluster + cloud cluster)

3. Security

  • Exposing home network to internet (if not using VPN)
  • More attack surface

Solutions: - Mandatory VPN (Tailscale recommended) - Client certificates already in place (k3s default) - Firewall rules on each node (allow only cluster node IPs)

4. Node Network

Current (same LAN):

Node Network: 10.89.97.0/24
- All nodes see each other directly

Multi-site with VPN:

Physical Networks:
- Home: 192.168.1.0/24
- Cloud: (public IPs)

VPN Network (Tailscale): 100.64.0.0/10
- Master: 100.64.1.1
- Worker-1: 100.64.1.2
- Worker-2: 100.64.1.3

k3s uses VPN IPs for node communication

Installation changes:

# Master (home, Tailscale IP 100.64.1.1)
curl -sfL https://get.k3s.io | sh -s - server \
  --node-ip 100.64.1.1 \
  --advertise-address 100.64.1.1 \
  --tls-san 100.64.1.1

# Worker (cloud, Tailscale IP 100.64.1.2)
curl -sfL https://get.k3s.io | \
  K3S_URL=https://100.64.1.1:6443 \
  K3S_TOKEN=xxx \
  sh -s - --node-ip 100.64.1.2

5. Flannel Configuration

Flannel needs to know which interface to use for VXLAN.

Current (same LAN): Auto-detects eth0 (10.89.97.x)

Multi-site with VPN:

# Tell Flannel to use Tailscale interface
curl -sfL https://get.k3s.io | sh -s - server \
  --flannel-iface tailscale0

Option 1: VPN Mesh (Best for Learning)

Home ←→ Tailscale/Wireguard ←→ Cloud

✅ Secure
✅ Same experience as LAN
✅ No public k8s API exposure
❌ Latency still limits HA masters

Option 2: Cloud-Only Cluster with Home Access

Cloud: k3s cluster (master + workers)
Home: kubectl only (via kubeconfig)

✅ Low latency (all nodes in same datacenter)
✅ Can do HA masters
✅ Home just uses kubectl
❌ Home not part of cluster (can't run workloads at home)

Option 3: Edge Clusters with Federation

Home Cluster: k3s (master + workers)
Cloud Cluster: k3s (master + workers)
Federation: ArgoCD/Flux sync apps to both

✅ Each cluster independent (no cross-site latency)
✅ Can do HA masters in each cluster
✅ Deploy apps to both with GitOps
❌ More complex (managing 2 clusters)

For homelab learning: Option 1 (VPN mesh) is best. Gives you multi-site experience without complexity.


Practical Examples

Add a Worker Node

# 1. Get the join token from master
ssh root@10.89.97.201 'cat /var/lib/rancher/k3s/server/node-token'
# Output: K10abc123...::server:xyz789

# 2. On new node (10.89.97.204), install k3s as agent
curl -sfL https://get.k3s.io | \
  K3S_URL=https://10.89.97.201:6443 \
  K3S_TOKEN=K10abc123...::server:xyz789 \
  sh -

# 3. Verify from master
kubectl get nodes
# Should show new node in "Ready" state within 30 seconds

Check Cluster Certificates

# View API server certificate
openssl s_client -connect 10.89.97.201:6443 -showcerts 2>/dev/null | \
  openssl x509 -noout -text | grep -A 2 "Subject:"

# View kubelet certificate
ssh root@10.89.97.202 'openssl x509 -in /var/lib/rancher/k3s/agent/client-kubelet.crt -noout -text | grep -A 2 "Subject:"'

# Check certificate expiration
ssh root@10.89.97.201 'openssl x509 -in /var/lib/rancher/k3s/server/tls/server-ca.crt -noout -enddate'

Troubleshoot Network Issues

# 1. Check node connectivity
ping 10.89.97.202  # Can nodes reach each other?

# 2. Check API server access
curl -k https://10.89.97.201:6443/healthz
# Should return "ok"

# 3. Check Flannel VXLAN
ip -d link show flannel.1  # Should see VXLAN interface

# 4. Test pod-to-pod networking
# Deploy test pods
kubectl run test1 --image=busybox --command -- sleep 3600
kubectl run test2 --image=busybox --command -- sleep 3600

# Get pod IPs
kubectl get pods -o wide
# test1: 10.42.1.5 on worker-1
# test2: 10.42.2.8 on worker-2

# Exec into test1 and ping test2
kubectl exec -it test1 -- ping 10.42.2.8
# Should work!

# 5. Check DNS
kubectl exec -it test1 -- nslookup kubernetes.default
# Should resolve to service IP (10.43.0.1)

# Cleanup
kubectl delete pod test1 test2

Migrate from Single to HA Masters

# Current: 1 master (SQLite backend)
# Goal: 3 masters (embedded etcd)

# 1. BACKUP FIRST!
ssh root@10.89.97.201 'tar czf /tmp/k3s-backup.tar.gz /var/lib/rancher/k3s'
scp root@10.89.97.201:/tmp/k3s-backup.tar.gz ./

# 2. Drain workloads from master
kubectl drain k3s-master --ignore-daemonsets --delete-emptydir-data

# 3. Stop k3s on master
ssh root@10.89.97.201 'systemctl stop k3s'

# 4. Convert to cluster-init (embedded etcd)
ssh root@10.89.97.201 'systemctl edit k3s'
# Add: --cluster-init to ExecStart

# 5. Restart
ssh root@10.89.97.201 'systemctl start k3s'

# 6. Verify
kubectl get nodes

# 7. Join second master
# (on new master VM 204)
curl -sfL https://get.k3s.io | sh -s - server \
  --server https://10.89.97.201:6443 \
  --token K10abc123...

# 8. Join third master
# (on new master VM 205)
curl -sfL https://get.k3s.io | sh -s - server \
  --server https://10.89.97.201:6443 \
  --token K10abc123...

# 9. Verify etcd cluster
ssh root@10.89.97.201 'k3s etcd-snapshot ls'

# 10. Update kubeconfig to use all masters (optional)
# Edit ~/.kube/config to include all master IPs

View Component Logs

# API server logs
ssh root@10.89.97.201 'journalctl -u k3s -f | grep apiserver'

# kubelet logs
ssh root@10.89.97.202 'journalctl -u k3s-agent -f'

# Scheduler decisions
ssh root@10.89.97.201 'journalctl -u k3s -f | grep scheduler'

# Controller logs
ssh root@10.89.97.201 'journalctl -u k3s -f | grep controller'

# All k3s logs
ssh root@10.89.97.201 'journalctl -u k3s -f'

Further Reading

  • Official k3s docs: https://docs.k3s.io
  • Kubernetes concepts: https://kubernetes.io/docs/concepts/
  • Flannel networking: https://github.com/flannel-io/flannel
  • etcd documentation: https://etcd.io/docs/
  • Kubernetes networking guide: https://kubernetes.io/docs/concepts/cluster-administration/networking/
  • RBAC: https://kubernetes.io/docs/reference/access-authn-authz/rbac/

Questions This Answers

  • ✅ What is Kubernetes and why use it?
  • ✅ How is k3s different from k8s?
  • ✅ What runs on master vs worker nodes?
  • ✅ How do I add more worker nodes?
  • ✅ How do I add more master nodes?
  • ✅ What's the difference between master and worker?
  • ✅ How does networking work in our same-LAN setup?
  • ✅ How would networking work across different networks/machines?
  • ✅ How does authentication work?
  • ✅ How does routing work?
  • ✅ How does storage work (etcd/SQLite vs Longhorn/NFS)?
  • ✅ What if I add more servers or media shares?
  • ✅ What are the storage options (Longhorn, NFS, Ceph)?
  • ✅ When do I need Ceph vs Longhorn?
  • ✅ How does Longhorn provide HA?
  • ✅ What makes our setup easy?
  • ✅ What would change for multi-site deployments?

Next: 02-kubectl-kubeconfig.md - How to manage the cluster with kubectl