Understanding Kubernetes¶

Last Updated: 2025-11-09

This guide explains what Kubernetes is, how it works, and specifically how it operates in our Tower Fleet homelab environment.

Table of Contents¶

What is Kubernetes?
k3s vs k8s
Cluster Architecture Deep Dive
Scaling the Cluster
Networking Explained
Authentication & Security
How Requests Flow
Storage Architecture
What Makes Our Setup Easy
What Changes for Multi-Site
Practical Examples

What is Kubernetes?¶

Kubernetes (k8s) is a container orchestration platform that automates deploying, scaling, and managing containerized applications across a cluster of machines.

Core Concepts¶

Container Orchestration means: - Automatically scheduling containers onto available machines - Restarting failed containers - Scaling up/down based on demand - Load balancing traffic across containers - Managing storage, networking, and secrets

Key Building Blocks:

Pod: Smallest deployable unit - one or more containers that share network/storage
Deployment: Defines desired state (e.g., "run 3 replicas of nginx")
Service: Stable network endpoint to access pods (even as they move/restart)
Node: A physical or virtual machine in the cluster
Cluster: A set of nodes working together

Why Container Orchestration Matters¶

Without k8s: - Manually SSH to servers to deploy apps - Write custom scripts for health checks and restarts - Hard to scale (need to manually distribute load) - No standardization (every app deployed differently)

With k8s: - Declare desired state in YAML (kubectl apply -f app.yaml) - k8s ensures current state matches desired state - Self-healing (crashed pods restart automatically) - Standardized (all apps deployed the same way) - Portable (same YAML works on any k8s cluster)

k3s vs k8s¶

What is k3s?¶

k3s is a lightweight, certified Kubernetes distribution created by Rancher (now SUSE). It's a full k8s implementation, but optimized for: - Edge computing - IoT devices - CI/CD environments - Homelabs (our use case!)

Key Differences¶

Feature	Kubernetes (k8s)	k3s
Binary size	~1GB+ (multiple binaries)	~70MB (single binary)
Memory footprint	~1-2GB per node	~512MB per node
Installation	Complex (kubeadm, many steps)	One command
Default storage	None (must install)	local-path provisioner included
Default ingress	None (must install)	Traefik included
Database	Requires etcd cluster	Embedded SQLite (or etcd for HA)
Cloud integrations	Many built-in	Removed (reduces bloat)
ARM support	Limited	Excellent (Raspberry Pi ready)
Certification	N/A (reference impl)	100% certified Kubernetes

What k3s Removed¶

Legacy/alpha features
Cloud provider integrations (AWS, GCP, Azure)
In-tree storage drivers (uses CSI instead)
Non-essential admission controllers

Important: k3s is fully certified Kubernetes. Any app that runs on k8s will run on k3s. It's not a fork or subset - it's the same API.

Why We Chose k3s for Homelab¶

Resource efficient: Leaves more RAM/CPU for actual workloads
Easy installation: Single command vs hours of kubeadm troubleshooting
Batteries included: Storage and ingress work out of the box
Perfect for learning: Same k8s APIs you'll use professionally
Great documentation: Rancher maintains excellent guides

Cluster Architecture Deep Dive¶

A Kubernetes cluster has two types of nodes:

Control Plane (Master) Node¶

The "brain" of the cluster. Runs these critical components:

1. kube-apiserver (The API Server)¶

What it does: Central management point - every operation goes through here
Who talks to it: kubectl, web dashboards, other control plane components, node agents
Port: 6443 (HTTPS)
Example: When you run kubectl get pods, you're talking to the API server

2. etcd (Distributed Database)¶

What it does: Stores all cluster state (what pods exist, what their status is, etc.)
Type: Distributed key-value store with strong consistency
Critical: If etcd data is lost, cluster state is lost
In k3s: Can use embedded SQLite (single-master) or embedded etcd (multi-master)
Port: 2379-2380

3. kube-scheduler (The Scheduler)¶

What it does: Decides which worker node should run each new pod
Factors it considers:
Available CPU/memory on nodes
Node affinity rules (e.g., "must run on SSD nodes")
Resource requests from pods
Taints and tolerations
Example: You create a deployment → scheduler assigns pods to workers

4. kube-controller-manager (The Control Loop)¶

What it does: Runs controllers that maintain desired state
Controllers include:
Deployment controller: Ensures correct number of replicas
ReplicaSet controller: Creates/deletes pods to match desired count
Node controller: Detects when nodes go down
Service controller: Creates load balancers, updates endpoints
Pattern: Continuously loops asking "is current state = desired state?"

5. cloud-controller-manager (Cloud Integrations)¶

What it does: Talks to cloud provider APIs (AWS, GCP, Azure)
In k3s: Not included (we don't need it for bare-metal)

Worker Nodes¶

Run your actual application workloads. Components:

1. kubelet (The Node Agent)¶

What it does: Agent on every node (including master in our case)
Responsibilities:
Watches API server for pods assigned to this node
Starts/stops containers via container runtime
Reports pod status back to API server
Performs health checks on pods
Port: 10250

2. kube-proxy (Network Proxy)¶

What it does: Manages network rules for pod-to-pod and pod-to-service communication
How it works: Updates iptables rules to route traffic to correct pods
Example: Service has IP 10.43.0.10 → kube-proxy routes to backend pods

3. Container Runtime (Runs Containers)¶

What it does: Actually starts/stops containers
In k3s: Uses containerd (industry standard)
Interface: CRI (Container Runtime Interface) - standardized API

Our Cluster Layout¶

┌─────────────────────────────────────────────────────────────┐
│ k3s-master (10.89.97.201)                                   │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Control Plane                                           │ │
│ │ - kube-apiserver (port 6443)                           │ │
│ │ - etcd (embedded, port 2379)                           │ │
│ │ - kube-scheduler                                       │ │
│ │ - kube-controller-manager                              │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Worker Components (master also runs workloads)         │ │
│ │ - kubelet                                              │ │
│ │ - kube-proxy                                           │ │
│ │ - containerd                                           │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ k3s-worker-1 (10.89.97.202)                                 │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Worker Components Only                                  │ │
│ │ - kubelet                                              │ │
│ │ - kube-proxy                                           │ │
│ │ - containerd                                           │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ k3s-worker-2 (10.89.97.203)                                 │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Worker Components Only                                  │ │
│ │ - kubelet                                              │ │
│ │ - kube-proxy                                           │ │
│ │ - containerd                                           │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Note: In our setup, the master node also runs workloads (no taints), so it has both control plane AND worker components.

Scaling the Cluster¶

Adding Worker Nodes¶

Easy! Workers just need to contact the master and provide a token.

# On the NEW worker node (e.g., 10.89.97.204)

# Get the token from master first
ssh root@10.89.97.201 'cat /var/lib/rancher/k3s/server/node-token'
# Example output: K10abc123def456...::server:xyz789

# Install k3s as agent (worker)
curl -sfL https://get.k3s.io | \
  K3S_URL=https://10.89.97.201:6443 \
  K3S_TOKEN=K10abc123def456...::server:xyz789 \
  sh -

# Verify on master
kubectl get nodes

That's it! The new node: 1. Connects to master:6443 2. Authenticates with token 3. kubelet registers with API server 4. Scheduler starts assigning pods to it

Requirements: - Network access to master on port 6443 - Shared token (stored in /var/lib/rancher/k3s/server/node-token)

Adding Master Nodes (HA Control Plane)¶

More complex - requires etcd quorum (odd number: 3, 5, 7 masters).

# First master (already done)
curl -sfL https://get.k3s.io | sh -s - server --cluster-init

# Second master
curl -sfL https://get.k3s.io | sh -s - server \
  --server https://10.89.97.201:6443 \
  --token K10abc123...

# Third master
curl -sfL https://get.k3s.io | sh -s - server \
  --server https://10.89.97.201:6443 \
  --token K10abc123...

Why odd numbers? etcd needs quorum (majority) to agree on state: - 3 masters = tolerates 1 failure (2/3 quorum) - 5 masters = tolerates 2 failures (3/5 quorum) - 2 masters = tolerates 0 failures (not HA!)

When do you need HA masters? - Production environments (can't afford downtime) - Critical infrastructure - Not needed for homelab (we can tolerate restarts)

Removing Nodes¶

# Drain node (move pods elsewhere)
kubectl drain k3s-worker-2 --ignore-daemonsets --delete-emptydir-data

# Delete from cluster
kubectl delete node k3s-worker-2

# On the node itself, uninstall k3s
ssh root@10.89.97.203 '/usr/local/bin/k3s-agent-uninstall.sh'

Networking Explained¶

Kubernetes networking can be confusing because there are three separate networks:

1. Node Network (Physical/VM Network)¶

Our case: 10.89.97.0/24 - 10.89.97.201 - k3s-master - 10.89.97.202 - k3s-worker-1 - 10.89.97.203 - k3s-worker-2

Purpose: Nodes communicate with each other Requirement: All nodes must be able to reach each other

2. Pod Network (Overlay Network)¶

Our case: 10.42.0.0/16 (k3s default) - 10.42.0.x - Pods on master - 10.42.1.x - Pods on worker-1 - 10.42.2.x - Pods on worker-2

Purpose: Every pod gets a unique IP How it works: - CNI plugin (Flannel in k3s) creates an overlay network - Pods on different nodes can talk directly via these IPs - Flannel uses VXLAN tunnels to route traffic between nodes

Example:

Pod A on worker-1 (10.42.1.5) → Pod B on worker-2 (10.42.2.8)

1. Pod A sends packet to 10.42.2.8
2. Flannel on worker-1 sees destination is on worker-2
3. Encapsulates packet in VXLAN tunnel to 10.89.97.203
4. worker-2 receives, de-encapsulates, delivers to Pod B

3. Service Network (Virtual IPs)¶

Our case: 10.43.0.0/16 (k3s default)

Purpose: Stable IPs for accessing groups of pods - Pods come and go (deployments, restarts, scaling) - Services provide a stable IP that load balances to healthy pods

Types: - ClusterIP: Only accessible within cluster (default) - NodePort: Accessible on every node's IP at a specific port - LoadBalancer: Gets external IP (needs MetalLB in our case)

How it works: - Service gets a virtual IP (e.g., 10.43.0.10) - kube-proxy on each node creates iptables rules - Traffic to 10.43.0.10 is load-balanced to backend pods

Same Network (Our Case)¶

Why it's easy: - All nodes on 10.89.97.0/24 can directly talk - No NAT, no firewall rules needed between nodes - Flannel just needs to know node IPs (which it gets from k8s API)

Required ports between nodes: - 6443: API server (workers → master) - 10250: kubelet API (master → workers for logs, exec) - 8472: VXLAN for Flannel (pod traffic between nodes)

Different Networks/Machines¶

Scenario: Master on home network (192.168.1.0/24), workers on VPS (different ISP)

Challenges: 1. Connectivity: Nodes can't directly reach each other 2. Latency: High latency breaks etcd (needs <10ms for HA masters) 3. Firewalls: Need to open/forward ports

Solutions:

Option 1: VPN Mesh (Recommended) - Use Tailscale or Wireguard - Creates overlay network (e.g., 100.64.0.0/10) - All nodes get IPs on same VPN subnet - From k3s perspective, looks like same network!

# Install Tailscale on all nodes
curl -fsSL https://tailscale.com/install.sh | sh
tailscale up

# Nodes get IPs like 100.64.1.1, 100.64.1.2, etc.
# Use these IPs for k3s installation

# On master
curl -sfL https://get.k3s.io | sh -s - server --node-ip 100.64.1.1

# On worker
curl -sfL https://get.k3s.io | \
  K3S_URL=https://100.64.1.1:6443 \
  K3S_TOKEN=xxx \
  sh -s - --node-ip 100.64.1.2

Option 2: Public IPs + Firewall - Use public IPs for node communication - Lock down with firewall rules (only allow specific node IPs) - Not recommended for multi-master (latency + security)

Option 3: Separate Clusters - Run independent cluster per site - Use tools like Kubefed to federate clusters - More complex but better for geo-distributed

DNS in the Cluster¶

CoreDNS runs as a pod and provides DNS for the cluster:

Services get DNS names: <service-name>.<namespace>.svc.cluster.local
Example: nginx.default.svc.cluster.local → resolves to service IP
Pods can use short names within same namespace: curl http://nginx

Authentication & Security¶

Node-to-Master Authentication¶

How workers authenticate to master:

During join: Worker provides K3S_TOKEN from /var/lib/rancher/k3s/server/node-token
After join: Worker uses client certificate issued by k3s CA
Ongoing: All communication is mTLS (mutual TLS)

Token is secret! Anyone with token can join cluster.

To rotate token: Restart k3s-server (generates new token, existing nodes keep certs)

kubectl Authentication¶

How kubectl authenticates:

Reads ~/.kube/config (kubeconfig):

clusters:
  - name: default
    cluster:
      server: https://10.89.97.201:6443
      certificate-authority-data: <base64-encoded-ca-cert>

users:
  - name: default
    user:
      client-certificate-data: <base64-encoded-client-cert>
      client-key-data: <base64-encoded-client-key>

contexts:
  - name: default
    context:
      cluster: default
      user: default

Flow: 1. kubectl presents client certificate to API server 2. API server verifies cert signed by cluster CA 3. Certificate CN (Common Name) determines permissions 4. k3s default kubeconfig has admin privileges

Pod-to-API Authentication¶

Service Accounts: - Every pod gets a ServiceAccount (default: default SA in namespace) - ServiceAccount has a token auto-mounted at /var/run/secrets/kubernetes.io/serviceaccount/token - Apps can use this token to talk to k8s API

RBAC (Role-Based Access Control): - Define Roles (what permissions: list pods, create deployments, etc.) - Bind Roles to ServiceAccounts - Pods run as ServiceAccount → get those permissions

Example: Prometheus needs to list pods/nodes to scrape metrics → create ServiceAccount with read-only RBAC

Encryption¶

All cluster traffic is encrypted: - API server ↔ etcd: TLS - kubectl ↔ API server: TLS - kubelet ↔ API server: mTLS - Pod network: Not encrypted by default (use service mesh like Istio for pod-to-pod encryption)

Cross-Network Security¶

Same security model regardless of network: - All auth based on certificates/tokens, not network location - Firewall should still restrict access to port 6443 (only allow known IPs) - Use VPN for multi-site (adds network-level security)

Best practices: - Don't expose API server to public internet - Rotate tokens periodically - Use least-privilege RBAC - Enable audit logging (k3s: --kube-apiserver-arg=audit-log-path=/var/log/k3s-audit.log)

How Requests Flow¶

Example 1: kubectl get pods¶

┌──────────────┐
│ Your Laptop  │
│ $ kubectl    │
│   get pods   │
└──────┬───────┘
       │ 1. Read ~/.kube/config
       │    - API server: https://10.89.97.201:6443
       │    - Client cert for authentication
       │
       ▼
┌──────────────────────────────────────┐
│ k3s-master (10.89.97.201)            │
│                                      │
│  2. TLS connection to port 6443      │
│     ┌─────────────────────────────┐  │
│     │ kube-apiserver              │  │
│     │                             │  │
│     │ 3. Verify client cert       │  │
│     │ 4. Check RBAC permissions   │  │
│     └──────────┬──────────────────┘  │
│                │                     │
│                ▼                     │
│     ┌─────────────────────────────┐  │
│     │ etcd                        │  │
│     │                             │  │
│     │ 5. Query for pod objects    │  │
│     │    in all namespaces        │  │
│     └──────────┬──────────────────┘  │
│                │                     │
└────────────────┼─────────────────────┘
                 │
                 ▼
          6. Return pod list
             (JSON format)
                 │
                 ▼
         7. kubectl formats
            as table and prints

Timeline: Typically 10-50ms

Example 2: Create a Deployment¶

┌──────────────────────────────────────────┐
│ $ kubectl create deployment nginx        │
│     --image=nginx --replicas=3           │
└──────┬───────────────────────────────────┘
       │ 1. kubectl sends Deployment object
       │    to API server
       ▼
┌────────────────────────────────────────────────────┐
│ kube-apiserver                                     │
│  2. Validate Deployment spec                       │
│  3. Store in etcd                                  │
│  4. Return success to kubectl                      │
└────────────────────────────────────────────────────┘
       │
       │ (Controllers are watching for new objects)
       ▼
┌────────────────────────────────────────────────────┐
│ kube-controller-manager                            │
│                                                    │
│  Deployment Controller sees new Deployment:        │
│  5. Create a ReplicaSet (desired: 3 replicas)     │
│     → stores in etcd via API                       │
│                                                    │
│  ReplicaSet Controller sees new ReplicaSet:        │
│  6. Create 3 Pod objects                          │
│     → stores in etcd via API                       │
└────────────────────────────────────────────────────┘
       │
       ▼
┌────────────────────────────────────────────────────┐
│ kube-scheduler                                     │
│                                                    │
│  7. Watches for unscheduled pods                   │
│  8. For each pod:                                  │
│     - Score all nodes (available resources)        │
│     - Pick best node                               │
│     - Update pod.spec.nodeName in etcd             │
│                                                    │
│  Result:                                           │
│  - Pod 1 → k3s-worker-1                           │
│  - Pod 2 → k3s-worker-2                           │
│  - Pod 3 → k3s-master                             │
└────────────────────────────────────────────────────┘
       │
       ▼
┌────────────────────────────────────────────────────┐
│ kubelet (on each node)                             │
│                                                    │
│  9. Watches API for pods assigned to this node     │
│ 10. Sees new pod assigned                          │
│ 11. Tells containerd to pull image & start        │
│ 12. Reports pod status back to API server          │
└────────────────────────────────────────────────────┘
       │
       ▼
    Pod running!

Timeline: Typically 5-30 seconds (depends on image size)

Example 3: Pod-to-Pod Communication (Different Nodes)¶

Scenario: nginx pod on worker-1 calls API pod on worker-2

┌─────────────────────────────────────────────────────┐
│ k3s-worker-1 (10.89.97.202)                         │
│                                                     │
│  ┌──────────────────────────────┐                   │
│  │ nginx pod (10.42.1.5)        │                   │
│  │                              │                   │
│  │ 1. curl http://api.default.  │                   │
│  │         svc.cluster.local    │                   │
│  └──────────────┬───────────────┘                   │
│                 │                                   │
│                 ▼                                   │
│  ┌──────────────────────────────┐                   │
│  │ CoreDNS (on some node)       │                   │
│  │                              │                   │
│  │ 2. DNS query                 │                   │
│  │    Returns: 10.43.0.10       │                   │
│  │    (Service IP)              │                   │
│  └──────────────┬───────────────┘                   │
│                 │                                   │
│                 ▼                                   │
│  3. Send packet to 10.43.0.10                       │
│                 │                                   │
│                 ▼                                   │
│  ┌──────────────────────────────┐                   │
│  │ kube-proxy (iptables rules)  │                   │
│  │                              │                   │
│  │ 4. Intercept packet to       │                   │
│  │    10.43.0.10                │                   │
│  │ 5. Load balance to backend:  │                   │
│  │    → 10.42.2.8 (on worker-2) │                   │
│  └──────────────┬───────────────┘                   │
│                 │                                   │
└─────────────────┼───────────────────────────────────┘
                  │
                  ▼
   6. Destination 10.42.2.8 is on different node
                  │
                  ▼
┌─────────────────┼───────────────────────────────────┐
│                 │                                   │
│  ┌──────────────▼───────────────┐                   │
│  │ Flannel CNI (worker-1)       │                   │
│  │                              │                   │
│  │ 7. Encapsulate in VXLAN:     │                   │
│  │    Outer: 10.89.97.202 →     │                   │
│  │           10.89.97.203       │                   │
│  │    Inner: 10.42.1.5 →        │                   │
│  │           10.42.2.8           │                   │
│  └──────────────┬───────────────┘                   │
└─────────────────┼───────────────────────────────────┘
                  │ 8. Send over node network
                  │    (10.89.97.202 → 10.89.97.203)
                  ▼
┌─────────────────────────────────────────────────────┐
│ k3s-worker-2 (10.89.97.203)                         │
│                                                     │
│  ┌──────────────────────────────┐                   │
│  │ Flannel CNI (worker-2)       │                   │
│  │                              │                   │
│  │ 9. Receive VXLAN packet      │                   │
│  │ 10. De-encapsulate           │                   │
│  │ 11. Route to 10.42.2.8       │                   │
│  └──────────────┬───────────────┘                   │
│                 │                                   │
│                 ▼                                   │
│  ┌──────────────────────────────┐                   │
│  │ api pod (10.42.2.8)          │                   │
│  │                              │                   │
│  │ 12. Receive HTTP request     │                   │
│  │ 13. Process and respond      │                   │
│  └──────────────────────────────┘                   │
└─────────────────────────────────────────────────────┘
         (Response follows reverse path)

Key Points: - Pods use service DNS names, not IPs - kube-proxy translates Service IP to pod IPs - Flannel handles routing between nodes - All transparent to the application (just HTTP call)

Storage Architecture¶

Understanding storage in Kubernetes is critical because it's separate from both cluster state (etcd/SQLite) and compute resources.

Two Types of Storage¶

1. Cluster State Storage (etcd/SQLite)¶

What it stores: Kubernetes objects and configuration - Pod definitions - Deployments, Services, ConfigMaps - RBAC rules - Cluster settings

Size: Typically <1GB (just metadata, not application data)

Current setup: SQLite on master VM - Location: /var/lib/rancher/k3s/server/db/state.db - Replicas: None (single master) - HA option: Use --cluster-init for embedded etcd across 3+ masters

Important: This is NOT where your application data lives!

2. Application Data Storage (Persistent Volumes)¶

What it stores: Actual application data - Database files (PostgreSQL, MySQL) - User uploads - Application state - Logs and metrics

Size: Can be TBs of data

Managed by: Storage providers (Longhorn, NFS, Ceph, cloud providers)

Kubernetes Storage Concepts¶

PersistentVolume (PV)¶

A piece of storage in the cluster (e.g., 10GB disk)
Created by admin or dynamically provisioned
Independent lifecycle (survives pod deletion)

PersistentVolumeClaim (PVC)¶

A request for storage by a pod
"I need 10GB of SSD storage"
Gets bound to a matching PV

StorageClass¶

Defines "types" of storage available
Examples: fast-ssd, slow-hdd, network-nfs
Enables dynamic provisioning (auto-creates PVs)

Flow:

1. Pod needs storage
2. Creates PVC: "I need 10GB, class=longhorn-ssd"
3. StorageClass provisions a PV (creates 10GB Longhorn volume)
4. PVC binds to PV
5. Pod mounts the volume

Storage Options for Homelab¶

Option 1: Longhorn (Distributed Block Storage) - Current Plan¶

What is Longhorn? - Cloud-native distributed block storage for Kubernetes - Created by Rancher (same company as k3s) - Replicates data across nodes - Automatic snapshots and backups

Our Current Architecture:

┌─────────────────────────────────────────────────┐
│ Proxmox Host (tower)                            │
│                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌────────┐│
│  │ VM 201       │  │ VM 202       │  │ VM 203 ││
│  │ (master)     │  │ (worker-1)   │  │ (w-2)  ││
│  │              │  │              │  │        ││
│  │ 80GB disk    │  │ 80GB disk    │  │ 80GB   ││
│  │ ↓            │  │ ↓            │  │ ↓      ││
│  │ Longhorn     │  │ Longhorn     │  │ Longhorn││
│  │ storage      │  │ storage      │  │ storage││
│  └──────────────┘  └──────────────┘  └────────┘│
└─────────────────────────────────────────────────┘

Total: 240GB raw → 120GB usable (2-replica)
Each PV stored on 2 nodes for redundancy

How it works:

# Create a PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 10Gi

# Longhorn automatically:
# 1. Creates a 10GB volume
# 2. Replicates to 2 nodes (master + worker-1)
# 3. Binds to PVC
# 4. Pod can mount it

Pros: - ✅ Survives single node failure (2-replica redundancy) - ✅ Automatic replication and healing - ✅ Web UI for management - ✅ Snapshots and backups - ✅ Pure Kubernetes-native (no external dependencies) - ✅ Great for learning cloud-native storage

Cons: - ❌ Limited by VM disk size (240GB total, 120GB usable) - ❌ Single Proxmox host = single point of failure - ❌ Network overhead (replica sync between nodes) - ❌ Not ideal for large media files

Best for: - Databases (PostgreSQL, MySQL, MongoDB) - Application state - Small to medium datasets (<100GB) - Learning distributed storage concepts

Option 2: NFS from /vault (Centralized Storage)¶

What is NFS? - Network File System - Centralized storage accessed over network - Multiple pods can mount same volume (ReadWriteMany)

Architecture with /vault:

┌─────────────────────────────────────────────────┐
│ Proxmox Host (tower)                            │
│                                                 │
│  ┌────────────────┐                             │
│  │ LXC 101 (NAS)  │                             │
│  │ ZFS pool       │                             │
│  │ /vault         │ ← Large ZFS dataset         │
│  └────────┬───────┘                             │
│           │ NFS export: /vault/k8s-volumes      │
│           ↓                                     │
│  ┌─────────────────────────────────────┐        │
│  │ k8s cluster (VMs 201-203)           │        │
│  │                                     │        │
│  │ NFS CSI driver installed            │        │
│  │ Mounts /vault/k8s-volumes           │        │
│  │                                     │        │
│  │ PVs: /vault/k8s-volumes/pvc-abc/    │        │
│  └─────────────────────────────────────┘        │
└─────────────────────────────────────────────────┘

Setup:

# On NAS (LXC 101)
mkdir -p /vault/k8s-volumes
echo "/vault/k8s-volumes 10.89.97.0/24(rw,sync,no_subtree_check,no_root_squash)" >> /etc/exports
exportfs -ra

# In k8s cluster
# Install NFS CSI driver
helm repo add nfs-csi https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts
helm install nfs-csi nfs-csi/csi-driver-nfs

# Create StorageClass
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-vault
provisioner: nfs.csi.k8s.io
parameters:
  server: 10.89.97.101  # NAS IP
  share: /vault/k8s-volumes
EOF

Pros: - ✅ Large capacity (your /vault ZFS pool) - ✅ Leverage existing ZFS features (snapshots, compression, dedup) - ✅ Easy backups (ZFS send/receive) - ✅ Multiple pods can access same volume (ReadWriteMany) - ✅ Good for large datasets (media, logs)

Cons: - ❌ Single point of failure (NAS dies = all data gone) - ❌ Network bottleneck (all I/O over network) - ❌ Not distributed (no HA) - ❌ Slower than local storage

Best for: - Media libraries (read-only or read-mostly) - Shared config files - Log aggregation - Backups - Large datasets that don't fit in Longhorn

Option 3: Ceph (Enterprise Distributed Storage)¶

What is Ceph? - Enterprise-grade distributed storage system - Provides object, block, and file storage - Industry standard (used by OpenStack, Proxmox, etc.)

Architecture:

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│ Ceph Node 1  │  │ Ceph Node 2  │  │ Ceph Node 3  │
│              │  │              │  │              │
│ OSD:         │  │ OSD:         │  │ OSD:         │
│ - 2TB SSD    │  │ - 2TB SSD    │  │ - 2TB SSD    │
│ - 4TB HDD    │  │ - 4TB HDD    │  │ - 4TB HDD    │
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       └─────────────────┴─────────────────┘
              Ceph Cluster (min 3 nodes)
                         ↓
              ┌──────────────────────┐
              │ k8s cluster          │
              │                      │
              │ Rook-Ceph operator   │ ← Manages Ceph
              │ StorageClasses:      │
              │ - ceph-rbd (block)   │
              │ - cephfs (file)      │
              └──────────────────────┘

Components: - MON (Monitor): Maintains cluster state (min 3 for quorum) - OSD (Object Storage Daemon): Actual storage disks (each disk = 1 OSD) - MDS (Metadata Server): For CephFS file system - Rook: Kubernetes operator that manages Ceph

Pros: - ✅ True HA (survives multiple node failures) - ✅ Highly scalable (add more OSDs to grow) - ✅ Multiple access methods (RBD block, CephFS file, S3 object) - ✅ Production-grade (used by enterprises) - ✅ Excellent resume material

Cons: - ❌ Very complex to set up and manage - ❌ Resource hungry (min 3 nodes, each with multiple disks + RAM) - ❌ Steep learning curve - ❌ Requires dedicated nodes (shouldn't run workloads on Ceph nodes) - ❌ Min 3 OSDs per storage pool (needs lots of disks) - ❌ Overkill for homelab

Requirements: - Min 3 physical servers (or VMs with dedicated disks) - Each node: 4+ cores, 8GB+ RAM, multiple disks - Low-latency network between nodes

Best for: - Large homelab (3+ physical servers) - Learning enterprise storage - High availability requirements - Large storage needs (multi-TB)

Option 4: Hybrid Approach (Recommended)¶

Tier your storage by use case:

# Fast, HA, small - Longhorn on VM SSDs
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn-ssd
provisioner: driver.longhorn.io
parameters:
  numberOfReplicas: "2"
  staleReplicaTimeout: "30"

---
# Large, centralized - NFS from /vault
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-bulk
provisioner: nfs.csi.k8s.io
parameters:
  server: 10.89.97.101
  share: /vault/k8s-volumes

---
# Fast, no HA - local VM disk
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-path
provisioner: rancher.io/local-path  # k3s default

Usage guide:

Workload	StorageClass	Why
PostgreSQL database	`longhorn-ssd`	Needs HA, small size, random I/O
MongoDB database	`longhorn-ssd`	Needs HA, small size
Prometheus metrics	`longhorn-ssd`	Needs HA, time-series data
Media library (Plex)	`nfs-bulk`	Large, read-heavy, shared
Log aggregation	`nfs-bulk`	Large, append-only
Build cache	`local-path`	Fastest, ephemeral, no HA needed
Temp files	`local-path`	Ephemeral, speed matters

Example PVC:

# High-performance database
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
spec:
  storageClassName: longhorn-ssd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

---
# Media library (read-only, shared)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: media-library
spec:
  storageClassName: nfs-bulk
  accessModes:
    - ReadOnlyMany  # Multiple pods can read
  resources:
    requests:
      storage: 1Ti

Scaling Storage: Future Options¶

Scenario 1: Add More Nodes (Same Proxmox)¶

Easy - Just scale horizontally:

# Create VMs 204, 205, etc.
# Join as workers
# Longhorn automatically uses their disks

Result: - More Longhorn storage pool (5 nodes × 80GB = 400GB raw → 200GB usable) - Better fault tolerance (lose 1 of 5 nodes vs 1 of 3)

Limitation: Still single Proxmox host (all VMs die if host dies)

Scenario 2: Multiple Proxmox Hosts (True HA)¶

Multi-host architecture:

┌─────────────────────┐  ┌─────────────────────┐  ┌─────────────────────┐
│ Proxmox Host 1      │  │ Proxmox Host 2      │  │ Proxmox Host 3      │
│ (tower)             │  │ (fortress)          │  │ (citadel)           │
│ 192.168.1.10        │  │ 192.168.1.11        │  │ 192.168.1.12        │
│                     │  │                     │  │                     │
│ ┌─────────────────┐ │  │ ┌─────────────────┐ │  │ ┌─────────────────┐ │
│ │ k3s-master-1    │ │  │ │ k3s-master-2    │ │  │ │ k3s-master-3    │ │
│ │ (etcd member)   │ │  │ │ (etcd member)   │ │  │ │ (etcd member)   │ │
│ └─────────────────┘ │  │ └─────────────────┘ │  │ └─────────────────┘ │
│                     │  │                     │  │                     │
│ ┌─────────────────┐ │  │ ┌─────────────────┐ │  │ ┌─────────────────┐ │
│ │ k3s-worker-1    │ │  │ │ k3s-worker-2    │ │  │ │ k3s-worker-3    │ │
│ │ Longhorn        │ │  │ │ Longhorn        │ │  │ │ Longhorn        │ │
│ └─────────────────┘ │  │ └─────────────────┘ │  │ └─────────────────┘ │
└─────────────────────┘  └─────────────────────┘  └─────────────────────┘
         │                        │                        │
         └────────────────────────┴────────────────────────┘
                    Network: LAN or VPN

Benefits: - ✅ True HA: Entire Proxmox host can die, cluster survives - ✅ Control plane HA (etcd quorum: 3 masters across 3 hosts) - ✅ Storage HA (Longhorn replicas on different physical hosts) - ✅ Lose 1 host → still operational

Requirements: - 3+ physical servers - Network between hosts (same LAN or VPN) - Low latency for etcd (<10ms ideal, <50ms acceptable)

Setup notes: - Use --cluster-init on first master - Install k3s on each host's VMs (or bare metal) - Longhorn automatically spreads replicas across hosts

Scenario 3: Hybrid Storage Tiers¶

Combine everything:

Current Phase (Phase 2):
┌────────────────────────────────────────┐
│ Longhorn on VM disks (120GB usable)    │
│ - Use for: Databases, app state        │
│ - Replicas: 2                          │
│ - Performance: Good                    │
└────────────────────────────────────────┘

Future Phase 3:
┌────────────────────────────────────────┐
│ Add NFS from /vault                    │
│ - Use for: Media, logs, backups        │
│ - Capacity: Multi-TB                   │
│ - Performance: Medium                  │
└────────────────────────────────────────┘

Future Phase 4 (Multi-host):
┌────────────────────────────────────────┐
│ Longhorn across 3 Proxmox hosts        │
│ - Use for: Critical HA data            │
│ - Survives host failure                │
│ - Performance: Good                    │
└────────────────────────────────────────┘

Optional Phase 5 (Enterprise):
┌────────────────────────────────────────┐
│ Ceph cluster (3+ nodes)                │
│ - Use for: Everything (consolidate)    │
│ - RBD, CephFS, S3                      │
│ - Resume material                      │
└────────────────────────────────────────┘

Storage Best Practices¶

1. Choose the Right Storage for the Workload¶

Don't use Longhorn for everything: - ✅ Databases → Longhorn (needs HA, small size) - ❌ Media files → NFS (large, centralized better) - ❌ Build artifacts → local-path (ephemeral, speed matters)

2. Set Resource Limits¶

# Don't let one app consume all storage
apiVersion: v1
kind: ResourceQuota
metadata:
  name: storage-quota
  namespace: my-app
spec:
  hard:
    requests.storage: "50Gi"
    persistentvolumeclaims: "5"

3. Monitor Storage Usage¶

# Check Longhorn volumes
kubectl get pv

# Check PVC usage
kubectl get pvc -A

# Longhorn UI (Phase 2)
# Access at: http://<node-ip>:30080

4. Backup Regularly¶

Longhorn: Built-in snapshots and backups to S3/NFS

NFS: ZFS snapshots on /vault

# On NAS
zfs snapshot rpool/data/subvol-101-disk-0@k8s-backup-$(date +%Y%m%d)

Best practice: Backup to external location (not same Proxmox host)

5. Plan for Growth¶

Current: 120GB usable (Longhorn 2-replica on 3×80GB)

Growth path: 1. Add more worker nodes (more disks in pool) 2. Increase VM disk size (qm resize on Proxmox) 3. Add NFS tier for bulk data 4. Multi-host for true HA 5. Consider Ceph for 500GB+ persistent data

Common Questions¶

Q: Can I use multiple storage providers at once? A: Yes! That's the hybrid approach. Longhorn + NFS + local-path all coexist.

Q: Can I migrate PVs between storage classes? A: Not directly. Need to backup data, delete PVC, create new PVC with different class, restore data.

Q: What happens if a Longhorn replica node dies? A: Longhorn automatically rebuilds the replica on a healthy node. No data loss (if 2+ replicas exist).

Q: Can I expand a PV after creation? A: Yes, if StorageClass has allowVolumeExpansion: true. Edit PVC size, Longhorn auto-expands.

Q: Should I use Longhorn or NFS for Prometheus? A: Longhorn. Prometheus does lots of random I/O (time-series database), needs fast local storage.

Q: Can I access /vault directly from pods? A: Yes, via NFS CSI driver. Or mount /vault as hostPath (not recommended - ties pod to specific node).

Q: Do I need Ceph? A: No, not for homelab. Longhorn + NFS covers 99% of use cases. Ceph is for large-scale deployments.

What Makes Our Setup Easy¶

1. Same LAN (10.89.97.0/24)¶

Benefit: Nodes can directly talk to each other - No VPN needed - No port forwarding - No NAT traversal - Low latency (<1ms between nodes)

Comparison to multi-site: - Multi-site needs VPN or public IPs - Higher latency (50-100ms+ typical) - More complex firewall rules

2. All on Same Proxmox Host¶

Benefits: - Easy VM management (qm list, qm start, etc.) - Shared storage access (could mount host paths if needed) - Console access via Proxmox UI - Backup entire cluster with Proxmox backup - Resource monitoring in one place

Comparison to separate hardware: - Separate hardware = physically visit machines - Different management tools per hypervisor - Harder to troubleshoot (no centralized console)

3. Fresh Install / No Legacy Constraints¶

Benefits: - Latest k3s version (v1.33.5) - Clean networking (no old iptables rules) - No migration concerns (not moving from existing system) - Can make breaking changes freely (learning environment)

4. k3s Defaults Work Out of the Box¶

What we didn't need to configure: - ✅ Flannel CNI (pod networking) - included - ✅ CoreDNS (DNS) - included - ✅ local-path storage - included - ✅ Traefik ingress - included (we'll replace with MetalLB) - ✅ Service account tokens - automatic

Comparison to vanilla k8s (kubeadm): - Must install CNI separately (Calico, Flannel, etc.) - Must install CoreDNS - No storage by default - No ingress by default - Much more YAML to write

5. Learning Environment (Can Break Things)¶

Benefits: - Can experiment without fear - Easy to rebuild (VMs are disposable) - Mistakes are learning opportunities - No SLA requirements

What Changes for Multi-Site¶

Scenario: Master at Home, Workers in Cloud¶

Challenges:

1. Connectivity¶

Home IP likely dynamic (DHCP from ISP)
NAT/firewall at home router
Cloud nodes have public IPs
Can't directly connect without setup

Solutions: - Dynamic DNS: DuckDNS, Cloudflare DDNS for home IP - VPN: Tailscale/Wireguard for direct node-to-node - Port forwarding: Forward 6443 on home router (less secure)

2. Latency¶

Home to cloud: 50-150ms typical
etcd breaks with >100ms latency (consensus protocol)
Impact: HA masters won't work reliably

Solutions: - Single master at home (accept single point of failure) - Master in cloud (lower latency to cloud workers) - Separate clusters (home cluster + cloud cluster)

3. Security¶

Exposing home network to internet (if not using VPN)
More attack surface

Solutions: - Mandatory VPN (Tailscale recommended) - Client certificates already in place (k3s default) - Firewall rules on each node (allow only cluster node IPs)

4. Node Network¶

Current (same LAN):

Node Network: 10.89.97.0/24
- All nodes see each other directly

Multi-site with VPN:

Physical Networks:
- Home: 192.168.1.0/24
- Cloud: (public IPs)

VPN Network (Tailscale): 100.64.0.0/10
- Master: 100.64.1.1
- Worker-1: 100.64.1.2
- Worker-2: 100.64.1.3

k3s uses VPN IPs for node communication

Installation changes:

# Master (home, Tailscale IP 100.64.1.1)
curl -sfL https://get.k3s.io | sh -s - server \
  --node-ip 100.64.1.1 \
  --advertise-address 100.64.1.1 \
  --tls-san 100.64.1.1

# Worker (cloud, Tailscale IP 100.64.1.2)
curl -sfL https://get.k3s.io | \
  K3S_URL=https://100.64.1.1:6443 \
  K3S_TOKEN=xxx \
  sh -s - --node-ip 100.64.1.2

5. Flannel Configuration¶

Flannel needs to know which interface to use for VXLAN.

Current (same LAN): Auto-detects eth0 (10.89.97.x)