Tower Fleet - K3s Installation Guide¶

Last Updated: 2025-12-10 Version: v1.1 Tested On: Proxmox VE 8.3.5, Debian 13 (Trixie)

This guide documents the complete process of building the tower-fleet k3s cluster from scratch. Follow these steps to recreate the cluster or troubleshoot issues.

Prerequisites¶

Hardware Requirements: - Proxmox host with 12+ cores, 24GB+ RAM available - 240GB+ storage for VMs (ZFS recommended) - Network: 10.89.97.0/24 with DHCP or static IPs available

Access: - Root SSH access to Proxmox host - GitHub account (for GitOps with Flux) - SSH keys configured in ~/.ssh/authorized_keys on Proxmox

Knowledge Level: - Basic Linux/SSH skills - Understanding of Kubernetes concepts (pods, deployments, services) - Git basics (clone, commit, push)

Phase 1: VM Creation¶

Step 1.1: Download Debian 13 (Trixie) Cloud Image¶

The Debian cloud image is optimized for VMs with cloud-init support.

# Download to Proxmox template storage
cd /var/lib/vz/template/iso
wget https://cloud.debian.org/images/cloud/trixie/daily/latest/debian-13-generic-amd64-daily.qcow2

# Verify download
ls -lh debian-13-generic-amd64-daily.qcow2
# Should be ~450-500MB

Why cloud image? - Pre-configured for cloud-init (automated provisioning) - Smaller than full ISO - Faster boot times - No manual installation needed

Step 1.2: Check Available Storage¶

# List Proxmox storage pools
pvesm status

# Example output:
# Name          Type      Status    Total         Used      Available
# local         dir       active    1.4T          500G      900G
# local-zfs     zfspool   active    1.3T          344G      1009G

Decision Point: Note your storage name (we use local-zfs)

Step 1.3: Create the VMs¶

Create the VM creation script:

cat > /tmp/create-k3s-vms.sh << 'EOF'
#!/bin/bash
set -e

# Configuration
TEMPLATE_IMG="/var/lib/vz/template/iso/debian-13-generic-amd64-daily.qcow2"
STORAGE="local-zfs"  # ← CHANGE THIS to match your storage from Step 1.2
BRIDGE="vmbr0"       # ← CHANGE if you use different bridge
GATEWAY="10.89.97.1" # ← CHANGE to your network gateway

# VM configurations (VMID:NAME:IP)
declare -A VMS=(
  [201]="k3s-master:10.89.97.201"
  [202]="k3s-worker-1:10.89.97.202"
  [203]="k3s-worker-2:10.89.97.203"
)

# Create each VM
for VMID in "${!VMS[@]}"; do
  IFS=':' read -r NAME IP <<< "${VMS[$VMID]}"

  echo "Creating VM $VMID: $NAME ($IP)..."

  # Create base VM
  qm create $VMID \
    --name "$NAME" \
    --memory 8192 \
    --cores 4 \
    --net0 virtio,bridge=$BRIDGE \
    --scsihw virtio-scsi-single \
    --ostype l26 \
    --cpu host \
    --agent enabled=1

  # Import and attach disk
  qm importdisk $VMID "$TEMPLATE_IMG" $STORAGE
  qm set $VMID --scsi0 ${STORAGE}:vm-${VMID}-disk-0
  qm set $VMID --boot order=scsi0

  # IMPORTANT: Resize disk to 80GB (importdisk creates ~3GB disk from cloud image)
  qm resize $VMID scsi0 +77G

  # Add cloud-init drive
  qm set $VMID --ide2 ${STORAGE}:cloudinit

  # Configure cloud-init (network, SSH keys)
  qm set $VMID \
    --ciuser root \
    --cipassword "k3s-temp-password" \
    --ipconfig0 "ip=${IP}/24,gw=${GATEWAY}" \
    --nameserver 1.1.1.1 \
    --searchdomain local \
    --sshkeys ~/.ssh/authorized_keys

  echo "✓ VM $VMID created"
done

# Start all VMs
echo ""
echo "Starting VMs..."
for VMID in "${!VMS[@]}"; do
  qm start $VMID
done

# Wait for boot
echo "Waiting 30 seconds for VMs to boot..."
sleep 30

# Test SSH connectivity
echo ""
echo "Testing SSH connectivity..."
for VMID in "${!VMS[@]}"; do
  IFS=':' read -r NAME IP <<< "${VMS[$VMID]}"
  if ssh -o StrictHostKeyChecking=no -o ConnectTimeout=5 root@$IP "echo 'VM $VMID reachable'" 2>/dev/null; then
    echo "✓ $NAME ($IP) - SSH ready"
  else
    echo "✗ $NAME ($IP) - SSH not ready (give it more time)"
  fi
done

echo ""
echo "VM creation complete!"
echo "Next: Install k3s (see Phase 2)"
EOF

chmod +x /tmp/create-k3s-vms.sh

Run the script:

/tmp/create-k3s-vms.sh

Expected output:

Creating VM 201: k3s-master (10.89.97.201)...
✓ VM 201 created
Creating VM 202: k3s-worker-1 (10.89.97.202)...
✓ VM 202 created
Creating VM 203: k3s-worker-2 (10.89.97.203)...
✓ VM 203 created
Starting VMs...
...
✓ k3s-master (10.89.97.201) - SSH ready
✓ k3s-worker-1 (10.89.97.202) - SSH ready
✓ k3s-worker-2 (10.89.97.203) - SSH ready

Step 1.4: Verify VM Configuration¶

CRITICAL: Verify disk sizes before proceeding!

# Check disk configuration for all VMs
for vmid in 201 202 203; do
  echo "=== VM $vmid ==="
  qm config $vmid | grep scsi0
done

Expected output:

=== VM 201 ===
scsi0: local-zfs:vm-201-disk-0,size=80G
=== VM 202 ===
scsi0: local-zfs:vm-202-disk-0,size=80G
=== VM 203 ===
scsi0: local-zfs:vm-203-disk-0,size=80G

If you see size=3G or similar small size:

This means the disk resize didn't happen. The cloud image is ~3GB and wasn't expanded. Fix it now:

# Resize each VM disk to 80GB
for vmid in 201 202 203; do
  echo "Resizing VM $vmid..."
  qm resize $vmid scsi0 80G
done

# Verify
for vmid in 201 202 203; do
  qm config $vmid | grep scsi0
done

Then extend the filesystem inside each VM:

# For each node
for ip in 10.89.97.201 10.89.97.202 10.89.97.203; do
  echo "=== Extending filesystem on $ip ==="
  ssh root@$ip 'growpart /dev/sda 1 && resize2fs /dev/sda1 && df -h /'
done

Expected output:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        79G  1.8G   74G   3% /

Why this matters: - Longhorn needs substantial disk space (we allocated 80GB per node = 240GB total) - With 2-replica, that's 120GB usable storage - Small disks (3GB) cause DiskPressure errors and pod evictions - Phase 2 (Longhorn installation) will fail without proper disk space

Troubleshooting:

Issue	Solution
"storage 'local-zfs' does not exist"	Run `pvesm status` and update `STORAGE=` in script
"VMID 201 already exists"	Run `qm destroy 201 202 203` to remove old VMs
SSH not ready after 30s	Wait longer (some systems take 60s), then test: `ssh root@10.89.97.201`
Network unreachable	Verify GATEWAY IP is correct for your network

Phase 2: k3s Installation¶

Step 2.1: Install k3s on Master Node¶

What we're doing: - Installing k3s "server" (control plane + worker combined) - Disabling Traefik (we'll use our own ingress later) - Disabling servicelb (we'll use MetalLB for LoadBalancer support)

Command:

ssh root@10.89.97.201 'curl -sfL https://get.k3s.io | sh -s - server \
  --write-kubeconfig-mode=644 \
  --disable=traefik \
  --disable=servicelb'

Explanation of flags: - server - Installs k3s in server mode (control plane) - --write-kubeconfig-mode=644 - Makes kubeconfig readable by all users (homelab only!) - --disable=traefik - We'll install our own ingress controller later - --disable=servicelb - We'll use MetalLB for better LoadBalancer support

Expected output:

[INFO]  Finding release for channel stable
[INFO]  Using v1.33.5+k3s1 as release
[INFO]  Downloading binary...
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  systemd: Starting k3s

Verify installation:

ssh root@10.89.97.201 'kubectl get nodes'

Expected output:

NAME         STATUS   ROLES                  AGE   VERSION
k3s-master   Ready    control-plane,master   30s   v1.33.5+k3s1

Troubleshooting:

Issue	Solution
"Connection refused"	k3s still starting, wait 30s and retry
Node shows "NotReady"	Check logs: `ssh root@10.89.97.201 'journalctl -u k3s -f'`
Download fails	Check internet connectivity: `ssh root@10.89.97.201 'ping -c 3 google.com'`

Step 2.2: Get Join Token for Workers¶

What is this? The join token is a secret that authenticates worker nodes when they join the cluster.

Command:

K3S_TOKEN=$(ssh root@10.89.97.201 'cat /var/lib/rancher/k3s/server/node-token')
echo "Join token: $K3S_TOKEN"

Expected output:

Join token: K10c6775699...::server:11d2e255...

Save this token - you'll need it if you add more workers later.

Step 2.3: Join Worker Nodes¶

What we're doing: - Installing k3s in "agent" mode (worker only) - Pointing agents to the master's API server - Using the join token for authentication

Join worker-1:

K3S_TOKEN=$(ssh root@10.89.97.201 'cat /var/lib/rancher/k3s/server/node-token')
K3S_URL="https://10.89.97.201:6443"

ssh root@10.89.97.202 "curl -sfL https://get.k3s.io | K3S_URL='${K3S_URL}' K3S_TOKEN='${K3S_TOKEN}' sh -"

Expected output:

[INFO]  Finding release for channel stable
[INFO]  Using v1.33.5+k3s1 as release
[INFO]  Downloading binary...
[INFO]  systemd: Starting k3s-agent

Join worker-2:

ssh root@10.89.97.203 "curl -sfL https://get.k3s.io | K3S_URL='${K3S_URL}' K3S_TOKEN='${K3S_TOKEN}' sh -"

Or join both in parallel:

K3S_TOKEN=$(ssh root@10.89.97.201 'cat /var/lib/rancher/k3s/server/node-token')
K3S_URL="https://10.89.97.201:6443"

# Join both workers simultaneously
ssh root@10.89.97.202 "curl -sfL https://get.k3s.io | K3S_URL='${K3S_URL}' K3S_TOKEN='${K3S_TOKEN}' sh -" &
ssh root@10.89.97.203 "curl -sfL https://get.k3s.io | K3S_URL='${K3S_URL}' K3S_TOKEN='${K3S_TOKEN}' sh -" &
wait

echo "Both workers joined!"

Step 2.4: Verify Cluster Health¶

Check all nodes are ready:

ssh root@10.89.97.201 'kubectl get nodes -o wide'

Expected output:

NAME            STATUS   ROLES                  AGE   VERSION        INTERNAL-IP
k3s-master      Ready    control-plane,master   5m    v1.33.5+k3s1   10.89.97.201
k3s-worker-1    Ready    <none>                 2m    v1.33.5+k3s1   10.89.97.202
k3s-worker-2    Ready    <none>                 2m    v1.33.5+k3s1   10.89.97.203

Check system pods:

ssh root@10.89.97.201 'kubectl get pods -A'

Expected output:

NAMESPACE     NAME                                     READY   STATUS    RESTARTS   AGE
kube-system   coredns-...                              1/1     Running   0          5m
kube-system   local-path-provisioner-...               1/1     Running   0          5m
kube-system   metrics-server-...                       1/1     Running   0          5m

Troubleshooting:

Issue	Solution
Node shows "NotReady"	Wait 60s (node registration takes time)
Node missing from list	Check worker logs: `ssh root@10.89.97.202 'journalctl -u k3s-agent -f'`
CoreDNS in CrashLoopBackOff	Check master logs: `ssh root@10.89.97.201 'journalctl -u k3s -f'`

Step 2.5: Set Up kubectl Access from Proxmox Host¶

Why? You'll want to manage the cluster from the Proxmox host without SSHing to the master every time.

Copy kubeconfig:

# Create kube config directory
mkdir -p ~/.kube

# Copy kubeconfig from master
scp root@10.89.97.201:/etc/rancher/k3s/k3s.yaml ~/.kube/config

# Update server IP (it defaults to 127.0.0.1)
sed -i 's/127.0.0.1/10.89.97.201/g' ~/.kube/config

# Set permissions
chmod 600 ~/.kube/config

Install kubectl on Proxmox (if not already installed):

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
mv kubectl /usr/local/bin/

# Verify
kubectl version --client

Test access:

kubectl get nodes

Expected output:

NAME            STATUS   ROLES                  AGE   VERSION
k3s-master      Ready    control-plane,master   10m   v1.33.5+k3s1
k3s-worker-1    Ready    <none>                 7m    v1.33.5+k3s1
k3s-worker-2    Ready    <none>                 7m    v1.33.5+k3s1

Add alias for convenience:

echo "alias k='kubectl'" >> ~/.bashrc
source ~/.bashrc

# Now you can use 'k' instead of 'kubectl'
k get nodes

Phase 2 Complete! ✓¶

What we've accomplished: - ✅ 3-node k3s cluster running - ✅ All nodes in "Ready" state - ✅ kubectl access from Proxmox host - ✅ System pods (CoreDNS, metrics-server) running

Next Steps: - Phase 3: Core Infrastructure (MetalLB, Longhorn, cert-manager) - Phase 4: GitOps with Flux - Phase 5: Observability Stack

Quick Reference Commands¶

# Check cluster health
kubectl get nodes
kubectl get pods -A

# SSH to nodes
ssh root@10.89.97.201  # master
ssh root@10.89.97.202  # worker-1
ssh root@10.89.97.203  # worker-2

# View k3s logs
ssh root@10.89.97.201 'journalctl -u k3s -f'           # master
ssh root@10.89.97.202 'journalctl -u k3s-agent -f'     # worker

# Restart k3s
ssh root@10.89.97.201 'systemctl restart k3s'          # master
ssh root@10.89.97.202 'systemctl restart k3s-agent'    # worker

# Stop/start VMs
qm start 201 202 203
qm stop 201 202 203

Disaster Recovery¶

Rebuild Cluster from Scratch¶

If something goes horribly wrong:

# 1. Destroy VMs
qm destroy 201 202 203

# 2. Re-run VM creation script
/tmp/create-k3s-vms.sh

# 3. Reinstall k3s (follow Phase 2 steps)

Add Additional Worker Node¶

# 1. Create VM (use VMID 204+)
# 2. Get token from master
K3S_TOKEN=$(ssh root@10.89.97.201 'cat /var/lib/rancher/k3s/server/node-token')

# 3. Join new worker
ssh root@NEW_WORKER_IP "curl -sfL https://get.k3s.io | K3S_URL='https://10.89.97.201:6443' K3S_TOKEN='${K3S_TOKEN}' sh -"

Lessons Learned¶

2025-11-09: Initial Installation¶

What Went Well: - Cloud-init made VM provisioning fast (<5 min total) - k3s installation is straightforward and well-documented - Debian 13 (Trixie) images are stable and modern

Challenges: - Had to identify correct Proxmox storage name (local-zfs not local-lvm) - Initial script attempt failed, needed cleanup with qm destroy

Improvements Made: - Added storage name check step to guide - Added troubleshooting section for common issues - Documented all commands with explanations for learning

Would Do Differently: - Could use Terraform/Ansible for fully automated provisioning - Consider setting up HA control plane (3 master nodes) for production-like experience

Next: Phase 3 - Core Infrastructure¶

Continue to Phase 3 to install: - MetalLB (LoadBalancer services) - Longhorn (distributed storage) - cert-manager (SSL certificates)

See TOWER_FLEET.md for overall project timeline.