Tower Fleet - K3s Installation Guide¶
Last Updated: 2025-12-10 Version: v1.1 Tested On: Proxmox VE 8.3.5, Debian 13 (Trixie)
This guide documents the complete process of building the tower-fleet k3s cluster from scratch. Follow these steps to recreate the cluster or troubleshoot issues.
Prerequisites¶
Hardware Requirements: - Proxmox host with 12+ cores, 24GB+ RAM available - 240GB+ storage for VMs (ZFS recommended) - Network: 10.89.97.0/24 with DHCP or static IPs available
Access:
- Root SSH access to Proxmox host
- GitHub account (for GitOps with Flux)
- SSH keys configured in ~/.ssh/authorized_keys on Proxmox
Knowledge Level: - Basic Linux/SSH skills - Understanding of Kubernetes concepts (pods, deployments, services) - Git basics (clone, commit, push)
Phase 1: VM Creation¶
Step 1.1: Download Debian 13 (Trixie) Cloud Image¶
The Debian cloud image is optimized for VMs with cloud-init support.
# Download to Proxmox template storage
cd /var/lib/vz/template/iso
wget https://cloud.debian.org/images/cloud/trixie/daily/latest/debian-13-generic-amd64-daily.qcow2
# Verify download
ls -lh debian-13-generic-amd64-daily.qcow2
# Should be ~450-500MB
Why cloud image? - Pre-configured for cloud-init (automated provisioning) - Smaller than full ISO - Faster boot times - No manual installation needed
Step 1.2: Check Available Storage¶
# List Proxmox storage pools
pvesm status
# Example output:
# Name Type Status Total Used Available
# local dir active 1.4T 500G 900G
# local-zfs zfspool active 1.3T 344G 1009G
Decision Point: Note your storage name (we use local-zfs)
Step 1.3: Create the VMs¶
Create the VM creation script:
cat > /tmp/create-k3s-vms.sh << 'EOF'
#!/bin/bash
set -e
# Configuration
TEMPLATE_IMG="/var/lib/vz/template/iso/debian-13-generic-amd64-daily.qcow2"
STORAGE="local-zfs" # ← CHANGE THIS to match your storage from Step 1.2
BRIDGE="vmbr0" # ← CHANGE if you use different bridge
GATEWAY="10.89.97.1" # ← CHANGE to your network gateway
# VM configurations (VMID:NAME:IP)
declare -A VMS=(
[201]="k3s-master:10.89.97.201"
[202]="k3s-worker-1:10.89.97.202"
[203]="k3s-worker-2:10.89.97.203"
)
# Create each VM
for VMID in "${!VMS[@]}"; do
IFS=':' read -r NAME IP <<< "${VMS[$VMID]}"
echo "Creating VM $VMID: $NAME ($IP)..."
# Create base VM
qm create $VMID \
--name "$NAME" \
--memory 8192 \
--cores 4 \
--net0 virtio,bridge=$BRIDGE \
--scsihw virtio-scsi-single \
--ostype l26 \
--cpu host \
--agent enabled=1
# Import and attach disk
qm importdisk $VMID "$TEMPLATE_IMG" $STORAGE
qm set $VMID --scsi0 ${STORAGE}:vm-${VMID}-disk-0
qm set $VMID --boot order=scsi0
# IMPORTANT: Resize disk to 80GB (importdisk creates ~3GB disk from cloud image)
qm resize $VMID scsi0 +77G
# Add cloud-init drive
qm set $VMID --ide2 ${STORAGE}:cloudinit
# Configure cloud-init (network, SSH keys)
qm set $VMID \
--ciuser root \
--cipassword "k3s-temp-password" \
--ipconfig0 "ip=${IP}/24,gw=${GATEWAY}" \
--nameserver 1.1.1.1 \
--searchdomain local \
--sshkeys ~/.ssh/authorized_keys
echo "✓ VM $VMID created"
done
# Start all VMs
echo ""
echo "Starting VMs..."
for VMID in "${!VMS[@]}"; do
qm start $VMID
done
# Wait for boot
echo "Waiting 30 seconds for VMs to boot..."
sleep 30
# Test SSH connectivity
echo ""
echo "Testing SSH connectivity..."
for VMID in "${!VMS[@]}"; do
IFS=':' read -r NAME IP <<< "${VMS[$VMID]}"
if ssh -o StrictHostKeyChecking=no -o ConnectTimeout=5 root@$IP "echo 'VM $VMID reachable'" 2>/dev/null; then
echo "✓ $NAME ($IP) - SSH ready"
else
echo "✗ $NAME ($IP) - SSH not ready (give it more time)"
fi
done
echo ""
echo "VM creation complete!"
echo "Next: Install k3s (see Phase 2)"
EOF
chmod +x /tmp/create-k3s-vms.sh
Run the script:
Expected output:
Creating VM 201: k3s-master (10.89.97.201)...
✓ VM 201 created
Creating VM 202: k3s-worker-1 (10.89.97.202)...
✓ VM 202 created
Creating VM 203: k3s-worker-2 (10.89.97.203)...
✓ VM 203 created
Starting VMs...
...
✓ k3s-master (10.89.97.201) - SSH ready
✓ k3s-worker-1 (10.89.97.202) - SSH ready
✓ k3s-worker-2 (10.89.97.203) - SSH ready
Step 1.4: Verify VM Configuration¶
CRITICAL: Verify disk sizes before proceeding!
# Check disk configuration for all VMs
for vmid in 201 202 203; do
echo "=== VM $vmid ==="
qm config $vmid | grep scsi0
done
Expected output:
=== VM 201 ===
scsi0: local-zfs:vm-201-disk-0,size=80G
=== VM 202 ===
scsi0: local-zfs:vm-202-disk-0,size=80G
=== VM 203 ===
scsi0: local-zfs:vm-203-disk-0,size=80G
If you see size=3G or similar small size:
This means the disk resize didn't happen. The cloud image is ~3GB and wasn't expanded. Fix it now:
# Resize each VM disk to 80GB
for vmid in 201 202 203; do
echo "Resizing VM $vmid..."
qm resize $vmid scsi0 80G
done
# Verify
for vmid in 201 202 203; do
qm config $vmid | grep scsi0
done
Then extend the filesystem inside each VM:
# For each node
for ip in 10.89.97.201 10.89.97.202 10.89.97.203; do
echo "=== Extending filesystem on $ip ==="
ssh root@$ip 'growpart /dev/sda 1 && resize2fs /dev/sda1 && df -h /'
done
Expected output:
Why this matters: - Longhorn needs substantial disk space (we allocated 80GB per node = 240GB total) - With 2-replica, that's 120GB usable storage - Small disks (3GB) cause DiskPressure errors and pod evictions - Phase 2 (Longhorn installation) will fail without proper disk space
Troubleshooting:
| Issue | Solution |
|---|---|
| "storage 'local-zfs' does not exist" | Run pvesm status and update STORAGE= in script |
| "VMID 201 already exists" | Run qm destroy 201 202 203 to remove old VMs |
| SSH not ready after 30s | Wait longer (some systems take 60s), then test: ssh root@10.89.97.201 |
| Network unreachable | Verify GATEWAY IP is correct for your network |
Phase 2: k3s Installation¶
Step 2.1: Install k3s on Master Node¶
What we're doing: - Installing k3s "server" (control plane + worker combined) - Disabling Traefik (we'll use our own ingress later) - Disabling servicelb (we'll use MetalLB for LoadBalancer support)
Command:
ssh root@10.89.97.201 'curl -sfL https://get.k3s.io | sh -s - server \
--write-kubeconfig-mode=644 \
--disable=traefik \
--disable=servicelb'
Explanation of flags:
- server - Installs k3s in server mode (control plane)
- --write-kubeconfig-mode=644 - Makes kubeconfig readable by all users (homelab only!)
- --disable=traefik - We'll install our own ingress controller later
- --disable=servicelb - We'll use MetalLB for better LoadBalancer support
Expected output:
[INFO] Finding release for channel stable
[INFO] Using v1.33.5+k3s1 as release
[INFO] Downloading binary...
[INFO] Installing k3s to /usr/local/bin/k3s
[INFO] systemd: Starting k3s
Verify installation:
Expected output:
Troubleshooting:
| Issue | Solution |
|---|---|
| "Connection refused" | k3s still starting, wait 30s and retry |
| Node shows "NotReady" | Check logs: ssh root@10.89.97.201 'journalctl -u k3s -f' |
| Download fails | Check internet connectivity: ssh root@10.89.97.201 'ping -c 3 google.com' |
Step 2.2: Get Join Token for Workers¶
What is this? The join token is a secret that authenticates worker nodes when they join the cluster.
Command:
K3S_TOKEN=$(ssh root@10.89.97.201 'cat /var/lib/rancher/k3s/server/node-token')
echo "Join token: $K3S_TOKEN"
Expected output:
Save this token - you'll need it if you add more workers later.
Step 2.3: Join Worker Nodes¶
What we're doing: - Installing k3s in "agent" mode (worker only) - Pointing agents to the master's API server - Using the join token for authentication
Join worker-1:
K3S_TOKEN=$(ssh root@10.89.97.201 'cat /var/lib/rancher/k3s/server/node-token')
K3S_URL="https://10.89.97.201:6443"
ssh root@10.89.97.202 "curl -sfL https://get.k3s.io | K3S_URL='${K3S_URL}' K3S_TOKEN='${K3S_TOKEN}' sh -"
Expected output:
[INFO] Finding release for channel stable
[INFO] Using v1.33.5+k3s1 as release
[INFO] Downloading binary...
[INFO] systemd: Starting k3s-agent
Join worker-2:
ssh root@10.89.97.203 "curl -sfL https://get.k3s.io | K3S_URL='${K3S_URL}' K3S_TOKEN='${K3S_TOKEN}' sh -"
Or join both in parallel:
K3S_TOKEN=$(ssh root@10.89.97.201 'cat /var/lib/rancher/k3s/server/node-token')
K3S_URL="https://10.89.97.201:6443"
# Join both workers simultaneously
ssh root@10.89.97.202 "curl -sfL https://get.k3s.io | K3S_URL='${K3S_URL}' K3S_TOKEN='${K3S_TOKEN}' sh -" &
ssh root@10.89.97.203 "curl -sfL https://get.k3s.io | K3S_URL='${K3S_URL}' K3S_TOKEN='${K3S_TOKEN}' sh -" &
wait
echo "Both workers joined!"
Step 2.4: Verify Cluster Health¶
Check all nodes are ready:
Expected output:
NAME STATUS ROLES AGE VERSION INTERNAL-IP
k3s-master Ready control-plane,master 5m v1.33.5+k3s1 10.89.97.201
k3s-worker-1 Ready <none> 2m v1.33.5+k3s1 10.89.97.202
k3s-worker-2 Ready <none> 2m v1.33.5+k3s1 10.89.97.203
Check system pods:
Expected output:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-... 1/1 Running 0 5m
kube-system local-path-provisioner-... 1/1 Running 0 5m
kube-system metrics-server-... 1/1 Running 0 5m
Troubleshooting:
| Issue | Solution |
|---|---|
| Node shows "NotReady" | Wait 60s (node registration takes time) |
| Node missing from list | Check worker logs: ssh root@10.89.97.202 'journalctl -u k3s-agent -f' |
| CoreDNS in CrashLoopBackOff | Check master logs: ssh root@10.89.97.201 'journalctl -u k3s -f' |
Step 2.5: Set Up kubectl Access from Proxmox Host¶
Why? You'll want to manage the cluster from the Proxmox host without SSHing to the master every time.
Copy kubeconfig:
# Create kube config directory
mkdir -p ~/.kube
# Copy kubeconfig from master
scp root@10.89.97.201:/etc/rancher/k3s/k3s.yaml ~/.kube/config
# Update server IP (it defaults to 127.0.0.1)
sed -i 's/127.0.0.1/10.89.97.201/g' ~/.kube/config
# Set permissions
chmod 600 ~/.kube/config
Install kubectl on Proxmox (if not already installed):
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
mv kubectl /usr/local/bin/
# Verify
kubectl version --client
Test access:
Expected output:
NAME STATUS ROLES AGE VERSION
k3s-master Ready control-plane,master 10m v1.33.5+k3s1
k3s-worker-1 Ready <none> 7m v1.33.5+k3s1
k3s-worker-2 Ready <none> 7m v1.33.5+k3s1
Add alias for convenience:
echo "alias k='kubectl'" >> ~/.bashrc
source ~/.bashrc
# Now you can use 'k' instead of 'kubectl'
k get nodes
Phase 2 Complete! ✓¶
What we've accomplished: - ✅ 3-node k3s cluster running - ✅ All nodes in "Ready" state - ✅ kubectl access from Proxmox host - ✅ System pods (CoreDNS, metrics-server) running
Next Steps: - Phase 3: Core Infrastructure (MetalLB, Longhorn, cert-manager) - Phase 4: GitOps with Flux - Phase 5: Observability Stack
Quick Reference Commands¶
# Check cluster health
kubectl get nodes
kubectl get pods -A
# SSH to nodes
ssh root@10.89.97.201 # master
ssh root@10.89.97.202 # worker-1
ssh root@10.89.97.203 # worker-2
# View k3s logs
ssh root@10.89.97.201 'journalctl -u k3s -f' # master
ssh root@10.89.97.202 'journalctl -u k3s-agent -f' # worker
# Restart k3s
ssh root@10.89.97.201 'systemctl restart k3s' # master
ssh root@10.89.97.202 'systemctl restart k3s-agent' # worker
# Stop/start VMs
qm start 201 202 203
qm stop 201 202 203
Disaster Recovery¶
Rebuild Cluster from Scratch¶
If something goes horribly wrong:
# 1. Destroy VMs
qm destroy 201 202 203
# 2. Re-run VM creation script
/tmp/create-k3s-vms.sh
# 3. Reinstall k3s (follow Phase 2 steps)
Add Additional Worker Node¶
# 1. Create VM (use VMID 204+)
# 2. Get token from master
K3S_TOKEN=$(ssh root@10.89.97.201 'cat /var/lib/rancher/k3s/server/node-token')
# 3. Join new worker
ssh root@NEW_WORKER_IP "curl -sfL https://get.k3s.io | K3S_URL='https://10.89.97.201:6443' K3S_TOKEN='${K3S_TOKEN}' sh -"
Lessons Learned¶
2025-11-09: Initial Installation¶
What Went Well: - Cloud-init made VM provisioning fast (<5 min total) - k3s installation is straightforward and well-documented - Debian 13 (Trixie) images are stable and modern
Challenges:
- Had to identify correct Proxmox storage name (local-zfs not local-lvm)
- Initial script attempt failed, needed cleanup with qm destroy
Improvements Made: - Added storage name check step to guide - Added troubleshooting section for common issues - Documented all commands with explanations for learning
Would Do Differently: - Could use Terraform/Ansible for fully automated provisioning - Consider setting up HA control plane (3 master nodes) for production-like experience
Next: Phase 3 - Core Infrastructure¶
Continue to Phase 3 to install: - MetalLB (LoadBalancer services) - Longhorn (distributed storage) - cert-manager (SSL certificates)
See TOWER_FLEET.md for overall project timeline.