Cluster Testing Guide¶

How to validate the tower-fleet k3s cluster at different stages.

Phase 1.5: Basic Cluster Testing (Before Core Infrastructure)¶

Purpose: Validate cluster fundamentals work before adding MetalLB/Longhorn

Time: ~5 minutes

Test 1: Deploy Test Application¶

# Deploy nginx
kubectl create deployment nginx-test --image=nginx

# Wait for pod to be ready
kubectl wait --for=condition=ready pod -l app=nginx-test --timeout=60s

# Check pod status
kubectl get pods -l app=nginx-test -o wide

Expected output:

NAME                         READY   STATUS    RESTARTS   AGE   IP          NODE
nginx-test-xxx               1/1     Running   0          30s   10.42.2.2   k3s-worker-1

✅ Validates: - Pod scheduling works - Scheduler places pods on worker nodes - Container runtime (containerd) works - Image pull from Docker Hub works

Test 2: NodePort Service¶

# Expose as NodePort
kubectl expose deployment nginx-test --port=80 --type=NodePort

# Get assigned port
kubectl get svc nginx-test

Expected output:

NAME         TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
nginx-test   NodePort   10.43.x.x      <none>        80:31435/TCP   5s

Test access on all nodes:

# Get the NodePort number
NODEPORT=$(kubectl get svc nginx-test -o jsonpath='{.spec.ports[0].nodePort}')

# Test on master
curl -I http://10.89.97.201:$NODEPORT

# Test on worker-1
curl -I http://10.89.97.202:$NODEPORT

# Test on worker-2
curl -I http://10.89.97.203:$NODEPORT

Expected: All return HTTP/1.1 200 OK

✅ Validates: - kube-proxy is working - NodePort service type works - Network routing to all nodes works

Test 3: ClusterIP and DNS¶

# Test internal cluster DNS and networking
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never -- curl -s http://nginx-test

Expected output:

<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
...

✅ Validates: - CoreDNS is working - Service discovery works - Pod-to-pod networking works - ClusterIP service type works

Test 4: LoadBalancer (Should Fail - Expected)¶

# Try to create LoadBalancer service
kubectl expose deployment nginx-test --name=nginx-lb --port=80 --type=LoadBalancer

# Check status
kubectl get svc nginx-lb

Expected output:

NAME       TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
nginx-lb   LoadBalancer   10.43.x.x       <pending>     80:31568/TCP   5s

❌ Expected to show <pending> - This is normal! MetalLB not installed yet.

✅ Validates: - Service gets created (just no external IP yet) - After Phase 2 (MetalLB), this will get an IP from 10.89.97.210-220

Cleanup¶

# Remove test resources
kubectl delete svc nginx-test nginx-lb
kubectl delete deployment nginx-test

Test Results Summary (Phase 1.5)¶

Date: 2025-11-09

Test	Result	Notes
Pod deployment	✅ PASS	Deployed to k3s-worker-1
Image pull	✅ PASS	nginx:latest from Docker Hub
NodePort access	✅ PASS	All 3 nodes return HTTP 200
ClusterIP/DNS	✅ PASS	Service discovery works
LoadBalancer	⏳ PENDING	Expected - MetalLB not installed

Conclusion: Cluster fundamentals are working correctly! Ready for Phase 2.

Phase 2.5: Full Testing (After Core Infrastructure)¶

Purpose: Validate MetalLB, Longhorn, cert-manager

Time: ~10 minutes

Test 5: LoadBalancer with MetalLB¶

# Deploy nginx
kubectl create deployment nginx-test --image=nginx

# Expose as LoadBalancer
kubectl expose deployment nginx-test --port=80 --type=LoadBalancer

# Wait for external IP
kubectl get svc nginx-test --watch

Expected output (after MetalLB installed):

NAME         TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)        AGE
nginx-test   LoadBalancer   10.43.x.x     10.89.97.210    80:31234/TCP   10s

Test access:

# Get LoadBalancer IP
LB_IP=$(kubectl get svc nginx-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# Test access
curl http://$LB_IP

Expected: Returns nginx welcome page

✅ Validates: - MetalLB is working - IP pool configuration is correct - LoadBalancer service type now works

Test 6: Persistent Storage with Longhorn¶

# Create test PVC
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 1Gi
EOF

# Check PVC status
kubectl get pvc test-pvc

Expected output:

NAME       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS
test-pvc   Bound    pvc-xxx-xxx-xxx                           1Gi        RWO            longhorn

Create pod using PVC:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - name: test
    image: busybox
    command: ['sh', '-c', 'echo "Hello Longhorn" > /data/test.txt && sleep 3600']
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: test-pvc
EOF

# Wait for pod
kubectl wait --for=condition=ready pod/test-pod --timeout=60s

# Verify data was written
kubectl exec test-pod -- cat /data/test.txt

Expected: Returns Hello Longhorn

✅ Validates: - Longhorn is installed correctly - PVC creation works - Volume mounting works - Data persistence works

Check replication:

# Access Longhorn UI (optional)
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

# Then open: http://localhost:8080
# Should see 2 replicas for the volume

Cleanup:

kubectl delete pod test-pod
kubectl delete pvc test-pvc

Test 7: Full Application Deployment¶

# Deploy complete app with LoadBalancer + PVC
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: test-app
  template:
    metadata:
      labels:
        app: test-app
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
        volumeMounts:
        - name: data
          mountPath: /usr/share/nginx/html
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: test-app-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-app-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
  name: test-app
spec:
  type: LoadBalancer
  selector:
    app: test-app
  ports:
  - port: 80
    targetPort: 80
EOF

# Wait for deployment
kubectl wait --for=condition=available deployment/test-app --timeout=120s

# Get LoadBalancer IP
kubectl get svc test-app

✅ Validates: - Complete app deployment works - LoadBalancer + PVC together - Multi-replica deployment - Pod scheduling across workers

Cleanup:

kubectl delete deployment test-app
kubectl delete svc test-app
kubectl delete pvc test-app-pvc

Continuous Health Checks¶

Run these regularly to ensure cluster health:

# Quick health check
kubectl get nodes  # All should be Ready
kubectl get pods -A  # All should be Running

# Detailed check
kubectl top nodes  # CPU/RAM usage
kubectl top pods -A  # Pod resource usage

# Check for issues
kubectl get events --sort-by='.lastTimestamp' -A | head -20

# Longhorn health (after Phase 2)
kubectl get nodes,pv,pvc -A

Troubleshooting Failed Tests¶

Pod stuck in Pending¶

kubectl describe pod/POD_NAME
# Look at Events section for reason

Common causes: - No resources (CPU/RAM exhausted) - PVC not bound - Node selector mismatch

Service shows for LoadBalancer¶

# Check MetalLB is running
kubectl get pods -n metallb-system

# Check MetalLB logs
kubectl logs -n metallb-system -l app=metallb

PVC stuck in Pending¶

kubectl describe pvc/PVC_NAME

# Check Longhorn is running
kubectl get pods -n longhorn-system

# Check StorageClass exists
kubectl get sc

Performance Testing (Optional)¶

Network throughput¶

# Install iperf3
kubectl run iperf-server --image=networkstatic/iperf3 -- -s
kubectl run iperf-client --image=networkstatic/iperf3 --rm -it -- -c iperf-server

Storage IOPS¶

# Install fio
kubectl run fio --image=clusterhq/fio --rm -it -- fio --name=test --ioengine=libaio --size=1G --rw=randwrite --bs=4k --numjobs=1 --time_based --runtime=60

Automated Testing Script¶

Create: /root/scripts/test-cluster.sh

#!/bin/bash
set -e

echo "=== K3s Cluster Health Check ==="
echo ""

# Check nodes
echo "1. Checking nodes..."
if kubectl get nodes | grep -q "NotReady"; then
  echo "   ❌ FAIL: Some nodes are NotReady"
  kubectl get nodes
  exit 1
else
  echo "   ✅ PASS: All nodes Ready"
fi

# Check system pods
echo ""
echo "2. Checking system pods..."
NOT_RUNNING=$(kubectl get pods -n kube-system --no-headers | grep -v "Running\|Completed" | wc -l)
if [ "$NOT_RUNNING" -gt 0 ]; then
  echo "   ❌ FAIL: Some system pods not running"
  kubectl get pods -n kube-system | grep -v "Running\|Completed"
  exit 1
else
  echo "   ✅ PASS: All system pods running"
fi

# Test deployment
echo ""
echo "3. Testing deployment..."
kubectl create deployment test --image=nginx --dry-run=client -o yaml | kubectl apply -f -
kubectl wait --for=condition=available deployment/test --timeout=60s
echo "   ✅ PASS: Deployment works"

# Cleanup
kubectl delete deployment test

echo ""
echo "=== All tests passed! ==="

Run:

chmod +x /root/scripts/test-cluster.sh
/root/scripts/test-cluster.sh

Next Steps¶

After Phase 1.5 tests pass → Proceed to Core Infrastructure
After Phase 2.5 tests pass → Ready for application deployment