Cluster Testing Guide¶
How to validate the tower-fleet k3s cluster at different stages.
Phase 1.5: Basic Cluster Testing (Before Core Infrastructure)¶
Purpose: Validate cluster fundamentals work before adding MetalLB/Longhorn
Time: ~5 minutes
Test 1: Deploy Test Application¶
# Deploy nginx
kubectl create deployment nginx-test --image=nginx
# Wait for pod to be ready
kubectl wait --for=condition=ready pod -l app=nginx-test --timeout=60s
# Check pod status
kubectl get pods -l app=nginx-test -o wide
Expected output:
✅ Validates: - Pod scheduling works - Scheduler places pods on worker nodes - Container runtime (containerd) works - Image pull from Docker Hub works
Test 2: NodePort Service¶
# Expose as NodePort
kubectl expose deployment nginx-test --port=80 --type=NodePort
# Get assigned port
kubectl get svc nginx-test
Expected output:
Test access on all nodes:
# Get the NodePort number
NODEPORT=$(kubectl get svc nginx-test -o jsonpath='{.spec.ports[0].nodePort}')
# Test on master
curl -I http://10.89.97.201:$NODEPORT
# Test on worker-1
curl -I http://10.89.97.202:$NODEPORT
# Test on worker-2
curl -I http://10.89.97.203:$NODEPORT
Expected: All return HTTP/1.1 200 OK
✅ Validates: - kube-proxy is working - NodePort service type works - Network routing to all nodes works
Test 3: ClusterIP and DNS¶
# Test internal cluster DNS and networking
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never -- curl -s http://nginx-test
Expected output:
✅ Validates: - CoreDNS is working - Service discovery works - Pod-to-pod networking works - ClusterIP service type works
Test 4: LoadBalancer (Should Fail - Expected)¶
# Try to create LoadBalancer service
kubectl expose deployment nginx-test --name=nginx-lb --port=80 --type=LoadBalancer
# Check status
kubectl get svc nginx-lb
Expected output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-lb LoadBalancer 10.43.x.x <pending> 80:31568/TCP 5s
❌ Expected to show <pending> - This is normal! MetalLB not installed yet.
✅ Validates: - Service gets created (just no external IP yet) - After Phase 2 (MetalLB), this will get an IP from 10.89.97.210-220
Cleanup¶
Test Results Summary (Phase 1.5)¶
Date: 2025-11-09
| Test | Result | Notes |
|---|---|---|
| Pod deployment | ✅ PASS | Deployed to k3s-worker-1 |
| Image pull | ✅ PASS | nginx:latest from Docker Hub |
| NodePort access | ✅ PASS | All 3 nodes return HTTP 200 |
| ClusterIP/DNS | ✅ PASS | Service discovery works |
| LoadBalancer | ⏳ PENDING | Expected - MetalLB not installed |
Conclusion: Cluster fundamentals are working correctly! Ready for Phase 2.
Phase 2.5: Full Testing (After Core Infrastructure)¶
Purpose: Validate MetalLB, Longhorn, cert-manager
Time: ~10 minutes
Test 5: LoadBalancer with MetalLB¶
# Deploy nginx
kubectl create deployment nginx-test --image=nginx
# Expose as LoadBalancer
kubectl expose deployment nginx-test --port=80 --type=LoadBalancer
# Wait for external IP
kubectl get svc nginx-test --watch
Expected output (after MetalLB installed):
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-test LoadBalancer 10.43.x.x 10.89.97.210 80:31234/TCP 10s
Test access:
# Get LoadBalancer IP
LB_IP=$(kubectl get svc nginx-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
# Test access
curl http://$LB_IP
Expected: Returns nginx welcome page
✅ Validates: - MetalLB is working - IP pool configuration is correct - LoadBalancer service type now works
Test 6: Persistent Storage with Longhorn¶
# Create test PVC
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 1Gi
EOF
# Check PVC status
kubectl get pvc test-pvc
Expected output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
test-pvc Bound pvc-xxx-xxx-xxx 1Gi RWO longhorn
Create pod using PVC:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: test
image: busybox
command: ['sh', '-c', 'echo "Hello Longhorn" > /data/test.txt && sleep 3600']
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: test-pvc
EOF
# Wait for pod
kubectl wait --for=condition=ready pod/test-pod --timeout=60s
# Verify data was written
kubectl exec test-pod -- cat /data/test.txt
Expected: Returns Hello Longhorn
✅ Validates: - Longhorn is installed correctly - PVC creation works - Volume mounting works - Data persistence works
Check replication:
# Access Longhorn UI (optional)
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# Then open: http://localhost:8080
# Should see 2 replicas for the volume
Cleanup:
Test 7: Full Application Deployment¶
# Deploy complete app with LoadBalancer + PVC
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-app
spec:
replicas: 2
selector:
matchLabels:
app: test-app
template:
metadata:
labels:
app: test-app
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
volumeMounts:
- name: data
mountPath: /usr/share/nginx/html
volumes:
- name: data
persistentVolumeClaim:
claimName: test-app-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-app-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: test-app
spec:
type: LoadBalancer
selector:
app: test-app
ports:
- port: 80
targetPort: 80
EOF
# Wait for deployment
kubectl wait --for=condition=available deployment/test-app --timeout=120s
# Get LoadBalancer IP
kubectl get svc test-app
✅ Validates: - Complete app deployment works - LoadBalancer + PVC together - Multi-replica deployment - Pod scheduling across workers
Cleanup:
Continuous Health Checks¶
Run these regularly to ensure cluster health:
# Quick health check
kubectl get nodes # All should be Ready
kubectl get pods -A # All should be Running
# Detailed check
kubectl top nodes # CPU/RAM usage
kubectl top pods -A # Pod resource usage
# Check for issues
kubectl get events --sort-by='.lastTimestamp' -A | head -20
# Longhorn health (after Phase 2)
kubectl get nodes,pv,pvc -A
Troubleshooting Failed Tests¶
Pod stuck in Pending¶
Common causes: - No resources (CPU/RAM exhausted) - PVC not bound - Node selector mismatch
Service shows for LoadBalancer¶
# Check MetalLB is running
kubectl get pods -n metallb-system
# Check MetalLB logs
kubectl logs -n metallb-system -l app=metallb
PVC stuck in Pending¶
kubectl describe pvc/PVC_NAME
# Check Longhorn is running
kubectl get pods -n longhorn-system
# Check StorageClass exists
kubectl get sc
Performance Testing (Optional)¶
Network throughput¶
# Install iperf3
kubectl run iperf-server --image=networkstatic/iperf3 -- -s
kubectl run iperf-client --image=networkstatic/iperf3 --rm -it -- -c iperf-server
Storage IOPS¶
# Install fio
kubectl run fio --image=clusterhq/fio --rm -it -- fio --name=test --ioengine=libaio --size=1G --rw=randwrite --bs=4k --numjobs=1 --time_based --runtime=60
Automated Testing Script¶
Create: /root/scripts/test-cluster.sh
#!/bin/bash
set -e
echo "=== K3s Cluster Health Check ==="
echo ""
# Check nodes
echo "1. Checking nodes..."
if kubectl get nodes | grep -q "NotReady"; then
echo " ❌ FAIL: Some nodes are NotReady"
kubectl get nodes
exit 1
else
echo " ✅ PASS: All nodes Ready"
fi
# Check system pods
echo ""
echo "2. Checking system pods..."
NOT_RUNNING=$(kubectl get pods -n kube-system --no-headers | grep -v "Running\|Completed" | wc -l)
if [ "$NOT_RUNNING" -gt 0 ]; then
echo " ❌ FAIL: Some system pods not running"
kubectl get pods -n kube-system | grep -v "Running\|Completed"
exit 1
else
echo " ✅ PASS: All system pods running"
fi
# Test deployment
echo ""
echo "3. Testing deployment..."
kubectl create deployment test --image=nginx --dry-run=client -o yaml | kubectl apply -f -
kubectl wait --for=condition=available deployment/test --timeout=60s
echo " ✅ PASS: Deployment works"
# Cleanup
kubectl delete deployment test
echo ""
echo "=== All tests passed! ==="
Run:
Next Steps¶
- After Phase 1.5 tests pass → Proceed to Core Infrastructure
- After Phase 2.5 tests pass → Ready for application deployment