Troubleshooting Guide¶
Common issues and solutions for the tower-fleet k3s cluster.
Node Issues¶
Node shows "NotReady"¶
Symptoms:
Diagnosis:
# Check node conditions
kubectl describe node/k3s-worker-1
# Check kubelet logs
ssh root@10.89.97.202 'journalctl -u k3s-agent -n 100'
Common causes: 1. Network issues - Node can't reach master
-
k3s-agent not running
-
Disk pressure / resource exhaustion
Solution:
# Restart k3s-agent
ssh root@10.89.97.202 'systemctl restart k3s-agent'
# Wait 30 seconds, then check
kubectl get nodes
DiskPressure causing pod evictions¶
Symptoms:
kubectl get nodes
# NODE STATUS with DiskPressure condition
kubectl get events -n longhorn-system
# Warning Evicted pod/longhorn-manager-xxx The node had condition: [DiskPressure]
Diagnosis:
# Check disk usage on all nodes
for ip in 10.89.97.201 10.89.97.202 10.89.97.203; do
echo "=== $ip ==="
ssh root@$ip 'df -h /'
done
# Check VM disk configuration
for vmid in 201 202 203; do
echo "=== VM $vmid ==="
qm config $vmid | grep scsi0
done
Common cause: VMs have insufficient disk space (often 3GB from cloud image instead of 80GB)
Root cause: The qm importdisk command imports the cloud image at its original size (~3GB), and qm set --scsi0 ...,size=80G does NOT resize imported disks (only works for new disks).
Solution:
# 1. Stop affected services (if Longhorn is failing)
helm uninstall longhorn -n longhorn-system 2>/dev/null || true
kubectl delete namespace longhorn-system --force --grace-period=0 &
# 2. Resize VM disks at Proxmox level
for vmid in 201 202 203; do
echo "Resizing VM $vmid to 80GB..."
qm resize $vmid scsi0 80G
done
# 3. Extend partitions and filesystems inside VMs
for ip in 10.89.97.201 10.89.97.202 10.89.97.203; do
echo "=== Extending filesystem on $ip ==="
ssh root@$ip 'growpart /dev/sda 1 && resize2fs /dev/sda1'
done
# 4. Verify disk space
for ip in 10.89.97.201 10.89.97.202 10.89.97.203; do
ssh root@$ip 'df -h / | tail -1'
done
# Should show ~79G total, ~74G available
# 5. Wait for namespace deletion to complete
kubectl get namespace longhorn-system 2>&1 || echo "Namespace deleted"
# 6. Reinstall Longhorn
kubectl create namespace longhorn-system
helm install longhorn longhorn/longhorn --namespace longhorn-system
Prevention:
Always include qm resize in VM creation scripts after qm importdisk:
# Correct way:
qm importdisk $VMID "$TEMPLATE_IMG" $STORAGE
qm set $VMID --scsi0 ${STORAGE}:vm-${VMID}-disk-0
qm resize $VMID scsi0 +77G # ← This actually resizes the disk!
Verification after fix:
# Check Longhorn pods are running
kubectl get pods -n longhorn-system
# Check available storage
kubectl get nodes -o custom-columns=NAME:.metadata.name,CAPACITY:.status.capacity.ephemeral-storage
Node not appearing in cluster¶
Diagnosis:
# Check if k3s-agent is running
ssh root@10.89.97.202 'systemctl status k3s-agent'
# Check logs for errors
ssh root@10.89.97.202 'journalctl -u k3s-agent -f'
Common causes: 1. Wrong join token - Token mismatch 2. Wrong master IP - Can't find API server 3. Firewall blocking port 6443
Solution:
# Re-fetch token from master
K3S_TOKEN=$(ssh root@10.89.97.201 'cat /var/lib/rancher/k3s/server/node-token')
# Uninstall k3s-agent
ssh root@10.89.97.202 '/usr/local/bin/k3s-agent-uninstall.sh'
# Reinstall with correct token
ssh root@10.89.97.202 "curl -sfL https://get.k3s.io | K3S_URL='https://10.89.97.201:6443' K3S_TOKEN='${K3S_TOKEN}' sh -"
Pod Issues¶
Pod stuck in "Pending"¶
Symptoms:
Diagnosis:
Common causes: 1. No resources available - All nodes full
Solution: Scale down other apps or add more nodes-
PVC not bound - Waiting for storage
Solution: Check PVC status:kubectl get pvc -
Node selector mismatch - Pod requires specific node Solution: Check pod spec for nodeSelector/affinity
Pod stuck in "ContainerCreating"¶
Diagnosis:
Common causes: 1. Image pull error - Can't download container image
Solution: - Check image name/tag is correct - Check image registry is accessible - Add ImagePullSecrets if private registry- Volume mount issues - Can't mount PVC
Solution: Check PVC exists:
kubectl get pvc
Pod CrashLoopBackOff¶
Symptoms:
Diagnosis:
# Check logs
kubectl logs my-app-xxx
# Check previous container logs (if restarted)
kubectl logs my-app-xxx --previous
# Describe pod
kubectl describe pod/my-app-xxx
Common causes: 1. Application error - App crashes on startup - Check logs for error messages - Verify environment variables are correct - Check ConfigMaps/Secrets are mounted properly
- Liveness/Readiness probe failing
- Probe checks fail → k8s kills pod → restarts → fails again
-
Solution: Adjust probe timing or fix app health endpoint
-
Missing dependencies - Database not ready, etc.
- Use init containers or retry logic
Pod "ImagePullBackOff"¶
Symptoms:
Diagnosis:
Solutions: 1. Typo in image name
-
Private registry without credentials
-
Image doesn't exist
- Verify image exists:
docker pull IMAGE_NAME
kubectl Issues¶
"connection refused" when running kubectl¶
Symptoms:
Diagnosis:
# Check if master is running
qm status 201
# Check if k3s is running on master
ssh root@10.89.97.201 'systemctl status k3s'
# Check network connectivity
ping -c 3 10.89.97.201
nc -zv 10.89.97.201 6443
Solutions: 1. Master VM is stopped
-
k3s service not running
-
Network issue
- Check firewall rules
- Verify IP address in kubeconfig is correct
"Unable to connect to the server: x509: certificate signed by unknown authority"¶
Cause: Kubeconfig has wrong CA certificate
Solution:
# Re-copy kubeconfig from master
scp root@10.89.97.201:/etc/rancher/k3s/k3s.yaml ~/.kube/config
sed -i 's/127.0.0.1/10.89.97.201/g' ~/.kube/config
chmod 600 ~/.kube/config
"error: You must be logged in to the server (Unauthorized)"¶
Cause: Wrong or expired credentials
Solution:
# Check kubeconfig is present
cat ~/.kube/config
# Re-copy from master
scp root@10.89.97.201:/etc/rancher/k3s/k3s.yaml ~/.kube/config
sed -i 's/127.0.0.1/10.89.97.201/g' ~/.kube/config
Service/Network Issues¶
LoadBalancer service stuck in "Pending"¶
Symptoms:
Cause: MetalLB not installed (Phase 3)
Temporary workaround:
# Use NodePort instead
kubectl expose deployment my-app --port=80 --type=NodePort
# Access via node IP:port
kubectl get svc my-app
# my-app NodePort 10.43.x.x <none> 80:31234/TCP
# Access at: http://10.89.97.201:31234
Permanent solution: Install MetalLB (see 03-core-infrastructure.md)
Can't access service from outside cluster¶
Diagnosis:
# Check service type
kubectl get svc my-app
# Test from within cluster
kubectl run test --rm -it --image=busybox -- wget -O- http://my-app
Solutions: 1. Service is ClusterIP - Only accessible inside cluster
-
MetalLB not configured - See above
-
Port mismatch
Storage Issues¶
PVC stuck in "Pending"¶
Symptoms:
Diagnosis:
Common causes: 1. No default StorageClass (Before Longhorn installation)
Temporary: Use k3s's local-path:- StorageClass doesn't exist
- Verify in PVC spec:
storageClassName: longhorn - Check it exists:
kubectl get sc longhorn
Supabase Issues¶
404 Not Found on API Calls (Wrong Schema)¶
Symptoms:
# Browser console or network tab shows:
POST http://10.89.97.214:8000/rest/v1/accounts 404 (Not Found)
{"code":"42P01","details":null,"message":"relation \"home_portal.accounts\" does not exist"}
Cause: Supabase client is using the wrong schema. In a multi-schema setup, clients default to the first schema in PostgREST configuration (typically home_portal), causing queries to target the wrong schema.
Solution: Configure the Supabase client to use the correct schema:
// lib/supabase/client.ts
import { createBrowserClient } from "@supabase/ssr"
export function createClient() {
return createBrowserClient(
process.env.NEXT_PUBLIC_SUPABASE_URL!,
process.env.NEXT_PUBLIC_SUPABASE_ANON_KEY!,
{
db: { schema: "your_app_schema" } // ← CRITICAL: Specify your schema!
}
)
}
// lib/supabase/server.ts
import { createServerClient } from "@supabase/ssr"
import { cookies } from "next/headers"
export async function createClient() {
const cookieStore = await cookies()
return createServerClient(
process.env.NEXT_PUBLIC_SUPABASE_URL!,
process.env.NEXT_PUBLIC_SUPABASE_ANON_KEY!,
{
db: { schema: "your_app_schema" }, // ← CRITICAL: Specify your schema!
cookies: {
get(name: string) {
return cookieStore.get(name)?.value
},
},
}
)
}
Schema names by app:
- home_portal - Home Portal
- money_tracker - Money Tracker
- rms - Recipe Management System
Verification:
# After fixing, rebuild and redeploy
cd /root/projects/your-app
npm run build
# Test API access
ANON_KEY=$(kubectl get secret -n supabase supabase-secrets -o jsonpath='{.data.ANON_KEY}' | base64 -d)
curl -X GET "http://10.89.97.214:8000/rest/v1/your_table" \
-H "apikey: $ANON_KEY" \
-H "Accept-Profile: your_app_schema" \
-H "Content-Profile: your_app_schema"
See Supabase Multi-App Architecture for details.
Tables Exist But Can't Be Accessed¶
Symptoms: - Tables visible in Supabase Studio - API returns 403 Forbidden or empty results - Queries fail with permission errors
Cause: Missing table permissions. Tables created without proper grants to authenticated, anon, and service_role roles.
Diagnosis:
# Check table permissions
kubectl exec -n supabase postgres-0 -- psql -U postgres << 'EOF'
\dp your_schema.*
EOF
Solution:
# Grant permissions on all existing tables
kubectl exec -n supabase postgres-0 -- psql -U postgres << 'EOF'
GRANT ALL ON ALL TABLES IN SCHEMA your_schema TO postgres, authenticated, service_role;
GRANT SELECT ON ALL TABLES IN SCHEMA your_schema TO anon;
GRANT ALL ON ALL SEQUENCES IN SCHEMA your_schema TO postgres, authenticated, service_role;
EOF
Prevention: Set default privileges when creating a new schema:
ALTER DEFAULT PRIVILEGES IN SCHEMA your_schema
GRANT ALL ON TABLES TO postgres, authenticated, service_role;
ALTER DEFAULT PRIVILEGES IN SCHEMA your_schema
GRANT SELECT ON TABLES TO anon;
ALTER DEFAULT PRIVILEGES IN SCHEMA your_schema
GRANT ALL ON SEQUENCES TO postgres, authenticated, service_role;
New Schema Not Recognized by PostgREST¶
Symptoms: - Schema added to PostgREST configuration - Tables created successfully - API still returns "schema does not exist" errors
Cause: PostgREST hasn't reloaded its configuration. PostgREST caches the schema list on startup.
Diagnosis:
# Check PostgREST logs
kubectl logs -n supabase -l app=rest --tail=50
# Verify schema is in config
kubectl get configmap -n supabase supabase-config -o yaml | grep PGRST_DB_SCHEMA
Solution:
# Restart PostgREST to reload configuration
kubectl rollout restart deployment -n supabase rest
# Wait for rollout to complete
kubectl rollout status deployment -n supabase rest --timeout=2m
# Verify PostgREST recognizes the schema
ANON_KEY=$(kubectl get secret -n supabase supabase-secrets -o jsonpath='{.data.ANON_KEY}' | base64 -d)
curl -I -H "apikey: $ANON_KEY" "http://10.89.97.214:8000/rest/v1/"
# Should return 200 OK
When to restart PostgREST:
- After adding a new schema to PGRST_DB_SCHEMA
- After changing PostgREST configuration
- After modifying database permissions
PostgREST Can't Find Table Relationships¶
Symptoms:
- Nested queries fail: Could not find a relationship between 'table_a' and 'table_b' in the schema cache
- Foreign key exists in database
- Individual table queries work fine
- Nested selects like select('*, child_table(*)') fail
Cause: PostgREST caches table relationships on startup. After applying migrations that add foreign keys, the cache is stale.
Diagnosis:
# Verify the foreign key exists
kubectl exec -n supabase postgres-0 -- psql -U postgres -c "\d <schema>.<table>"
# Look for "Foreign-key constraints" section
# Check PostgREST logs for relationship errors
kubectl logs -n supabase -l app=rest --tail=50 | grep -i relationship
Solution:
# Option 1: Send NOTIFY to reload schema (less disruptive)
kubectl exec -n supabase postgres-0 -- psql -U postgres -c "NOTIFY pgrst, 'reload schema';"
# Option 2: Restart PostgREST if NOTIFY doesn't work
kubectl rollout restart deployment -n supabase rest
kubectl rollout status deployment -n supabase rest --timeout=2m
When to reload PostgREST schema cache: - After applying migrations that add/modify foreign keys - After adding new tables with relationships - After modifying RLS policies - When nested queries fail with "relationship not found"
Note: The NOTIFY approach requires PostgREST to be configured to listen for pg_notify events. If it doesn't work, use the restart approach.
Can't Connect to Supabase from App¶
Symptoms: - Local development works fine - K8s deployment can't reach Supabase - Timeout errors or connection refused
Diagnosis:
# Check if Supabase services are running
kubectl get pods -n supabase
# Test connectivity from your app's namespace
kubectl run -it --rm debug --image=busybox -n your-namespace -- sh
# Inside the pod:
wget -O- http://postgres-service.supabase.svc.cluster.local:5432
Common causes:
-
Wrong Supabase URL in ConfigMap
Should be:kubectl get configmap -n your-namespace your-app-config -o yaml # Verify NEXT_PUBLIC_SUPABASE_URL is correcthttp://10.89.97.214:8000(k8s LoadBalancer IP) -
Missing or wrong ANON_KEY
-
Network policy blocking traffic
- Check for NetworkPolicies:
kubectl get networkpolicies -A
Solution: Verify environment variables match k8s Supabase instance:
# Get correct ANON_KEY
kubectl get secret -n supabase supabase-secrets -o jsonpath='{.data.ANON_KEY}' | base64 -d
# Update your app's secret
kubectl create secret generic your-app-secrets \
-n your-namespace \
--from-literal=NEXT_PUBLIC_SUPABASE_ANON_KEY="<anon-key>" \
--dry-run=client -o yaml | kubectl apply -f -
# Restart deployment to pick up changes
kubectl rollout restart deployment -n your-namespace your-app
Complete Cluster Failure¶
All nodes down¶
Recovery:
# Start all VMs
qm start 201 202 203
# Wait 2 minutes for full startup
# Check cluster
kubectl get nodes
Master node failure / k3s won't start¶
Diagnosis:
Nuclear option - rebuild cluster:
# 1. Backup important data
kubectl get all -A -o yaml > cluster-backup.yaml
# 2. Destroy VMs
qm destroy 201 202 203
# 3. Recreate from documentation
# See 01-installation.md
# 4. Restore applications via GitOps (Phase 4+)
General Debugging Commands¶
# View all events (super useful!)
kubectl get events --sort-by='.lastTimestamp' -A
# Describe shows events + details
kubectl describe pod/my-pod
kubectl describe node/k3s-worker-1
# Logs
kubectl logs pod-name
kubectl logs pod-name --previous # Previous crashed container
# Execute commands in pod
kubectl exec -it pod-name -- bash
# Resource usage
kubectl top nodes
kubectl top pods -A
# API server logs (on master)
ssh root@10.89.97.201 'journalctl -u k3s -f'
LXC/NAS Storage Issues¶
Arr-Stack Permission Errors (Sonarr/Radarr/Lidarr)¶
Symptoms:
[v4.0.15.2941] System.UnauthorizedAccessException: Access to the path '/data/media/tv/...' is denied.
---> System.IO.IOException: Permission denied
Cause: UID mapping mismatch between: - LXC 101 (NAS): Unprivileged container with UID mapping (container UID 0 → host UID 100000) - VM 100 (arr-stack): Docker containers running as UID 1000 - Samba shares: Files created via macOS/Windows copied as root inside LXC → becomes UID 100000 on host
This creates a situation where:
1. Sonarr/Radarr/Lidarr (running as UID 1000) create files as 1000:1000 on the host
2. Files copied via Samba are created as 100000:100000 on the host (root in container)
3. Arr apps can't delete/modify files they don't own → permission errors
Solution: Configure Custom UID Mapping for LXC 101
This solution maps container UID 1000 to host UID 1000, ensuring consistent file ownership across all access methods.
Step 1: Backup and stop LXC 101
# Backup configuration
pct config 101 > /root/lxc-101-config-backup-$(date +%Y%m%d-%H%M%S).txt
# Stop container
pct stop 101
Step 2: Enable UID mapping in Proxmox
# Uncomment the root:1000:1 mapping in subuid/subgid
sed -i 's/#root:1000:1/root:1000:1/' /etc/subuid /etc/subgid
# Verify
cat /etc/subuid /etc/subgid
# Should show:
# root:100000:65536
# root:1000:1
Step 3: Add custom UID mapping to LXC configuration
# Edit /etc/pve/lxc/101.conf and add these lines after "unprivileged: 1":
lxc.idmap: u 0 100000 1000
lxc.idmap: u 1000 1000 1
lxc.idmap: u 1001 101001 64535
lxc.idmap: g 0 100000 1000
lxc.idmap: g 1000 1000 1
lxc.idmap: g 1001 101001 64535
Explanation:
- u 0 100000 1000 - Map container UIDs 0-999 to host UIDs 100000-100999
- u 1000 1000 1 - Map container UID 1000 to host UID 1000 (direct mapping!)
- u 1001 101001 64535 - Map container UIDs 1001-65535 to host UIDs 101001-165535
- Same for GIDs (g instead of u)
Step 4: Start container and verify
# Start LXC 101
pct start 101
# Verify mapping is working
pct exec 101 -- stat -c "%U:%G (%u:%g)" /mnt/vault/media/tv /mnt/vault/media/movies
# Should show: jake:admin (1000:1000)
Step 5: Fix jake's primary group
# Change jake's primary group from root (0) to admin (1000)
pct exec 101 -- usermod -g 1000 jake
# Verify
pct exec 101 -- id jake
# Should show: uid=1000(jake) gid=1000(admin) groups=1000(admin)...
Step 6: Update Samba configuration
# Add force user/group to vault share
pct exec 101 -- bash -c "cat >> /etc/samba/smb.conf << 'EOF'
# Updated vault section for consistent file ownership
[vault]
force user = jake
force group = admin
valid users = root,jake,@root
write list = root,jake,@root
create mode = 0664
path = /mnt/vault
directory mode = 0775
writeable = yes
EOF"
# Remove old vault section (manually edit to avoid conflicts)
# Then restart Samba
pct exec 101 -- systemctl restart smbd nmbd
Step 7: Fix existing file ownership
# Change all media files to 1000:1000
ssh root@10.89.97.50 "chown -R 1000:1000 /mnt/media"
# This may take several minutes depending on file count
Verification:
Test file creation from all three sources:
# 1. From inside LXC 101 (jake user)
pct exec 101 -- su -s /bin/sh jake -c 'touch /mnt/vault/media/test-lxc.txt'
ls -l /vault/subvol-101-disk-0/media/test-lxc.txt
# Should show: 1000:1000
# 2. From arr-stack (Sonarr container)
ssh root@10.89.97.50 "docker exec sonarr su -s /bin/sh abc -c 'touch /data/media/test-sonarr.txt'"
ls -l /vault/subvol-101-disk-0/media/test-sonarr.txt
# Should show: 1000:1000
# 3. From Mac/Windows via Samba
# - Disconnect and reconnect Samba share
# - Create a new test file
ssh root@10.89.97.50 "ls -l /mnt/media/[your-test-file]"
# Should show: jake:jake (1000:1000)
# Cleanup
rm -f /vault/subvol-101-disk-0/media/test-*.txt
Expected result: All three methods create files as 1000:1000 on the host, ensuring arr-stack apps can always manage media files regardless of how they were created.
Reconnecting Samba shares on macOS/Windows:
After updating Samba configuration, existing connections use old settings. You must disconnect and reconnect:
- Eject/unmount the share
- Wait 5 seconds
- Reconnect:
smb://10.89.97.89/vault - Authenticate (as root or jake - doesn't matter, files are forced to jake)
- Create new files - they will now be owned by 1000:1000
Persistence: These changes are permanent and survive container restarts. No manual chown needed in the future.
Ingress Issues¶
404 Not Found (Default Backend)¶
Symptoms: Accessing app via hostname returns NGINX 404 "default backend - 404"
Cause: Host header doesn't match any Ingress rule
Solutions: 1. Check Ingress configuration:
2. Verify hostname matches exactly (case-sensitive) 3. Test with explicit Host header: 4. Check DNS resolution:DNS Not Resolving¶
Symptoms: curl: (6) Could not resolve host: myapp.internal
Solutions:
1. Check OPNsense DNS override configured:
- Services → Unbound DNS → Overrides → Host Overrides
- Should show: myapp.internal → 10.89.97.220
2. Or add to /etc/hosts:
Connection Timeout via Ingress¶
Symptoms: Request hangs or times out
Debugging:
# Check NGINX Ingress pod logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
# Check backend service
kubectl get svc -n {app}
# Should be type: ClusterIP (not LoadBalancer)
# Check backend pod
kubectl get pods -n {app}
kubectl logs -n {app} -l app={app}
# Test service directly
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl http://{app}.{app}.svc.cluster.local
Wrong Service Type¶
Symptoms: App has both LoadBalancer IP and Ingress, causing conflicts
Solution:
# Change Service to ClusterIP
kubectl patch svc {app} -n {app} -p '{"spec":{"type":"ClusterIP"}}'
# Verify
kubectl get svc -n {app}
# Should show type: ClusterIP, no EXTERNAL-IP
Ingress Not Getting IP Address¶
Symptoms: kubectl get ingress shows empty ADDRESS column
Causes: - NGINX Ingress Controller not running - IngressClass not specified
Solutions:
# Check Ingress controller pods
kubectl get pods -n ingress-nginx
# Check IngressClass
kubectl get ingressclass
# Should show 'nginx' class
# Verify Ingress spec has ingressClassName
kubectl get ingress {app} -n {app} -o yaml | grep ingressClassName
# Should show: ingressClassName: nginx
DNS Issues¶
Internet Outage Causes Authentication 500 Errors¶
Symptoms: - Internet goes down - Apps with Authentik SSO (arr-stack, etc.) return 500 Internal Server Error - Direct IP access works (e.g., http://10.89.97.50:8265) - Domain-based access fails (e.g., https://radarr.bogocat.com) - After some delay, services "magically" start working again
Root Cause: K3s nodes were using external DNS (1.1.1.1 Cloudflare) instead of OPNsense (10.89.97.1).
When internet drops:
1. Browser resolves *.bogocat.com via OPNsense (works - has local overrides)
2. Request hits K8s Ingress
3. NGINX Ingress calls auth-url to verify authentication
4. Authentik/CoreDNS needs to resolve external URLs
5. CoreDNS forwards to 1.1.1.1 → TIMEOUT (no internet)
6. Auth check fails → 500 error
Why services "suddenly work": Chrome's DNS cache expires, retries query, OPNsense responds with internal IP.
Diagnosis:
# Check what DNS K3s nodes are using
for ip in 10.89.97.201 10.89.97.202 10.89.97.203; do
echo "=== $ip ==="
ssh root@$ip 'cat /etc/resolv.conf | grep nameserver'
done
# BAD: nameserver 1.1.1.1
# GOOD: nameserver 10.89.97.1
Solution: Configure K3s Nodes to Use OPNsense DNS
# Run on each K3s node (201, 202, 203)
for ip in 10.89.97.201 10.89.97.202 10.89.97.203; do
echo "=== Configuring $ip ==="
ssh root@$ip "mkdir -p /etc/systemd/resolved.conf.d && cat > /etc/systemd/resolved.conf.d/local-dns.conf << 'EOF'
[Resolve]
DNS=10.89.97.1
FallbackDNS=1.1.1.1
Domains=~.
EOF
systemctl restart systemd-resolved"
done
# Restart CoreDNS to pick up new resolv.conf
kubectl rollout restart deployment coredns -n kube-system
# Verify DNS resolution from inside K8s
kubectl run dns-test --rm -it --restart=Never --image=busybox:latest -- nslookup auth.bogocat.com
# Should return: 10.89.97.220 (internal Ingress IP)
Verification:
# Check each node's DNS configuration
for ip in 10.89.97.201 10.89.97.202 10.89.97.203; do
echo "=== $ip ==="
ssh root@$ip 'resolvectl status | head -10'
done
# Should show:
# DNS Servers 10.89.97.1
# Fallback DNS Servers 1.1.1.1
# Test DNS resolution
ssh root@10.89.97.201 'resolvectl query auth.bogocat.com'
# Should return: 10.89.97.220
Why this works:
- OPNsense has DNS overrides for *.bogocat.com → internal IPs
- During internet outage, local DNS still works
- Fallback to 1.1.1.1 only if OPNsense is down
Related services affected: - All arr-stack apps (radarr, sonarr, prowlarr, etc.) with Authentik forward auth - Any K8s app using external DNS resolution - Authentik itself if it needs to resolve external URLs
K3s Node DNS Configuration Reference¶
Location: /etc/systemd/resolved.conf.d/local-dns.conf
Standard configuration for all K3s nodes:
Explanation:
- DNS=10.89.97.1 - Primary DNS is OPNsense (has local overrides)
- FallbackDNS=1.1.1.1 - Fallback to Cloudflare if OPNsense is down
- Domains=~. - Use this DNS for all domains (not just specific ones)
Node IPs: - k3s-master: 10.89.97.201 - k3s-worker-1: 10.89.97.202 - k3s-worker-2: 10.89.97.203
Docker Build Issues¶
BuildKit "spawn sh EACCES" Error¶
Symptoms:
Cause: Corrupted Docker buildx builder after system upgrades (apt dist-upgrade).
Diagnosis:
Solution:
# Create fresh builder
docker buildx create --name fresh-builder --use
docker buildx inspect --bootstrap
# Use in builds
docker buildx build --builder fresh-builder --load -t app:v1.0.0 .
Full details: See Docker Deployment Guide
Getting Help¶
If issues persist:
1. Check events: kubectl get events -A --sort-by='.lastTimestamp'
2. Check logs: kubectl logs pod-name and ssh root@NODE 'journalctl -u k3s'
3. Search k3s GitHub issues: https://github.com/k3s-io/k3s/issues
4. Kubernetes documentation: https://kubernetes.io/docs/
5. NGINX Ingress docs: https://kubernetes.github.io/ingress-nginx/