Skip to content

Troubleshooting Guide

Common issues and solutions for the tower-fleet k3s cluster.


Node Issues

Node shows "NotReady"

Symptoms:

kubectl get nodes
# k3s-worker-1   NotReady   <none>   5m

Diagnosis:

# Check node conditions
kubectl describe node/k3s-worker-1

# Check kubelet logs
ssh root@10.89.97.202 'journalctl -u k3s-agent -n 100'

Common causes: 1. Network issues - Node can't reach master

ssh root@10.89.97.202 'ping -c 3 10.89.97.201'

  1. k3s-agent not running

    ssh root@10.89.97.202 'systemctl status k3s-agent'
    ssh root@10.89.97.202 'systemctl restart k3s-agent'
    

  2. Disk pressure / resource exhaustion

    ssh root@10.89.97.202 'df -h'
    ssh root@10.89.97.202 'free -h'
    

Solution:

# Restart k3s-agent
ssh root@10.89.97.202 'systemctl restart k3s-agent'

# Wait 30 seconds, then check
kubectl get nodes


DiskPressure causing pod evictions

Symptoms:

kubectl get nodes
# NODE STATUS with DiskPressure condition

kubectl get events -n longhorn-system
# Warning  Evicted  pod/longhorn-manager-xxx  The node had condition: [DiskPressure]

Diagnosis:

# Check disk usage on all nodes
for ip in 10.89.97.201 10.89.97.202 10.89.97.203; do
  echo "=== $ip ==="
  ssh root@$ip 'df -h /'
done

# Check VM disk configuration
for vmid in 201 202 203; do
  echo "=== VM $vmid ==="
  qm config $vmid | grep scsi0
done

Common cause: VMs have insufficient disk space (often 3GB from cloud image instead of 80GB)

Root cause: The qm importdisk command imports the cloud image at its original size (~3GB), and qm set --scsi0 ...,size=80G does NOT resize imported disks (only works for new disks).

Solution:

# 1. Stop affected services (if Longhorn is failing)
helm uninstall longhorn -n longhorn-system 2>/dev/null || true
kubectl delete namespace longhorn-system --force --grace-period=0 &

# 2. Resize VM disks at Proxmox level
for vmid in 201 202 203; do
  echo "Resizing VM $vmid to 80GB..."
  qm resize $vmid scsi0 80G
done

# 3. Extend partitions and filesystems inside VMs
for ip in 10.89.97.201 10.89.97.202 10.89.97.203; do
  echo "=== Extending filesystem on $ip ==="
  ssh root@$ip 'growpart /dev/sda 1 && resize2fs /dev/sda1'
done

# 4. Verify disk space
for ip in 10.89.97.201 10.89.97.202 10.89.97.203; do
  ssh root@$ip 'df -h / | tail -1'
done
# Should show ~79G total, ~74G available

# 5. Wait for namespace deletion to complete
kubectl get namespace longhorn-system 2>&1 || echo "Namespace deleted"

# 6. Reinstall Longhorn
kubectl create namespace longhorn-system
helm install longhorn longhorn/longhorn --namespace longhorn-system

Prevention:

Always include qm resize in VM creation scripts after qm importdisk:

# Correct way:
qm importdisk $VMID "$TEMPLATE_IMG" $STORAGE
qm set $VMID --scsi0 ${STORAGE}:vm-${VMID}-disk-0
qm resize $VMID scsi0 +77G  # ← This actually resizes the disk!

Verification after fix:

# Check Longhorn pods are running
kubectl get pods -n longhorn-system

# Check available storage
kubectl get nodes -o custom-columns=NAME:.metadata.name,CAPACITY:.status.capacity.ephemeral-storage

Node not appearing in cluster

Diagnosis:

# Check if k3s-agent is running
ssh root@10.89.97.202 'systemctl status k3s-agent'

# Check logs for errors
ssh root@10.89.97.202 'journalctl -u k3s-agent -f'

Common causes: 1. Wrong join token - Token mismatch 2. Wrong master IP - Can't find API server 3. Firewall blocking port 6443

Solution:

# Re-fetch token from master
K3S_TOKEN=$(ssh root@10.89.97.201 'cat /var/lib/rancher/k3s/server/node-token')

# Uninstall k3s-agent
ssh root@10.89.97.202 '/usr/local/bin/k3s-agent-uninstall.sh'

# Reinstall with correct token
ssh root@10.89.97.202 "curl -sfL https://get.k3s.io | K3S_URL='https://10.89.97.201:6443' K3S_TOKEN='${K3S_TOKEN}' sh -"


Pod Issues

Pod stuck in "Pending"

Symptoms:

kubectl get pods
# my-app-xxx   0/1   Pending   0   5m

Diagnosis:

kubectl describe pod/my-app-xxx
# Look at Events section at bottom

Common causes: 1. No resources available - All nodes full

Warning  FailedScheduling  ... 0/3 nodes available: insufficient memory
Solution: Scale down other apps or add more nodes

  1. PVC not bound - Waiting for storage

    Warning  FailedScheduling  ... persistentvolumeclaim "data" not found
    
    Solution: Check PVC status: kubectl get pvc

  2. Node selector mismatch - Pod requires specific node Solution: Check pod spec for nodeSelector/affinity


Pod stuck in "ContainerCreating"

Diagnosis:

kubectl describe pod/my-app-xxx
# Check Events

Common causes: 1. Image pull error - Can't download container image

Warning  Failed  ... Failed to pull image "myapp:v1": rpc error: code = Unknown
Solution: - Check image name/tag is correct - Check image registry is accessible - Add ImagePullSecrets if private registry

  1. Volume mount issues - Can't mount PVC Solution: Check PVC exists: kubectl get pvc

Pod CrashLoopBackOff

Symptoms:

kubectl get pods
# my-app-xxx   0/1   CrashLoopBackOff   5   3m

Diagnosis:

# Check logs
kubectl logs my-app-xxx

# Check previous container logs (if restarted)
kubectl logs my-app-xxx --previous

# Describe pod
kubectl describe pod/my-app-xxx

Common causes: 1. Application error - App crashes on startup - Check logs for error messages - Verify environment variables are correct - Check ConfigMaps/Secrets are mounted properly

  1. Liveness/Readiness probe failing
  2. Probe checks fail → k8s kills pod → restarts → fails again
  3. Solution: Adjust probe timing or fix app health endpoint

  4. Missing dependencies - Database not ready, etc.

  5. Use init containers or retry logic

Pod "ImagePullBackOff"

Symptoms:

kubectl get pods
# my-app-xxx   0/1   ImagePullBackOff   0   2m

Diagnosis:

kubectl describe pod/my-app-xxx
# Look for "Failed to pull image" in Events

Solutions: 1. Typo in image name

image: nginx:lates  # Should be "latest"

  1. Private registry without credentials

    # Create secret
    kubectl create secret docker-registry regcred \
      --docker-server=ghcr.io \
      --docker-username=myuser \
      --docker-password=mytoken
    
    # Reference in pod spec
    imagePullSecrets:
    - name: regcred
    

  2. Image doesn't exist

  3. Verify image exists: docker pull IMAGE_NAME

kubectl Issues

"connection refused" when running kubectl

Symptoms:

kubectl get nodes
# The connection to the server 10.89.97.201:6443 was refused

Diagnosis:

# Check if master is running
qm status 201

# Check if k3s is running on master
ssh root@10.89.97.201 'systemctl status k3s'

# Check network connectivity
ping -c 3 10.89.97.201
nc -zv 10.89.97.201 6443

Solutions: 1. Master VM is stopped

qm start 201
# Wait 30 seconds for k3s to start

  1. k3s service not running

    ssh root@10.89.97.201 'systemctl start k3s'
    

  2. Network issue

  3. Check firewall rules
  4. Verify IP address in kubeconfig is correct

"Unable to connect to the server: x509: certificate signed by unknown authority"

Cause: Kubeconfig has wrong CA certificate

Solution:

# Re-copy kubeconfig from master
scp root@10.89.97.201:/etc/rancher/k3s/k3s.yaml ~/.kube/config
sed -i 's/127.0.0.1/10.89.97.201/g' ~/.kube/config
chmod 600 ~/.kube/config


"error: You must be logged in to the server (Unauthorized)"

Cause: Wrong or expired credentials

Solution:

# Check kubeconfig is present
cat ~/.kube/config

# Re-copy from master
scp root@10.89.97.201:/etc/rancher/k3s/k3s.yaml ~/.kube/config
sed -i 's/127.0.0.1/10.89.97.201/g' ~/.kube/config


Service/Network Issues

LoadBalancer service stuck in "Pending"

Symptoms:

kubectl get svc
# my-app   LoadBalancer   10.43.x.x   <pending>   80:31234/TCP

Cause: MetalLB not installed (Phase 3)

Temporary workaround:

# Use NodePort instead
kubectl expose deployment my-app --port=80 --type=NodePort

# Access via node IP:port
kubectl get svc my-app
# my-app   NodePort   10.43.x.x   <none>   80:31234/TCP

# Access at: http://10.89.97.201:31234

Permanent solution: Install MetalLB (see 03-core-infrastructure.md)


Can't access service from outside cluster

Diagnosis:

# Check service type
kubectl get svc my-app

# Test from within cluster
kubectl run test --rm -it --image=busybox -- wget -O- http://my-app

Solutions: 1. Service is ClusterIP - Only accessible inside cluster

# Change to LoadBalancer or NodePort
kubectl patch svc my-app -p '{"spec":{"type":"LoadBalancer"}}'

  1. MetalLB not configured - See above

  2. Port mismatch

    kubectl describe svc my-app
    # Verify Port: and TargetPort: are correct
    


Storage Issues

PVC stuck in "Pending"

Symptoms:

kubectl get pvc
# data-pvc   Pending   ...

Diagnosis:

kubectl describe pvc data-pvc
# Check Events section

Common causes: 1. No default StorageClass (Before Longhorn installation)

kubectl get storageclass
# Should have one marked as (default)
Temporary: Use k3s's local-path:
storageClassName: local-path

  1. StorageClass doesn't exist
  2. Verify in PVC spec: storageClassName: longhorn
  3. Check it exists: kubectl get sc longhorn

Supabase Issues

404 Not Found on API Calls (Wrong Schema)

Symptoms:

# Browser console or network tab shows:
POST http://10.89.97.214:8000/rest/v1/accounts 404 (Not Found)
{"code":"42P01","details":null,"message":"relation \"home_portal.accounts\" does not exist"}

Cause: Supabase client is using the wrong schema. In a multi-schema setup, clients default to the first schema in PostgREST configuration (typically home_portal), causing queries to target the wrong schema.

Solution: Configure the Supabase client to use the correct schema:

// lib/supabase/client.ts
import { createBrowserClient } from "@supabase/ssr"

export function createClient() {
  return createBrowserClient(
    process.env.NEXT_PUBLIC_SUPABASE_URL!,
    process.env.NEXT_PUBLIC_SUPABASE_ANON_KEY!,
    {
      db: { schema: "your_app_schema" }  // ← CRITICAL: Specify your schema!
    }
  )
}

// lib/supabase/server.ts
import { createServerClient } from "@supabase/ssr"
import { cookies } from "next/headers"

export async function createClient() {
  const cookieStore = await cookies()

  return createServerClient(
    process.env.NEXT_PUBLIC_SUPABASE_URL!,
    process.env.NEXT_PUBLIC_SUPABASE_ANON_KEY!,
    {
      db: { schema: "your_app_schema" },  // ← CRITICAL: Specify your schema!
      cookies: {
        get(name: string) {
          return cookieStore.get(name)?.value
        },
      },
    }
  )
}

Schema names by app: - home_portal - Home Portal - money_tracker - Money Tracker - rms - Recipe Management System

Verification:

# After fixing, rebuild and redeploy
cd /root/projects/your-app
npm run build

# Test API access
ANON_KEY=$(kubectl get secret -n supabase supabase-secrets -o jsonpath='{.data.ANON_KEY}' | base64 -d)
curl -X GET "http://10.89.97.214:8000/rest/v1/your_table" \
  -H "apikey: $ANON_KEY" \
  -H "Accept-Profile: your_app_schema" \
  -H "Content-Profile: your_app_schema"

See Supabase Multi-App Architecture for details.


Tables Exist But Can't Be Accessed

Symptoms: - Tables visible in Supabase Studio - API returns 403 Forbidden or empty results - Queries fail with permission errors

Cause: Missing table permissions. Tables created without proper grants to authenticated, anon, and service_role roles.

Diagnosis:

# Check table permissions
kubectl exec -n supabase postgres-0 -- psql -U postgres << 'EOF'
\dp your_schema.*
EOF

Solution:

# Grant permissions on all existing tables
kubectl exec -n supabase postgres-0 -- psql -U postgres << 'EOF'
GRANT ALL ON ALL TABLES IN SCHEMA your_schema TO postgres, authenticated, service_role;
GRANT SELECT ON ALL TABLES IN SCHEMA your_schema TO anon;
GRANT ALL ON ALL SEQUENCES IN SCHEMA your_schema TO postgres, authenticated, service_role;
EOF

Prevention: Set default privileges when creating a new schema:

ALTER DEFAULT PRIVILEGES IN SCHEMA your_schema
  GRANT ALL ON TABLES TO postgres, authenticated, service_role;

ALTER DEFAULT PRIVILEGES IN SCHEMA your_schema
  GRANT SELECT ON TABLES TO anon;

ALTER DEFAULT PRIVILEGES IN SCHEMA your_schema
  GRANT ALL ON SEQUENCES TO postgres, authenticated, service_role;

New Schema Not Recognized by PostgREST

Symptoms: - Schema added to PostgREST configuration - Tables created successfully - API still returns "schema does not exist" errors

Cause: PostgREST hasn't reloaded its configuration. PostgREST caches the schema list on startup.

Diagnosis:

# Check PostgREST logs
kubectl logs -n supabase -l app=rest --tail=50

# Verify schema is in config
kubectl get configmap -n supabase supabase-config -o yaml | grep PGRST_DB_SCHEMA

Solution:

# Restart PostgREST to reload configuration
kubectl rollout restart deployment -n supabase rest

# Wait for rollout to complete
kubectl rollout status deployment -n supabase rest --timeout=2m

# Verify PostgREST recognizes the schema
ANON_KEY=$(kubectl get secret -n supabase supabase-secrets -o jsonpath='{.data.ANON_KEY}' | base64 -d)
curl -I -H "apikey: $ANON_KEY" "http://10.89.97.214:8000/rest/v1/"
# Should return 200 OK

When to restart PostgREST: - After adding a new schema to PGRST_DB_SCHEMA - After changing PostgREST configuration - After modifying database permissions


PostgREST Can't Find Table Relationships

Symptoms: - Nested queries fail: Could not find a relationship between 'table_a' and 'table_b' in the schema cache - Foreign key exists in database - Individual table queries work fine - Nested selects like select('*, child_table(*)') fail

Cause: PostgREST caches table relationships on startup. After applying migrations that add foreign keys, the cache is stale.

Diagnosis:

# Verify the foreign key exists
kubectl exec -n supabase postgres-0 -- psql -U postgres -c "\d <schema>.<table>"
# Look for "Foreign-key constraints" section

# Check PostgREST logs for relationship errors
kubectl logs -n supabase -l app=rest --tail=50 | grep -i relationship

Solution:

# Option 1: Send NOTIFY to reload schema (less disruptive)
kubectl exec -n supabase postgres-0 -- psql -U postgres -c "NOTIFY pgrst, 'reload schema';"

# Option 2: Restart PostgREST if NOTIFY doesn't work
kubectl rollout restart deployment -n supabase rest
kubectl rollout status deployment -n supabase rest --timeout=2m

When to reload PostgREST schema cache: - After applying migrations that add/modify foreign keys - After adding new tables with relationships - After modifying RLS policies - When nested queries fail with "relationship not found"

Note: The NOTIFY approach requires PostgREST to be configured to listen for pg_notify events. If it doesn't work, use the restart approach.


Can't Connect to Supabase from App

Symptoms: - Local development works fine - K8s deployment can't reach Supabase - Timeout errors or connection refused

Diagnosis:

# Check if Supabase services are running
kubectl get pods -n supabase

# Test connectivity from your app's namespace
kubectl run -it --rm debug --image=busybox -n your-namespace -- sh
# Inside the pod:
wget -O- http://postgres-service.supabase.svc.cluster.local:5432

Common causes:

  1. Wrong Supabase URL in ConfigMap

    kubectl get configmap -n your-namespace your-app-config -o yaml
    # Verify NEXT_PUBLIC_SUPABASE_URL is correct
    
    Should be: http://10.89.97.214:8000 (k8s LoadBalancer IP)

  2. Missing or wrong ANON_KEY

    kubectl get secret -n your-namespace your-app-secrets -o yaml
    # Verify NEXT_PUBLIC_SUPABASE_ANON_KEY matches Supabase
    

  3. Network policy blocking traffic

  4. Check for NetworkPolicies: kubectl get networkpolicies -A

Solution: Verify environment variables match k8s Supabase instance:

# Get correct ANON_KEY
kubectl get secret -n supabase supabase-secrets -o jsonpath='{.data.ANON_KEY}' | base64 -d

# Update your app's secret
kubectl create secret generic your-app-secrets \
  -n your-namespace \
  --from-literal=NEXT_PUBLIC_SUPABASE_ANON_KEY="<anon-key>" \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart deployment to pick up changes
kubectl rollout restart deployment -n your-namespace your-app

Complete Cluster Failure

All nodes down

Recovery:

# Start all VMs
qm start 201 202 203

# Wait 2 minutes for full startup

# Check cluster
kubectl get nodes


Master node failure / k3s won't start

Diagnosis:

ssh root@10.89.97.201 'journalctl -u k3s -n 200'

Nuclear option - rebuild cluster:

# 1. Backup important data
kubectl get all -A -o yaml > cluster-backup.yaml

# 2. Destroy VMs
qm destroy 201 202 203

# 3. Recreate from documentation
# See 01-installation.md

# 4. Restore applications via GitOps (Phase 4+)


General Debugging Commands

# View all events (super useful!)
kubectl get events --sort-by='.lastTimestamp' -A

# Describe shows events + details
kubectl describe pod/my-pod
kubectl describe node/k3s-worker-1

# Logs
kubectl logs pod-name
kubectl logs pod-name --previous  # Previous crashed container

# Execute commands in pod
kubectl exec -it pod-name -- bash

# Resource usage
kubectl top nodes
kubectl top pods -A

# API server logs (on master)
ssh root@10.89.97.201 'journalctl -u k3s -f'

LXC/NAS Storage Issues

Arr-Stack Permission Errors (Sonarr/Radarr/Lidarr)

Symptoms:

[v4.0.15.2941] System.UnauthorizedAccessException: Access to the path '/data/media/tv/...' is denied.
 ---> System.IO.IOException: Permission denied

Cause: UID mapping mismatch between: - LXC 101 (NAS): Unprivileged container with UID mapping (container UID 0 → host UID 100000) - VM 100 (arr-stack): Docker containers running as UID 1000 - Samba shares: Files created via macOS/Windows copied as root inside LXC → becomes UID 100000 on host

This creates a situation where: 1. Sonarr/Radarr/Lidarr (running as UID 1000) create files as 1000:1000 on the host 2. Files copied via Samba are created as 100000:100000 on the host (root in container) 3. Arr apps can't delete/modify files they don't own → permission errors

Solution: Configure Custom UID Mapping for LXC 101

This solution maps container UID 1000 to host UID 1000, ensuring consistent file ownership across all access methods.

Step 1: Backup and stop LXC 101

# Backup configuration
pct config 101 > /root/lxc-101-config-backup-$(date +%Y%m%d-%H%M%S).txt

# Stop container
pct stop 101

Step 2: Enable UID mapping in Proxmox

# Uncomment the root:1000:1 mapping in subuid/subgid
sed -i 's/#root:1000:1/root:1000:1/' /etc/subuid /etc/subgid

# Verify
cat /etc/subuid /etc/subgid
# Should show:
# root:100000:65536
# root:1000:1

Step 3: Add custom UID mapping to LXC configuration

# Edit /etc/pve/lxc/101.conf and add these lines after "unprivileged: 1":
lxc.idmap: u 0 100000 1000
lxc.idmap: u 1000 1000 1
lxc.idmap: u 1001 101001 64535
lxc.idmap: g 0 100000 1000
lxc.idmap: g 1000 1000 1
lxc.idmap: g 1001 101001 64535

Explanation: - u 0 100000 1000 - Map container UIDs 0-999 to host UIDs 100000-100999 - u 1000 1000 1 - Map container UID 1000 to host UID 1000 (direct mapping!) - u 1001 101001 64535 - Map container UIDs 1001-65535 to host UIDs 101001-165535 - Same for GIDs (g instead of u)

Step 4: Start container and verify

# Start LXC 101
pct start 101

# Verify mapping is working
pct exec 101 -- stat -c "%U:%G (%u:%g)" /mnt/vault/media/tv /mnt/vault/media/movies
# Should show: jake:admin (1000:1000)

Step 5: Fix jake's primary group

# Change jake's primary group from root (0) to admin (1000)
pct exec 101 -- usermod -g 1000 jake

# Verify
pct exec 101 -- id jake
# Should show: uid=1000(jake) gid=1000(admin) groups=1000(admin)...

Step 6: Update Samba configuration

# Add force user/group to vault share
pct exec 101 -- bash -c "cat >> /etc/samba/smb.conf << 'EOF'

# Updated vault section for consistent file ownership
[vault]
    force user = jake
    force group = admin
    valid users = root,jake,@root
    write list = root,jake,@root
    create mode = 0664
    path = /mnt/vault
    directory mode = 0775
    writeable = yes
EOF"

# Remove old vault section (manually edit to avoid conflicts)
# Then restart Samba
pct exec 101 -- systemctl restart smbd nmbd

Step 7: Fix existing file ownership

# Change all media files to 1000:1000
ssh root@10.89.97.50 "chown -R 1000:1000 /mnt/media"
# This may take several minutes depending on file count

Verification:

Test file creation from all three sources:

# 1. From inside LXC 101 (jake user)
pct exec 101 -- su -s /bin/sh jake -c 'touch /mnt/vault/media/test-lxc.txt'
ls -l /vault/subvol-101-disk-0/media/test-lxc.txt
# Should show: 1000:1000

# 2. From arr-stack (Sonarr container)
ssh root@10.89.97.50 "docker exec sonarr su -s /bin/sh abc -c 'touch /data/media/test-sonarr.txt'"
ls -l /vault/subvol-101-disk-0/media/test-sonarr.txt
# Should show: 1000:1000

# 3. From Mac/Windows via Samba
# - Disconnect and reconnect Samba share
# - Create a new test file
ssh root@10.89.97.50 "ls -l /mnt/media/[your-test-file]"
# Should show: jake:jake (1000:1000)

# Cleanup
rm -f /vault/subvol-101-disk-0/media/test-*.txt

Expected result: All three methods create files as 1000:1000 on the host, ensuring arr-stack apps can always manage media files regardless of how they were created.

Reconnecting Samba shares on macOS/Windows:

After updating Samba configuration, existing connections use old settings. You must disconnect and reconnect:

  1. Eject/unmount the share
  2. Wait 5 seconds
  3. Reconnect: smb://10.89.97.89/vault
  4. Authenticate (as root or jake - doesn't matter, files are forced to jake)
  5. Create new files - they will now be owned by 1000:1000

Persistence: These changes are permanent and survive container restarts. No manual chown needed in the future.


Ingress Issues

404 Not Found (Default Backend)

Symptoms: Accessing app via hostname returns NGINX 404 "default backend - 404"

Cause: Host header doesn't match any Ingress rule

Solutions: 1. Check Ingress configuration:

kubectl get ingress -n {app}
kubectl describe ingress {app} -n {app}
2. Verify hostname matches exactly (case-sensitive) 3. Test with explicit Host header:
curl -H "Host: myapp.internal" http://10.89.97.220
4. Check DNS resolution:
nslookup myapp.internal

DNS Not Resolving

Symptoms: curl: (6) Could not resolve host: myapp.internal

Solutions: 1. Check OPNsense DNS override configured: - Services → Unbound DNS → Overrides → Host Overrides - Should show: myapp.internal10.89.97.220 2. Or add to /etc/hosts:

echo "10.89.97.220 myapp.internal" | sudo tee -a /etc/hosts
3. Clear DNS cache (if applicable) 4. Restart Unbound DNS in OPNsense

Connection Timeout via Ingress

Symptoms: Request hangs or times out

Debugging:

# Check NGINX Ingress pod logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Check backend service
kubectl get svc -n {app}
# Should be type: ClusterIP (not LoadBalancer)

# Check backend pod
kubectl get pods -n {app}
kubectl logs -n {app} -l app={app}

# Test service directly
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://{app}.{app}.svc.cluster.local

Wrong Service Type

Symptoms: App has both LoadBalancer IP and Ingress, causing conflicts

Solution:

# Change Service to ClusterIP
kubectl patch svc {app} -n {app} -p '{"spec":{"type":"ClusterIP"}}'

# Verify
kubectl get svc -n {app}
# Should show type: ClusterIP, no EXTERNAL-IP

Ingress Not Getting IP Address

Symptoms: kubectl get ingress shows empty ADDRESS column

Causes: - NGINX Ingress Controller not running - IngressClass not specified

Solutions:

# Check Ingress controller pods
kubectl get pods -n ingress-nginx

# Check IngressClass
kubectl get ingressclass
# Should show 'nginx' class

# Verify Ingress spec has ingressClassName
kubectl get ingress {app} -n {app} -o yaml | grep ingressClassName
# Should show: ingressClassName: nginx


DNS Issues

Internet Outage Causes Authentication 500 Errors

Symptoms: - Internet goes down - Apps with Authentik SSO (arr-stack, etc.) return 500 Internal Server Error - Direct IP access works (e.g., http://10.89.97.50:8265) - Domain-based access fails (e.g., https://radarr.bogocat.com) - After some delay, services "magically" start working again

Root Cause: K3s nodes were using external DNS (1.1.1.1 Cloudflare) instead of OPNsense (10.89.97.1).

When internet drops: 1. Browser resolves *.bogocat.com via OPNsense (works - has local overrides) 2. Request hits K8s Ingress 3. NGINX Ingress calls auth-url to verify authentication 4. Authentik/CoreDNS needs to resolve external URLs 5. CoreDNS forwards to 1.1.1.1 → TIMEOUT (no internet) 6. Auth check fails → 500 error

Why services "suddenly work": Chrome's DNS cache expires, retries query, OPNsense responds with internal IP.

Diagnosis:

# Check what DNS K3s nodes are using
for ip in 10.89.97.201 10.89.97.202 10.89.97.203; do
  echo "=== $ip ==="
  ssh root@$ip 'cat /etc/resolv.conf | grep nameserver'
done
# BAD: nameserver 1.1.1.1
# GOOD: nameserver 10.89.97.1

Solution: Configure K3s Nodes to Use OPNsense DNS

# Run on each K3s node (201, 202, 203)
for ip in 10.89.97.201 10.89.97.202 10.89.97.203; do
  echo "=== Configuring $ip ==="
  ssh root@$ip "mkdir -p /etc/systemd/resolved.conf.d && cat > /etc/systemd/resolved.conf.d/local-dns.conf << 'EOF'
[Resolve]
DNS=10.89.97.1
FallbackDNS=1.1.1.1
Domains=~.
EOF
systemctl restart systemd-resolved"
done

# Restart CoreDNS to pick up new resolv.conf
kubectl rollout restart deployment coredns -n kube-system

# Verify DNS resolution from inside K8s
kubectl run dns-test --rm -it --restart=Never --image=busybox:latest -- nslookup auth.bogocat.com
# Should return: 10.89.97.220 (internal Ingress IP)

Verification:

# Check each node's DNS configuration
for ip in 10.89.97.201 10.89.97.202 10.89.97.203; do
  echo "=== $ip ==="
  ssh root@$ip 'resolvectl status | head -10'
done
# Should show:
# DNS Servers 10.89.97.1
# Fallback DNS Servers 1.1.1.1

# Test DNS resolution
ssh root@10.89.97.201 'resolvectl query auth.bogocat.com'
# Should return: 10.89.97.220

Why this works: - OPNsense has DNS overrides for *.bogocat.com → internal IPs - During internet outage, local DNS still works - Fallback to 1.1.1.1 only if OPNsense is down

Related services affected: - All arr-stack apps (radarr, sonarr, prowlarr, etc.) with Authentik forward auth - Any K8s app using external DNS resolution - Authentik itself if it needs to resolve external URLs


K3s Node DNS Configuration Reference

Location: /etc/systemd/resolved.conf.d/local-dns.conf

Standard configuration for all K3s nodes:

[Resolve]
DNS=10.89.97.1
FallbackDNS=1.1.1.1
Domains=~.

Explanation: - DNS=10.89.97.1 - Primary DNS is OPNsense (has local overrides) - FallbackDNS=1.1.1.1 - Fallback to Cloudflare if OPNsense is down - Domains=~. - Use this DNS for all domains (not just specific ones)

Node IPs: - k3s-master: 10.89.97.201 - k3s-worker-1: 10.89.97.202 - k3s-worker-2: 10.89.97.203


Docker Build Issues

BuildKit "spawn sh EACCES" Error

Symptoms:

npm error Error: spawn sh EACCES
npm error   code: 'EACCES',
npm error   syscall: 'spawn sh',

Cause: Corrupted Docker buildx builder after system upgrades (apt dist-upgrade).

Diagnosis:

docker buildx ls
# If you see "v0.0.0+unknown", the builder is corrupted

Solution:

# Create fresh builder
docker buildx create --name fresh-builder --use
docker buildx inspect --bootstrap

# Use in builds
docker buildx build --builder fresh-builder --load -t app:v1.0.0 .

Full details: See Docker Deployment Guide


Getting Help

If issues persist: 1. Check events: kubectl get events -A --sort-by='.lastTimestamp' 2. Check logs: kubectl logs pod-name and ssh root@NODE 'journalctl -u k3s' 3. Search k3s GitHub issues: https://github.com/k3s-io/k3s/issues 4. Kubernetes documentation: https://kubernetes.io/docs/ 5. NGINX Ingress docs: https://kubernetes.github.io/ingress-nginx/