Skip to content

Alert Investigation & Resolution Guide

Last Updated: 2025-11-12 Status: Production

Overview

This guide walks you through investigating and resolving alerts using both CLI and UI dashboards (Prometheus, Grafana, AlertManager).


Accessing the Monitoring Dashboards

All monitoring services are exposed via LoadBalancer for direct access:

Grafana - Pre-built dashboards, visualizations: - URL: http://10.89.97.211 - Credentials: admin / prom-operator - What to use it for: Visual exploration, pre-built dashboards, log queries via Loki

Prometheus UI - Query metrics, view targets, check alert rules: - URL: http://10.89.97.216:9090 - What to use it for: PromQL queries, alert rule testing, target health

AlertManager UI - Active alerts, silences: - URL: http://10.89.97.217:9093 - What to use it for: View firing alerts, create/manage silences

Loki API - Log aggregation (best accessed via Grafana): - URL: http://10.89.97.218:3100 - What to use it for: Direct API access (use Grafana → Explore for UI)

Alternative: Port-Forward (If LoadBalancer IPs Change)

If you need to access via localhost or LoadBalancer IPs change:

# Prometheus
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090

# Grafana
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

# AlertManager
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093

# Loki
kubectl port-forward -n monitoring svc/loki 3100:3100

Alert #1: KubeMemoryOvercommit

Alert Details: - Severity: Warning - Description: Cluster has overcommitted memory resource requests - Impact: Cannot tolerate node failure without pod evictions

CLI Investigation

# 1. Check node memory capacity
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
MEMORY:.status.capacity.memory

# 2. List all pod memory requests by namespace
kubectl get pods -A -o json | \
  jq -r '.items[] |
    select(.spec.containers[].resources.requests.memory != null) |
    "\(.metadata.namespace) \(.metadata.name) \(.spec.containers[].resources.requests.memory)"' | \
  awk '{print $1, $3}' | \
  awk '{ns[$1]+=$2} END {for (n in ns) print n, ns[n] "Mi"}' | \
  sort -k2 -rn

# 3. Find top memory-requesting pods
kubectl get pods -A -o json | \
  jq -r '.items[] |
    select(.spec.containers[].resources.requests.memory != null) |
    "\(.metadata.namespace)/\(.metadata.name) \(.spec.containers[].resources.requests.memory)"' | \
  sort -k2 -rn | head -20

# 4. Calculate total requests vs capacity
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

Prometheus UI Investigation

  1. Access Prometheus: http://10.89.97.216:9090

  2. Check Total Memory Requests:

    sum(kube_pod_container_resource_requests{resource="memory"})
    

  3. Check Node Capacity:

    sum(kube_node_status_capacity{resource="memory"})
    

  4. Calculate Overcommit Ratio:

    sum(kube_pod_container_resource_requests{resource="memory"}) /
    sum(kube_node_status_capacity{resource="memory"})
    

  5. Find Top Memory Requesters:

    topk(10, sum by (namespace) (kube_pod_container_resource_requests{resource="memory"}))
    

Grafana Dashboard Investigation

  1. Access Grafana: http://10.89.97.211 (admin/prom-operator)

  2. Pre-built Dashboards:

  3. Kubernetes / Compute Resources / Cluster - Shows total capacity vs requests
  4. Kubernetes / Compute Resources / Namespace (Pods) - Memory by namespace
  5. Node Exporter / Nodes - Per-node memory usage

  6. What to Look For:

  7. Red bars on resource usage graphs (>100% = overcommitted)
  8. Namespaces consuming most memory
  9. Pods with high memory requests but low actual usage

Resolution Options

Option 1: Reduce Pod Memory Requests (Recommended)

# Find pods with excessive requests vs actual usage
kubectl top pods -A --sort-by=memory

# Edit deployment to lower memory request
kubectl edit deployment <name> -n <namespace>

# Look for:
resources:
  requests:
    memory: "2Gi"  # ← Reduce this if actual usage is much lower

Option 2: Increase Repeat Interval in AlertManager

# /root/k8s/alertmanager-config.yaml
route:
  repeat_interval: 24h  # Only notify once per day (was 12h)

Option 3: Accept the Risk - Single-node K3s clusters often overcommit - Risk: If node fails, no backup node to reschedule pods - Mitigation: Regular backups, quick node recovery procedures


Alert #2: KubePodNotReady / KubeStatefulSetReplicasMismatch

Alert Details: - Pod: loki-chunks-cache-0 in monitoring namespace - Severity: Warning - Description: Pod stuck in Pending state, StatefulSet has 0/1 replicas

CLI Investigation

# 1. Check pod status
kubectl get pod loki-chunks-cache-0 -n monitoring

# 2. Describe pod for events
kubectl describe pod loki-chunks-cache-0 -n monitoring

# 3. Check for PVC issues (common cause)
kubectl get pvc -n monitoring | grep loki-chunks-cache

# 4. Check PVC details if exists
kubectl describe pvc <pvc-name> -n monitoring

# 5. Check StatefulSet
kubectl get statefulset loki-chunks-cache -n monitoring
kubectl describe statefulset loki-chunks-cache -n monitoring

# 6. Check storage class availability
kubectl get storageclass
kubectl get pv | grep loki

# 7. View pod logs (if running)
kubectl logs loki-chunks-cache-0 -n monitoring

Prometheus UI Investigation

  1. Check Pod Phase:
    kube_pod_status_phase{namespace="monitoring",pod="loki-chunks-cache-0"}
    
  2. 1 = Running
  3. 2 = Pending
  4. 3 = Failed

  5. Check Container Ready Status:

    kube_pod_container_status_ready{namespace="monitoring",pod="loki-chunks-cache-0"}
    

  6. Check StatefulSet Replicas:

    kube_statefulset_status_replicas{namespace="monitoring",statefulset="loki-chunks-cache"}
    kube_statefulset_status_replicas_ready{namespace="monitoring",statefulset="loki-chunks-cache"}
    

Grafana Dashboard Investigation

  1. Dashboard: Kubernetes / Compute Resources / Namespace (Pods)
  2. Filter namespace: monitoring
  3. Look for loki-chunks-cache-0
  4. Check CPU/Memory usage (will be 0 if Pending)

  5. Dashboard: Kubernetes / Persistent Volumes

  6. Check if PVC is bound
  7. Look for storage capacity issues

Resolution Options

Background: loki-chunks-cache is optional - This component caches Loki index queries for improved query performance - Default configuration requests 9.6GB memory - far too large for homelab clusters - Loki SingleBinary mode works fine without it - Performance impact is minimal for homelab workloads (<10GB/day log ingestion)

Root Cause Analysis:

# Check pod scheduling failure
kubectl describe pod loki-chunks-cache-0 -n monitoring

# Common error: "0/3 nodes are available: 3 Insufficient memory"
# The pod requests 9830Mi (9.6GB) but nodes only have ~7.75GB each

Option 1: Disable Chunks Cache via Helm (Recommended)

# Permanently disable the cache by updating Helm values
helm upgrade loki grafana/loki \
  -n monitoring \
  --reuse-values \
  --set chunksCache.enabled=false

# Verify removal
kubectl get statefulset loki-chunks-cache -n monitoring
# Should return: Error from server (NotFound)

# Check Loki is healthy
kubectl get pod loki-0 -n monitoring
# Should be: 2/2 Running

Why this is better than scaling to 0: - Configuration persists across Helm upgrades - StatefulSet is fully removed (cleaner) - No confusion about why replicas=0 - Documented in Helm release history

Option 2: Reduce Memory to Reasonable Levels

# Keep the cache but reduce memory to 1-2GB
helm upgrade loki grafana/loki \
  -n monitoring \
  --reuse-values \
  --set chunksCache.resources.requests.memory=1Gi \
  --set chunksCache.resources.limits.memory=1229Mi

# This matches the results-cache configuration which runs successfully

Option 3: Increase Node Memory (Not Recommended for Homelab)

# Add 4-8GB RAM to each k3s VM
# Requires VM shutdown and reconfiguration
# Overkill for typical homelab log volumes

# Only pursue this if you have other workloads requiring more memory

Option 4: Fix PVC Issues (Only if PVC exists and is Pending)

# Check if PVC is the issue
kubectl get pvc -n monitoring | grep loki-chunks-cache

# If PVC exists and is Pending:
kubectl describe pvc <pvc-name> -n monitoring

# Check storage availability
kubectl get pv
kubectl get sc

# If using Longhorn, check UI
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

Verified Resolution (2025-11-16): - Disabled chunks cache via Helm on production cluster - All three alerts (KubePodNotReady, KubeStatefulSetReplicasMismatch, KubeMemoryOvercommit) resolved - Loki continues operating normally with results-cache only - Memory pressure reduced from 1.8GB overcommit to comfortable levels (19-41% per node)


Alert #3: AlertmanagerFailedToSendAlerts

Alert Details: - Severity: Warning - Description: AlertManager failing to send 12-14% of notifications to Discord - Cause: Discord rate limiting (429 errors)

CLI Investigation

# 1. Check AlertManager logs for errors
kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 -c alertmanager | grep -i "error\|429\|rate"

# 2. Check recent Discord webhook attempts
kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 -c alertmanager --tail=100 | grep discord

# 3. Count active firing alerts
kubectl get prometheusrules -n monitoring -o json | \
  jq -r '.items[].spec.groups[].rules[] | select(.alert != null) | .alert' | wc -l

# 4. Check AlertManager config
kubectl get secret alertmanager-discord-config -n monitoring -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d

Prometheus UI Investigation

  1. Check Notification Queue Size:

    alertmanager_notifications_total
    alertmanager_notifications_failed_total
    

  2. Calculate Failure Rate:

    rate(alertmanager_notifications_failed_total[5m]) /
    rate(alertmanager_notifications_total[5m])
    

  3. Check Alerts by Receiver:

    alertmanager_notifications_total{integration="discord"}
    

AlertManager UI Investigation

  1. Access AlertManager: http://10.89.97.217:9093

  2. View Active Alerts:

  3. Shows all currently firing alerts
  4. Group by: alertname, namespace, severity

  5. Check Silences:

  6. Shows active silences
  7. Create new silences to temporarily stop notifications

Resolution Options

Option 1: Increase AlertManager Group Interval (Recommended)

# /root/k8s/alertmanager-config.yaml
route:
  group_interval: 15m  # Increase from 10s to 15 minutes
  repeat_interval: 24h  # Also increase repeat interval

# Apply updated config
kubectl create secret generic alertmanager-discord-config \
  -n monitoring \
  --from-file=alertmanager.yaml=/root/k8s/alertmanager-config.yaml \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart AlertManager pods to reload config
kubectl delete pod -n monitoring -l app.kubernetes.io/name=alertmanager

Option 2: Disable Low-Priority Alerts

# Find PrometheusRules generating noise
kubectl get prometheusrules -n monitoring

# Delete or edit rules that are not critical
kubectl edit prometheusrule <name> -n monitoring

Option 3: Use Multiple Discord Webhooks

# /root/k8s/alertmanager-config.yaml
receivers:
- name: 'discord-critical'
  discord_configs:
  - webhook_url: 'https://discord.com/api/webhooks/WEBHOOK1'

- name: 'discord-warning'
  discord_configs:
  - webhook_url: 'https://discord.com/api/webhooks/WEBHOOK2'

Option 4: Temporarily Silence Noisy Alerts

# Via AlertManager UI (http://localhost:9093):
# 1. Find the alert
# 2. Click "Silence"
# 3. Set duration (e.g., 24h)
# 4. Add reason: "Investigating - reducing noise"


Alert #4: KubePersistentVolumeFillingUp (Docker Registry)

Alert Details: - Severity: Critical (when <3% free) or Warning (when <15% free) - Namespace: docker-registry - Description: PersistentVolume running out of space - Impact: Registry will become read-only or fail, breaking CI/CD pipelines

CLI Investigation

# 1. Check current disk usage
kubectl exec -n docker-registry deploy/docker-registry -- df -h /var/lib/registry

# 2. Check PVC size and status
kubectl get pvc -n docker-registry
kubectl describe pvc docker-registry-pvc -n docker-registry

# 3. List all images in registry
curl -s http://10.89.97.201:30500/v2/_catalog | jq

# 4. List tags for each image
for repo in $(curl -s http://10.89.97.201:30500/v2/_catalog | jq -r '.repositories[]'); do
  echo "=== $repo ==="
  curl -s http://10.89.97.201:30500/v2/$repo/tags/list | jq -r '.tags[]' | sort -V
done

# 5. Check blob storage size (where actual data lives)
kubectl exec -n docker-registry deploy/docker-registry -- du -sh /var/lib/registry/docker/registry/v2/blobs/

Prometheus UI Investigation

  1. Check PV Usage Over Time:

    kubelet_volume_stats_used_bytes{namespace="docker-registry"} /
    kubelet_volume_stats_capacity_bytes{namespace="docker-registry"}
    

  2. Check Available Space:

    kubelet_volume_stats_available_bytes{namespace="docker-registry"}
    

  3. Predict When Full (linear projection):

    predict_linear(kubelet_volume_stats_available_bytes{namespace="docker-registry"}[7d], 30*24*60*60)
    

Resolution Options

Option 1: Expand PVC (Immediate Relief)

# Longhorn supports online volume expansion
kubectl patch pvc docker-registry-pvc -n docker-registry \
  -p '{"spec":{"resources":{"requests":{"storage":"30Gi"}}}}'

# Verify expansion
kubectl get pvc -n docker-registry
# Wait for capacity to update (may take a minute)

Option 2: Clean Up Old Image Tags (Recommended)

# Use the cleanup script (dry run first)
/root/tower-fleet/scripts/registry-cleanup.sh

# Actually delete old tags (keeps latest 5 versions per repo)
/root/tower-fleet/scripts/registry-cleanup.sh --execute

# Delete and run garbage collection
/root/tower-fleet/scripts/registry-cleanup.sh --execute --gc

# Keep only 3 versions instead of 5
/root/tower-fleet/scripts/registry-cleanup.sh --execute --keep 3

Option 3: Manual Tag Deletion

# Get digest for a specific tag
REPO="home-portal"
TAG="v1.0.1"
DIGEST=$(curl -sI -H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
  http://10.89.97.201:30500/v2/${REPO}/manifests/${TAG} | \
  grep -i Docker-Content-Digest | awk '{print $2}' | tr -d '\r')

# Delete the tag
curl -X DELETE "http://10.89.97.201:30500/v2/${REPO}/manifests/${DIGEST}"

Option 4: Run Garbage Collection (After Deleting Tags)

Deleting tags only removes manifest references. Actual blob data requires garbage collection:

# Scale down registry (GC requires exclusive access)
kubectl scale deployment docker-registry -n docker-registry --replicas=0

# Wait for pod to terminate
kubectl wait --for=delete pod -l app=docker-registry -n docker-registry --timeout=60s

# Run garbage collection (via temporary pod or exec)
# Note: Must mount the same PVC
kubectl run registry-gc --rm -it --restart=Never \
  -n docker-registry \
  --image=registry:2 \
  --overrides='{
    "spec": {
      "containers": [{
        "name": "registry-gc",
        "image": "registry:2",
        "command": ["registry", "garbage-collect", "/etc/docker/registry/config.yml"],
        "volumeMounts": [{"name": "data", "mountPath": "/var/lib/registry"}]
      }],
      "volumes": [{"name": "data", "persistentVolumeClaim": {"claimName": "docker-registry-pvc"}}]
    }
  }'

# Scale back up
kubectl scale deployment docker-registry -n docker-registry --replicas=1

Registry Management Commands

Task Command
List all images curl http://10.89.97.201:30500/v2/_catalog
List tags for image curl http://10.89.97.201:30500/v2/<name>/tags/list
Check disk usage kubectl exec -n docker-registry deploy/docker-registry -- df -h /var/lib/registry
Resize PVC kubectl patch pvc docker-registry-pvc -n docker-registry -p '{"spec":{"resources":{"requests":{"storage":"30Gi"}}}}'
Cleanup script /root/tower-fleet/scripts/registry-cleanup.sh --help

Ongoing Management

Recommended retention policy: Keep 5 most recent versions per image

Automation options: 1. Manual cleanup: Run /root/tower-fleet/scripts/registry-cleanup.sh --execute --gc monthly 2. CI/CD integration: Add cleanup step after successful deployments 3. Cron job: Schedule weekly cleanup (see below)

Example cron job (weekly Sunday 3am):

# Add to root crontab on a k8s node or management host
0 3 * * 0 /root/tower-fleet/scripts/registry-cleanup.sh --execute --gc --keep 5 >> /var/log/registry-cleanup.log 2>&1


Alert #5: InfoInhibitor

Alert Details: - Severity: none - Description: Meta-alert used to inhibit info-level alerts - Impact: None - this is intentional behavior

What It Means

InfoInhibitor is a meta-alert from kube-prometheus-stack designed to reduce noise: 1. Fires when severity="info" alerts exist 2. Used to inhibit info alerts from firing unless there's also a warning/critical alert

Why You See It

The InfoInhibitor alert should be: 1. Routed to a null receiver (no notifications) 2. Used in inhibit_rules to suppress info alerts

If you're receiving Discord notifications for this alert, your AlertManager config needs updating.

Resolution

Check current AlertManager config:

kubectl get secret alertmanager-kube-prometheus-stack-alertmanager -n monitoring \
  -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d

Ensure these are configured:

route:
  routes:
    # Route InfoInhibitor to null (no notifications)
    - match:
        alertname: InfoInhibitor
      receiver: 'null'

receivers:
  # Null receiver - drops alerts silently
  - name: 'null'

inhibit_rules:
  # Inhibit info alerts when InfoInhibitor fires (unless warning/critical also firing)
  - source_match:
        alertname: 'InfoInhibitor'
    target_match:
        severity: 'info'
    equal: ['namespace']

Apply updated config:

# Edit your alertmanager config file
vim /root/tower-fleet/k8s/alertmanager-config.yaml

# Apply as secret
kubectl create secret generic alertmanager-kube-prometheus-stack-alertmanager \
  -n monitoring \
  --from-file=alertmanager.yaml=/root/tower-fleet/k8s/alertmanager-config.yaml \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart AlertManager to pick up changes
kubectl rollout restart statefulset alertmanager-kube-prometheus-stack-alertmanager -n monitoring


General Investigation Workflow

When an Alert Fires:

  1. Check AlertManager UI - What's the current state?
  2. Is it still firing?
  3. What are the labels?
  4. What's the severity?

  5. Check Prometheus UI - Query the alert expression

  6. Go to Alerts page
  7. Find the alert rule
  8. Click "expr" to see the PromQL query
  9. Test the query in Graph view
  10. Look at historical data

  11. Check Grafana Dashboards - Visual context

  12. Find relevant pre-built dashboard
  13. Look for anomalies in graphs
  14. Check related metrics (CPU, memory, network)

  15. Check Logs with Loki - What happened?

    # Via Grafana → Explore → Loki
    # Query: {namespace="<namespace>", pod="<pod>"}
    
    # Or via CLI:
    kubectl logs <pod> -n <namespace> --tail=100
    

  16. Check Kubernetes State - Current resource status

    kubectl get pods -n <namespace>
    kubectl describe pod <pod> -n <namespace>
    kubectl get events -n <namespace> --sort-by='.lastTimestamp'
    

Creating Effective Silences

When to Silence: - During planned maintenance - Known issues being actively worked on - False positives until rules can be fixed - Alert fatigue from noisy but non-critical alerts

How to Create Silence via CLI:

# Port-forward to AlertManager
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &

# Create silence
curl -X POST http://localhost:9093/api/v2/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {"name": "alertname", "value": "AlertName", "isRegex": false},
      {"name": "namespace", "value": "my-namespace", "isRegex": false}
    ],
    "startsAt": "'$(date -u +"%Y-%m-%dT%H:%M:%SZ")'",
    "endsAt": "'$(date -u -d "+24 hours" +"%Y-%m-%dT%H:%M:%SZ")'",
    "createdBy": "Your Name",
    "comment": "Reason for silence"
  }'

How to Delete Silence:

# List silences
curl http://localhost:9093/api/v2/silences | jq

# Delete by ID
curl -X DELETE http://localhost:9093/api/v2/silence/<silence-id>


Alert Priority & Response Time

Severity Response Time Action
Critical Immediate Drop everything, investigate now
Warning Within 1 hour Investigate during work hours
Info Best effort Review during regular maintenance

Critical Alerts Should Be: - Service completely down - Data loss imminent - Security breach

Warning Alerts Should Be: - Performance degraded - Resource usage high (but not critical) - Non-critical component failing

Info Alerts Should Be: - Informational only - Scheduled maintenance completed - Configuration changes


Best Practices

  1. Review alerts weekly - Adjust thresholds based on actual behavior
  2. Silence during maintenance - Prevent alert fatigue
  3. Document resolutions - Add to runbooks for next time
  4. Test alert rules - Verify they fire when expected
  5. Keep Discord clean - Only critical alerts should page you
  6. Use Grafana for exploration - Better than CLI for visual analysis
  7. Check Loki logs first - Often reveals root cause immediately
  8. Correlate metrics - Single metric might not tell full story

Quick Reference: Direct Access URLs

# Grafana (dashboards & log exploration)
http://10.89.97.211
Credentials: admin / prom-operator

# Prometheus (metrics & alerts)
http://10.89.97.216:9090

# AlertManager (alert routing & silences)
http://10.89.97.217:9093

# Loki API (direct access, or use Grafana → Explore)
http://10.89.97.218:3100

# Longhorn (storage management)
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# → http://localhost:8080

Bookmark these URLs for quick access during incident investigation!



For additional help: - Prometheus docs: https://prometheus.io/docs/ - AlertManager docs: https://prometheus.io/docs/alerting/latest/alertmanager/ - Grafana docs: https://grafana.com/docs/