Alert Investigation & Resolution Guide¶

Last Updated: 2025-11-12 Status: Production

Overview¶

This guide walks you through investigating and resolving alerts using both CLI and UI dashboards (Prometheus, Grafana, AlertManager).

Accessing the Monitoring Dashboards¶

Direct Access (Recommended)¶

All monitoring services are exposed via LoadBalancer for direct access:

Grafana - Pre-built dashboards, visualizations: - URL: http://10.89.97.211 - Credentials: admin / prom-operator - What to use it for: Visual exploration, pre-built dashboards, log queries via Loki

Prometheus UI - Query metrics, view targets, check alert rules: - URL: http://10.89.97.216:9090 - What to use it for: PromQL queries, alert rule testing, target health

AlertManager UI - Active alerts, silences: - URL: http://10.89.97.217:9093 - What to use it for: View firing alerts, create/manage silences

Loki API - Log aggregation (best accessed via Grafana): - URL: http://10.89.97.218:3100 - What to use it for: Direct API access (use Grafana → Explore for UI)

Alternative: Port-Forward (If LoadBalancer IPs Change)¶

If you need to access via localhost or LoadBalancer IPs change:

# Prometheus
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090

# Grafana
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

# AlertManager
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093

# Loki
kubectl port-forward -n monitoring svc/loki 3100:3100

Alert #1: KubeMemoryOvercommit¶

Alert Details: - Severity: Warning - Description: Cluster has overcommitted memory resource requests - Impact: Cannot tolerate node failure without pod evictions

CLI Investigation¶

# 1. Check node memory capacity
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
MEMORY:.status.capacity.memory

# 2. List all pod memory requests by namespace
kubectl get pods -A -o json | \
  jq -r '.items[] |
    select(.spec.containers[].resources.requests.memory != null) |
    "\(.metadata.namespace) \(.metadata.name) \(.spec.containers[].resources.requests.memory)"' | \
  awk '{print $1, $3}' | \
  awk '{ns[$1]+=$2} END {for (n in ns) print n, ns[n] "Mi"}' | \
  sort -k2 -rn

# 3. Find top memory-requesting pods
kubectl get pods -A -o json | \
  jq -r '.items[] |
    select(.spec.containers[].resources.requests.memory != null) |
    "\(.metadata.namespace)/\(.metadata.name) \(.spec.containers[].resources.requests.memory)"' | \
  sort -k2 -rn | head -20

# 4. Calculate total requests vs capacity
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

Prometheus UI Investigation¶

Access Prometheus: http://10.89.97.216:9090

Check Total Memory Requests:

sum(kube_pod_container_resource_requests{resource="memory"})

Check Node Capacity:

sum(kube_node_status_capacity{resource="memory"})

Calculate Overcommit Ratio:

sum(kube_pod_container_resource_requests{resource="memory"}) /
sum(kube_node_status_capacity{resource="memory"})

Find Top Memory Requesters:

topk(10, sum by (namespace) (kube_pod_container_resource_requests{resource="memory"}))

Grafana Dashboard Investigation¶

Access Grafana: http://10.89.97.211 (admin/prom-operator)
Pre-built Dashboards:
Kubernetes / Compute Resources / Cluster - Shows total capacity vs requests
Kubernetes / Compute Resources / Namespace (Pods) - Memory by namespace
Node Exporter / Nodes - Per-node memory usage
What to Look For:
Red bars on resource usage graphs (>100% = overcommitted)
Namespaces consuming most memory
Pods with high memory requests but low actual usage

Resolution Options¶

Option 1: Reduce Pod Memory Requests (Recommended)

# Find pods with excessive requests vs actual usage
kubectl top pods -A --sort-by=memory

# Edit deployment to lower memory request
kubectl edit deployment <name> -n <namespace>

# Look for:
resources:
  requests:
    memory: "2Gi"  # ← Reduce this if actual usage is much lower

Option 2: Increase Repeat Interval in AlertManager

# /root/k8s/alertmanager-config.yaml
route:
  repeat_interval: 24h  # Only notify once per day (was 12h)

Option 3: Accept the Risk - Single-node K3s clusters often overcommit - Risk: If node fails, no backup node to reschedule pods - Mitigation: Regular backups, quick node recovery procedures

Alert #2: KubePodNotReady / KubeStatefulSetReplicasMismatch¶

Alert Details: - Pod: loki-chunks-cache-0 in monitoring namespace - Severity: Warning - Description: Pod stuck in Pending state, StatefulSet has 0/1 replicas

CLI Investigation¶

# 1. Check pod status
kubectl get pod loki-chunks-cache-0 -n monitoring

# 2. Describe pod for events
kubectl describe pod loki-chunks-cache-0 -n monitoring

# 3. Check for PVC issues (common cause)
kubectl get pvc -n monitoring | grep loki-chunks-cache

# 4. Check PVC details if exists
kubectl describe pvc <pvc-name> -n monitoring

# 5. Check StatefulSet
kubectl get statefulset loki-chunks-cache -n monitoring
kubectl describe statefulset loki-chunks-cache -n monitoring

# 6. Check storage class availability
kubectl get storageclass
kubectl get pv | grep loki

# 7. View pod logs (if running)
kubectl logs loki-chunks-cache-0 -n monitoring

Prometheus UI Investigation¶

Check Pod Phase:

kube_pod_status_phase{namespace="monitoring",pod="loki-chunks-cache-0"}

1 = Running
2 = Pending
3 = Failed

Check Container Ready Status:

kube_pod_container_status_ready{namespace="monitoring",pod="loki-chunks-cache-0"}

Check StatefulSet Replicas:

kube_statefulset_status_replicas{namespace="monitoring",statefulset="loki-chunks-cache"}
kube_statefulset_status_replicas_ready{namespace="monitoring",statefulset="loki-chunks-cache"}

Grafana Dashboard Investigation¶

Dashboard: Kubernetes / Compute Resources / Namespace (Pods)
Filter namespace: monitoring
Look for loki-chunks-cache-0
Check CPU/Memory usage (will be 0 if Pending)
Dashboard: Kubernetes / Persistent Volumes
Check if PVC is bound
Look for storage capacity issues

Resolution Options¶

Background: loki-chunks-cache is optional - This component caches Loki index queries for improved query performance - Default configuration requests 9.6GB memory - far too large for homelab clusters - Loki SingleBinary mode works fine without it - Performance impact is minimal for homelab workloads (<10GB/day log ingestion)

Root Cause Analysis:

# Check pod scheduling failure
kubectl describe pod loki-chunks-cache-0 -n monitoring

# Common error: "0/3 nodes are available: 3 Insufficient memory"
# The pod requests 9830Mi (9.6GB) but nodes only have ~7.75GB each

Option 1: Disable Chunks Cache via Helm (Recommended)

# Permanently disable the cache by updating Helm values
helm upgrade loki grafana/loki \
  -n monitoring \
  --reuse-values \
  --set chunksCache.enabled=false

# Verify removal
kubectl get statefulset loki-chunks-cache -n monitoring
# Should return: Error from server (NotFound)

# Check Loki is healthy
kubectl get pod loki-0 -n monitoring
# Should be: 2/2 Running

Why this is better than scaling to 0: - Configuration persists across Helm upgrades - StatefulSet is fully removed (cleaner) - No confusion about why replicas=0 - Documented in Helm release history

Option 2: Reduce Memory to Reasonable Levels

# Keep the cache but reduce memory to 1-2GB
helm upgrade loki grafana/loki \
  -n monitoring \
  --reuse-values \
  --set chunksCache.resources.requests.memory=1Gi \
  --set chunksCache.resources.limits.memory=1229Mi

# This matches the results-cache configuration which runs successfully

Option 3: Increase Node Memory (Not Recommended for Homelab)

# Add 4-8GB RAM to each k3s VM
# Requires VM shutdown and reconfiguration
# Overkill for typical homelab log volumes

# Only pursue this if you have other workloads requiring more memory

Option 4: Fix PVC Issues (Only if PVC exists and is Pending)

# Check if PVC is the issue
kubectl get pvc -n monitoring | grep loki-chunks-cache

# If PVC exists and is Pending:
kubectl describe pvc <pvc-name> -n monitoring

# Check storage availability
kubectl get pv
kubectl get sc

# If using Longhorn, check UI
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

Verified Resolution (2025-11-16): - Disabled chunks cache via Helm on production cluster - All three alerts (KubePodNotReady, KubeStatefulSetReplicasMismatch, KubeMemoryOvercommit) resolved - Loki continues operating normally with results-cache only - Memory pressure reduced from 1.8GB overcommit to comfortable levels (19-41% per node)

Alert #3: AlertmanagerFailedToSendAlerts¶

Alert Details: - Severity: Warning - Description: AlertManager failing to send 12-14% of notifications to Discord - Cause: Discord rate limiting (429 errors)

CLI Investigation¶

# 1. Check AlertManager logs for errors
kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 -c alertmanager | grep -i "error\|429\|rate"

# 2. Check recent Discord webhook attempts
kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 -c alertmanager --tail=100 | grep discord

# 3. Count active firing alerts
kubectl get prometheusrules -n monitoring -o json | \
  jq -r '.items[].spec.groups[].rules[] | select(.alert != null) | .alert' | wc -l

# 4. Check AlertManager config
kubectl get secret alertmanager-discord-config -n monitoring -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d

Prometheus UI Investigation¶

Check Notification Queue Size:

alertmanager_notifications_total
alertmanager_notifications_failed_total

Calculate Failure Rate:

rate(alertmanager_notifications_failed_total[5m]) /
rate(alertmanager_notifications_total[5m])

Check Alerts by Receiver:

alertmanager_notifications_total{integration="discord"}

AlertManager UI Investigation¶

Access AlertManager: http://10.89.97.217:9093
View Active Alerts:
Shows all currently firing alerts
Group by: alertname, namespace, severity
Check Silences:
Shows active silences
Create new silences to temporarily stop notifications

Resolution Options¶

Option 1: Increase AlertManager Group Interval (Recommended)

# /root/k8s/alertmanager-config.yaml
route:
  group_interval: 15m  # Increase from 10s to 15 minutes
  repeat_interval: 24h  # Also increase repeat interval

# Apply updated config
kubectl create secret generic alertmanager-discord-config \
  -n monitoring \
  --from-file=alertmanager.yaml=/root/k8s/alertmanager-config.yaml \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart AlertManager pods to reload config
kubectl delete pod -n monitoring -l app.kubernetes.io/name=alertmanager

Option 2: Disable Low-Priority Alerts

# Find PrometheusRules generating noise
kubectl get prometheusrules -n monitoring

# Delete or edit rules that are not critical
kubectl edit prometheusrule <name> -n monitoring

Option 3: Use Multiple Discord Webhooks

# /root/k8s/alertmanager-config.yaml
receivers:
- name: 'discord-critical'
  discord_configs:
  - webhook_url: 'https://discord.com/api/webhooks/WEBHOOK1'

- name: 'discord-warning'
  discord_configs:
  - webhook_url: 'https://discord.com/api/webhooks/WEBHOOK2'

Option 4: Temporarily Silence Noisy Alerts

# Via AlertManager UI (http://localhost:9093):
# 1. Find the alert
# 2. Click "Silence"
# 3. Set duration (e.g., 24h)
# 4. Add reason: "Investigating - reducing noise"

Alert #4: KubePersistentVolumeFillingUp (Docker Registry)¶

Alert Details: - Severity: Critical (when <3% free) or Warning (when <15% free) - Namespace: docker-registry - Description: PersistentVolume running out of space - Impact: Registry will become read-only or fail, breaking CI/CD pipelines

CLI Investigation¶

# 1. Check current disk usage
kubectl exec -n docker-registry deploy/docker-registry -- df -h /var/lib/registry

# 2. Check PVC size and status
kubectl get pvc -n docker-registry
kubectl describe pvc docker-registry-pvc -n docker-registry

# 3. List all images in registry
curl -s http://10.89.97.201:30500/v2/_catalog | jq

# 4. List tags for each image
for repo in $(curl -s http://10.89.97.201:30500/v2/_catalog | jq -r '.repositories[]'); do
  echo "=== $repo ==="
  curl -s http://10.89.97.201:30500/v2/$repo/tags/list | jq -r '.tags[]' | sort -V
done

# 5. Check blob storage size (where actual data lives)
kubectl exec -n docker-registry deploy/docker-registry -- du -sh /var/lib/registry/docker/registry/v2/blobs/

Prometheus UI Investigation¶

Check PV Usage Over Time:

kubelet_volume_stats_used_bytes{namespace="docker-registry"} /
kubelet_volume_stats_capacity_bytes{namespace="docker-registry"}

Check Available Space:

kubelet_volume_stats_available_bytes{namespace="docker-registry"}

Predict When Full (linear projection):

predict_linear(kubelet_volume_stats_available_bytes{namespace="docker-registry"}[7d], 30*24*60*60)

Resolution Options¶

Option 1: Expand PVC (Immediate Relief)

# Longhorn supports online volume expansion
kubectl patch pvc docker-registry-pvc -n docker-registry \
  -p '{"spec":{"resources":{"requests":{"storage":"30Gi"}}}}'

# Verify expansion
kubectl get pvc -n docker-registry
# Wait for capacity to update (may take a minute)

Option 2: Clean Up Old Image Tags (Recommended)

# Use the cleanup script (dry run first)
/root/tower-fleet/scripts/registry-cleanup.sh

# Actually delete old tags (keeps latest 5 versions per repo)
/root/tower-fleet/scripts/registry-cleanup.sh --execute

# Delete and run garbage collection
/root/tower-fleet/scripts/registry-cleanup.sh --execute --gc

# Keep only 3 versions instead of 5
/root/tower-fleet/scripts/registry-cleanup.sh --execute --keep 3

Option 3: Manual Tag Deletion

# Get digest for a specific tag
REPO="home-portal"
TAG="v1.0.1"
DIGEST=$(curl -sI -H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
  http://10.89.97.201:30500/v2/${REPO}/manifests/${TAG} | \
  grep -i Docker-Content-Digest | awk '{print $2}' | tr -d '\r')

# Delete the tag
curl -X DELETE "http://10.89.97.201:30500/v2/${REPO}/manifests/${DIGEST}"

Option 4: Run Garbage Collection (After Deleting Tags)

Deleting tags only removes manifest references. Actual blob data requires garbage collection:

# Scale down registry (GC requires exclusive access)
kubectl scale deployment docker-registry -n docker-registry --replicas=0

# Wait for pod to terminate
kubectl wait --for=delete pod -l app=docker-registry -n docker-registry --timeout=60s

# Run garbage collection (via temporary pod or exec)
# Note: Must mount the same PVC
kubectl run registry-gc --rm -it --restart=Never \
  -n docker-registry \
  --image=registry:2 \
  --overrides='{
    "spec": {
      "containers": [{
        "name": "registry-gc",
        "image": "registry:2",
        "command": ["registry", "garbage-collect", "/etc/docker/registry/config.yml"],
        "volumeMounts": [{"name": "data", "mountPath": "/var/lib/registry"}]
      }],
      "volumes": [{"name": "data", "persistentVolumeClaim": {"claimName": "docker-registry-pvc"}}]
    }
  }'

# Scale back up
kubectl scale deployment docker-registry -n docker-registry --replicas=1

Registry Management Commands¶

Task	Command
List all images	`curl http://10.89.97.201:30500/v2/_catalog`
List tags for image	`curl http://10.89.97.201:30500/v2/<name>/tags/list`
Check disk usage	`kubectl exec -n docker-registry deploy/docker-registry -- df -h /var/lib/registry`
Resize PVC	`kubectl patch pvc docker-registry-pvc -n docker-registry -p '{"spec":{"resources":{"requests":{"storage":"30Gi"}}}}'`
Cleanup script	`/root/tower-fleet/scripts/registry-cleanup.sh --help`

Ongoing Management¶

Recommended retention policy: Keep 5 most recent versions per image

Automation options: 1. Manual cleanup: Run /root/tower-fleet/scripts/registry-cleanup.sh --execute --gc monthly 2. CI/CD integration: Add cleanup step after successful deployments 3. Cron job: Schedule weekly cleanup (see below)

Example cron job (weekly Sunday 3am):

# Add to root crontab on a k8s node or management host
0 3 * * 0 /root/tower-fleet/scripts/registry-cleanup.sh --execute --gc --keep 5 >> /var/log/registry-cleanup.log 2>&1

Alert #5: InfoInhibitor¶

Alert Details: - Severity: none - Description: Meta-alert used to inhibit info-level alerts - Impact: None - this is intentional behavior

What It Means¶

InfoInhibitor is a meta-alert from kube-prometheus-stack designed to reduce noise: 1. Fires when severity="info" alerts exist 2. Used to inhibit info alerts from firing unless there's also a warning/critical alert

Why You See It¶

The InfoInhibitor alert should be: 1. Routed to a null receiver (no notifications) 2. Used in inhibit_rules to suppress info alerts

If you're receiving Discord notifications for this alert, your AlertManager config needs updating.

Resolution¶

Check current AlertManager config:

kubectl get secret alertmanager-kube-prometheus-stack-alertmanager -n monitoring \
  -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d

Ensure these are configured:

route:
  routes:
    # Route InfoInhibitor to null (no notifications)
    - match:
        alertname: InfoInhibitor
      receiver: 'null'

receivers:
  # Null receiver - drops alerts silently
  - name: 'null'

inhibit_rules:
  # Inhibit info alerts when InfoInhibitor fires (unless warning/critical also firing)
  - source_match:
        alertname: 'InfoInhibitor'
    target_match:
        severity: 'info'
    equal: ['namespace']

Apply updated config:

# Edit your alertmanager config file
vim /root/tower-fleet/k8s/alertmanager-config.yaml

# Apply as secret
kubectl create secret generic alertmanager-kube-prometheus-stack-alertmanager \
  -n monitoring \
  --from-file=alertmanager.yaml=/root/tower-fleet/k8s/alertmanager-config.yaml \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart AlertManager to pick up changes
kubectl rollout restart statefulset alertmanager-kube-prometheus-stack-alertmanager -n monitoring

General Investigation Workflow¶

When an Alert Fires:¶

Check AlertManager UI - What's the current state?
Is it still firing?
What are the labels?
What's the severity?
Check Prometheus UI - Query the alert expression
Go to Alerts page
Find the alert rule
Click "expr" to see the PromQL query
Test the query in Graph view
Look at historical data
Check Grafana Dashboards - Visual context
Find relevant pre-built dashboard
Look for anomalies in graphs
Check related metrics (CPU, memory, network)

Check Logs with Loki - What happened?

# Via Grafana → Explore → Loki
# Query: {namespace="<namespace>", pod="<pod>"}

# Or via CLI:
kubectl logs <pod> -n <namespace> --tail=100

Check Kubernetes State - Current resource status

kubectl get pods -n <namespace>
kubectl describe pod <pod> -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Creating Effective Silences¶

When to Silence: - During planned maintenance - Known issues being actively worked on - False positives until rules can be fixed - Alert fatigue from noisy but non-critical alerts

How to Create Silence via CLI:

# Port-forward to AlertManager
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &

# Create silence
curl -X POST http://localhost:9093/api/v2/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {"name": "alertname", "value": "AlertName", "isRegex": false},
      {"name": "namespace", "value": "my-namespace", "isRegex": false}
    ],
    "startsAt": "'$(date -u +"%Y-%m-%dT%H:%M:%SZ")'",
    "endsAt": "'$(date -u -d "+24 hours" +"%Y-%m-%dT%H:%M:%SZ")'",
    "createdBy": "Your Name",
    "comment": "Reason for silence"
  }'

How to Delete Silence:

# List silences
curl http://localhost:9093/api/v2/silences | jq

# Delete by ID
curl -X DELETE http://localhost:9093/api/v2/silence/<silence-id>

Alert Priority & Response Time¶

Severity	Response Time	Action
Critical	Immediate	Drop everything, investigate now
Warning	Within 1 hour	Investigate during work hours
Info	Best effort	Review during regular maintenance

Critical Alerts Should Be: - Service completely down - Data loss imminent - Security breach

Warning Alerts Should Be: - Performance degraded - Resource usage high (but not critical) - Non-critical component failing

Info Alerts Should Be: - Informational only - Scheduled maintenance completed - Configuration changes

Best Practices¶

Review alerts weekly - Adjust thresholds based on actual behavior
Silence during maintenance - Prevent alert fatigue
Document resolutions - Add to runbooks for next time
Test alert rules - Verify they fire when expected
Keep Discord clean - Only critical alerts should page you
Use Grafana for exploration - Better than CLI for visual analysis
Check Loki logs first - Often reveals root cause immediately
Correlate metrics - Single metric might not tell full story

Quick Reference: Direct Access URLs¶

# Grafana (dashboards & log exploration)
http://10.89.97.211
Credentials: admin / prom-operator

# Prometheus (metrics & alerts)
http://10.89.97.216:9090

# AlertManager (alert routing & silences)
http://10.89.97.217:9093

# Loki API (direct access, or use Grafana → Explore)
http://10.89.97.218:3100

# Longhorn (storage management)
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# → http://localhost:8080

Bookmark these URLs for quick access during incident investigation!

Alerting Guide - Full alerting configuration reference
Observability Standards - Metrics, logging, tracing standards
Loki Operations - Log aggregation and queries
Troubleshooting Guide - General K8s troubleshooting

For additional help: - Prometheus docs: https://prometheus.io/docs/ - AlertManager docs: https://prometheus.io/docs/alerting/latest/alertmanager/ - Grafana docs: https://grafana.com/docs/

Alert Investigation & Resolution Guide¶

Overview¶

Accessing the Monitoring Dashboards¶

Direct Access (Recommended)¶

Alternative: Port-Forward (If LoadBalancer IPs Change)¶

Alert #1: KubeMemoryOvercommit¶

CLI Investigation¶

Prometheus UI Investigation¶

Grafana Dashboard Investigation¶

Resolution Options¶

Alert #2: KubePodNotReady / KubeStatefulSetReplicasMismatch¶

CLI Investigation¶

Prometheus UI Investigation¶

Grafana Dashboard Investigation¶

Resolution Options¶

Alert #3: AlertmanagerFailedToSendAlerts¶

CLI Investigation¶

Prometheus UI Investigation¶

AlertManager UI Investigation¶

Resolution Options¶

Alert #4: KubePersistentVolumeFillingUp (Docker Registry)¶

CLI Investigation¶

Prometheus UI Investigation¶

Resolution Options¶

Registry Management Commands¶

Ongoing Management¶

Alert #5: InfoInhibitor¶

What It Means¶

Why You See It¶

Resolution¶

General Investigation Workflow¶

When an Alert Fires:¶

Creating Effective Silences¶

Alert Priority & Response Time¶

Best Practices¶

Quick Reference: Direct Access URLs¶

Related Documentation¶