Alert Investigation & Resolution Guide¶
Last Updated: 2025-11-12 Status: Production
Overview¶
This guide walks you through investigating and resolving alerts using both CLI and UI dashboards (Prometheus, Grafana, AlertManager).
Accessing the Monitoring Dashboards¶
Direct Access (Recommended)¶
All monitoring services are exposed via LoadBalancer for direct access:
Grafana - Pre-built dashboards, visualizations: - URL: http://10.89.97.211 - Credentials: admin / prom-operator - What to use it for: Visual exploration, pre-built dashboards, log queries via Loki
Prometheus UI - Query metrics, view targets, check alert rules: - URL: http://10.89.97.216:9090 - What to use it for: PromQL queries, alert rule testing, target health
AlertManager UI - Active alerts, silences: - URL: http://10.89.97.217:9093 - What to use it for: View firing alerts, create/manage silences
Loki API - Log aggregation (best accessed via Grafana): - URL: http://10.89.97.218:3100 - What to use it for: Direct API access (use Grafana → Explore for UI)
Alternative: Port-Forward (If LoadBalancer IPs Change)¶
If you need to access via localhost or LoadBalancer IPs change:
# Prometheus
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Grafana
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# AlertManager
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093
# Loki
kubectl port-forward -n monitoring svc/loki 3100:3100
Alert #1: KubeMemoryOvercommit¶
Alert Details: - Severity: Warning - Description: Cluster has overcommitted memory resource requests - Impact: Cannot tolerate node failure without pod evictions
CLI Investigation¶
# 1. Check node memory capacity
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
MEMORY:.status.capacity.memory
# 2. List all pod memory requests by namespace
kubectl get pods -A -o json | \
jq -r '.items[] |
select(.spec.containers[].resources.requests.memory != null) |
"\(.metadata.namespace) \(.metadata.name) \(.spec.containers[].resources.requests.memory)"' | \
awk '{print $1, $3}' | \
awk '{ns[$1]+=$2} END {for (n in ns) print n, ns[n] "Mi"}' | \
sort -k2 -rn
# 3. Find top memory-requesting pods
kubectl get pods -A -o json | \
jq -r '.items[] |
select(.spec.containers[].resources.requests.memory != null) |
"\(.metadata.namespace)/\(.metadata.name) \(.spec.containers[].resources.requests.memory)"' | \
sort -k2 -rn | head -20
# 4. Calculate total requests vs capacity
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"
Prometheus UI Investigation¶
-
Access Prometheus: http://10.89.97.216:9090
-
Check Total Memory Requests:
-
Check Node Capacity:
-
Calculate Overcommit Ratio:
-
Find Top Memory Requesters:
Grafana Dashboard Investigation¶
-
Access Grafana: http://10.89.97.211 (admin/prom-operator)
-
Pre-built Dashboards:
- Kubernetes / Compute Resources / Cluster - Shows total capacity vs requests
- Kubernetes / Compute Resources / Namespace (Pods) - Memory by namespace
-
Node Exporter / Nodes - Per-node memory usage
-
What to Look For:
- Red bars on resource usage graphs (>100% = overcommitted)
- Namespaces consuming most memory
- Pods with high memory requests but low actual usage
Resolution Options¶
Option 1: Reduce Pod Memory Requests (Recommended)
# Find pods with excessive requests vs actual usage
kubectl top pods -A --sort-by=memory
# Edit deployment to lower memory request
kubectl edit deployment <name> -n <namespace>
# Look for:
resources:
requests:
memory: "2Gi" # ← Reduce this if actual usage is much lower
Option 2: Increase Repeat Interval in AlertManager
# /root/k8s/alertmanager-config.yaml
route:
repeat_interval: 24h # Only notify once per day (was 12h)
Option 3: Accept the Risk - Single-node K3s clusters often overcommit - Risk: If node fails, no backup node to reschedule pods - Mitigation: Regular backups, quick node recovery procedures
Alert #2: KubePodNotReady / KubeStatefulSetReplicasMismatch¶
Alert Details:
- Pod: loki-chunks-cache-0 in monitoring namespace
- Severity: Warning
- Description: Pod stuck in Pending state, StatefulSet has 0/1 replicas
CLI Investigation¶
# 1. Check pod status
kubectl get pod loki-chunks-cache-0 -n monitoring
# 2. Describe pod for events
kubectl describe pod loki-chunks-cache-0 -n monitoring
# 3. Check for PVC issues (common cause)
kubectl get pvc -n monitoring | grep loki-chunks-cache
# 4. Check PVC details if exists
kubectl describe pvc <pvc-name> -n monitoring
# 5. Check StatefulSet
kubectl get statefulset loki-chunks-cache -n monitoring
kubectl describe statefulset loki-chunks-cache -n monitoring
# 6. Check storage class availability
kubectl get storageclass
kubectl get pv | grep loki
# 7. View pod logs (if running)
kubectl logs loki-chunks-cache-0 -n monitoring
Prometheus UI Investigation¶
- Check Pod Phase:
- 1 = Running
- 2 = Pending
-
3 = Failed
-
Check Container Ready Status:
-
Check StatefulSet Replicas:
Grafana Dashboard Investigation¶
- Dashboard: Kubernetes / Compute Resources / Namespace (Pods)
- Filter namespace:
monitoring - Look for
loki-chunks-cache-0 -
Check CPU/Memory usage (will be 0 if Pending)
-
Dashboard: Kubernetes / Persistent Volumes
- Check if PVC is bound
- Look for storage capacity issues
Resolution Options¶
Background: loki-chunks-cache is optional - This component caches Loki index queries for improved query performance - Default configuration requests 9.6GB memory - far too large for homelab clusters - Loki SingleBinary mode works fine without it - Performance impact is minimal for homelab workloads (<10GB/day log ingestion)
Root Cause Analysis:
# Check pod scheduling failure
kubectl describe pod loki-chunks-cache-0 -n monitoring
# Common error: "0/3 nodes are available: 3 Insufficient memory"
# The pod requests 9830Mi (9.6GB) but nodes only have ~7.75GB each
Option 1: Disable Chunks Cache via Helm (Recommended)
# Permanently disable the cache by updating Helm values
helm upgrade loki grafana/loki \
-n monitoring \
--reuse-values \
--set chunksCache.enabled=false
# Verify removal
kubectl get statefulset loki-chunks-cache -n monitoring
# Should return: Error from server (NotFound)
# Check Loki is healthy
kubectl get pod loki-0 -n monitoring
# Should be: 2/2 Running
Why this is better than scaling to 0: - Configuration persists across Helm upgrades - StatefulSet is fully removed (cleaner) - No confusion about why replicas=0 - Documented in Helm release history
Option 2: Reduce Memory to Reasonable Levels
# Keep the cache but reduce memory to 1-2GB
helm upgrade loki grafana/loki \
-n monitoring \
--reuse-values \
--set chunksCache.resources.requests.memory=1Gi \
--set chunksCache.resources.limits.memory=1229Mi
# This matches the results-cache configuration which runs successfully
Option 3: Increase Node Memory (Not Recommended for Homelab)
# Add 4-8GB RAM to each k3s VM
# Requires VM shutdown and reconfiguration
# Overkill for typical homelab log volumes
# Only pursue this if you have other workloads requiring more memory
Option 4: Fix PVC Issues (Only if PVC exists and is Pending)
# Check if PVC is the issue
kubectl get pvc -n monitoring | grep loki-chunks-cache
# If PVC exists and is Pending:
kubectl describe pvc <pvc-name> -n monitoring
# Check storage availability
kubectl get pv
kubectl get sc
# If using Longhorn, check UI
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
Verified Resolution (2025-11-16): - Disabled chunks cache via Helm on production cluster - All three alerts (KubePodNotReady, KubeStatefulSetReplicasMismatch, KubeMemoryOvercommit) resolved - Loki continues operating normally with results-cache only - Memory pressure reduced from 1.8GB overcommit to comfortable levels (19-41% per node)
Alert #3: AlertmanagerFailedToSendAlerts¶
Alert Details: - Severity: Warning - Description: AlertManager failing to send 12-14% of notifications to Discord - Cause: Discord rate limiting (429 errors)
CLI Investigation¶
# 1. Check AlertManager logs for errors
kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 -c alertmanager | grep -i "error\|429\|rate"
# 2. Check recent Discord webhook attempts
kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 -c alertmanager --tail=100 | grep discord
# 3. Count active firing alerts
kubectl get prometheusrules -n monitoring -o json | \
jq -r '.items[].spec.groups[].rules[] | select(.alert != null) | .alert' | wc -l
# 4. Check AlertManager config
kubectl get secret alertmanager-discord-config -n monitoring -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d
Prometheus UI Investigation¶
-
Check Notification Queue Size:
-
Calculate Failure Rate:
-
Check Alerts by Receiver:
AlertManager UI Investigation¶
-
Access AlertManager: http://10.89.97.217:9093
-
View Active Alerts:
- Shows all currently firing alerts
-
Group by: alertname, namespace, severity
-
Check Silences:
- Shows active silences
- Create new silences to temporarily stop notifications
Resolution Options¶
Option 1: Increase AlertManager Group Interval (Recommended)
# /root/k8s/alertmanager-config.yaml
route:
group_interval: 15m # Increase from 10s to 15 minutes
repeat_interval: 24h # Also increase repeat interval
# Apply updated config
kubectl create secret generic alertmanager-discord-config \
-n monitoring \
--from-file=alertmanager.yaml=/root/k8s/alertmanager-config.yaml \
--dry-run=client -o yaml | kubectl apply -f -
# Restart AlertManager pods to reload config
kubectl delete pod -n monitoring -l app.kubernetes.io/name=alertmanager
Option 2: Disable Low-Priority Alerts
# Find PrometheusRules generating noise
kubectl get prometheusrules -n monitoring
# Delete or edit rules that are not critical
kubectl edit prometheusrule <name> -n monitoring
Option 3: Use Multiple Discord Webhooks
# /root/k8s/alertmanager-config.yaml
receivers:
- name: 'discord-critical'
discord_configs:
- webhook_url: 'https://discord.com/api/webhooks/WEBHOOK1'
- name: 'discord-warning'
discord_configs:
- webhook_url: 'https://discord.com/api/webhooks/WEBHOOK2'
Option 4: Temporarily Silence Noisy Alerts
# Via AlertManager UI (http://localhost:9093):
# 1. Find the alert
# 2. Click "Silence"
# 3. Set duration (e.g., 24h)
# 4. Add reason: "Investigating - reducing noise"
Alert #4: KubePersistentVolumeFillingUp (Docker Registry)¶
Alert Details: - Severity: Critical (when <3% free) or Warning (when <15% free) - Namespace: docker-registry - Description: PersistentVolume running out of space - Impact: Registry will become read-only or fail, breaking CI/CD pipelines
CLI Investigation¶
# 1. Check current disk usage
kubectl exec -n docker-registry deploy/docker-registry -- df -h /var/lib/registry
# 2. Check PVC size and status
kubectl get pvc -n docker-registry
kubectl describe pvc docker-registry-pvc -n docker-registry
# 3. List all images in registry
curl -s http://10.89.97.201:30500/v2/_catalog | jq
# 4. List tags for each image
for repo in $(curl -s http://10.89.97.201:30500/v2/_catalog | jq -r '.repositories[]'); do
echo "=== $repo ==="
curl -s http://10.89.97.201:30500/v2/$repo/tags/list | jq -r '.tags[]' | sort -V
done
# 5. Check blob storage size (where actual data lives)
kubectl exec -n docker-registry deploy/docker-registry -- du -sh /var/lib/registry/docker/registry/v2/blobs/
Prometheus UI Investigation¶
-
Check PV Usage Over Time:
-
Check Available Space:
-
Predict When Full (linear projection):
Resolution Options¶
Option 1: Expand PVC (Immediate Relief)
# Longhorn supports online volume expansion
kubectl patch pvc docker-registry-pvc -n docker-registry \
-p '{"spec":{"resources":{"requests":{"storage":"30Gi"}}}}'
# Verify expansion
kubectl get pvc -n docker-registry
# Wait for capacity to update (may take a minute)
Option 2: Clean Up Old Image Tags (Recommended)
# Use the cleanup script (dry run first)
/root/tower-fleet/scripts/registry-cleanup.sh
# Actually delete old tags (keeps latest 5 versions per repo)
/root/tower-fleet/scripts/registry-cleanup.sh --execute
# Delete and run garbage collection
/root/tower-fleet/scripts/registry-cleanup.sh --execute --gc
# Keep only 3 versions instead of 5
/root/tower-fleet/scripts/registry-cleanup.sh --execute --keep 3
Option 3: Manual Tag Deletion
# Get digest for a specific tag
REPO="home-portal"
TAG="v1.0.1"
DIGEST=$(curl -sI -H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
http://10.89.97.201:30500/v2/${REPO}/manifests/${TAG} | \
grep -i Docker-Content-Digest | awk '{print $2}' | tr -d '\r')
# Delete the tag
curl -X DELETE "http://10.89.97.201:30500/v2/${REPO}/manifests/${DIGEST}"
Option 4: Run Garbage Collection (After Deleting Tags)
Deleting tags only removes manifest references. Actual blob data requires garbage collection:
# Scale down registry (GC requires exclusive access)
kubectl scale deployment docker-registry -n docker-registry --replicas=0
# Wait for pod to terminate
kubectl wait --for=delete pod -l app=docker-registry -n docker-registry --timeout=60s
# Run garbage collection (via temporary pod or exec)
# Note: Must mount the same PVC
kubectl run registry-gc --rm -it --restart=Never \
-n docker-registry \
--image=registry:2 \
--overrides='{
"spec": {
"containers": [{
"name": "registry-gc",
"image": "registry:2",
"command": ["registry", "garbage-collect", "/etc/docker/registry/config.yml"],
"volumeMounts": [{"name": "data", "mountPath": "/var/lib/registry"}]
}],
"volumes": [{"name": "data", "persistentVolumeClaim": {"claimName": "docker-registry-pvc"}}]
}
}'
# Scale back up
kubectl scale deployment docker-registry -n docker-registry --replicas=1
Registry Management Commands¶
| Task | Command |
|---|---|
| List all images | curl http://10.89.97.201:30500/v2/_catalog |
| List tags for image | curl http://10.89.97.201:30500/v2/<name>/tags/list |
| Check disk usage | kubectl exec -n docker-registry deploy/docker-registry -- df -h /var/lib/registry |
| Resize PVC | kubectl patch pvc docker-registry-pvc -n docker-registry -p '{"spec":{"resources":{"requests":{"storage":"30Gi"}}}}' |
| Cleanup script | /root/tower-fleet/scripts/registry-cleanup.sh --help |
Ongoing Management¶
Recommended retention policy: Keep 5 most recent versions per image
Automation options:
1. Manual cleanup: Run /root/tower-fleet/scripts/registry-cleanup.sh --execute --gc monthly
2. CI/CD integration: Add cleanup step after successful deployments
3. Cron job: Schedule weekly cleanup (see below)
Example cron job (weekly Sunday 3am):
# Add to root crontab on a k8s node or management host
0 3 * * 0 /root/tower-fleet/scripts/registry-cleanup.sh --execute --gc --keep 5 >> /var/log/registry-cleanup.log 2>&1
Alert #5: InfoInhibitor¶
Alert Details: - Severity: none - Description: Meta-alert used to inhibit info-level alerts - Impact: None - this is intentional behavior
What It Means¶
InfoInhibitor is a meta-alert from kube-prometheus-stack designed to reduce noise:
1. Fires when severity="info" alerts exist
2. Used to inhibit info alerts from firing unless there's also a warning/critical alert
Why You See It¶
The InfoInhibitor alert should be: 1. Routed to a null receiver (no notifications) 2. Used in inhibit_rules to suppress info alerts
If you're receiving Discord notifications for this alert, your AlertManager config needs updating.
Resolution¶
Check current AlertManager config:
kubectl get secret alertmanager-kube-prometheus-stack-alertmanager -n monitoring \
-o jsonpath='{.data.alertmanager\.yaml}' | base64 -d
Ensure these are configured:
route:
routes:
# Route InfoInhibitor to null (no notifications)
- match:
alertname: InfoInhibitor
receiver: 'null'
receivers:
# Null receiver - drops alerts silently
- name: 'null'
inhibit_rules:
# Inhibit info alerts when InfoInhibitor fires (unless warning/critical also firing)
- source_match:
alertname: 'InfoInhibitor'
target_match:
severity: 'info'
equal: ['namespace']
Apply updated config:
# Edit your alertmanager config file
vim /root/tower-fleet/k8s/alertmanager-config.yaml
# Apply as secret
kubectl create secret generic alertmanager-kube-prometheus-stack-alertmanager \
-n monitoring \
--from-file=alertmanager.yaml=/root/tower-fleet/k8s/alertmanager-config.yaml \
--dry-run=client -o yaml | kubectl apply -f -
# Restart AlertManager to pick up changes
kubectl rollout restart statefulset alertmanager-kube-prometheus-stack-alertmanager -n monitoring
General Investigation Workflow¶
When an Alert Fires:¶
- Check AlertManager UI - What's the current state?
- Is it still firing?
- What are the labels?
-
What's the severity?
-
Check Prometheus UI - Query the alert expression
- Go to Alerts page
- Find the alert rule
- Click "expr" to see the PromQL query
- Test the query in Graph view
-
Look at historical data
-
Check Grafana Dashboards - Visual context
- Find relevant pre-built dashboard
- Look for anomalies in graphs
-
Check related metrics (CPU, memory, network)
-
Check Logs with Loki - What happened?
-
Check Kubernetes State - Current resource status
Creating Effective Silences¶
When to Silence: - During planned maintenance - Known issues being actively worked on - False positives until rules can be fixed - Alert fatigue from noisy but non-critical alerts
How to Create Silence via CLI:
# Port-forward to AlertManager
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
# Create silence
curl -X POST http://localhost:9093/api/v2/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{"name": "alertname", "value": "AlertName", "isRegex": false},
{"name": "namespace", "value": "my-namespace", "isRegex": false}
],
"startsAt": "'$(date -u +"%Y-%m-%dT%H:%M:%SZ")'",
"endsAt": "'$(date -u -d "+24 hours" +"%Y-%m-%dT%H:%M:%SZ")'",
"createdBy": "Your Name",
"comment": "Reason for silence"
}'
How to Delete Silence:
# List silences
curl http://localhost:9093/api/v2/silences | jq
# Delete by ID
curl -X DELETE http://localhost:9093/api/v2/silence/<silence-id>
Alert Priority & Response Time¶
| Severity | Response Time | Action |
|---|---|---|
| Critical | Immediate | Drop everything, investigate now |
| Warning | Within 1 hour | Investigate during work hours |
| Info | Best effort | Review during regular maintenance |
Critical Alerts Should Be: - Service completely down - Data loss imminent - Security breach
Warning Alerts Should Be: - Performance degraded - Resource usage high (but not critical) - Non-critical component failing
Info Alerts Should Be: - Informational only - Scheduled maintenance completed - Configuration changes
Best Practices¶
- Review alerts weekly - Adjust thresholds based on actual behavior
- Silence during maintenance - Prevent alert fatigue
- Document resolutions - Add to runbooks for next time
- Test alert rules - Verify they fire when expected
- Keep Discord clean - Only critical alerts should page you
- Use Grafana for exploration - Better than CLI for visual analysis
- Check Loki logs first - Often reveals root cause immediately
- Correlate metrics - Single metric might not tell full story
Quick Reference: Direct Access URLs¶
# Grafana (dashboards & log exploration)
http://10.89.97.211
Credentials: admin / prom-operator
# Prometheus (metrics & alerts)
http://10.89.97.216:9090
# AlertManager (alert routing & silences)
http://10.89.97.217:9093
# Loki API (direct access, or use Grafana → Explore)
http://10.89.97.218:3100
# Longhorn (storage management)
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# → http://localhost:8080
Bookmark these URLs for quick access during incident investigation!
Related Documentation¶
- Alerting Guide - Full alerting configuration reference
- Observability Standards - Metrics, logging, tracing standards
- Loki Operations - Log aggregation and queries
- Troubleshooting Guide - General K8s troubleshooting
For additional help: - Prometheus docs: https://prometheus.io/docs/ - AlertManager docs: https://prometheus.io/docs/alerting/latest/alertmanager/ - Grafana docs: https://grafana.com/docs/