Alerting & Notifications Guide¶
Last Updated: 2025-11-12 Status: Production
Overview¶
This guide covers the alerting and notification system configured for the K3s cluster, including AlertManager, Discord integration, and alert rules.
Architecture¶
Components: - Prometheus: Evaluates alert rules based on metrics - AlertManager: Routes and manages alert notifications - Discord: Receives formatted alert notifications via webhook - PrometheusRules: CRDs defining when alerts fire
Alert Rules¶
Home Portal Alerts¶
Location: /root/k8s/home-portal-alerts.yaml
| Alert | Severity | Threshold | Description |
|---|---|---|---|
| HomePortalDown | critical | 2 minutes | Application is not responding |
| HomePortalHighErrorRate | warning | >5% for 5min | High 5xx error rate |
| HomePortalHighLatency | warning | P95 >2s for 10min | Slow response times |
| HomePortalHighMemory | warning | >85% for 10min | Memory usage near limit |
| HomePortalFrequentRestarts | warning | Any restarts in 15min | Pod is crash looping |
| HomePortalMetricsDown | warning | 5 minutes | Metrics endpoint not responding |
Cluster-Wide Alerts¶
Location: /root/k8s/cluster-alerts.yaml
Infrastructure: - NodeDown - Node is offline - NodeHighCPU - >85% CPU for 15min - NodeHighMemory - >85% memory for 15min - NodeHighDiskUsage - <15% free space - NodeCriticalDiskUsage - <5% free space
Kubernetes Resources: - PodCrashLooping - Pod restarting frequently - PodNotReady - Pod stuck in non-running state - DeploymentReplicasMismatch - Deployment not at desired replicas
Storage (Longhorn): - LonghornVolumeDegraded - Volume in degraded state - LonghornVolumeFaulted - Volume in faulted state (critical)
Monitoring Stack: - PrometheusDown - Prometheus offline - LokiDown - Loki offline - GrafanaDown - Grafana offline
Discord Configuration¶
Webhook Setup¶
- Create Discord Webhook:
- Open Discord channel
- Channel Settings → Integrations → Webhooks
-
Create webhook, copy URL
-
Configure AlertManager:
# Edit AlertManager config vim /root/k8s/alertmanager-config.yaml # Update webhook URL in receivers section # Apply changes kubectl create secret generic alertmanager-discord-config \ -n monitoring \ --from-file=alertmanager.yaml=/root/k8s/alertmanager-config.yaml \ --dry-run=client -o yaml | kubectl apply -f - # Patch AlertManager to use new config kubectl patch alertmanager kube-prometheus-stack-alertmanager \ -n monitoring \ --type='json' \ -p='[{"op": "replace", "path": "/spec/configSecret", "value": "alertmanager-discord-config"}]'
Message Format¶
Standard Alert:
**Alert:** [AlertName]
**Severity:** [critical/warning/info]
**Namespace:** [namespace]
**Description:** [description]
**Summary:** [summary]
Critical Alert:
Managing Alerts¶
Adding New Alert Rules¶
For Application-Specific Alerts:
-
Create PrometheusRule:
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: my-app-alerts namespace: monitoring labels: prometheus: kube-prometheus-stack-prometheus role: alert-rules spec: groups: - name: my-app interval: 30s rules: - alert: MyAppDown expr: up{namespace="my-app"} == 0 for: 2m labels: severity: critical namespace: my-app annotations: summary: "My App is down" description: "My App has been down for 2+ minutes" -
Apply:
-
Verify:
Viewing Active Alerts¶
Via Prometheus UI:
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/alerts
Via AlertManager UI:
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093
# Open http://localhost:9093
Via Discord: Active alerts are automatically sent to the configured Discord channel.
Silencing Alerts¶
Temporary Silence via AlertManager:
- Access AlertManager UI (see above)
- Find the alert
- Click "Silence"
- Set duration and reason
- Confirm
Permanent Disable:
# Edit the PrometheusRule
kubectl edit prometheusrule <rule-name> -n monitoring
# Delete specific rule or entire PrometheusRule
kubectl delete prometheusrule <rule-name> -n monitoring
Alert Severity Levels¶
| Severity | Use Case | Example |
|---|---|---|
| critical | System down, data loss risk | App offline, node down |
| warning | Performance degraded, may lead to critical | High latency, high memory |
| info | Informational, no action needed | Deployment completed |
Inhibition Rules: - Critical alerts suppress warnings for the same issue - Warnings suppress info alerts for the same issue
Troubleshooting¶
Alerts Not Firing¶
Check Prometheus is evaluating rules:
kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \
-c prometheus | grep -i "rule\|alert"
Check PrometheusRule syntax:
Test alert expression in Prometheus UI:
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/graph
# Paste alert expr and execute
Alerts Not Reaching Discord¶
Check AlertManager logs:
Common issues: - Invalid webhook URL - Discord rate limiting (429 errors) - Network connectivity
Test webhook directly:
curl -H "Content-Type: application/json" \
-d '{"content": "Test message"}' \
https://discord.com/api/webhooks/YOUR_WEBHOOK_URL
Discord Rate Limiting¶
If you see 429 errors in AlertManager logs, Discord is rate limiting. Solutions:
- Increase group_interval in AlertManager config (default: 5m)
- Reduce alert frequency - adjust
for:duration in rules - Use inhibition rules to suppress duplicate alerts
K3s-Specific Considerations¶
False Positive Alerts - RESOLVED¶
K3s combines components differently than standard Kubernetes. The default kube-prometheus-stack includes alert rules for components that K3s embeds into the main process.
These alerts were false-positive on K3s: - KubeProxyDown (K3s uses kube-proxy differently) - KubeSchedulerDown (Scheduler is embedded in k3s) - KubeControllerManagerDown (Controller Manager is embedded)
Resolution:
The problematic PrometheusRules have been deleted:
kubectl delete prometheusrule -n monitoring \
kube-prometheus-stack-kubernetes-system-kube-proxy \
kube-prometheus-stack-kubernetes-system-scheduler \
kube-prometheus-stack-kubernetes-system-controller-manager
These alerts will no longer fire. The K3s cluster health is still monitored through: - Node-level metrics (NodeDown, NodeHighCPU, etc.) - Pod-level metrics (PodCrashLooping, PodNotReady, etc.) - kube-apiserver health (which is the primary K3s component to watch)
Alert Configuration Files¶
| File | Purpose |
|---|---|
/root/k8s/alertmanager-config.yaml |
AlertManager routing and receivers |
/root/k8s/home-portal-alerts.yaml |
Home Portal alert rules |
/root/k8s/cluster-alerts.yaml |
Cluster-wide alert rules |
Apply changes:
# After editing alert rules
kubectl apply -f /root/k8s/<alerts-file>.yaml
# After editing AlertManager config
kubectl create secret generic alertmanager-discord-config \
-n monitoring \
--from-file=alertmanager.yaml=/root/k8s/alertmanager-config.yaml \
--dry-run=client -o yaml | kubectl apply -f -
Best Practices¶
- Start with few critical alerts - Avoid alert fatigue
- Use appropriate
for:durations - Prevent flapping alerts - Group related alerts - Use inhibition rules
- Test thoroughly - Verify alerts fire correctly before relying on them
- Document alert response - What to do when each alert fires
- Review regularly - Adjust thresholds based on actual behavior
- Silence during maintenance - Use AlertManager silences
References¶
- Prometheus Alerting
- AlertManager Configuration
- PrometheusRule CRD
- Discord Webhook API: https://discord.com/developers/docs/resources/webhook
For additional help:
- Observability Standards: /root/tower-fleet/docs/reference/observability-standards.md
- Loki Operations: /root/tower-fleet/docs/operations/loki-operations.md