Alerting & Notifications Guide¶

Last Updated: 2025-11-12 Status: Production

Overview¶

This guide covers the alerting and notification system configured for the K3s cluster, including AlertManager, Discord integration, and alert rules.

Architecture¶

Prometheus → AlertManager → Discord Webhook
     ↓
 PrometheusRules (Alert Definitions)

Components: - Prometheus: Evaluates alert rules based on metrics - AlertManager: Routes and manages alert notifications - Discord: Receives formatted alert notifications via webhook - PrometheusRules: CRDs defining when alerts fire

Alert Rules¶

Home Portal Alerts¶

Location: /root/k8s/home-portal-alerts.yaml

Alert	Severity	Threshold	Description
HomePortalDown	critical	2 minutes	Application is not responding
HomePortalHighErrorRate	warning	>5% for 5min	High 5xx error rate
HomePortalHighLatency	warning	P95 >2s for 10min	Slow response times
HomePortalHighMemory	warning	>85% for 10min	Memory usage near limit
HomePortalFrequentRestarts	warning	Any restarts in 15min	Pod is crash looping
HomePortalMetricsDown	warning	5 minutes	Metrics endpoint not responding

Cluster-Wide Alerts¶

Location: /root/k8s/cluster-alerts.yaml

Infrastructure: - NodeDown - Node is offline - NodeHighCPU - >85% CPU for 15min - NodeHighMemory - >85% memory for 15min - NodeHighDiskUsage - <15% free space - NodeCriticalDiskUsage - <5% free space

Kubernetes Resources: - PodCrashLooping - Pod restarting frequently - PodNotReady - Pod stuck in non-running state - DeploymentReplicasMismatch - Deployment not at desired replicas

Storage (Longhorn): - LonghornVolumeDegraded - Volume in degraded state - LonghornVolumeFaulted - Volume in faulted state (critical)

Monitoring Stack: - PrometheusDown - Prometheus offline - LokiDown - Loki offline - GrafanaDown - Grafana offline

Discord Configuration¶

Webhook Setup¶

Create Discord Webhook:
Open Discord channel
Channel Settings → Integrations → Webhooks
Create webhook, copy URL

Configure AlertManager:

# Edit AlertManager config
vim /root/k8s/alertmanager-config.yaml

# Update webhook URL in receivers section
# Apply changes
kubectl create secret generic alertmanager-discord-config \
  -n monitoring \
  --from-file=alertmanager.yaml=/root/k8s/alertmanager-config.yaml \
  --dry-run=client -o yaml | kubectl apply -f -

# Patch AlertManager to use new config
kubectl patch alertmanager kube-prometheus-stack-alertmanager \
  -n monitoring \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/configSecret", "value": "alertmanager-discord-config"}]'

Message Format¶

Standard Alert:

**Alert:** [AlertName]
**Severity:** [critical/warning/info]
**Namespace:** [namespace]
**Description:** [description]
**Summary:** [summary]

Critical Alert:

🚨 CRITICAL: [AlertName]
🔴 CRITICAL ALERT
[same fields as above]

Managing Alerts¶

Adding New Alert Rules¶

For Application-Specific Alerts:

Create PrometheusRule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus-stack-prometheus
    role: alert-rules
spec:
  groups:
  - name: my-app
    interval: 30s
    rules:
    - alert: MyAppDown
      expr: up{namespace="my-app"} == 0
      for: 2m
      labels:
        severity: critical
        namespace: my-app
      annotations:
        summary: "My App is down"
        description: "My App has been down for 2+ minutes"

Apply:
```
kubectl apply -f my-app-alerts.yaml
```

Verify:

# Check if Prometheus loaded the rules
kubectl get prometheusrules -n monitoring

Viewing Active Alerts¶

Via Prometheus UI:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/alerts

Via AlertManager UI:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093
# Open http://localhost:9093

Via Discord: Active alerts are automatically sent to the configured Discord channel.

Silencing Alerts¶

Temporary Silence via AlertManager:

Access AlertManager UI (see above)
Find the alert
Click "Silence"
Set duration and reason
Confirm

Permanent Disable:

# Edit the PrometheusRule
kubectl edit prometheusrule <rule-name> -n monitoring

# Delete specific rule or entire PrometheusRule
kubectl delete prometheusrule <rule-name> -n monitoring

Alert Severity Levels¶

Severity	Use Case	Example
critical	System down, data loss risk	App offline, node down
warning	Performance degraded, may lead to critical	High latency, high memory
info	Informational, no action needed	Deployment completed

Inhibition Rules: - Critical alerts suppress warnings for the same issue - Warnings suppress info alerts for the same issue

Troubleshooting¶

Alerts Not Firing¶

Check Prometheus is evaluating rules:

kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \
  -c prometheus | grep -i "rule\|alert"

Check PrometheusRule syntax:

kubectl describe prometheusrule <name> -n monitoring

Test alert expression in Prometheus UI:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/graph
# Paste alert expr and execute

Alerts Not Reaching Discord¶

Check AlertManager logs:

kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 \
  -c alertmanager

Common issues: - Invalid webhook URL - Discord rate limiting (429 errors) - Network connectivity

Test webhook directly:

curl -H "Content-Type: application/json" \
  -d '{"content": "Test message"}' \
  https://discord.com/api/webhooks/YOUR_WEBHOOK_URL

Discord Rate Limiting¶

If you see 429 errors in AlertManager logs, Discord is rate limiting. Solutions:

Increase group_interval in AlertManager config (default: 5m)
Reduce alert frequency - adjust for: duration in rules
Use inhibition rules to suppress duplicate alerts

K3s-Specific Considerations¶

False Positive Alerts - RESOLVED¶

K3s combines components differently than standard Kubernetes. The default kube-prometheus-stack includes alert rules for components that K3s embeds into the main process.

These alerts were false-positive on K3s: - KubeProxyDown (K3s uses kube-proxy differently) - KubeSchedulerDown (Scheduler is embedded in k3s) - KubeControllerManagerDown (Controller Manager is embedded)

Resolution:

The problematic PrometheusRules have been deleted:

kubectl delete prometheusrule -n monitoring \
  kube-prometheus-stack-kubernetes-system-kube-proxy \
  kube-prometheus-stack-kubernetes-system-scheduler \
  kube-prometheus-stack-kubernetes-system-controller-manager

These alerts will no longer fire. The K3s cluster health is still monitored through: - Node-level metrics (NodeDown, NodeHighCPU, etc.) - Pod-level metrics (PodCrashLooping, PodNotReady, etc.) - kube-apiserver health (which is the primary K3s component to watch)

Alert Configuration Files¶

File	Purpose
`/root/k8s/alertmanager-config.yaml`	AlertManager routing and receivers
`/root/k8s/home-portal-alerts.yaml`	Home Portal alert rules
`/root/k8s/cluster-alerts.yaml`	Cluster-wide alert rules

Apply changes:

# After editing alert rules
kubectl apply -f /root/k8s/<alerts-file>.yaml

# After editing AlertManager config
kubectl create secret generic alertmanager-discord-config \
  -n monitoring \
  --from-file=alertmanager.yaml=/root/k8s/alertmanager-config.yaml \
  --dry-run=client -o yaml | kubectl apply -f -

Best Practices¶

Start with few critical alerts - Avoid alert fatigue
Use appropriate for: durations - Prevent flapping alerts
Group related alerts - Use inhibition rules
Test thoroughly - Verify alerts fire correctly before relying on them
Document alert response - What to do when each alert fires
Review regularly - Adjust thresholds based on actual behavior
Silence during maintenance - Use AlertManager silences

References¶

Prometheus Alerting
AlertManager Configuration
PrometheusRule CRD
Discord Webhook API: https://discord.com/developers/docs/resources/webhook

For additional help: - Observability Standards: /root/tower-fleet/docs/reference/observability-standards.md - Loki Operations: /root/tower-fleet/docs/operations/loki-operations.md