Skip to content

Alerting & Notifications Guide

Last Updated: 2025-11-12 Status: Production

Overview

This guide covers the alerting and notification system configured for the K3s cluster, including AlertManager, Discord integration, and alert rules.

Architecture

Prometheus → AlertManager → Discord Webhook
 PrometheusRules (Alert Definitions)

Components: - Prometheus: Evaluates alert rules based on metrics - AlertManager: Routes and manages alert notifications - Discord: Receives formatted alert notifications via webhook - PrometheusRules: CRDs defining when alerts fire


Alert Rules

Home Portal Alerts

Location: /root/k8s/home-portal-alerts.yaml

Alert Severity Threshold Description
HomePortalDown critical 2 minutes Application is not responding
HomePortalHighErrorRate warning >5% for 5min High 5xx error rate
HomePortalHighLatency warning P95 >2s for 10min Slow response times
HomePortalHighMemory warning >85% for 10min Memory usage near limit
HomePortalFrequentRestarts warning Any restarts in 15min Pod is crash looping
HomePortalMetricsDown warning 5 minutes Metrics endpoint not responding

Cluster-Wide Alerts

Location: /root/k8s/cluster-alerts.yaml

Infrastructure: - NodeDown - Node is offline - NodeHighCPU - >85% CPU for 15min - NodeHighMemory - >85% memory for 15min - NodeHighDiskUsage - <15% free space - NodeCriticalDiskUsage - <5% free space

Kubernetes Resources: - PodCrashLooping - Pod restarting frequently - PodNotReady - Pod stuck in non-running state - DeploymentReplicasMismatch - Deployment not at desired replicas

Storage (Longhorn): - LonghornVolumeDegraded - Volume in degraded state - LonghornVolumeFaulted - Volume in faulted state (critical)

Monitoring Stack: - PrometheusDown - Prometheus offline - LokiDown - Loki offline - GrafanaDown - Grafana offline


Discord Configuration

Webhook Setup

  1. Create Discord Webhook:
  2. Open Discord channel
  3. Channel Settings → Integrations → Webhooks
  4. Create webhook, copy URL

  5. Configure AlertManager:

    # Edit AlertManager config
    vim /root/k8s/alertmanager-config.yaml
    
    # Update webhook URL in receivers section
    # Apply changes
    kubectl create secret generic alertmanager-discord-config \
      -n monitoring \
      --from-file=alertmanager.yaml=/root/k8s/alertmanager-config.yaml \
      --dry-run=client -o yaml | kubectl apply -f -
    
    # Patch AlertManager to use new config
    kubectl patch alertmanager kube-prometheus-stack-alertmanager \
      -n monitoring \
      --type='json' \
      -p='[{"op": "replace", "path": "/spec/configSecret", "value": "alertmanager-discord-config"}]'
    

Message Format

Standard Alert:

**Alert:** [AlertName]
**Severity:** [critical/warning/info]
**Namespace:** [namespace]
**Description:** [description]
**Summary:** [summary]

Critical Alert:

🚨 CRITICAL: [AlertName]
🔴 CRITICAL ALERT
[same fields as above]


Managing Alerts

Adding New Alert Rules

For Application-Specific Alerts:

  1. Create PrometheusRule:

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: my-app-alerts
      namespace: monitoring
      labels:
        prometheus: kube-prometheus-stack-prometheus
        role: alert-rules
    spec:
      groups:
      - name: my-app
        interval: 30s
        rules:
        - alert: MyAppDown
          expr: up{namespace="my-app"} == 0
          for: 2m
          labels:
            severity: critical
            namespace: my-app
          annotations:
            summary: "My App is down"
            description: "My App has been down for 2+ minutes"
    

  2. Apply:

    kubectl apply -f my-app-alerts.yaml
    

  3. Verify:

    # Check if Prometheus loaded the rules
    kubectl get prometheusrules -n monitoring
    

Viewing Active Alerts

Via Prometheus UI:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/alerts

Via AlertManager UI:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093
# Open http://localhost:9093

Via Discord: Active alerts are automatically sent to the configured Discord channel.

Silencing Alerts

Temporary Silence via AlertManager:

  1. Access AlertManager UI (see above)
  2. Find the alert
  3. Click "Silence"
  4. Set duration and reason
  5. Confirm

Permanent Disable:

# Edit the PrometheusRule
kubectl edit prometheusrule <rule-name> -n monitoring

# Delete specific rule or entire PrometheusRule
kubectl delete prometheusrule <rule-name> -n monitoring

Alert Severity Levels

Severity Use Case Example
critical System down, data loss risk App offline, node down
warning Performance degraded, may lead to critical High latency, high memory
info Informational, no action needed Deployment completed

Inhibition Rules: - Critical alerts suppress warnings for the same issue - Warnings suppress info alerts for the same issue


Troubleshooting

Alerts Not Firing

Check Prometheus is evaluating rules:

kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \
  -c prometheus | grep -i "rule\|alert"

Check PrometheusRule syntax:

kubectl describe prometheusrule <name> -n monitoring

Test alert expression in Prometheus UI:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/graph
# Paste alert expr and execute

Alerts Not Reaching Discord

Check AlertManager logs:

kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 \
  -c alertmanager

Common issues: - Invalid webhook URL - Discord rate limiting (429 errors) - Network connectivity

Test webhook directly:

curl -H "Content-Type: application/json" \
  -d '{"content": "Test message"}' \
  https://discord.com/api/webhooks/YOUR_WEBHOOK_URL

Discord Rate Limiting

If you see 429 errors in AlertManager logs, Discord is rate limiting. Solutions:

  1. Increase group_interval in AlertManager config (default: 5m)
  2. Reduce alert frequency - adjust for: duration in rules
  3. Use inhibition rules to suppress duplicate alerts

K3s-Specific Considerations

False Positive Alerts - RESOLVED

K3s combines components differently than standard Kubernetes. The default kube-prometheus-stack includes alert rules for components that K3s embeds into the main process.

These alerts were false-positive on K3s: - KubeProxyDown (K3s uses kube-proxy differently) - KubeSchedulerDown (Scheduler is embedded in k3s) - KubeControllerManagerDown (Controller Manager is embedded)

Resolution:

The problematic PrometheusRules have been deleted:

kubectl delete prometheusrule -n monitoring \
  kube-prometheus-stack-kubernetes-system-kube-proxy \
  kube-prometheus-stack-kubernetes-system-scheduler \
  kube-prometheus-stack-kubernetes-system-controller-manager

These alerts will no longer fire. The K3s cluster health is still monitored through: - Node-level metrics (NodeDown, NodeHighCPU, etc.) - Pod-level metrics (PodCrashLooping, PodNotReady, etc.) - kube-apiserver health (which is the primary K3s component to watch)


Alert Configuration Files

File Purpose
/root/k8s/alertmanager-config.yaml AlertManager routing and receivers
/root/k8s/home-portal-alerts.yaml Home Portal alert rules
/root/k8s/cluster-alerts.yaml Cluster-wide alert rules

Apply changes:

# After editing alert rules
kubectl apply -f /root/k8s/<alerts-file>.yaml

# After editing AlertManager config
kubectl create secret generic alertmanager-discord-config \
  -n monitoring \
  --from-file=alertmanager.yaml=/root/k8s/alertmanager-config.yaml \
  --dry-run=client -o yaml | kubectl apply -f -


Best Practices

  1. Start with few critical alerts - Avoid alert fatigue
  2. Use appropriate for: durations - Prevent flapping alerts
  3. Group related alerts - Use inhibition rules
  4. Test thoroughly - Verify alerts fire correctly before relying on them
  5. Document alert response - What to do when each alert fires
  6. Review regularly - Adjust thresholds based on actual behavior
  7. Silence during maintenance - Use AlertManager silences

References


For additional help: - Observability Standards: /root/tower-fleet/docs/reference/observability-standards.md - Loki Operations: /root/tower-fleet/docs/operations/loki-operations.md