Observability Stack¶
Monitoring, logging, and alerting for the Tower Fleet Kubernetes cluster using Grafana, Loki, and Prometheus.
Overview¶
Dashboard: Grafana at http://10.89.97.231:3000
Logging: Loki with Promtail agents
Metrics: Prometheus (included with Grafana stack)
Namespace: monitoring
Architecture¶
┌─────────────┐
│ Grafana │ ← User interface (dashboards, alerts)
│ :3000 │
└──────┬──────┘
│
├──────────┐
│ │
┌──────▼─────┐ ┌─▼────────┐
│ Loki │ │Prometheus│
│ (logs) │ │ (metrics)│
└──────▲─────┘ └─▲────────┘
│ │
│ │
┌──────┴─────┐ │
│ Promtail │ │ (scrapes)
│ (agents) │ │
└────────────┘ │
▲ │
│ │
└──────────────┴── Kubernetes pods/nodes
Components¶
Grafana¶
Access: http://10.89.97.231:3000 Purpose: Visualization, dashboards, alerting UI
Key Features: - Pre-built dashboards for Kubernetes metrics - Log exploration via Loki - Alert management - Multi-datasource support (Loki + Prometheus)
Default Credentials: (Check deployment manifests or secrets)
Loki¶
Purpose: Log aggregation and querying Access: Via Grafana (datasource)
Loki stores logs from all cluster pods and nodes, indexed by labels (namespace, pod, container).
Query Example:
Promtail¶
Purpose: Log collection agent Deployment: DaemonSet (runs on every node)
Promtail scrapes logs from: - Container stdout/stderr - System logs (via /var/log) - Kubernetes events
Prometheus¶
Purpose: Metrics collection and time-series database Access: Via Grafana (datasource)
Prometheus scrapes metrics from: - Kubernetes API (cluster metrics) - Node exporters (CPU, RAM, disk) - Application metrics (if exposed)
Accessing Grafana¶
Via LoadBalancer (Direct)¶
Via Port Forward (Alternative)¶
Using Grafana¶
Viewing Logs (Loki)¶
- Navigate to Explore (compass icon)
- Select Loki datasource
- Build query with label filters:
{namespace="supabase"}- All logs from supabase namespace{pod="postgres-0"}- Specific pod logs{namespace="monitoring"} |= "error"- Filter for "error" string
LogQL Examples:
# All Supabase logs
{namespace="supabase"}
# PostgreSQL errors
{namespace="supabase", container="postgres"} |= "ERROR"
# Home portal access logs
{namespace="home-portal"} |~ "GET|POST"
# Last 5 minutes of logs
{namespace="money-tracker"}[5m]
See Loki Operations Guide for advanced queries.
Viewing Metrics (Prometheus)¶
- Navigate to Explore (compass icon)
- Select Prometheus datasource
- Enter PromQL query
PromQL Examples:
# CPU usage per node
rate(node_cpu_seconds_total[5m])
# Pod memory usage
container_memory_usage_bytes{namespace="supabase"}
# Disk usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes
Pre-Built Dashboards¶
Grafana includes dashboards for: - Kubernetes Cluster - Node health, resource usage - Pod Resources - CPU/memory per pod - Storage - PVC usage, Longhorn health - Application Metrics - Custom app dashboards
Import Additional Dashboards:
- Navigate to Dashboards → Import
- Enter dashboard ID from grafana.com/dashboards
- Select Prometheus/Loki datasources
Popular Dashboard IDs: - 1860: Node Exporter Full - 13639: Loki Logs - 11074: Kubernetes Cluster Monitoring
Alerting¶
Current Configuration¶
Alerting is configured via Alertmanager with Discord notifications:
Discord Channel: #captain-hook (via webhook)
Routing:
- severity=critical → Discord with 🚨 prefix
- severity=warning|info → Discord standard format
- Watchdog, InfoInhibitor → Silenced (null receiver)
Timing: - Group wait: 10s - Group interval: 10s - Repeat interval: 12h (alerts re-fire every 12 hours if still active)
Inhibition Rules: - Critical alerts suppress warning/info for same alertname+namespace - Warning alerts suppress info for same alertname+namespace
Alert Sources¶
Pre-configured alerts (100+): From kube-prometheus-stack - Kubernetes: Pod crashes, deployments stuck, resource overcommit - Nodes: CPU, memory, disk, network issues - Storage: Longhorn volume degraded/faulted - Monitoring: Prometheus/Loki/Grafana down
Custom alerts:
- cluster-alerts: NodeDown, PodCrashLooping, DeploymentReplicasMismatch
- hardware-health-alerts: SMART disk health, ZFS pool status
- home-portal-alerts: App-specific HTTP errors, latency, memory
Viewing Alerts¶
# Current firing alerts
kubectl exec -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 \
-c alertmanager -- amtool alert -o simple
# Or via Prometheus
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \
-- promtool query instant http://localhost:9090 'ALERTS{alertstate="firing"}'
Creating Custom Alerts¶
Create a PrometheusRule resource:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
namespace: my-app
labels:
release: kube-prometheus-stack # Required for discovery
spec:
groups:
- name: my-app
rules:
- alert: MyAppDown
expr: up{job="my-app"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "MyApp is down"
description: "MyApp has been unreachable for >2 minutes"
Modifying Alertmanager Config¶
The Discord webhook is stored in a secret and merged into Alertmanager config:
# View current config
kubectl get secret alertmanager-kube-prometheus-stack-alertmanager-generated \
-n monitoring -o jsonpath='{.data.alertmanager\.yaml\.gz}' | base64 -d | gunzip
# Config is managed via AlertmanagerConfig CRD or helm values
See Alerting Guide for detailed configuration.
Log Retention¶
Loki Retention: Configurable (default: 30 days) Prometheus Retention: Configurable (default: 15 days)
Adjust retention:
# Loki config
limits_config:
retention_period: 720h # 30 days
# Prometheus config
storage:
tsdb:
retention.time: 15d
Storage Usage¶
Monitor observability stack storage:
# Check PVCs
kubectl get pvc -n monitoring
# Loki storage usage
kubectl exec -n monitoring <loki-pod> -- du -sh /loki
# Prometheus storage usage
kubectl exec -n monitoring <prometheus-pod> -- du -sh /prometheus
Performance Tuning¶
Reduce Log Volume¶
Filter noisy logs at Promtail level:
Optimize Queries¶
- Use specific label matchers:
{namespace="supabase"}vs{} - Limit time range:
[5m]vs[24h] - Use aggregation functions instead of raw queries
Scale Components¶
# Scale Loki replicas
kubectl scale deployment loki -n monitoring --replicas=2
# Increase Prometheus memory
kubectl edit deployment prometheus -n monitoring
# Set: resources.limits.memory: 4Gi
Troubleshooting¶
Grafana Not Accessible¶
# Check Grafana pod
kubectl get pods -n monitoring -l app=grafana
# Check service
kubectl get svc -n monitoring grafana
# Should show LoadBalancer with EXTERNAL-IP: 10.89.97.231
# View logs
kubectl logs -n monitoring -l app=grafana
Loki Query Timeout¶
# Check Loki pod status
kubectl get pods -n monitoring -l app=loki
# View Loki logs
kubectl logs -n monitoring -l app=loki
# Common issues:
# - Out of memory (increase limits)
# - Too many logs (adjust retention)
# - Slow disk I/O (check Longhorn performance)
Missing Logs¶
# Check Promtail DaemonSet
kubectl get pods -n monitoring -l app=promtail
# View Promtail logs (check for scrape errors)
kubectl logs -n monitoring -l app=promtail
# Verify Promtail is scraping target namespace
kubectl logs -n monitoring <promtail-pod> | grep "supabase"
Prometheus Not Scraping¶
# Check Prometheus pod
kubectl get pods -n monitoring -l app=prometheus
# View Prometheus targets (via port-forward)
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Visit: http://localhost:9090/targets
# Check for failed scrapes
Best Practices¶
- Use Label Selectors: Always filter logs/metrics by namespace or pod
- Set Appropriate Retention: Balance storage vs historical data needs
- Create Dashboards for Apps: Custom dashboards for each application
- Alert on Key Metrics: CPU, memory, disk, error rates
- Test Alerts: Verify notification channels work
- Document Queries: Save useful LogQL/PromQL queries
- Monitor the Monitors: Alert if Loki/Prometheus/Grafana are down
Reference Queries¶
Useful Loki Queries¶
# Application startup logs
{namespace="home-portal"} |= "Server listening"
# Database migrations
{namespace="supabase", container="postgres"} |= "migration"
# HTTP errors
{namespace="money-tracker"} |~ "HTTP/1.1\" (4|5)\\d{2}"
# Slow queries
{namespace="supabase"} |= "duration" | json | duration > 1000
Useful Prometheus Queries¶
# Top 5 CPU consuming pods
topk(5, rate(container_cpu_usage_seconds_total[5m]))
# Network traffic per namespace
sum(rate(container_network_transmit_bytes_total[5m])) by (namespace)
# Pod restart count
sum(kube_pod_container_status_restarts_total) by (namespace, pod)
Related Documentation: - Kubernetes Overview - Loki Operations Guide - Alert Investigation - Observability Standards