Observability Stack¶

Monitoring, logging, and alerting for the Tower Fleet Kubernetes cluster using Grafana, Loki, and Prometheus.

Overview¶

Dashboard: Grafana at http://10.89.97.231:3000 Logging: Loki with Promtail agents Metrics: Prometheus (included with Grafana stack) Namespace: monitoring

Architecture¶

┌─────────────┐
│   Grafana   │  ← User interface (dashboards, alerts)
│  :3000      │
└──────┬──────┘
       │
       ├──────────┐
       │          │
┌──────▼─────┐ ┌─▼────────┐
│    Loki    │ │Prometheus│
│   (logs)   │ │ (metrics)│
└──────▲─────┘ └─▲────────┘
       │         │
       │         │
┌──────┴─────┐   │
│  Promtail  │   │ (scrapes)
│  (agents)  │   │
└────────────┘   │
  ▲              │
  │              │
  └──────────────┴── Kubernetes pods/nodes

Components¶

Grafana¶

Access: http://10.89.97.231:3000 Purpose: Visualization, dashboards, alerting UI

Key Features: - Pre-built dashboards for Kubernetes metrics - Log exploration via Loki - Alert management - Multi-datasource support (Loki + Prometheus)

Default Credentials: (Check deployment manifests or secrets)

Loki¶

Purpose: Log aggregation and querying Access: Via Grafana (datasource)

Loki stores logs from all cluster pods and nodes, indexed by labels (namespace, pod, container).

Query Example:

{namespace="supabase", container="postgres"}

Promtail¶

Purpose: Log collection agent Deployment: DaemonSet (runs on every node)

Promtail scrapes logs from: - Container stdout/stderr - System logs (via /var/log) - Kubernetes events

Prometheus¶

Purpose: Metrics collection and time-series database Access: Via Grafana (datasource)

Prometheus scrapes metrics from: - Kubernetes API (cluster metrics) - Node exporters (CPU, RAM, disk) - Application metrics (if exposed)

Accessing Grafana¶

Via LoadBalancer (Direct)¶

# Access directly
open http://10.89.97.231:3000

Via Port Forward (Alternative)¶

kubectl port-forward -n monitoring svc/grafana 3000:80
open http://localhost:3000

Using Grafana¶

Viewing Logs (Loki)¶

Navigate to Explore (compass icon)
Select Loki datasource
Build query with label filters:
{namespace="supabase"} - All logs from supabase namespace
{pod="postgres-0"} - Specific pod logs
{namespace="monitoring"} |= "error" - Filter for "error" string

LogQL Examples:

# All Supabase logs
{namespace="supabase"}

# PostgreSQL errors
{namespace="supabase", container="postgres"} |= "ERROR"

# Home portal access logs
{namespace="home-portal"} |~ "GET|POST"

# Last 5 minutes of logs
{namespace="money-tracker"}[5m]

See Loki Operations Guide for advanced queries.

Viewing Metrics (Prometheus)¶

Navigate to Explore (compass icon)
Select Prometheus datasource
Enter PromQL query

PromQL Examples:

# CPU usage per node
rate(node_cpu_seconds_total[5m])

# Pod memory usage
container_memory_usage_bytes{namespace="supabase"}

# Disk usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes

Pre-Built Dashboards¶

Grafana includes dashboards for: - Kubernetes Cluster - Node health, resource usage - Pod Resources - CPU/memory per pod - Storage - PVC usage, Longhorn health - Application Metrics - Custom app dashboards

Import Additional Dashboards:

Navigate to Dashboards → Import
Enter dashboard ID from grafana.com/dashboards
Select Prometheus/Loki datasources

Popular Dashboard IDs: - 1860: Node Exporter Full - 13639: Loki Logs - 11074: Kubernetes Cluster Monitoring

Alerting¶

Current Configuration¶

Alerting is configured via Alertmanager with Discord notifications:

Discord Channel: #captain-hook (via webhook)

Routing: - severity=critical → Discord with 🚨 prefix - severity=warning|info → Discord standard format - Watchdog, InfoInhibitor → Silenced (null receiver)

Timing: - Group wait: 10s - Group interval: 10s - Repeat interval: 12h (alerts re-fire every 12 hours if still active)

Inhibition Rules: - Critical alerts suppress warning/info for same alertname+namespace - Warning alerts suppress info for same alertname+namespace

Alert Sources¶

Pre-configured alerts (100+): From kube-prometheus-stack - Kubernetes: Pod crashes, deployments stuck, resource overcommit - Nodes: CPU, memory, disk, network issues - Storage: Longhorn volume degraded/faulted - Monitoring: Prometheus/Loki/Grafana down

Custom alerts: - cluster-alerts: NodeDown, PodCrashLooping, DeploymentReplicasMismatch - hardware-health-alerts: SMART disk health, ZFS pool status - home-portal-alerts: App-specific HTTP errors, latency, memory

Viewing Alerts¶

# Current firing alerts
kubectl exec -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 \
  -c alertmanager -- amtool alert -o simple

# Or via Prometheus
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \
  -- promtool query instant http://localhost:9090 'ALERTS{alertstate="firing"}'

Creating Custom Alerts¶

Create a PrometheusRule resource:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-alerts
  namespace: my-app
  labels:
    release: kube-prometheus-stack  # Required for discovery
spec:
  groups:
  - name: my-app
    rules:
    - alert: MyAppDown
      expr: up{job="my-app"} == 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "MyApp is down"
        description: "MyApp has been unreachable for >2 minutes"

Modifying Alertmanager Config¶

The Discord webhook is stored in a secret and merged into Alertmanager config:

# View current config
kubectl get secret alertmanager-kube-prometheus-stack-alertmanager-generated \
  -n monitoring -o jsonpath='{.data.alertmanager\.yaml\.gz}' | base64 -d | gunzip

# Config is managed via AlertmanagerConfig CRD or helm values

See Alerting Guide for detailed configuration.

Log Retention¶

Loki Retention: Configurable (default: 30 days) Prometheus Retention: Configurable (default: 15 days)

Adjust retention:

# Loki config
limits_config:
  retention_period: 720h  # 30 days

# Prometheus config
storage:
  tsdb:
    retention.time: 15d

Storage Usage¶

Monitor observability stack storage:

# Check PVCs
kubectl get pvc -n monitoring

# Loki storage usage
kubectl exec -n monitoring <loki-pod> -- du -sh /loki

# Prometheus storage usage
kubectl exec -n monitoring <prometheus-pod> -- du -sh /prometheus

Performance Tuning¶

Reduce Log Volume¶

Filter noisy logs at Promtail level:

# Promtail config
pipeline_stages:
  - drop:
      expression: "healthcheck|ping"  # Drop health check logs

Optimize Queries¶

Use specific label matchers: {namespace="supabase"} vs {}
Limit time range: [5m] vs [24h]
Use aggregation functions instead of raw queries

Scale Components¶

# Scale Loki replicas
kubectl scale deployment loki -n monitoring --replicas=2

# Increase Prometheus memory
kubectl edit deployment prometheus -n monitoring
# Set: resources.limits.memory: 4Gi

Troubleshooting¶

Grafana Not Accessible¶

# Check Grafana pod
kubectl get pods -n monitoring -l app=grafana

# Check service
kubectl get svc -n monitoring grafana
# Should show LoadBalancer with EXTERNAL-IP: 10.89.97.231

# View logs
kubectl logs -n monitoring -l app=grafana

Loki Query Timeout¶

# Check Loki pod status
kubectl get pods -n monitoring -l app=loki

# View Loki logs
kubectl logs -n monitoring -l app=loki

# Common issues:
# - Out of memory (increase limits)
# - Too many logs (adjust retention)
# - Slow disk I/O (check Longhorn performance)

Missing Logs¶

# Check Promtail DaemonSet
kubectl get pods -n monitoring -l app=promtail

# View Promtail logs (check for scrape errors)
kubectl logs -n monitoring -l app=promtail

# Verify Promtail is scraping target namespace
kubectl logs -n monitoring <promtail-pod> | grep "supabase"

Prometheus Not Scraping¶

# Check Prometheus pod
kubectl get pods -n monitoring -l app=prometheus

# View Prometheus targets (via port-forward)
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Visit: http://localhost:9090/targets

# Check for failed scrapes

Best Practices¶

Use Label Selectors: Always filter logs/metrics by namespace or pod
Set Appropriate Retention: Balance storage vs historical data needs
Create Dashboards for Apps: Custom dashboards for each application
Alert on Key Metrics: CPU, memory, disk, error rates
Test Alerts: Verify notification channels work
Document Queries: Save useful LogQL/PromQL queries
Monitor the Monitors: Alert if Loki/Prometheus/Grafana are down

Reference Queries¶

Useful Loki Queries¶

# Application startup logs
{namespace="home-portal"} |= "Server listening"

# Database migrations
{namespace="supabase", container="postgres"} |= "migration"

# HTTP errors
{namespace="money-tracker"} |~ "HTTP/1.1\" (4|5)\\d{2}"

# Slow queries
{namespace="supabase"} |= "duration" | json | duration > 1000

Useful Prometheus Queries¶

# Top 5 CPU consuming pods
topk(5, rate(container_cpu_usage_seconds_total[5m]))

# Network traffic per namespace
sum(rate(container_network_transmit_bytes_total[5m])) by (namespace)

# Pod restart count
sum(kube_pod_container_status_restarts_total) by (namespace, pod)

Related Documentation: - Kubernetes Overview - Loki Operations Guide - Alert Investigation - Observability Standards