Skip to content

Observability Stack

Monitoring, logging, and alerting for the Tower Fleet Kubernetes cluster using Grafana, Loki, and Prometheus.

Overview

Dashboard: Grafana at http://10.89.97.231:3000 Logging: Loki with Promtail agents Metrics: Prometheus (included with Grafana stack) Namespace: monitoring

Architecture

┌─────────────┐
│   Grafana   │  ← User interface (dashboards, alerts)
│  :3000      │
└──────┬──────┘
       ├──────────┐
       │          │
┌──────▼─────┐ ┌─▼────────┐
│    Loki    │ │Prometheus│
│   (logs)   │ │ (metrics)│
└──────▲─────┘ └─▲────────┘
       │         │
       │         │
┌──────┴─────┐   │
│  Promtail  │   │ (scrapes)
│  (agents)  │   │
└────────────┘   │
  ▲              │
  │              │
  └──────────────┴── Kubernetes pods/nodes

Components

Grafana

Access: http://10.89.97.231:3000 Purpose: Visualization, dashboards, alerting UI

Key Features: - Pre-built dashboards for Kubernetes metrics - Log exploration via Loki - Alert management - Multi-datasource support (Loki + Prometheus)

Default Credentials: (Check deployment manifests or secrets)

Loki

Purpose: Log aggregation and querying Access: Via Grafana (datasource)

Loki stores logs from all cluster pods and nodes, indexed by labels (namespace, pod, container).

Query Example:

{namespace="supabase", container="postgres"}

Promtail

Purpose: Log collection agent Deployment: DaemonSet (runs on every node)

Promtail scrapes logs from: - Container stdout/stderr - System logs (via /var/log) - Kubernetes events

Prometheus

Purpose: Metrics collection and time-series database Access: Via Grafana (datasource)

Prometheus scrapes metrics from: - Kubernetes API (cluster metrics) - Node exporters (CPU, RAM, disk) - Application metrics (if exposed)

Accessing Grafana

Via LoadBalancer (Direct)

# Access directly
open http://10.89.97.231:3000

Via Port Forward (Alternative)

kubectl port-forward -n monitoring svc/grafana 3000:80
open http://localhost:3000

Using Grafana

Viewing Logs (Loki)

  1. Navigate to Explore (compass icon)
  2. Select Loki datasource
  3. Build query with label filters:
  4. {namespace="supabase"} - All logs from supabase namespace
  5. {pod="postgres-0"} - Specific pod logs
  6. {namespace="monitoring"} |= "error" - Filter for "error" string

LogQL Examples:

# All Supabase logs
{namespace="supabase"}

# PostgreSQL errors
{namespace="supabase", container="postgres"} |= "ERROR"

# Home portal access logs
{namespace="home-portal"} |~ "GET|POST"

# Last 5 minutes of logs
{namespace="money-tracker"}[5m]

See Loki Operations Guide for advanced queries.

Viewing Metrics (Prometheus)

  1. Navigate to Explore (compass icon)
  2. Select Prometheus datasource
  3. Enter PromQL query

PromQL Examples:

# CPU usage per node
rate(node_cpu_seconds_total[5m])

# Pod memory usage
container_memory_usage_bytes{namespace="supabase"}

# Disk usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes

Pre-Built Dashboards

Grafana includes dashboards for: - Kubernetes Cluster - Node health, resource usage - Pod Resources - CPU/memory per pod - Storage - PVC usage, Longhorn health - Application Metrics - Custom app dashboards

Import Additional Dashboards:

  1. Navigate to DashboardsImport
  2. Enter dashboard ID from grafana.com/dashboards
  3. Select Prometheus/Loki datasources

Popular Dashboard IDs: - 1860: Node Exporter Full - 13639: Loki Logs - 11074: Kubernetes Cluster Monitoring

Alerting

Current Configuration

Alerting is configured via Alertmanager with Discord notifications:

Discord Channel: #captain-hook (via webhook)

Routing: - severity=critical → Discord with 🚨 prefix - severity=warning|info → Discord standard format - Watchdog, InfoInhibitor → Silenced (null receiver)

Timing: - Group wait: 10s - Group interval: 10s - Repeat interval: 12h (alerts re-fire every 12 hours if still active)

Inhibition Rules: - Critical alerts suppress warning/info for same alertname+namespace - Warning alerts suppress info for same alertname+namespace

Alert Sources

Pre-configured alerts (100+): From kube-prometheus-stack - Kubernetes: Pod crashes, deployments stuck, resource overcommit - Nodes: CPU, memory, disk, network issues - Storage: Longhorn volume degraded/faulted - Monitoring: Prometheus/Loki/Grafana down

Custom alerts: - cluster-alerts: NodeDown, PodCrashLooping, DeploymentReplicasMismatch - hardware-health-alerts: SMART disk health, ZFS pool status - home-portal-alerts: App-specific HTTP errors, latency, memory

Viewing Alerts

# Current firing alerts
kubectl exec -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 \
  -c alertmanager -- amtool alert -o simple

# Or via Prometheus
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \
  -- promtool query instant http://localhost:9090 'ALERTS{alertstate="firing"}'

Creating Custom Alerts

Create a PrometheusRule resource:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-alerts
  namespace: my-app
  labels:
    release: kube-prometheus-stack  # Required for discovery
spec:
  groups:
  - name: my-app
    rules:
    - alert: MyAppDown
      expr: up{job="my-app"} == 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "MyApp is down"
        description: "MyApp has been unreachable for >2 minutes"

Modifying Alertmanager Config

The Discord webhook is stored in a secret and merged into Alertmanager config:

# View current config
kubectl get secret alertmanager-kube-prometheus-stack-alertmanager-generated \
  -n monitoring -o jsonpath='{.data.alertmanager\.yaml\.gz}' | base64 -d | gunzip

# Config is managed via AlertmanagerConfig CRD or helm values

See Alerting Guide for detailed configuration.

Log Retention

Loki Retention: Configurable (default: 30 days) Prometheus Retention: Configurable (default: 15 days)

Adjust retention:

# Loki config
limits_config:
  retention_period: 720h  # 30 days

# Prometheus config
storage:
  tsdb:
    retention.time: 15d

Storage Usage

Monitor observability stack storage:

# Check PVCs
kubectl get pvc -n monitoring

# Loki storage usage
kubectl exec -n monitoring <loki-pod> -- du -sh /loki

# Prometheus storage usage
kubectl exec -n monitoring <prometheus-pod> -- du -sh /prometheus

Performance Tuning

Reduce Log Volume

Filter noisy logs at Promtail level:

# Promtail config
pipeline_stages:
  - drop:
      expression: "healthcheck|ping"  # Drop health check logs

Optimize Queries

  • Use specific label matchers: {namespace="supabase"} vs {}
  • Limit time range: [5m] vs [24h]
  • Use aggregation functions instead of raw queries

Scale Components

# Scale Loki replicas
kubectl scale deployment loki -n monitoring --replicas=2

# Increase Prometheus memory
kubectl edit deployment prometheus -n monitoring
# Set: resources.limits.memory: 4Gi

Troubleshooting

Grafana Not Accessible

# Check Grafana pod
kubectl get pods -n monitoring -l app=grafana

# Check service
kubectl get svc -n monitoring grafana
# Should show LoadBalancer with EXTERNAL-IP: 10.89.97.231

# View logs
kubectl logs -n monitoring -l app=grafana

Loki Query Timeout

# Check Loki pod status
kubectl get pods -n monitoring -l app=loki

# View Loki logs
kubectl logs -n monitoring -l app=loki

# Common issues:
# - Out of memory (increase limits)
# - Too many logs (adjust retention)
# - Slow disk I/O (check Longhorn performance)

Missing Logs

# Check Promtail DaemonSet
kubectl get pods -n monitoring -l app=promtail

# View Promtail logs (check for scrape errors)
kubectl logs -n monitoring -l app=promtail

# Verify Promtail is scraping target namespace
kubectl logs -n monitoring <promtail-pod> | grep "supabase"

Prometheus Not Scraping

# Check Prometheus pod
kubectl get pods -n monitoring -l app=prometheus

# View Prometheus targets (via port-forward)
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Visit: http://localhost:9090/targets

# Check for failed scrapes

Best Practices

  1. Use Label Selectors: Always filter logs/metrics by namespace or pod
  2. Set Appropriate Retention: Balance storage vs historical data needs
  3. Create Dashboards for Apps: Custom dashboards for each application
  4. Alert on Key Metrics: CPU, memory, disk, error rates
  5. Test Alerts: Verify notification channels work
  6. Document Queries: Save useful LogQL/PromQL queries
  7. Monitor the Monitors: Alert if Loki/Prometheus/Grafana are down

Reference Queries

Useful Loki Queries

# Application startup logs
{namespace="home-portal"} |= "Server listening"

# Database migrations
{namespace="supabase", container="postgres"} |= "migration"

# HTTP errors
{namespace="money-tracker"} |~ "HTTP/1.1\" (4|5)\\d{2}"

# Slow queries
{namespace="supabase"} |= "duration" | json | duration > 1000

Useful Prometheus Queries

# Top 5 CPU consuming pods
topk(5, rate(container_cpu_usage_seconds_total[5m]))

# Network traffic per namespace
sum(rate(container_network_transmit_bytes_total[5m])) by (namespace)

# Pod restart count
sum(kube_pod_container_status_restarts_total) by (namespace, pod)

Related Documentation: - Kubernetes Overview - Loki Operations Guide - Alert Investigation - Observability Standards