Skip to content

Observability Standards

This document defines the standard observability approach for all applications deployed in the Kubernetes cluster.


Overview

Every application must implement three pillars of observability:

  1. Metrics - Quantitative measurements of application behavior
  2. Logs - Event records of what happened
  3. Traces - Request flow through the system (future enhancement)

Standard Metrics Collection

Level 1: Infrastructure Metrics (Required for ALL apps)

Automatically collected by Kubernetes - no code changes needed:

  • Pod metrics: CPU, memory, network I/O
  • Container metrics: Restarts, status, resource usage
  • Service metrics: Endpoint availability, request counts

Implementation: - Collected by kube-state-metrics and node-exporter - Available in Prometheus immediately after deployment - Create Grafana dashboard using standard queries (see template below)

Standard Dashboard Panels: 1. Pod Status (up/down) 2. Container Restarts 3. CPU Usage (%) 4. Memory Usage (bytes) 5. Network I/O (bytes/sec)

Custom metrics exposed by your application using prom-client:

Step 1: Install prom-client

npm install prom-client

Step 2: Create Metrics Registry

// lib/metrics/registry.ts
import { Registry, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';

export const register = new Registry();

// Collect default Node.js metrics (CPU, memory, event loop, etc.)
collectDefaultMetrics({
  register,
  prefix: 'app_name_'  // Replace with your app name
});

// HTTP metrics
export const httpRequestsTotal = new Counter({
  name: 'app_name_http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [register],
});

export const httpRequestDuration = new Histogram({
  name: 'app_name_http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path', 'status'],
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5],
  registers: [register],
});

export const httpErrors = new Counter({
  name: 'app_name_http_errors_total',
  help: 'Total number of HTTP errors',
  labelNames: ['method', 'path', 'status'],
  registers: [register],
});

// Auth metrics
export const authLoginAttempts = new Counter({
  name: 'app_name_auth_login_attempts_total',
  help: 'Total number of login attempts',
  labelNames: ['status'],  // success/failure
  registers: [register],
});

// Add more application-specific metrics as needed

Step 3: Create Helper Functions

// lib/metrics/collectors.ts
import {
  httpRequestsTotal,
  httpRequestDuration,
  httpErrors,
  authLoginAttempts
} from './registry';

export function recordHttpRequest(
  method: string,
  path: string,
  statusCode: number,
  durationMs: number
) {
  const status = String(statusCode);
  const durationSeconds = durationMs / 1000;

  httpRequestsTotal.inc({ method, path, status });
  httpRequestDuration.observe({ method, path, status }, durationSeconds);

  if (statusCode >= 400) {
    httpErrors.inc({ method, path, status });
  }
}

export function recordAuthAttempt(success: boolean) {
  authLoginAttempts.inc({ status: success ? 'success' : 'failure' });
}

export function startTimer() {
  const start = Date.now();
  return () => Date.now() - start;
}

Step 4: Create Metrics Middleware

// lib/metrics/middleware.ts
import { NextRequest, NextResponse } from 'next/server';
import { recordHttpRequest } from './collectors';

type RouteHandler = (
  req: NextRequest,
  context?: any
) => Promise<NextResponse> | NextResponse;

export function withMetrics(
  handler: RouteHandler,
  routeName?: string
): RouteHandler {
  return async (req: NextRequest, context?: any) => {
    const startTime = Date.now();
    const method = req.method;
    const path = routeName || new URL(req.url).pathname;

    try {
      const response = await handler(req, context);
      const duration = Date.now() - startTime;
      recordHttpRequest(method, path, response.status, duration);
      return response;
    } catch (error) {
      const duration = Date.now() - startTime;
      recordHttpRequest(method, path, 500, duration);
      throw error;
    }
  };
}

Step 5: Expose Metrics Endpoint

// app/api/metrics/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { register } from '@/lib/metrics/registry';

export async function GET(req: NextRequest) {
  try {
    const metrics = await register.metrics();
    return new NextResponse(metrics, {
      status: 200,
      headers: { 'Content-Type': register.contentType },
    });
  } catch (error) {
    console.error('Error generating metrics:', error);
    return NextResponse.json(
      { error: 'Failed to generate metrics' },
      { status: 500 }
    );
  }
}

Step 6: Wrap API Routes

// Example: app/api/some-endpoint/route.ts
import { withMetrics } from '@/lib/metrics/middleware';
import { recordAuthAttempt, startTimer } from '@/lib/metrics/collectors';

async function handler(request: NextRequest) {
  const timer = startTimer();

  try {
    // Your route logic here
    const result = await doSomething();

    // Record custom metrics
    recordAuthAttempt(true);

    return NextResponse.json(result);
  } catch (error) {
    recordAuthAttempt(false);
    throw error;
  }
}

// Wrap with metrics middleware
export const POST = withMetrics(handler, '/api/some-endpoint');

Step 7: Allow Unauthenticated Access to Metrics

// middleware.ts
export async function middleware(request: NextRequest) {
  // ... existing auth logic ...

  // Allow unauthenticated access to /api/metrics for Prometheus scraping
  if (request.nextUrl.pathname.startsWith('/api/metrics')) {
    return NextResponse.next();
  }

  // ... rest of your middleware ...
}

Standard application metrics to track: - http_requests_total{method,status,path} - Total HTTP requests by method, status code, path - http_request_duration_seconds{method,path} - Request latency histogram - http_errors_total{method,path,status} - HTTP errors by endpoint - auth_login_attempts_total{status} - Authentication attempts (success/failure) - process_cpu_* - CPU usage (from default metrics) - process_resident_memory_bytes - Memory usage (from default metrics) - nodejs_eventloop_lag_* - Event loop lag (from default metrics)

Application-specific metrics examples: - Blog operations: blog_posts_created_total, blog_images_uploaded_total{status} - Database: db_queries_total{operation}, db_query_duration_seconds{operation} - Cache: cache_hits_total, cache_misses_total - Business: user_signups_total, feature_usage_total{feature}

Reference implementation: See home-portal (/root/projects/home-portal/lib/metrics/) for a complete working example.

Level 3: Business Metrics (Optional, app-specific)

Application-specific business metrics: - User signups - Feature usage counts - Transaction volumes - Custom domain metrics


Standard Logging

Logging Standards

ALL applications must:

  1. Log to stdout/stderr - Kubernetes captures these automatically
  2. Use structured logging (JSON format preferred)
  3. Include standard fields in every log entry
  4. Use appropriate log levels

Log Levels

FATAL   - Application cannot continue, immediate action required
ERROR   - Operation failed, but app continues (e.g., failed API call)
WARN    - Something unexpected, but handled (e.g., deprecated feature used)
INFO    - General informational messages (e.g., "Server started")
DEBUG   - Detailed information for debugging (not in production)
TRACE   - Very detailed, fine-grained (development only)

Standard Log Fields

Every log entry should include:

{
  "timestamp": "2025-11-11T17:30:00.000Z",
  "level": "INFO",
  "service": "home-portal",
  "environment": "production",
  "pod": "home-portal-abc123",
  "message": "User logged in successfully",
  "userId": "user-123",
  "requestId": "req-456",
  "duration": 150,
  "metadata": {
    "ip": "10.89.97.100",
    "userAgent": "Mozilla/5.0..."
  }
}

For Next.js Applications

Use a logging library:

npm install pino pino-pretty

Create a logger:

// lib/logger.ts
import pino from 'pino'

export const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => {
      return { level: label.toUpperCase() }
    },
  },
  base: {
    service: 'home-portal',
    environment: process.env.NODE_ENV,
    pod: process.env.HOSTNAME,
  },
})

Use in your app:

import { logger } from '@/lib/logger'

// Info log
logger.info({ userId: user.id }, 'User logged in')

// Error log
logger.error({ error: err.message, stack: err.stack }, 'Failed to fetch data')

// With request context
logger.info({ 
  requestId: req.id,
  method: req.method,
  path: req.path,
  duration: Date.now() - startTime
}, 'Request completed')

What to Log

DO log: - Application startup/shutdown - User authentication events (success/failure) - API calls (requests/responses at INFO level) - Database operations (at DEBUG level) - Errors and exceptions (with stack traces) - Performance issues (slow queries, timeouts) - Security events (unauthorized access attempts)

DO NOT log: - Passwords or secrets - Full credit card numbers - Personally identifiable information (PII) without proper masking - Session tokens or API keys - Full request/response bodies containing sensitive data


Kubernetes Deployment Standards

1. Labels

Required labels on all resources:

metadata:
  labels:
    app: home-portal           # Application name
    version: v1.0.0            # Version number
    component: frontend        # Component type (frontend, backend, database)
    environment: production    # Environment

2. Annotations

Useful for metrics:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "3000"
    prometheus.io/path: "/api/metrics"

3. Health Checks

Always include liveness and readiness probes:

spec:
  containers:
  - name: app
    livenessProbe:
      httpGet:
        path: /api/health
        port: 3000
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /api/health
        port: 3000
      initialDelaySeconds: 10
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 3

Health endpoint example:

// app/api/health/route.ts
import { NextResponse } from 'next/server'

export async function GET() {
  // Check database connection
  const dbHealthy = await checkDatabase()

  // Check external dependencies
  const supabaseHealthy = await checkSupabase()

  const healthy = dbHealthy && supabaseHealthy

  return NextResponse.json(
    {
      status: healthy ? 'healthy' : 'unhealthy',
      timestamp: new Date().toISOString(),
      checks: {
        database: dbHealthy ? 'ok' : 'failed',
        supabase: supabaseHealthy ? 'ok' : 'failed',
      }
    },
    { status: healthy ? 200 : 503 }
  )
}

4. Resource Limits

Always define requests and limits:

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Prometheus & Grafana Setup

ServiceMonitor (for apps with custom metrics)

Only create if your app exposes /api/metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: home-portal
  namespace: home-portal
  labels:
    app: home-portal
    release: kube-prometheus-stack  # Required for Prometheus to discover it
spec:
  selector:
    matchLabels:
      app: home-portal
  endpoints:
  - port: http
    path: /api/metrics
    interval: 30s
    scrapeTimeout: 10s

PrometheusRule (Alerts)

Standard alerts for every app:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: {app}-alerts
  namespace: {namespace}
  labels:
    release: kube-prometheus-stack  # Required
spec:
  groups:
  - name: {app}
    interval: 30s
    rules:
    # Pod down
    - alert: {App}PodDown
      expr: up{namespace="{namespace}", pod=~"{app}-.*"} == 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "{App} pod is down"
        description: "Pod {{ $labels.pod }} has been down for >2min"

    # High CPU
    - alert: {App}HighCPU
      expr: sum(rate(container_cpu_usage_seconds_total{namespace="{namespace}", pod=~"{app}-.*"}[5m])) * 100 > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "{App} high CPU usage"
        description: "CPU usage >80% for >5min"

    # High Memory
    - alert: {App}HighMemory
      expr: sum(container_memory_working_set_bytes{namespace="{namespace}", pod=~"{app}-.*"}) / sum(container_spec_memory_limit_bytes{namespace="{namespace}", pod=~"{app}-.*"}) * 100 > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "{App} high memory usage"
        description: "Memory usage >85% of limit for >5min"

    # Frequent restarts
    - alert: {App}FrequentRestarts
      expr: rate(kube_pod_container_status_restarts_total{namespace="{namespace}", pod=~"{app}-.*"}[15m]) > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "{App} container restarting"
        description: "Container restarting frequently"

    # Pod not ready
    - alert: {App}PodNotReady
      expr: kube_pod_status_ready{namespace="{namespace}", pod=~"{app}-.*", condition="true"} == 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "{App} pod not ready"
        description: "Pod not ready for >5min"

Grafana Dashboard

Store as ConfigMap for GitOps:

apiVersion: v1
kind: ConfigMap
metadata:
  name: {app}-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"  # Grafana will auto-import
data:
  {app}-dashboard.json: |
    {
      "title": "{App} - Standard Metrics",
      "uid": "{app}-std",
      # ... dashboard JSON ...
    }

Standard panels to include: 1. Pod Status 2. Container Restarts 3. CPU Usage 4. Memory Usage 5. Network I/O 6. HTTP Request Rate (if app metrics available) 7. HTTP Error Rate (if app metrics available) 8. Response Time (if app metrics available)

Troubleshooting: Dashboard Not Appearing

Problem: Dashboard ConfigMap created but doesn't appear in Grafana

Common Cause: Wrong JSON format for file-based provisioning

Explanation: - File-based provisioning (via ConfigMap) requires unwrapped JSON - the dashboard object directly - API imports use wrapped format: {"dashboard": {...}, "overwrite": true}

Check your JSON format:

# WRONG - Wrapped format (for API imports only)
{
  "dashboard": {
    "title": "My Dashboard",
    "uid": "my-dash",
    ...
  },
  "overwrite": true
}

# CORRECT - Unwrapped format (for ConfigMap/file provisioning)
{
  "title": "My Dashboard",
  "uid": "my-dash",
  ...
}

Fix:

If you accidentally used the wrapped format, unwrap it:

# Unwrap the dashboard JSON
jq '.dashboard' /path/to/dashboard.json > /path/to/dashboard-fixed.json

# Update the ConfigMap
kubectl create configmap my-dashboard \
  --from-file=/path/to/dashboard-fixed.json \
  -n monitoring \
  --dry-run=client -o yaml | kubectl apply -f -

# Ensure label is set
kubectl label configmap my-dashboard -n monitoring grafana_dashboard=1 --overwrite

Verify:

# Check sidecar picked up the dashboard
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana -c grafana-sc-dashboard --tail=20 | grep -i "my-dashboard"

# Should see: Writing /tmp/dashboards/my-dashboard.json

Additional checks: - Verify label: kubectl get cm -n monitoring my-dashboard -o yaml | grep grafana_dashboard - Restart Grafana if needed: kubectl rollout restart deployment -n monitoring kube-prometheus-stack-grafana - Wait 30-60 seconds for sidecar to detect changes


Implementation Checklist

For Every New Application

  • [ ] Deployment manifest includes:
  • [ ] Standard labels (app, version, component, environment)
  • [ ] Resource requests and limits
  • [ ] Liveness probe
  • [ ] Readiness probe
  • [ ] Environment variables from ConfigMap/Secret

  • [ ] Observability configured:

  • [ ] Grafana dashboard created (ConfigMap in monitoring namespace)
  • [ ] PrometheusRule alerts created
  • [ ] Application logs to stdout/stderr
  • [ ] Structured logging implemented (JSON format)

  • [ ] Level 2 metrics (optional but recommended):

  • [ ] /api/health endpoint implemented
  • [ ] /api/metrics endpoint implemented (Prometheus format)
  • [ ] ServiceMonitor created if exposing custom metrics
  • [ ] Custom application metrics instrumented

  • [ ] Documentation:

  • [ ] Application documented in /root/k8s/docs/applications/{app}.md
  • [ ] Deployment process documented
  • [ ] Runbook for common issues created

Example: home-portal Reference Implementation

See /root/tower-fleet/docs/applications/home-portal.md for a complete example of this standard in action.

Kubernetes resources: - Deployment: home-portal namespace with LoadBalancer service (http://10.89.97.213) - ServiceMonitor: /root/k8s/home-portal-servicemonitor.yaml - Grafana Dashboard: ConfigMap home-portal-dashboard in monitoring namespace

Application code (Level 2 metrics): - Metrics Registry: /root/projects/home-portal/lib/metrics/registry.ts - Collectors: /root/projects/home-portal/lib/metrics/collectors.ts - Middleware: /root/projects/home-portal/lib/metrics/middleware.ts - Metrics Endpoint: /root/projects/home-portal/app/api/metrics/route.ts

Metrics exposed: - HTTP requests, duration, errors (with method, path, status labels) - Authentication events (login attempts, logouts, session refreshes) - Blog operations (posts created, images uploaded with success/failure) - Node.js process metrics (CPU, memory, event loop lag)

Dashboard panels: 1. HTTP Request Rate (by endpoint and status) 2. HTTP Request Duration (P50, P95, P99 percentiles) 3. HTTP Errors by Status Code 4. Total Requests and Error Rate stats 5. Authentication Metrics (logins, logouts, refreshes) 6. Blog Operations (posts, image uploads) 7. Memory Usage (RSS and heap) 8. CPU Usage (user and system) 9. Event Loop Lag 10. Blog Operation Duration (P95) 11. Service Health Status

To view Grafana:

# Access Grafana
http://10.89.97.211

# Default credentials
Username: admin
Password: prom-operator

To view Prometheus:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090


Future Enhancements

See Tower Fleet Roadmap - Observability for planned enhancements including: - Distributed Tracing (Phase 2): Jaeger/Tempo with trace IDs, HTTP instrumentation - Application Performance Monitoring (APM): Performance profiling, slow query tracking - Synthetic Monitoring: External health checks and uptime monitoring

Level 2 Logging: Centralized Log Aggregation

Status: Implemented (home-portal reference)

Centralized log aggregation provides a unified view of application logs across the cluster, enabling efficient debugging, error tracking, and system monitoring.

Architecture

The logging stack consists of three main components:

  1. Loki - Log aggregation and storage (monolithic mode)
  2. Promtail - Log collection agent (DaemonSet on all nodes)
  3. Grafana - Unified visualization (logs + metrics)
Kubernetes Pods → Container Logs → Promtail (DaemonSet)
                               Loki (Storage)
                             Grafana (Visualization)

Implementation

1. Deploy Loki

Configuration: /root/k8s/loki-values.yaml

# Add Grafana Helm repo
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update grafana

# Deploy Loki
helm install loki grafana/loki \
  -n monitoring \
  -f /root/k8s/loki-values.yaml

Key configuration details: - Mode: SingleBinary (monolithic) - Storage: 20Gi Longhorn PVC with filesystem backend - Retention: 30 days with compactor enabled - Resources: 200m-500m CPU, 512Mi-1Gi memory

IMPORTANT: When enabling retention, you must configure delete_request_store:

loki:
  compactor:
    retention_enabled: true
    retention_delete_delay: 2h
    retention_delete_worker_count: 150
    delete_request_store: filesystem  # Required!

2. Deploy Promtail

Configuration: /root/k8s/promtail-values.yaml

# Deploy Promtail
helm install promtail grafana/promtail \
  -n monitoring \
  -f /root/k8s/promtail-values.yaml

Promtail configuration: - DaemonSet: Runs on all nodes (including master) - Scrape configs: Supports both standard and JSON logging - Label extraction: Namespace, pod, container, app labels - Resources: 50m-200m CPU, 128Mi-256Mi memory per pod

3. Configure Grafana Datasource

Configuration: /root/k8s/loki-datasource.yaml

# Apply datasource ConfigMap
kubectl apply -f /root/k8s/loki-datasource.yaml

The datasource is automatically discovered by Grafana's sidecar via the grafana_datasource: "1" label.

4. Create Dashboards

Create dashboard ConfigMaps and label them for auto-discovery:

# Create dashboard ConfigMap
kubectl create configmap my-logs-dashboard \
  --from-file=/path/to/dashboard.json \
  -n monitoring

# Label for auto-discovery
kubectl label configmap my-logs-dashboard \
  -n monitoring \
  grafana_dashboard=1

IMPORTANT: Ensure dashboard JSON is in unwrapped format (direct dashboard object), not wrapped format ({"dashboard": {...}}). See Troubleshooting section for details.

LogQL Query Examples

Basic Queries:

# All logs from home-portal namespace
{namespace="home-portal"}

# Logs from specific pod
{namespace="home-portal", pod="home-portal-app-xyz"}

# Logs from specific container
{namespace="home-portal", container="home-portal"}

Filtering with Line Filters:

# Error logs
{namespace="home-portal"} |~ "(?i)(error|fail|exception)"

# HTTP 4xx/5xx errors
{namespace="home-portal"} |~ "(4[0-9]{2}|5[0-9]{2})"

# Warning logs
{namespace="home-portal"} |~ "(?i)(warn|warning)"

# Exclude health checks
{namespace="home-portal"} != "/health"

Aggregations and Metrics:

# Count logs over time
count_over_time({namespace="home-portal"}[5m])

# Rate of error logs
rate({namespace="home-portal"} |~ "error"[5m])

# Total errors in 15 minutes
sum(count_over_time({namespace="home-portal"} |~ "(?i)error"[15m]))

JSON Log Parsing:

# Parse JSON and filter by level
{namespace="home-portal"} | json | level="error"

# Extract and filter by status code
{namespace="home-portal"} | json | status >= 400

Dashboard Examples

Home Portal - Logs & Metrics Combined (/root/k8s/home-portal-logs-dashboard.json): - Error rate and request metrics - Live log tail for all pods - Filtered error logs (4xx/5xx) - HTTP error breakdown by endpoint - Request duration percentiles - Memory and event loop metrics

Cluster-Wide Logs (/root/k8s/cluster-logs-dashboard.json): - Total log volume statistics - Error and warning counts - Live logs by namespace - Filtered error/warning views across all namespaces

Best Practices

1. Structured Logging

Use JSON format for application logs to enable rich querying:

// Good - structured JSON logging
console.log(JSON.stringify({
  level: 'error',
  message: 'Failed to process request',
  userId: user.id,
  path: req.path,
  statusCode: 500,
  error: error.message
}))

// Less ideal - plain text
console.error('Failed to process request')

2. Label Strategy

Keep labels minimal and high-cardinality values in the log line:

# Good - use pod/namespace labels
{namespace="home-portal", pod="home-portal-app"}

# Bad - don't create labels for high-cardinality data
# {user_id="12345"}  ← This will create too many label combinations

3. Log Retention

Balance storage costs with debugging needs: - Development: 7-14 days - Production: 30-90 days - Compliance: As required by regulations

4. Query Optimization

  • Use specific label filters to reduce data scanned
  • Limit time ranges when possible
  • Use limit parameter for large result sets
  • Consider using --since for recent logs

Validation

After deployment, verify the stack is working:

# 1. Check Loki pod
kubectl get pod loki-0 -n monitoring
# Should show 2/2 Running

# 2. Check Promtail DaemonSet
kubectl get daemonset promtail -n monitoring
# Should show 3/3 pods (or matching node count)

# 3. Check datasource
kubectl get configmap loki-datasource -n monitoring

# 4. Test log ingestion
kubectl port-forward -n monitoring svc/loki 3100:3100 &
curl -H "Content-Type: application/json" \
  -XPOST "http://localhost:3100/loki/api/v1/push" \
  --data-raw '{"streams": [{"stream": {"job": "test"}, "values": [["'$(date +%s)000000000'", "test log"]]}]}'

# 5. Query test log
curl -s "http://localhost:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={job="test"}' | jq

Troubleshooting

Loki Pod CrashLoopBackOff

If you see errors like:

CONFIG ERROR: invalid compactor config: compactor.delete-request-store should be configured when retention is enabled

Solution: Add delete_request_store: filesystem to the compactor configuration (see Implementation section above).

Promtail Not Collecting Logs

Check Promtail logs:

kubectl logs -n monitoring -l app.kubernetes.io/name=promtail

Common issues: - Incorrect volume mounts (check /var/log/pods) - Permissions (Promtail runs privileged) - Loki connection (check loki:3100 reachability)

No Logs Appearing in Grafana

  1. Verify Loki datasource is configured:
  2. Open Grafana → Configuration → Data Sources
  3. Should see "Loki" datasource

  4. Check Grafana sidecar logs:

    kubectl logs -n monitoring -l app.kubernetes.io/name=grafana -c grafana-sc-datasources
    

  5. Test Loki API directly:

    kubectl port-forward -n monitoring svc/loki 3100:3100
    curl "http://localhost:3100/loki/api/v1/label"
    

Dashboard Not Appearing

See the "Troubleshooting: Dashboard Not Appearing" section earlier in this document for detailed steps.

Cost and Resource Considerations

Storage: - Logs are compressed (typically 5-10x compression) - 20Gi storage supports ~2-3 weeks for moderate traffic applications - Monitor Loki PVC usage: kubectl get pvc -n monitoring | grep loki

Memory: - Loki: 512Mi-1Gi per replica - Promtail: 128Mi-256Mi per node - Query performance degrades with large time ranges

Operations

Backup Logs

Loki stores data in the PVC. Backup strategies: 1. Longhorn volume snapshots 2. Export via LogCLI for specific time ranges 3. S3 backend (future enhancement)

Scaling

For increased log volume: 1. Increase Loki PVC size 2. Add more CPU/memory to Loki pod 3. Consider microservices mode (separate read/write/backend)

Monitoring Loki Itself

Loki exposes Prometheus metrics at :3100/metrics: - loki_ingester_streams_created_total - Stream creation rate - loki_request_duration_seconds - Query performance - loki_ingester_chunk_age_seconds - Data freshness

A ServiceMonitor is automatically created when deploying with the provided values.

---

Resources


Last Updated: 2025-11-11
Maintained By: Infrastructure Team
Status:* Active - home-portal fully implemented as reference