Observability Standards¶
This document defines the standard observability approach for all applications deployed in the Kubernetes cluster.
Overview¶
Every application must implement three pillars of observability:
- Metrics - Quantitative measurements of application behavior
- Logs - Event records of what happened
- Traces - Request flow through the system (future enhancement)
Standard Metrics Collection¶
Level 1: Infrastructure Metrics (Required for ALL apps)¶
Automatically collected by Kubernetes - no code changes needed:
- Pod metrics: CPU, memory, network I/O
- Container metrics: Restarts, status, resource usage
- Service metrics: Endpoint availability, request counts
Implementation:
- Collected by kube-state-metrics and node-exporter
- Available in Prometheus immediately after deployment
- Create Grafana dashboard using standard queries (see template below)
Standard Dashboard Panels: 1. Pod Status (up/down) 2. Container Restarts 3. CPU Usage (%) 4. Memory Usage (bytes) 5. Network I/O (bytes/sec)
Level 2: Application Metrics (Recommended for production apps)¶
Custom metrics exposed by your application using prom-client:
Step 1: Install prom-client¶
Step 2: Create Metrics Registry¶
// lib/metrics/registry.ts
import { Registry, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';
export const register = new Registry();
// Collect default Node.js metrics (CPU, memory, event loop, etc.)
collectDefaultMetrics({
register,
prefix: 'app_name_' // Replace with your app name
});
// HTTP metrics
export const httpRequestsTotal = new Counter({
name: 'app_name_http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'path', 'status'],
registers: [register],
});
export const httpRequestDuration = new Histogram({
name: 'app_name_http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path', 'status'],
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5],
registers: [register],
});
export const httpErrors = new Counter({
name: 'app_name_http_errors_total',
help: 'Total number of HTTP errors',
labelNames: ['method', 'path', 'status'],
registers: [register],
});
// Auth metrics
export const authLoginAttempts = new Counter({
name: 'app_name_auth_login_attempts_total',
help: 'Total number of login attempts',
labelNames: ['status'], // success/failure
registers: [register],
});
// Add more application-specific metrics as needed
Step 3: Create Helper Functions¶
// lib/metrics/collectors.ts
import {
httpRequestsTotal,
httpRequestDuration,
httpErrors,
authLoginAttempts
} from './registry';
export function recordHttpRequest(
method: string,
path: string,
statusCode: number,
durationMs: number
) {
const status = String(statusCode);
const durationSeconds = durationMs / 1000;
httpRequestsTotal.inc({ method, path, status });
httpRequestDuration.observe({ method, path, status }, durationSeconds);
if (statusCode >= 400) {
httpErrors.inc({ method, path, status });
}
}
export function recordAuthAttempt(success: boolean) {
authLoginAttempts.inc({ status: success ? 'success' : 'failure' });
}
export function startTimer() {
const start = Date.now();
return () => Date.now() - start;
}
Step 4: Create Metrics Middleware¶
// lib/metrics/middleware.ts
import { NextRequest, NextResponse } from 'next/server';
import { recordHttpRequest } from './collectors';
type RouteHandler = (
req: NextRequest,
context?: any
) => Promise<NextResponse> | NextResponse;
export function withMetrics(
handler: RouteHandler,
routeName?: string
): RouteHandler {
return async (req: NextRequest, context?: any) => {
const startTime = Date.now();
const method = req.method;
const path = routeName || new URL(req.url).pathname;
try {
const response = await handler(req, context);
const duration = Date.now() - startTime;
recordHttpRequest(method, path, response.status, duration);
return response;
} catch (error) {
const duration = Date.now() - startTime;
recordHttpRequest(method, path, 500, duration);
throw error;
}
};
}
Step 5: Expose Metrics Endpoint¶
// app/api/metrics/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { register } from '@/lib/metrics/registry';
export async function GET(req: NextRequest) {
try {
const metrics = await register.metrics();
return new NextResponse(metrics, {
status: 200,
headers: { 'Content-Type': register.contentType },
});
} catch (error) {
console.error('Error generating metrics:', error);
return NextResponse.json(
{ error: 'Failed to generate metrics' },
{ status: 500 }
);
}
}
Step 6: Wrap API Routes¶
// Example: app/api/some-endpoint/route.ts
import { withMetrics } from '@/lib/metrics/middleware';
import { recordAuthAttempt, startTimer } from '@/lib/metrics/collectors';
async function handler(request: NextRequest) {
const timer = startTimer();
try {
// Your route logic here
const result = await doSomething();
// Record custom metrics
recordAuthAttempt(true);
return NextResponse.json(result);
} catch (error) {
recordAuthAttempt(false);
throw error;
}
}
// Wrap with metrics middleware
export const POST = withMetrics(handler, '/api/some-endpoint');
Step 7: Allow Unauthenticated Access to Metrics¶
// middleware.ts
export async function middleware(request: NextRequest) {
// ... existing auth logic ...
// Allow unauthenticated access to /api/metrics for Prometheus scraping
if (request.nextUrl.pathname.startsWith('/api/metrics')) {
return NextResponse.next();
}
// ... rest of your middleware ...
}
Standard application metrics to track:
- http_requests_total{method,status,path} - Total HTTP requests by method, status code, path
- http_request_duration_seconds{method,path} - Request latency histogram
- http_errors_total{method,path,status} - HTTP errors by endpoint
- auth_login_attempts_total{status} - Authentication attempts (success/failure)
- process_cpu_* - CPU usage (from default metrics)
- process_resident_memory_bytes - Memory usage (from default metrics)
- nodejs_eventloop_lag_* - Event loop lag (from default metrics)
Application-specific metrics examples:
- Blog operations: blog_posts_created_total, blog_images_uploaded_total{status}
- Database: db_queries_total{operation}, db_query_duration_seconds{operation}
- Cache: cache_hits_total, cache_misses_total
- Business: user_signups_total, feature_usage_total{feature}
Reference implementation: See home-portal (/root/projects/home-portal/lib/metrics/) for a complete working example.
Level 3: Business Metrics (Optional, app-specific)¶
Application-specific business metrics: - User signups - Feature usage counts - Transaction volumes - Custom domain metrics
Standard Logging¶
Logging Standards¶
ALL applications must:
- Log to stdout/stderr - Kubernetes captures these automatically
- Use structured logging (JSON format preferred)
- Include standard fields in every log entry
- Use appropriate log levels
Log Levels¶
FATAL - Application cannot continue, immediate action required
ERROR - Operation failed, but app continues (e.g., failed API call)
WARN - Something unexpected, but handled (e.g., deprecated feature used)
INFO - General informational messages (e.g., "Server started")
DEBUG - Detailed information for debugging (not in production)
TRACE - Very detailed, fine-grained (development only)
Standard Log Fields¶
Every log entry should include:
{
"timestamp": "2025-11-11T17:30:00.000Z",
"level": "INFO",
"service": "home-portal",
"environment": "production",
"pod": "home-portal-abc123",
"message": "User logged in successfully",
"userId": "user-123",
"requestId": "req-456",
"duration": 150,
"metadata": {
"ip": "10.89.97.100",
"userAgent": "Mozilla/5.0..."
}
}
For Next.js Applications¶
Use a logging library:
Create a logger:
// lib/logger.ts
import pino from 'pino'
export const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => {
return { level: label.toUpperCase() }
},
},
base: {
service: 'home-portal',
environment: process.env.NODE_ENV,
pod: process.env.HOSTNAME,
},
})
Use in your app:
import { logger } from '@/lib/logger'
// Info log
logger.info({ userId: user.id }, 'User logged in')
// Error log
logger.error({ error: err.message, stack: err.stack }, 'Failed to fetch data')
// With request context
logger.info({
requestId: req.id,
method: req.method,
path: req.path,
duration: Date.now() - startTime
}, 'Request completed')
What to Log¶
DO log: - Application startup/shutdown - User authentication events (success/failure) - API calls (requests/responses at INFO level) - Database operations (at DEBUG level) - Errors and exceptions (with stack traces) - Performance issues (slow queries, timeouts) - Security events (unauthorized access attempts)
DO NOT log: - Passwords or secrets - Full credit card numbers - Personally identifiable information (PII) without proper masking - Session tokens or API keys - Full request/response bodies containing sensitive data
Kubernetes Deployment Standards¶
1. Labels¶
Required labels on all resources:
metadata:
labels:
app: home-portal # Application name
version: v1.0.0 # Version number
component: frontend # Component type (frontend, backend, database)
environment: production # Environment
2. Annotations¶
Useful for metrics:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3000"
prometheus.io/path: "/api/metrics"
3. Health Checks¶
Always include liveness and readiness probes:
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Health endpoint example:
// app/api/health/route.ts
import { NextResponse } from 'next/server'
export async function GET() {
// Check database connection
const dbHealthy = await checkDatabase()
// Check external dependencies
const supabaseHealthy = await checkSupabase()
const healthy = dbHealthy && supabaseHealthy
return NextResponse.json(
{
status: healthy ? 'healthy' : 'unhealthy',
timestamp: new Date().toISOString(),
checks: {
database: dbHealthy ? 'ok' : 'failed',
supabase: supabaseHealthy ? 'ok' : 'failed',
}
},
{ status: healthy ? 200 : 503 }
)
}
4. Resource Limits¶
Always define requests and limits:
Prometheus & Grafana Setup¶
ServiceMonitor (for apps with custom metrics)¶
Only create if your app exposes /api/metrics:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: home-portal
namespace: home-portal
labels:
app: home-portal
release: kube-prometheus-stack # Required for Prometheus to discover it
spec:
selector:
matchLabels:
app: home-portal
endpoints:
- port: http
path: /api/metrics
interval: 30s
scrapeTimeout: 10s
PrometheusRule (Alerts)¶
Standard alerts for every app:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: {app}-alerts
namespace: {namespace}
labels:
release: kube-prometheus-stack # Required
spec:
groups:
- name: {app}
interval: 30s
rules:
# Pod down
- alert: {App}PodDown
expr: up{namespace="{namespace}", pod=~"{app}-.*"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "{App} pod is down"
description: "Pod {{ $labels.pod }} has been down for >2min"
# High CPU
- alert: {App}HighCPU
expr: sum(rate(container_cpu_usage_seconds_total{namespace="{namespace}", pod=~"{app}-.*"}[5m])) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "{App} high CPU usage"
description: "CPU usage >80% for >5min"
# High Memory
- alert: {App}HighMemory
expr: sum(container_memory_working_set_bytes{namespace="{namespace}", pod=~"{app}-.*"}) / sum(container_spec_memory_limit_bytes{namespace="{namespace}", pod=~"{app}-.*"}) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "{App} high memory usage"
description: "Memory usage >85% of limit for >5min"
# Frequent restarts
- alert: {App}FrequentRestarts
expr: rate(kube_pod_container_status_restarts_total{namespace="{namespace}", pod=~"{app}-.*"}[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "{App} container restarting"
description: "Container restarting frequently"
# Pod not ready
- alert: {App}PodNotReady
expr: kube_pod_status_ready{namespace="{namespace}", pod=~"{app}-.*", condition="true"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "{App} pod not ready"
description: "Pod not ready for >5min"
Grafana Dashboard¶
Store as ConfigMap for GitOps:
apiVersion: v1
kind: ConfigMap
metadata:
name: {app}-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1" # Grafana will auto-import
data:
{app}-dashboard.json: |
{
"title": "{App} - Standard Metrics",
"uid": "{app}-std",
# ... dashboard JSON ...
}
Standard panels to include: 1. Pod Status 2. Container Restarts 3. CPU Usage 4. Memory Usage 5. Network I/O 6. HTTP Request Rate (if app metrics available) 7. HTTP Error Rate (if app metrics available) 8. Response Time (if app metrics available)
Troubleshooting: Dashboard Not Appearing¶
Problem: Dashboard ConfigMap created but doesn't appear in Grafana
Common Cause: Wrong JSON format for file-based provisioning
Explanation:
- File-based provisioning (via ConfigMap) requires unwrapped JSON - the dashboard object directly
- API imports use wrapped format: {"dashboard": {...}, "overwrite": true}
Check your JSON format:
# WRONG - Wrapped format (for API imports only)
{
"dashboard": {
"title": "My Dashboard",
"uid": "my-dash",
...
},
"overwrite": true
}
# CORRECT - Unwrapped format (for ConfigMap/file provisioning)
{
"title": "My Dashboard",
"uid": "my-dash",
...
}
Fix:
If you accidentally used the wrapped format, unwrap it:
# Unwrap the dashboard JSON
jq '.dashboard' /path/to/dashboard.json > /path/to/dashboard-fixed.json
# Update the ConfigMap
kubectl create configmap my-dashboard \
--from-file=/path/to/dashboard-fixed.json \
-n monitoring \
--dry-run=client -o yaml | kubectl apply -f -
# Ensure label is set
kubectl label configmap my-dashboard -n monitoring grafana_dashboard=1 --overwrite
Verify:
# Check sidecar picked up the dashboard
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana -c grafana-sc-dashboard --tail=20 | grep -i "my-dashboard"
# Should see: Writing /tmp/dashboards/my-dashboard.json
Additional checks:
- Verify label: kubectl get cm -n monitoring my-dashboard -o yaml | grep grafana_dashboard
- Restart Grafana if needed: kubectl rollout restart deployment -n monitoring kube-prometheus-stack-grafana
- Wait 30-60 seconds for sidecar to detect changes
Implementation Checklist¶
For Every New Application¶
- [ ] Deployment manifest includes:
- [ ] Standard labels (app, version, component, environment)
- [ ] Resource requests and limits
- [ ] Liveness probe
- [ ] Readiness probe
-
[ ] Environment variables from ConfigMap/Secret
-
[ ] Observability configured:
- [ ] Grafana dashboard created (ConfigMap in monitoring namespace)
- [ ] PrometheusRule alerts created
- [ ] Application logs to stdout/stderr
-
[ ] Structured logging implemented (JSON format)
-
[ ] Level 2 metrics (optional but recommended):
- [ ]
/api/healthendpoint implemented - [ ]
/api/metricsendpoint implemented (Prometheus format) - [ ] ServiceMonitor created if exposing custom metrics
-
[ ] Custom application metrics instrumented
-
[ ] Documentation:
- [ ] Application documented in
/root/k8s/docs/applications/{app}.md - [ ] Deployment process documented
- [ ] Runbook for common issues created
Example: home-portal Reference Implementation¶
See /root/tower-fleet/docs/applications/home-portal.md for a complete example of this standard in action.
Kubernetes resources:
- Deployment: home-portal namespace with LoadBalancer service (http://10.89.97.213)
- ServiceMonitor: /root/k8s/home-portal-servicemonitor.yaml
- Grafana Dashboard: ConfigMap home-portal-dashboard in monitoring namespace
Application code (Level 2 metrics):
- Metrics Registry: /root/projects/home-portal/lib/metrics/registry.ts
- Collectors: /root/projects/home-portal/lib/metrics/collectors.ts
- Middleware: /root/projects/home-portal/lib/metrics/middleware.ts
- Metrics Endpoint: /root/projects/home-portal/app/api/metrics/route.ts
Metrics exposed: - HTTP requests, duration, errors (with method, path, status labels) - Authentication events (login attempts, logouts, session refreshes) - Blog operations (posts created, images uploaded with success/failure) - Node.js process metrics (CPU, memory, event loop lag)
Dashboard panels: 1. HTTP Request Rate (by endpoint and status) 2. HTTP Request Duration (P50, P95, P99 percentiles) 3. HTTP Errors by Status Code 4. Total Requests and Error Rate stats 5. Authentication Metrics (logins, logouts, refreshes) 6. Blog Operations (posts, image uploads) 7. Memory Usage (RSS and heap) 8. CPU Usage (user and system) 9. Event Loop Lag 10. Blog Operation Duration (P95) 11. Service Health Status
To view Grafana:
To view Prometheus:
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090
Future Enhancements¶
See Tower Fleet Roadmap - Observability for planned enhancements including: - Distributed Tracing (Phase 2): Jaeger/Tempo with trace IDs, HTTP instrumentation - Application Performance Monitoring (APM): Performance profiling, slow query tracking - Synthetic Monitoring: External health checks and uptime monitoring
Level 2 Logging: Centralized Log Aggregation¶
Status: Implemented (home-portal reference)
Centralized log aggregation provides a unified view of application logs across the cluster, enabling efficient debugging, error tracking, and system monitoring.
Architecture¶
The logging stack consists of three main components:
- Loki - Log aggregation and storage (monolithic mode)
- Promtail - Log collection agent (DaemonSet on all nodes)
- Grafana - Unified visualization (logs + metrics)
Implementation¶
1. Deploy Loki
Configuration: /root/k8s/loki-values.yaml
# Add Grafana Helm repo
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update grafana
# Deploy Loki
helm install loki grafana/loki \
-n monitoring \
-f /root/k8s/loki-values.yaml
Key configuration details: - Mode: SingleBinary (monolithic) - Storage: 20Gi Longhorn PVC with filesystem backend - Retention: 30 days with compactor enabled - Resources: 200m-500m CPU, 512Mi-1Gi memory
IMPORTANT: When enabling retention, you must configure delete_request_store:
loki:
compactor:
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
delete_request_store: filesystem # Required!
2. Deploy Promtail
Configuration: /root/k8s/promtail-values.yaml
# Deploy Promtail
helm install promtail grafana/promtail \
-n monitoring \
-f /root/k8s/promtail-values.yaml
Promtail configuration: - DaemonSet: Runs on all nodes (including master) - Scrape configs: Supports both standard and JSON logging - Label extraction: Namespace, pod, container, app labels - Resources: 50m-200m CPU, 128Mi-256Mi memory per pod
3. Configure Grafana Datasource
Configuration: /root/k8s/loki-datasource.yaml
The datasource is automatically discovered by Grafana's sidecar via the grafana_datasource: "1" label.
4. Create Dashboards
Create dashboard ConfigMaps and label them for auto-discovery:
# Create dashboard ConfigMap
kubectl create configmap my-logs-dashboard \
--from-file=/path/to/dashboard.json \
-n monitoring
# Label for auto-discovery
kubectl label configmap my-logs-dashboard \
-n monitoring \
grafana_dashboard=1
IMPORTANT: Ensure dashboard JSON is in unwrapped format (direct dashboard object), not wrapped format ({"dashboard": {...}}). See Troubleshooting section for details.
LogQL Query Examples¶
Basic Queries:
# All logs from home-portal namespace
{namespace="home-portal"}
# Logs from specific pod
{namespace="home-portal", pod="home-portal-app-xyz"}
# Logs from specific container
{namespace="home-portal", container="home-portal"}
Filtering with Line Filters:
# Error logs
{namespace="home-portal"} |~ "(?i)(error|fail|exception)"
# HTTP 4xx/5xx errors
{namespace="home-portal"} |~ "(4[0-9]{2}|5[0-9]{2})"
# Warning logs
{namespace="home-portal"} |~ "(?i)(warn|warning)"
# Exclude health checks
{namespace="home-portal"} != "/health"
Aggregations and Metrics:
# Count logs over time
count_over_time({namespace="home-portal"}[5m])
# Rate of error logs
rate({namespace="home-portal"} |~ "error"[5m])
# Total errors in 15 minutes
sum(count_over_time({namespace="home-portal"} |~ "(?i)error"[15m]))
JSON Log Parsing:
# Parse JSON and filter by level
{namespace="home-portal"} | json | level="error"
# Extract and filter by status code
{namespace="home-portal"} | json | status >= 400
Dashboard Examples¶
Home Portal - Logs & Metrics Combined (/root/k8s/home-portal-logs-dashboard.json):
- Error rate and request metrics
- Live log tail for all pods
- Filtered error logs (4xx/5xx)
- HTTP error breakdown by endpoint
- Request duration percentiles
- Memory and event loop metrics
Cluster-Wide Logs (/root/k8s/cluster-logs-dashboard.json):
- Total log volume statistics
- Error and warning counts
- Live logs by namespace
- Filtered error/warning views across all namespaces
Best Practices¶
1. Structured Logging
Use JSON format for application logs to enable rich querying:
// Good - structured JSON logging
console.log(JSON.stringify({
level: 'error',
message: 'Failed to process request',
userId: user.id,
path: req.path,
statusCode: 500,
error: error.message
}))
// Less ideal - plain text
console.error('Failed to process request')
2. Label Strategy
Keep labels minimal and high-cardinality values in the log line:
# Good - use pod/namespace labels
{namespace="home-portal", pod="home-portal-app"}
# Bad - don't create labels for high-cardinality data
# {user_id="12345"} ← This will create too many label combinations
3. Log Retention
Balance storage costs with debugging needs: - Development: 7-14 days - Production: 30-90 days - Compliance: As required by regulations
4. Query Optimization
- Use specific label filters to reduce data scanned
- Limit time ranges when possible
- Use
limitparameter for large result sets - Consider using
--sincefor recent logs
Validation¶
After deployment, verify the stack is working:
# 1. Check Loki pod
kubectl get pod loki-0 -n monitoring
# Should show 2/2 Running
# 2. Check Promtail DaemonSet
kubectl get daemonset promtail -n monitoring
# Should show 3/3 pods (or matching node count)
# 3. Check datasource
kubectl get configmap loki-datasource -n monitoring
# 4. Test log ingestion
kubectl port-forward -n monitoring svc/loki 3100:3100 &
curl -H "Content-Type: application/json" \
-XPOST "http://localhost:3100/loki/api/v1/push" \
--data-raw '{"streams": [{"stream": {"job": "test"}, "values": [["'$(date +%s)000000000'", "test log"]]}]}'
# 5. Query test log
curl -s "http://localhost:3100/loki/api/v1/query_range" \
--data-urlencode 'query={job="test"}' | jq
Troubleshooting¶
Loki Pod CrashLoopBackOff
If you see errors like:
CONFIG ERROR: invalid compactor config: compactor.delete-request-store should be configured when retention is enabled
Solution: Add delete_request_store: filesystem to the compactor configuration (see Implementation section above).
Promtail Not Collecting Logs
Check Promtail logs:
Common issues:
- Incorrect volume mounts (check /var/log/pods)
- Permissions (Promtail runs privileged)
- Loki connection (check loki:3100 reachability)
No Logs Appearing in Grafana
- Verify Loki datasource is configured:
- Open Grafana → Configuration → Data Sources
-
Should see "Loki" datasource
-
Check Grafana sidecar logs:
-
Test Loki API directly:
Dashboard Not Appearing
See the "Troubleshooting: Dashboard Not Appearing" section earlier in this document for detailed steps.
Cost and Resource Considerations¶
Storage:
- Logs are compressed (typically 5-10x compression)
- 20Gi storage supports ~2-3 weeks for moderate traffic applications
- Monitor Loki PVC usage: kubectl get pvc -n monitoring | grep loki
Memory: - Loki: 512Mi-1Gi per replica - Promtail: 128Mi-256Mi per node - Query performance degrades with large time ranges
Operations¶
Backup Logs
Loki stores data in the PVC. Backup strategies: 1. Longhorn volume snapshots 2. Export via LogCLI for specific time ranges 3. S3 backend (future enhancement)
Scaling
For increased log volume: 1. Increase Loki PVC size 2. Add more CPU/memory to Loki pod 3. Consider microservices mode (separate read/write/backend)
Monitoring Loki Itself
Loki exposes Prometheus metrics at :3100/metrics:
- loki_ingester_streams_created_total - Stream creation rate
- loki_request_duration_seconds - Query performance
- loki_ingester_chunk_age_seconds - Data freshness
A ServiceMonitor is automatically created when deploying with the provided values.
---¶
Resources¶
- Prometheus Naming Best Practices
- Kubernetes Logging Architecture
- The Twelve-Factor App - Logs
- OpenTelemetry - Future standard for traces
Last Updated: 2025-11-11
Maintained By: Infrastructure Team
Status:* Active - home-portal fully implemented as reference