Skip to content

Intent System: Observability & Learning Loop

Status: Design Created: 2025-12-23


Overview

Extend the intent system from "execute and forget" to a closed-loop system that observes outcomes, detects anomalies, files incidents, and iterates on policies.

Propose → Validate → Execute → Observe → Detect → File → Learn → Iterate
   │         │          │         │         │        │       │        │
   ▼         ▼          ▼         ▼         ▼        ▼       ▼        ▼
  LLM     Schema     Scripts   Metrics   Anomaly  Incident  Update   New
  Match   + Policy   + K8s     + Logs    Rules    Drafts   Runbooks  Intents

Goals

  1. Autonomous operation: Agents can execute multi-step tasks with confidence
  2. Full auditability: Every action is logged with context, replayable
  3. Anomaly detection: System notices when outcomes don't match expectations
  4. Incident automation: Automatically draft incident reports for review
  5. Continuous improvement: Learnings feed back into policies and runbooks

Components

1. Outcome Specification

Each intent defines expected outcomes that can be verified:

# Example: deploy-app intent
outcomes:
  success:
    - condition: "pods_ready"
      check: "kubectl get pods -n {{app}} -l app={{app}} -o jsonpath='{.items[*].status.phase}' | grep -v Running | wc -l"
      expected: "0"
      timeout: 300s
    - condition: "endpoint_healthy"
      check: "curl -s -o /dev/null -w '%{http_code}' https://{{app}}.bogocat.com/health"
      expected: "200"
      timeout: 60s

  failure_indicators:
    - pattern: "CrashLoopBackOff"
      severity: P1
    - pattern: "ImagePullBackOff"
      severity: P2
    - pattern: "OOMKilled"
      severity: P1

2. Execution Context Capture

Every intent execution captures:

execution_record:
  id: "exec-20251223-143052-deploy-home-portal"
  timestamp: "2025-12-23T14:30:52Z"
  intent: "deploy-app"
  params:
    app: "home-portal"
    tag: "latest"

  # Pre-execution state
  before:
    pods: [...snapshot...]
    metrics:
      cpu_usage: 45%
      memory_usage: 62%
    recent_logs: [...last 50 lines...]

  # Execution
  commands_run:
    - cmd: "docker build..."
      exit_code: 0
      duration: 45s
    - cmd: "kubectl apply..."
      exit_code: 0
      duration: 2s

  # Post-execution state
  after:
    pods: [...snapshot...]
    metrics:
      cpu_usage: 48%
      memory_usage: 65%
    outcome_checks:
      pods_ready: pass
      endpoint_healthy: pass

  # Result
  status: "success"
  duration: 180s

3. Anomaly Detection Rules

Define what constitutes unexpected behavior:

anomaly_rules:
  # Metric-based
  - name: "memory_spike"
    condition: "after.metrics.memory_usage - before.metrics.memory_usage > 20"
    severity: P2
    action: "flag_for_review"

  # Log-based
  - name: "error_rate_increase"
    condition: "error_count_after > error_count_before * 2"
    severity: P2
    action: "create_incident_draft"

  # State-based
  - name: "unexpected_restart"
    condition: "pod_restarts_after > pod_restarts_before"
    severity: P2
    action: "create_incident_draft"

  # Timeout-based
  - name: "slow_deployment"
    condition: "duration > expected_duration * 2"
    severity: P3
    action: "log_warning"

4. Incident Auto-Filing

When anomalies are detected, auto-generate incident drafts:

incident_draft:
  title: "Deploy home-portal: unexpected memory spike"
  severity: P2
  status: draft  # Requires human review before filing

  generated_from:
    execution_id: "exec-20251223-143052-deploy-home-portal"
    anomaly_rule: "memory_spike"

  auto_populated:
    timeline: [...from execution record...]
    before_state: [...snapshot...]
    after_state: [...snapshot...]
    commands_run: [...from execution record...]

  needs_human_input:
    - root_cause
    - lessons_learned
    - action_items

5. Learning Feedback Loop

Track patterns across incidents to improve:

learning_triggers:
  # Same anomaly 3+ times → suggest new policy
  - pattern: "repeated_anomaly"
    threshold: 3
    action: "suggest_policy_update"
    example: "Add memory limit check before deploy"

  # Manual fix repeated → suggest new intent
  - pattern: "repeated_manual_fix"
    threshold: 2
    action: "suggest_new_intent"
    example: "Create 'clear-cache' intent for this common fix"

  # Runbook step frequently needed → add to intent
  - pattern: "post_intent_manual_step"
    threshold: 3
    action: "suggest_intent_enhancement"
    example: "Add health check wait to deploy-app"

Implementation Phases

Phase 1: Outcome Checking (MVP)

Add success/failure verification to existing intents:

  1. Define outcome specs for each intent
  2. Run checks after execution
  3. Log pass/fail with context
  4. Surface failures in intent output

Deliverables: - intents/outcomes/ folder with outcome specs - scripts/check-outcome.sh utility - Updated run-intent.sh to call outcome checks

Phase 2: Context Capture

Capture before/after state for each execution:

  1. Snapshot relevant state before execution
  2. Capture all commands and outputs
  3. Snapshot state after execution
  4. Store in structured format

Deliverables: - scripts/capture-context.sh utility - logs/executions/YYYY-MM-DD/ structured logs - Retention policy (30 days)

Phase 3: Anomaly Detection

Add rules to detect unexpected outcomes:

  1. Define anomaly rules per intent
  2. Compare before/after snapshots
  3. Flag anomalies with severity
  4. Integrate with alerting (optional)

Deliverables: - intents/anomaly-rules/ folder - scripts/detect-anomalies.sh utility - Anomaly log/report format

Phase 4: Incident Drafting

Auto-generate incident drafts from anomalies:

  1. Template population from execution context
  2. Draft storage in docs/incidents/drafts/
  3. Human review workflow
  4. Promotion to filed incident

Deliverables: - scripts/draft-incident.sh utility - Draft → Filed workflow - Integration with incident-management.md process

Phase 5: Learning & Iteration

Track patterns and suggest improvements:

  1. Aggregate anomalies across executions
  2. Detect repeated patterns
  3. Suggest policy/intent updates
  4. Track suggestion → implementation

Deliverables: - Pattern detection logic - Suggestion queue/backlog - Feedback tracking


Data Model

Execution Record

logs/executions/
└── 2025-12-23/
    └── exec-143052-deploy-home-portal.yaml

Anomaly Log

logs/anomalies/
└── 2025-12-23/
    └── anomaly-143052-memory-spike.yaml

Incident Drafts

docs/incidents/drafts/
└── draft-20251223-memory-spike-home-portal.md

Integration Points

Component Integration
Prometheus Query metrics for before/after comparison
Loki Query logs for error patterns
kubectl Capture pod/deployment state
Velero Backup status for verify-backups intent
Grafana Optional: Dashboard for execution outcomes

Success Criteria

  1. Visibility: Can see outcome of every intent execution
  2. Anomaly detection: System flags unexpected behavior within 5 minutes
  3. Incident coverage: 80%+ of incidents have auto-populated context
  4. Learning velocity: Suggestions lead to policy updates within 1 week
  5. Audit trail: Can replay any execution from logs

Open Questions

  1. Storage: How long to retain execution records? (Suggest 30 days)
  2. Alerting: Should anomalies trigger PagerDuty/Slack? (Suggest Phase 3+)
  3. Human-in-loop: When can agents proceed without confirmation? (Policy decision)
  4. Playback: Do we need actual replay or just audit trail? (Suggest audit first)