Intent System: Observability & Learning Loop¶
Status: Design Created: 2025-12-23
Overview¶
Extend the intent system from "execute and forget" to a closed-loop system that observes outcomes, detects anomalies, files incidents, and iterates on policies.
Propose → Validate → Execute → Observe → Detect → File → Learn → Iterate
│ │ │ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
LLM Schema Scripts Metrics Anomaly Incident Update New
Match + Policy + K8s + Logs Rules Drafts Runbooks Intents
Goals¶
- Autonomous operation: Agents can execute multi-step tasks with confidence
- Full auditability: Every action is logged with context, replayable
- Anomaly detection: System notices when outcomes don't match expectations
- Incident automation: Automatically draft incident reports for review
- Continuous improvement: Learnings feed back into policies and runbooks
Components¶
1. Outcome Specification¶
Each intent defines expected outcomes that can be verified:
# Example: deploy-app intent
outcomes:
success:
- condition: "pods_ready"
check: "kubectl get pods -n {{app}} -l app={{app}} -o jsonpath='{.items[*].status.phase}' | grep -v Running | wc -l"
expected: "0"
timeout: 300s
- condition: "endpoint_healthy"
check: "curl -s -o /dev/null -w '%{http_code}' https://{{app}}.bogocat.com/health"
expected: "200"
timeout: 60s
failure_indicators:
- pattern: "CrashLoopBackOff"
severity: P1
- pattern: "ImagePullBackOff"
severity: P2
- pattern: "OOMKilled"
severity: P1
2. Execution Context Capture¶
Every intent execution captures:
execution_record:
id: "exec-20251223-143052-deploy-home-portal"
timestamp: "2025-12-23T14:30:52Z"
intent: "deploy-app"
params:
app: "home-portal"
tag: "latest"
# Pre-execution state
before:
pods: [...snapshot...]
metrics:
cpu_usage: 45%
memory_usage: 62%
recent_logs: [...last 50 lines...]
# Execution
commands_run:
- cmd: "docker build..."
exit_code: 0
duration: 45s
- cmd: "kubectl apply..."
exit_code: 0
duration: 2s
# Post-execution state
after:
pods: [...snapshot...]
metrics:
cpu_usage: 48%
memory_usage: 65%
outcome_checks:
pods_ready: pass
endpoint_healthy: pass
# Result
status: "success"
duration: 180s
3. Anomaly Detection Rules¶
Define what constitutes unexpected behavior:
anomaly_rules:
# Metric-based
- name: "memory_spike"
condition: "after.metrics.memory_usage - before.metrics.memory_usage > 20"
severity: P2
action: "flag_for_review"
# Log-based
- name: "error_rate_increase"
condition: "error_count_after > error_count_before * 2"
severity: P2
action: "create_incident_draft"
# State-based
- name: "unexpected_restart"
condition: "pod_restarts_after > pod_restarts_before"
severity: P2
action: "create_incident_draft"
# Timeout-based
- name: "slow_deployment"
condition: "duration > expected_duration * 2"
severity: P3
action: "log_warning"
4. Incident Auto-Filing¶
When anomalies are detected, auto-generate incident drafts:
incident_draft:
title: "Deploy home-portal: unexpected memory spike"
severity: P2
status: draft # Requires human review before filing
generated_from:
execution_id: "exec-20251223-143052-deploy-home-portal"
anomaly_rule: "memory_spike"
auto_populated:
timeline: [...from execution record...]
before_state: [...snapshot...]
after_state: [...snapshot...]
commands_run: [...from execution record...]
needs_human_input:
- root_cause
- lessons_learned
- action_items
5. Learning Feedback Loop¶
Track patterns across incidents to improve:
learning_triggers:
# Same anomaly 3+ times → suggest new policy
- pattern: "repeated_anomaly"
threshold: 3
action: "suggest_policy_update"
example: "Add memory limit check before deploy"
# Manual fix repeated → suggest new intent
- pattern: "repeated_manual_fix"
threshold: 2
action: "suggest_new_intent"
example: "Create 'clear-cache' intent for this common fix"
# Runbook step frequently needed → add to intent
- pattern: "post_intent_manual_step"
threshold: 3
action: "suggest_intent_enhancement"
example: "Add health check wait to deploy-app"
Implementation Phases¶
Phase 1: Outcome Checking (MVP)¶
Add success/failure verification to existing intents:
- Define outcome specs for each intent
- Run checks after execution
- Log pass/fail with context
- Surface failures in intent output
Deliverables:
- intents/outcomes/ folder with outcome specs
- scripts/check-outcome.sh utility
- Updated run-intent.sh to call outcome checks
Phase 2: Context Capture¶
Capture before/after state for each execution:
- Snapshot relevant state before execution
- Capture all commands and outputs
- Snapshot state after execution
- Store in structured format
Deliverables:
- scripts/capture-context.sh utility
- logs/executions/YYYY-MM-DD/ structured logs
- Retention policy (30 days)
Phase 3: Anomaly Detection¶
Add rules to detect unexpected outcomes:
- Define anomaly rules per intent
- Compare before/after snapshots
- Flag anomalies with severity
- Integrate with alerting (optional)
Deliverables:
- intents/anomaly-rules/ folder
- scripts/detect-anomalies.sh utility
- Anomaly log/report format
Phase 4: Incident Drafting¶
Auto-generate incident drafts from anomalies:
- Template population from execution context
- Draft storage in
docs/incidents/drafts/ - Human review workflow
- Promotion to filed incident
Deliverables:
- scripts/draft-incident.sh utility
- Draft → Filed workflow
- Integration with incident-management.md process
Phase 5: Learning & Iteration¶
Track patterns and suggest improvements:
- Aggregate anomalies across executions
- Detect repeated patterns
- Suggest policy/intent updates
- Track suggestion → implementation
Deliverables: - Pattern detection logic - Suggestion queue/backlog - Feedback tracking
Data Model¶
Execution Record¶
Anomaly Log¶
Incident Drafts¶
Integration Points¶
| Component | Integration |
|---|---|
| Prometheus | Query metrics for before/after comparison |
| Loki | Query logs for error patterns |
| kubectl | Capture pod/deployment state |
| Velero | Backup status for verify-backups intent |
| Grafana | Optional: Dashboard for execution outcomes |
Success Criteria¶
- Visibility: Can see outcome of every intent execution
- Anomaly detection: System flags unexpected behavior within 5 minutes
- Incident coverage: 80%+ of incidents have auto-populated context
- Learning velocity: Suggestions lead to policy updates within 1 week
- Audit trail: Can replay any execution from logs
Open Questions¶
- Storage: How long to retain execution records? (Suggest 30 days)
- Alerting: Should anomalies trigger PagerDuty/Slack? (Suggest Phase 3+)
- Human-in-loop: When can agents proceed without confirmation? (Policy decision)
- Playback: Do we need actual replay or just audit trail? (Suggest audit first)