Phase 6: Hardening - Implementation Plan¶

Target Version: 0.7.0 Status: Planning Created: 2025-12-23

Overview¶

Phase 6 transforms the intent system from "working" to "production-ready" with: - Reproducibility (replay from audit logs) - Integrity verification (automated chain checks) - Observability (metrics, alerting) - Usability (documentation)

Task Breakdown¶

Task 1: Replay Capability (P1)¶

Goal: Re-run any past intent execution from its audit log.

Use Cases: - Debugging: "Why did deploy fail last Tuesday?" - Disaster recovery: "Re-run the migration that worked" - Testing: "Replay this intent against staging"

Script: scripts/replay-intent.sh

Interface:

# Replay exact execution (same params, context)
./scripts/replay-intent.sh <request_id>

# Replay with dry-run first
./scripts/replay-intent.sh <request_id> --dry-run

# Replay with fresh context (new git sha, etc.)
./scripts/replay-intent.sh <request_id> --fresh-context

# Replay with modified params
./scripts/replay-intent.sh <request_id> --override app=money-tracker

Implementation Steps:

Extract intent and params from audit log

# From audit log, get first event
intent=$(jq -r 'select(.event=="intent_received") | .intent' < log.jsonl)
params=$(jq -r 'select(.event=="intent_received") | .params' < log.jsonl)
intent_file=$(find intents/ -name "${intent}.yaml")

Reconstruct execution command

# Build params string
params_str=$(echo "$params" | jq -r 'to_entries | map("\(.key)=\(.value)") | join(" ")')

# Execute
./scripts/run-intent.sh "$intent_file" --params $params_str

Add replay metadata to audit log

{
  "event": "intent_received",
  "replay_of": "req_original_123",
  "replay_mode": "exact|fresh-context|override"
}

Handle edge cases
Intent file changed since original run → warn user
Context capture commands fail → use cached values or fail
Original run used --confirm → require again

Deliverable: scripts/replay-intent.sh (~100 lines)

Acceptance Criteria: - [ ] Can replay any completed intent from audit log - [ ] Replay is logged with reference to original request - [ ] --dry-run shows what would execute - [ ] --fresh-context captures new environment state

Task 2: Audit Integrity Check (P1)¶

Goal: Automatically verify hash chains haven't been tampered with.

Existing: audit-viewer.sh verify <request_id> - verifies single log

New: scripts/audit-integrity-check.sh - batch verification + cron

Interface:

# Check all logs from today
./scripts/audit-integrity-check.sh

# Check specific date
./scripts/audit-integrity-check.sh --date 2025-12-22

# Check last N days
./scripts/audit-integrity-check.sh --days 7

# Output format
./scripts/audit-integrity-check.sh --format json

Implementation Steps:

Create batch verification script

#!/bin/bash
# For each log file in date range:
#   - Verify hash chain
#   - Record result
#   - Alert on failure

Add cron job

# /etc/cron.d/intent-audit-check
0 2 * * * root /root/tower-fleet/scripts/audit-integrity-check.sh --days 1 --alert

Alert on failure
Write to /root/tower-fleet/logs/alerts/audit-integrity.log
Exit non-zero (for cron email)
Future: webhook notification

Summary output

Audit Integrity Check - 2025-12-23
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Logs checked: 18
Valid chains: 18
Broken chains: 0
Status: PASS

Deliverable: - scripts/audit-integrity-check.sh (~80 lines) - Cron configuration in scripts/cron/audit-check.cron

Acceptance Criteria: - [ ] Batch verification of all logs in date range - [ ] Cron job runs daily at 2am - [ ] Broken chain alerts written to log - [ ] Exit code reflects success/failure

Task 3: Failure Alerting (P2)¶

Goal: Notify when intents fail so issues don't go unnoticed.

Approach: Start simple (log file), add webhook later.

Implementation Steps:

Add failure hook to run-intent.sh

# On execution_completed with outcome=failed:
log_failure "$request_id" "$intent" "$failed_step" "$error"

Create failure log

# /root/tower-fleet/logs/alerts/intent-failures.log
2025-12-23T15:30:00Z | req_abc123 | deploy-app | build_image | Exit code 1

Add daily failure summary (optional)

# scripts/intent-failure-summary.sh
# Summarizes failures from last 24h

Future: Webhook integration

# intents/config.yaml
alerts:
  webhook_url: "https://discord.com/api/webhooks/..."
  on_failure: true
  on_rollback: true

Deliverable: - Failure logging in run-intent.sh (modify existing) - logs/alerts/ directory structure - Optional: scripts/intent-failure-summary.sh

Acceptance Criteria: - [ ] All failures logged to logs/alerts/intent-failures.log - [ ] Log includes: timestamp, request_id, intent, step, error - [ ] Easy to grep/tail for monitoring

Task 4: Metrics Collection (P2)¶

Goal: Track success rates, durations, and patterns.

Script: scripts/intent-metrics.sh

Interface:

# Summary for today
./scripts/intent-metrics.sh

# Summary for date range
./scripts/intent-metrics.sh --from 2025-12-20 --to 2025-12-23

# Per-intent breakdown
./scripts/intent-metrics.sh --by-intent

# JSON output (for dashboards)
./scripts/intent-metrics.sh --format json

Metrics to Collect:

Metric	Description
`total_executions`	Count of intent runs
`success_count`	Completed successfully
`failure_count`	Failed (with or without rollback)
`rollback_count`	Failures that triggered rollback
`avg_duration_ms`	Average execution time
`p95_duration_ms`	95th percentile duration
`by_intent`	Breakdown per intent type
`by_step`	Which steps fail most often

Implementation Steps:

Parse audit logs

# For each log file:
#   - Extract intent name
#   - Extract outcome (success/failed)
#   - Calculate duration (first to last timestamp)
#   - Record failed step if applicable

Aggregate statistics

# Use awk/jq to compute:
#   - Counts by outcome
#   - Duration percentiles
#   - Failure hotspots

Format output

Intent System Metrics - 2025-12-23
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Overall:
  Total executions: 18
  Success rate: 94.4% (17/18)
  Avg duration: 45.2s

By Intent:
  deploy-app:    8 runs, 87.5% success, avg 52s
  observe-app:   6 runs, 100% success, avg 3s
  restart-app:   4 runs, 100% success, avg 15s

Top Failure Steps:
  build_image: 1 failure (npm install timeout)

Deliverable: scripts/intent-metrics.sh (~150 lines)

Acceptance Criteria: - [ ] Aggregates metrics from audit logs - [ ] Shows success rate, duration, by-intent breakdown - [ ] JSON output available for programmatic use - [ ] Date range filtering works

Task 5: Documentation (P2)¶

Goal: Complete user guide for the intent system.

Document: docs/guides/intent-system-user-guide.md

Outline:

# Intent System User Guide

## Quick Start
- Running your first intent
- Dry-run mode
- Understanding the output

## Available Intents
- deploy-app
- observe-app
- check-logs
- restart-app
- scale-app
- migrate-schema
- create-nextjs-app

## Using Slash Commands
- Command reference
- Examples

## Natural Language (LLM) Integration
- How intent matching works
- When to use slash commands vs natural language

## Audit Logs
- Viewing execution history
- Verifying integrity
- Replaying past executions

## Troubleshooting
- Common errors
- Lock issues
- Template resolution problems

## Extending the System
- Creating new intents
- Adding policy rules
- Custom verification checks

Deliverable: docs/guides/intent-system-user-guide.md (~300 lines)

Acceptance Criteria: - [ ] Quick start gets user running intent in <5 minutes - [ ] All 7 intents documented with examples - [ ] Troubleshooting covers common issues - [ ] Syncs to OtterWiki

Implementation Order¶

Based on dependencies and priority:

Week 1: P1 Tasks (Core Reliability)
├── Task 2: Audit Integrity Check (foundation for trust)
└── Task 1: Replay Capability (depends on stable audit logs)

Week 2: P2 Tasks (Observability)
├── Task 3: Failure Alerting (simple, high value)
└── Task 4: Metrics Collection (builds on alerting)

Week 3: P2 Tasks (Usability)
└── Task 5: Documentation (captures learnings from above)

Estimated Effort¶

Task	New Code	Modify Existing	Total
Replay Capability	~100 lines	~20 lines	~2 hours
Audit Integrity	~80 lines	cron setup	~1.5 hours
Failure Alerting	~30 lines	~40 lines	~1 hour
Metrics Collection	~150 lines	-	~2 hours
Documentation	~300 lines	-	~2 hours
Total	~660 lines	~60 lines	~8.5 hours

Success Criteria for Phase 6¶

Phase 6 is complete when:

Replay works: Any past intent can be re-executed from its audit log
Integrity verified daily: Cron job checks all audit chains
Failures visible: All failures logged to dedicated alert log
Metrics available: Can answer "what's our deploy success rate?"
Documented: New user can run intents from reading the guide

Files to Create/Modify¶

New Files:

scripts/
├── replay-intent.sh           # Task 1
├── audit-integrity-check.sh   # Task 2
├── intent-metrics.sh          # Task 4
└── cron/
    └── audit-check.cron       # Task 2

logs/alerts/                   # Task 3
├── intent-failures.log
└── audit-integrity.log

docs/guides/
└── intent-system-user-guide.md  # Task 5

Modified Files:

scripts/run-intent.sh          # Task 3: Add failure logging hook
docs/architecture/intent-system-roadmap.md  # Update status

Next Action¶

Start with Task 2: Audit Integrity Check because: 1. It's foundational (ensures audit logs are trustworthy) 2. Uses existing audit-viewer.sh verify logic 3. Quick win (cron setup is straightforward) 4. Enables Task 1 (replay depends on trusted logs)

Ready to implement?