Phase 6: Hardening - Implementation Plan¶
Target Version: 0.7.0 Status: Planning Created: 2025-12-23
Overview¶
Phase 6 transforms the intent system from "working" to "production-ready" with: - Reproducibility (replay from audit logs) - Integrity verification (automated chain checks) - Observability (metrics, alerting) - Usability (documentation)
Task Breakdown¶
Task 1: Replay Capability (P1)¶
Goal: Re-run any past intent execution from its audit log.
Use Cases: - Debugging: "Why did deploy fail last Tuesday?" - Disaster recovery: "Re-run the migration that worked" - Testing: "Replay this intent against staging"
Script: scripts/replay-intent.sh
Interface:
# Replay exact execution (same params, context)
./scripts/replay-intent.sh <request_id>
# Replay with dry-run first
./scripts/replay-intent.sh <request_id> --dry-run
# Replay with fresh context (new git sha, etc.)
./scripts/replay-intent.sh <request_id> --fresh-context
# Replay with modified params
./scripts/replay-intent.sh <request_id> --override app=money-tracker
Implementation Steps:
-
Extract intent and params from audit log
-
Reconstruct execution command
-
Add replay metadata to audit log
-
Handle edge cases
- Intent file changed since original run → warn user
- Context capture commands fail → use cached values or fail
- Original run used
--confirm→ require again
Deliverable: scripts/replay-intent.sh (~100 lines)
Acceptance Criteria:
- [ ] Can replay any completed intent from audit log
- [ ] Replay is logged with reference to original request
- [ ] --dry-run shows what would execute
- [ ] --fresh-context captures new environment state
Task 2: Audit Integrity Check (P1)¶
Goal: Automatically verify hash chains haven't been tampered with.
Existing: audit-viewer.sh verify <request_id> - verifies single log
New: scripts/audit-integrity-check.sh - batch verification + cron
Interface:
# Check all logs from today
./scripts/audit-integrity-check.sh
# Check specific date
./scripts/audit-integrity-check.sh --date 2025-12-22
# Check last N days
./scripts/audit-integrity-check.sh --days 7
# Output format
./scripts/audit-integrity-check.sh --format json
Implementation Steps:
-
Create batch verification script
-
Add cron job
-
Alert on failure
- Write to
/root/tower-fleet/logs/alerts/audit-integrity.log - Exit non-zero (for cron email)
-
Future: webhook notification
-
Summary output
Deliverable:
- scripts/audit-integrity-check.sh (~80 lines)
- Cron configuration in scripts/cron/audit-check.cron
Acceptance Criteria: - [ ] Batch verification of all logs in date range - [ ] Cron job runs daily at 2am - [ ] Broken chain alerts written to log - [ ] Exit code reflects success/failure
Task 3: Failure Alerting (P2)¶
Goal: Notify when intents fail so issues don't go unnoticed.
Approach: Start simple (log file), add webhook later.
Implementation Steps:
-
Add failure hook to run-intent.sh
-
Create failure log
-
Add daily failure summary (optional)
-
Future: Webhook integration
Deliverable:
- Failure logging in run-intent.sh (modify existing)
- logs/alerts/ directory structure
- Optional: scripts/intent-failure-summary.sh
Acceptance Criteria:
- [ ] All failures logged to logs/alerts/intent-failures.log
- [ ] Log includes: timestamp, request_id, intent, step, error
- [ ] Easy to grep/tail for monitoring
Task 4: Metrics Collection (P2)¶
Goal: Track success rates, durations, and patterns.
Script: scripts/intent-metrics.sh
Interface:
# Summary for today
./scripts/intent-metrics.sh
# Summary for date range
./scripts/intent-metrics.sh --from 2025-12-20 --to 2025-12-23
# Per-intent breakdown
./scripts/intent-metrics.sh --by-intent
# JSON output (for dashboards)
./scripts/intent-metrics.sh --format json
Metrics to Collect:
| Metric | Description |
|---|---|
total_executions |
Count of intent runs |
success_count |
Completed successfully |
failure_count |
Failed (with or without rollback) |
rollback_count |
Failures that triggered rollback |
avg_duration_ms |
Average execution time |
p95_duration_ms |
95th percentile duration |
by_intent |
Breakdown per intent type |
by_step |
Which steps fail most often |
Implementation Steps:
-
Parse audit logs
-
Aggregate statistics
-
Format output
Intent System Metrics - 2025-12-23 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Overall: Total executions: 18 Success rate: 94.4% (17/18) Avg duration: 45.2s By Intent: deploy-app: 8 runs, 87.5% success, avg 52s observe-app: 6 runs, 100% success, avg 3s restart-app: 4 runs, 100% success, avg 15s Top Failure Steps: build_image: 1 failure (npm install timeout)
Deliverable: scripts/intent-metrics.sh (~150 lines)
Acceptance Criteria: - [ ] Aggregates metrics from audit logs - [ ] Shows success rate, duration, by-intent breakdown - [ ] JSON output available for programmatic use - [ ] Date range filtering works
Task 5: Documentation (P2)¶
Goal: Complete user guide for the intent system.
Document: docs/guides/intent-system-user-guide.md
Outline:
# Intent System User Guide
## Quick Start
- Running your first intent
- Dry-run mode
- Understanding the output
## Available Intents
- deploy-app
- observe-app
- check-logs
- restart-app
- scale-app
- migrate-schema
- create-nextjs-app
## Using Slash Commands
- Command reference
- Examples
## Natural Language (LLM) Integration
- How intent matching works
- When to use slash commands vs natural language
## Audit Logs
- Viewing execution history
- Verifying integrity
- Replaying past executions
## Troubleshooting
- Common errors
- Lock issues
- Template resolution problems
## Extending the System
- Creating new intents
- Adding policy rules
- Custom verification checks
Deliverable: docs/guides/intent-system-user-guide.md (~300 lines)
Acceptance Criteria: - [ ] Quick start gets user running intent in <5 minutes - [ ] All 7 intents documented with examples - [ ] Troubleshooting covers common issues - [ ] Syncs to OtterWiki
Implementation Order¶
Based on dependencies and priority:
Week 1: P1 Tasks (Core Reliability)
├── Task 2: Audit Integrity Check (foundation for trust)
└── Task 1: Replay Capability (depends on stable audit logs)
Week 2: P2 Tasks (Observability)
├── Task 3: Failure Alerting (simple, high value)
└── Task 4: Metrics Collection (builds on alerting)
Week 3: P2 Tasks (Usability)
└── Task 5: Documentation (captures learnings from above)
Estimated Effort¶
| Task | New Code | Modify Existing | Total |
|---|---|---|---|
| Replay Capability | ~100 lines | ~20 lines | ~2 hours |
| Audit Integrity | ~80 lines | cron setup | ~1.5 hours |
| Failure Alerting | ~30 lines | ~40 lines | ~1 hour |
| Metrics Collection | ~150 lines | - | ~2 hours |
| Documentation | ~300 lines | - | ~2 hours |
| Total | ~660 lines | ~60 lines | ~8.5 hours |
Success Criteria for Phase 6¶
Phase 6 is complete when:
- Replay works: Any past intent can be re-executed from its audit log
- Integrity verified daily: Cron job checks all audit chains
- Failures visible: All failures logged to dedicated alert log
- Metrics available: Can answer "what's our deploy success rate?"
- Documented: New user can run intents from reading the guide
Files to Create/Modify¶
New Files:
scripts/
├── replay-intent.sh # Task 1
├── audit-integrity-check.sh # Task 2
├── intent-metrics.sh # Task 4
└── cron/
└── audit-check.cron # Task 2
logs/alerts/ # Task 3
├── intent-failures.log
└── audit-integrity.log
docs/guides/
└── intent-system-user-guide.md # Task 5
Modified Files:
scripts/run-intent.sh # Task 3: Add failure logging hook
docs/architecture/intent-system-roadmap.md # Update status
Next Action¶
Start with Task 2: Audit Integrity Check because:
1. It's foundational (ensures audit logs are trustworthy)
2. Uses existing audit-viewer.sh verify logic
3. Quick win (cron setup is straightforward)
4. Enables Task 1 (replay depends on trusted logs)
Ready to implement?