Incident Management Process¶
Guidelines for handling and documenting infrastructure incidents.
Incident Archive: /docs/incidents/
Severity Levels¶
| Level | Description | Examples | Response |
|---|---|---|---|
| P0 | Critical outage | All apps down, data loss risk, security breach | Immediate - drop everything |
| P1 | Major degradation | Single critical app down, backup failures, auth broken | Within 1 hour |
| P2 | Partial degradation | Feature broken, non-critical service down, performance issues | Within 24 hours |
| P3 | Minor issue | Cosmetic issues, non-urgent improvements | Best effort |
When to File an Incident Report¶
File a report when: - P0 or P1 incident occurs (always) - P2 incident required investigation or caused data impact - Issue revealed a systemic problem or process gap - Fix required non-obvious troubleshooting steps - Lessons learned would benefit future debugging
Skip for: - Simple config fixes with obvious cause - Known issues with existing runbooks - P3 issues unless they reveal larger problems
Incident Report Template¶
# Incident: [Brief Title]
**Date:** YYYY-MM-DD
**Severity:** P0/P1/P2/P3
**Duration:** X hours/minutes
**Status:** Resolved / Mitigated / Ongoing
---
## Summary
One or two sentences describing what happened and the impact.
---
## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Issue first observed |
| HH:MM | Investigation started |
| HH:MM | Root cause identified |
| HH:MM | Fix applied |
| HH:MM | Verified resolved |
---
## Impact
- What services/users were affected
- Duration of impact
- Data loss (if any)
---
## Root Cause
Technical explanation of what caused the issue.
---
## Resolution
Steps taken to resolve the issue.
---
## Lessons Learned
What we learned and what could be improved.
---
## Action Items
| Item | Owner | Status |
|------|-------|--------|
| Example action | - | Pending |
Post-Incident Review¶
For P0/P1 incidents:
- Within 24 hours: File initial incident report
- Within 1 week: Review with stakeholders (if applicable)
- Track action items: Add to roadmap or project backlog
For P2 incidents: - File report at your discretion - Focus on lessons learned for future reference