Skip to content

Incident Management Process

Guidelines for handling and documenting infrastructure incidents.

Incident Archive: /docs/incidents/


Severity Levels

Level Description Examples Response
P0 Critical outage All apps down, data loss risk, security breach Immediate - drop everything
P1 Major degradation Single critical app down, backup failures, auth broken Within 1 hour
P2 Partial degradation Feature broken, non-critical service down, performance issues Within 24 hours
P3 Minor issue Cosmetic issues, non-urgent improvements Best effort

When to File an Incident Report

File a report when: - P0 or P1 incident occurs (always) - P2 incident required investigation or caused data impact - Issue revealed a systemic problem or process gap - Fix required non-obvious troubleshooting steps - Lessons learned would benefit future debugging

Skip for: - Simple config fixes with obvious cause - Known issues with existing runbooks - P3 issues unless they reveal larger problems


Incident Report Template

# Incident: [Brief Title]

**Date:** YYYY-MM-DD
**Severity:** P0/P1/P2/P3
**Duration:** X hours/minutes
**Status:** Resolved / Mitigated / Ongoing

---

## Summary

One or two sentences describing what happened and the impact.

---

## Timeline

| Time (UTC) | Event |
|------------|-------|
| HH:MM | Issue first observed |
| HH:MM | Investigation started |
| HH:MM | Root cause identified |
| HH:MM | Fix applied |
| HH:MM | Verified resolved |

---

## Impact

- What services/users were affected
- Duration of impact
- Data loss (if any)

---

## Root Cause

Technical explanation of what caused the issue.

---

## Resolution

Steps taken to resolve the issue.

---

## Lessons Learned

What we learned and what could be improved.

---

## Action Items

| Item | Owner | Status |
|------|-------|--------|
| Example action | - | Pending |

Post-Incident Review

For P0/P1 incidents:

  1. Within 24 hours: File initial incident report
  2. Within 1 week: Review with stakeholders (if applicable)
  3. Track action items: Add to roadmap or project backlog

For P2 incidents: - File report at your discretion - Focus on lessons learned for future reference