Incident: Supabase Sandbox Storage Pod Stuck in ContainerCreating¶
Date: 2025-01-05 Severity: P3 (sandbox environment) Duration: ~21 days Status: Resolved
Summary¶
The storage deployment in supabase-sandbox namespace had a stuck rolling update for 21 days. Two ReplicaSets were competing for the same RWO (ReadWriteOnce) PVC, causing one pod to be permanently stuck in ContainerCreating state. Alerting worked correctly - Discord notifications were sent every 12 hours.
Timeline¶
| Time | Event |
|---|---|
| 2025-12-15 ~12:00 | Storage deployment updated (revision 5) |
| 2025-12-15 ~12:10 | New pod stuck in ContainerCreating (RWO volume conflict) |
| 2025-12-16 00:20 | Deployment marked as ProgressDeadlineExceeded |
| 2025-12-16 00:30 | First alert: KubeDeploymentRolloutStuck |
| 2025-01-04 07:38 | Alert escalation: KubePodNotReady (repeated every 12h) |
| 2025-01-05 18:00 | Issue investigated during observability stack review |
| 2025-01-05 18:05 | Root cause identified: RWO PVC conflict between ReplicaSets |
| 2025-01-05 18:06 | Fix applied: deleted stuck pod, triggering rollout completion |
| 2025-01-05 18:07 | Deployment successfully rolled out |
Impact¶
- supabase-sandbox storage: One pod stuck, but service remained available via working pod
- Data loss: None
- Production impact: None (sandbox environment)
- Alert noise: 4 Discord alerts per day for 21 days
Root Cause¶
Failed rolling update with RWO PersistentVolumeClaim
The storage deployment uses a PVC with ReadWriteOnce access mode:
volumes:
- name: storage-data
persistentVolumeClaim:
claimName: storage-data # accessModes: [ReadWriteOnce]
During rolling update:
1. Old ReplicaSet (storage-54477789cd) had 1 running pod with volume mounted
2. New ReplicaSet (storage-75f688dd58) created new pod requiring same volume
3. RWO prevents multiple pods from mounting simultaneously
4. New pod stuck in ContainerCreating waiting for volume
5. Old pod can't terminate (new pod not ready)
6. Deadlock: deployment ProgressDeadlineExceeded
State before fix:
Resolution¶
# Delete the stuck pod - this freed the deployment controller
kubectl delete pod storage-75f688dd58-tk88g -n supabase-sandbox
# Rollout completed automatically after pod deletion
kubectl rollout status deploy/storage -n supabase-sandbox
# "deployment successfully rolled out"
The deployment controller detected the change and completed the rollout: - Old RS scaled to 0 - New RS scaled to 1 with fresh pod - Volume successfully mounted
Lessons Learned¶
- RWO volumes require Recreate strategy or careful handling
- Rolling updates with RWO PVCs will deadlock
-
Consider using
strategy.type: Recreatefor single-replica deployments with RWO volumes -
Sandbox alerts should be silenced or routed separately
- 21 days of alerts for a non-production issue creates noise
-
Consider adding
alertname=~"Kube.*" namespace="supabase-sandbox"to null receiver -
Investigate alerts promptly
- Alerting worked correctly but was ignored
- Regular triage of firing alerts would catch this sooner
Action Items¶
- [ ] Consider
strategy.type: Recreatefor supabase storage deployment - [ ] Evaluate silencing sandbox namespace alerts or routing to separate channel
- [ ] Add weekly alert triage to operational routine
Related¶
- Alert:
KubeDeploymentRolloutStuck - Alert:
KubePodNotReady - Alert:
KubeContainerWaiting - Namespace:
supabase-sandbox