Incident: Supabase Sandbox Storage Pod Stuck in ContainerCreating¶

Date: 2025-01-05 Severity: P3 (sandbox environment) Duration: ~21 days Status: Resolved

Summary¶

The storage deployment in supabase-sandbox namespace had a stuck rolling update for 21 days. Two ReplicaSets were competing for the same RWO (ReadWriteOnce) PVC, causing one pod to be permanently stuck in ContainerCreating state. Alerting worked correctly - Discord notifications were sent every 12 hours.

Timeline¶

Time	Event
2025-12-15 ~12:00	Storage deployment updated (revision 5)
2025-12-15 ~12:10	New pod stuck in ContainerCreating (RWO volume conflict)
2025-12-16 00:20	Deployment marked as `ProgressDeadlineExceeded`
2025-12-16 00:30	First alert: `KubeDeploymentRolloutStuck`
2025-01-04 07:38	Alert escalation: `KubePodNotReady` (repeated every 12h)
2025-01-05 18:00	Issue investigated during observability stack review
2025-01-05 18:05	Root cause identified: RWO PVC conflict between ReplicaSets
2025-01-05 18:06	Fix applied: deleted stuck pod, triggering rollout completion
2025-01-05 18:07	Deployment successfully rolled out

Impact¶

supabase-sandbox storage: One pod stuck, but service remained available via working pod
Data loss: None
Production impact: None (sandbox environment)
Alert noise: 4 Discord alerts per day for 21 days

Root Cause¶

Failed rolling update with RWO PersistentVolumeClaim

The storage deployment uses a PVC with ReadWriteOnce access mode:

volumes:
  - name: storage-data
    persistentVolumeClaim:
      claimName: storage-data  # accessModes: [ReadWriteOnce]

During rolling update: 1. Old ReplicaSet (storage-54477789cd) had 1 running pod with volume mounted 2. New ReplicaSet (storage-75f688dd58) created new pod requiring same volume 3. RWO prevents multiple pods from mounting simultaneously 4. New pod stuck in ContainerCreating waiting for volume 5. Old pod can't terminate (new pod not ready) 6. Deadlock: deployment ProgressDeadlineExceeded

State before fix:

storage-54477789cd  1/1 ready (has volume)
storage-75f688dd58  0/1 ready (waiting for volume)

Resolution¶

# Delete the stuck pod - this freed the deployment controller
kubectl delete pod storage-75f688dd58-tk88g -n supabase-sandbox

# Rollout completed automatically after pod deletion
kubectl rollout status deploy/storage -n supabase-sandbox
# "deployment successfully rolled out"

The deployment controller detected the change and completed the rollout: - Old RS scaled to 0 - New RS scaled to 1 with fresh pod - Volume successfully mounted

Lessons Learned¶

RWO volumes require Recreate strategy or careful handling
Rolling updates with RWO PVCs will deadlock
Consider using strategy.type: Recreate for single-replica deployments with RWO volumes
Sandbox alerts should be silenced or routed separately
21 days of alerts for a non-production issue creates noise
Consider adding alertname=~"Kube.*" namespace="supabase-sandbox" to null receiver
Investigate alerts promptly
Alerting worked correctly but was ignored
Regular triage of firing alerts would catch this sooner

Action Items¶

[ ] Consider strategy.type: Recreate for supabase storage deployment
[ ] Evaluate silencing sandbox namespace alerts or routing to separate channel
[ ] Add weekly alert triage to operational routine

Alert: KubeDeploymentRolloutStuck
Alert: KubePodNotReady
Alert: KubeContainerWaiting
Namespace: supabase-sandbox