Skip to content

Incident: Supabase Sandbox Storage Pod Stuck in ContainerCreating

Date: 2025-01-05 Severity: P3 (sandbox environment) Duration: ~21 days Status: Resolved


Summary

The storage deployment in supabase-sandbox namespace had a stuck rolling update for 21 days. Two ReplicaSets were competing for the same RWO (ReadWriteOnce) PVC, causing one pod to be permanently stuck in ContainerCreating state. Alerting worked correctly - Discord notifications were sent every 12 hours.


Timeline

Time Event
2025-12-15 ~12:00 Storage deployment updated (revision 5)
2025-12-15 ~12:10 New pod stuck in ContainerCreating (RWO volume conflict)
2025-12-16 00:20 Deployment marked as ProgressDeadlineExceeded
2025-12-16 00:30 First alert: KubeDeploymentRolloutStuck
2025-01-04 07:38 Alert escalation: KubePodNotReady (repeated every 12h)
2025-01-05 18:00 Issue investigated during observability stack review
2025-01-05 18:05 Root cause identified: RWO PVC conflict between ReplicaSets
2025-01-05 18:06 Fix applied: deleted stuck pod, triggering rollout completion
2025-01-05 18:07 Deployment successfully rolled out

Impact

  • supabase-sandbox storage: One pod stuck, but service remained available via working pod
  • Data loss: None
  • Production impact: None (sandbox environment)
  • Alert noise: 4 Discord alerts per day for 21 days

Root Cause

Failed rolling update with RWO PersistentVolumeClaim

The storage deployment uses a PVC with ReadWriteOnce access mode:

volumes:
  - name: storage-data
    persistentVolumeClaim:
      claimName: storage-data  # accessModes: [ReadWriteOnce]

During rolling update: 1. Old ReplicaSet (storage-54477789cd) had 1 running pod with volume mounted 2. New ReplicaSet (storage-75f688dd58) created new pod requiring same volume 3. RWO prevents multiple pods from mounting simultaneously 4. New pod stuck in ContainerCreating waiting for volume 5. Old pod can't terminate (new pod not ready) 6. Deadlock: deployment ProgressDeadlineExceeded

State before fix:

storage-54477789cd  1/1 ready (has volume)
storage-75f688dd58  0/1 ready (waiting for volume)


Resolution

# Delete the stuck pod - this freed the deployment controller
kubectl delete pod storage-75f688dd58-tk88g -n supabase-sandbox

# Rollout completed automatically after pod deletion
kubectl rollout status deploy/storage -n supabase-sandbox
# "deployment successfully rolled out"

The deployment controller detected the change and completed the rollout: - Old RS scaled to 0 - New RS scaled to 1 with fresh pod - Volume successfully mounted


Lessons Learned

  1. RWO volumes require Recreate strategy or careful handling
  2. Rolling updates with RWO PVCs will deadlock
  3. Consider using strategy.type: Recreate for single-replica deployments with RWO volumes

  4. Sandbox alerts should be silenced or routed separately

  5. 21 days of alerts for a non-production issue creates noise
  6. Consider adding alertname=~"Kube.*" namespace="supabase-sandbox" to null receiver

  7. Investigate alerts promptly

  8. Alerting worked correctly but was ignored
  9. Regular triage of firing alerts would catch this sooner

Action Items

  • [ ] Consider strategy.type: Recreate for supabase storage deployment
  • [ ] Evaluate silencing sandbox namespace alerts or routing to separate channel
  • [ ] Add weekly alert triage to operational routine

  • Alert: KubeDeploymentRolloutStuck
  • Alert: KubePodNotReady
  • Alert: KubeContainerWaiting
  • Namespace: supabase-sandbox