Skip to content

Incident: Velero B2 Kopia Backup Controller Panic

Date: 2025-12-23 Severity: P2 Duration: ~6 days (unnoticed until investigation) Status: Resolved


Summary

Velero v1.16.0 controller repeatedly crashed with nil pointer dereference when processing B2 offsite BackupRepository maintenance. Weekly B2 backups were failing, and Kopia maintenance jobs were stuck in Error state. The issue was caused by a bug in the BackupRepoReconciler that panics when maintenance jobs are deleted before their results are retrieved.


Timeline

Time (UTC) Event
2025-12-17 ~23:00 First B2 backup attempt (init-b2-repo) partially failed
2025-12-18 05:05 Kopia maintenance jobs start failing with "repository not initialized"
2025-12-21 04:00 Weekly B2 backup (velero-weekly-backup-20251221040037) partially failed
2025-12-23 21:30 Issue discovered during Velero dashboard setup
2025-12-23 21:55 Root cause identified: controller panic in recallMaintenance
2025-12-23 22:05 Fix applied: deleted orphaned BackupRepositories
2025-12-23 22:09 Test backup (b2-reinit-20251223) completed successfully
2025-12-23 22:30 Controller restarted to clear stale in-memory state
2025-12-24 00:35 Verified no more panics occurring

Impact

  • B2 offsite backups: Failed for 6 namespaces (authentik, docker-registry, immich, monitoring, otterwiki, pelican)
  • Local MinIO backups: Unaffected - continued working normally
  • Data loss: None - local backups provided coverage
  • Controller stability: Velero pod experiencing repeated panics (recovered automatically)

Root Cause

Bug in Velero v1.16.0 (backup_repository_controller.go:368)

The BackupRepoReconciler.recallMaintenance function panics with "invalid memory address or nil pointer dereference" when:

  1. BackupRepositories are created for a storage location
  2. Initial backup/data-move fails for some namespaces
  3. Kopia maintenance jobs are created but fail with "repository not initialized"
  4. Failed job pods are deleted (manually or by TTL)
  5. Controller tries to retrieve results from deleted jobs
  6. Nil pointer dereference occurs in the result handling code

Stack trace excerpt:

panic: runtime error: invalid memory address or nil pointer dereference
github.com/vmware-tanzu/velero/pkg/controller.(*BackupRepoReconciler).recallMaintenance.func1
  backup_repository_controller.go:368


Resolution

  1. Deleted orphaned B2 BackupRepositories:

    kubectl delete backuprepository -n velero -l velero.io/storage-location=b2-offsite
    

  2. Cleaned up failed Kopia maintenance job pods:

    for pod in $(kubectl get pods -n velero -o name | grep "b2-offsite"); do
      kubectl delete $pod -n velero --ignore-not-found
    done
    

  3. Created fresh B2 backup to reinitialize repositories:

    velero backup create b2-reinit-20251223 \
      --storage-location b2-offsite \
      --include-namespaces otterwiki \
      --snapshot-volumes \
      --snapshot-move-data \
      --wait
    

  4. Restarted Velero controller to clear stale in-memory state:

    kubectl rollout restart deployment/velero -n velero
    


Lessons Learned

  1. Monitoring gap: No alerts for Velero controller panics or backup failures. Controller auto-recovered after each panic, masking the issue.

  2. B2 initial setup incomplete: The B2 offsite backup was configured but the initial full backup was never verified to complete successfully.

  3. Velero bug: v1.16.0 has a nil pointer bug in maintenance handling. Should monitor upstream for fix and consider upgrade.

  4. Dashboard timing: Adding the Velero Grafana dashboard earlier would have surfaced this issue sooner.


Action Items

Item Owner Status
Add Prometheus alert for Velero backup failures - Pending
Add Prometheus alert for pod CrashLoopBackOff in velero namespace - Pending
Monitor Velero GitHub for fix to recallMaintenance panic - Pending
Trigger and verify weekly B2 backup after fix - Completed
Add Velero dashboard to Grafana - Completed

References