Incident: Velero B2 Kopia Backup Controller Panic¶
Date: 2025-12-23 Severity: P2 Duration: ~6 days (unnoticed until investigation) Status: Resolved
Summary¶
Velero v1.16.0 controller repeatedly crashed with nil pointer dereference when processing B2 offsite BackupRepository maintenance. Weekly B2 backups were failing, and Kopia maintenance jobs were stuck in Error state. The issue was caused by a bug in the BackupRepoReconciler that panics when maintenance jobs are deleted before their results are retrieved.
Timeline¶
| Time (UTC) | Event |
|---|---|
| 2025-12-17 ~23:00 | First B2 backup attempt (init-b2-repo) partially failed |
| 2025-12-18 05:05 | Kopia maintenance jobs start failing with "repository not initialized" |
| 2025-12-21 04:00 | Weekly B2 backup (velero-weekly-backup-20251221040037) partially failed |
| 2025-12-23 21:30 | Issue discovered during Velero dashboard setup |
| 2025-12-23 21:55 | Root cause identified: controller panic in recallMaintenance |
| 2025-12-23 22:05 | Fix applied: deleted orphaned BackupRepositories |
| 2025-12-23 22:09 | Test backup (b2-reinit-20251223) completed successfully |
| 2025-12-23 22:30 | Controller restarted to clear stale in-memory state |
| 2025-12-24 00:35 | Verified no more panics occurring |
Impact¶
- B2 offsite backups: Failed for 6 namespaces (authentik, docker-registry, immich, monitoring, otterwiki, pelican)
- Local MinIO backups: Unaffected - continued working normally
- Data loss: None - local backups provided coverage
- Controller stability: Velero pod experiencing repeated panics (recovered automatically)
Root Cause¶
Bug in Velero v1.16.0 (backup_repository_controller.go:368)
The BackupRepoReconciler.recallMaintenance function panics with "invalid memory address or nil pointer dereference" when:
- BackupRepositories are created for a storage location
- Initial backup/data-move fails for some namespaces
- Kopia maintenance jobs are created but fail with "repository not initialized"
- Failed job pods are deleted (manually or by TTL)
- Controller tries to retrieve results from deleted jobs
- Nil pointer dereference occurs in the result handling code
Stack trace excerpt:
panic: runtime error: invalid memory address or nil pointer dereference
github.com/vmware-tanzu/velero/pkg/controller.(*BackupRepoReconciler).recallMaintenance.func1
backup_repository_controller.go:368
Resolution¶
-
Deleted orphaned B2 BackupRepositories:
-
Cleaned up failed Kopia maintenance job pods:
-
Created fresh B2 backup to reinitialize repositories:
-
Restarted Velero controller to clear stale in-memory state:
Lessons Learned¶
-
Monitoring gap: No alerts for Velero controller panics or backup failures. Controller auto-recovered after each panic, masking the issue.
-
B2 initial setup incomplete: The B2 offsite backup was configured but the initial full backup was never verified to complete successfully.
-
Velero bug: v1.16.0 has a nil pointer bug in maintenance handling. Should monitor upstream for fix and consider upgrade.
-
Dashboard timing: Adding the Velero Grafana dashboard earlier would have surfaced this issue sooner.
Action Items¶
| Item | Owner | Status |
|---|---|---|
| Add Prometheus alert for Velero backup failures | - | Pending |
| Add Prometheus alert for pod CrashLoopBackOff in velero namespace | - | Pending |
| Monitor Velero GitHub for fix to recallMaintenance panic | - | Pending |
| Trigger and verify weekly B2 backup after fix | - | Completed |
| Add Velero dashboard to Grafana | - | Completed |