AnythingLLM OOM and CPU Throttling¶
Date: 2025-01-07 Severity: Warning Duration: ~4 hours Services Affected: AnythingLLM, doc-sync CronJob
Summary¶
AnythingLLM experienced repeated OOM kills (66 restarts) and 90% CPU throttling during document sync operations. The doc-sync CronJob failed 50% of runs due to AnythingLLM crashing mid-embedding.
Detection¶
Alertmanager fired two alerts:
- CPUThrottlingHigh - 90.61% CPU throttling in anythingllm namespace
- KubeJobFailed - Multiple doc-sync jobs failed to complete
Root Cause¶
Resource limits were insufficient for document embedding workload: - Memory limit: 2Gi (caused OOM during batch embedding of 192 docs) - CPU limit: 2 cores (caused 90% throttling during native embedder processing)
The doc-sync CronJob uploads 192 markdown files then embeds them in batches of 20. The native embedder (Xenova/nomic-embed-text-v1) is CPU and memory intensive, exceeding the original limits.
Timeline¶
- ~02:00 UTC - Doc-sync CronJob deployed with 30-minute schedule
- ~03:00 UTC - First successful sync completed (took 18 minutes)
- ~03:30-07:00 UTC - Alternating success/failure pattern as jobs stress AnythingLLM
- ~13:00 UTC - Alerts fired, 66 restarts observed
- ~13:20 UTC - Root cause identified (exit code 137 = OOM)
- ~13:25 UTC - Resource limits increased, pod restarted
Resolution¶
Increased resource limits in manifests/apps/anythingllm/anythingllm.yaml:
# Before
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "2000m"
# After
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "4000m"
Also cleaned up failed jobs to clear alerts:
Prevention¶
- Load testing - Test resource-intensive features before deploying CronJobs
- Gradual rollout - Start with fewer documents or longer intervals
- Monitoring - Add memory usage alerts for AnythingLLM specifically
Lessons Learned¶
- Native embedders are more resource-intensive than expected
- Batch embedding 192 documents requires significant memory headroom
- The 0.5s delay between uploads isn't enough to prevent resource spikes during embedding
Related Files¶
manifests/apps/anythingllm/anythingllm.yaml- Main deploymentmanifests/apps/anythingllm/doc-sync-cronjob.yaml- Sync CronJob