Skip to content

AnythingLLM OOM and CPU Throttling

Date: 2025-01-07 Severity: Warning Duration: ~4 hours Services Affected: AnythingLLM, doc-sync CronJob

Summary

AnythingLLM experienced repeated OOM kills (66 restarts) and 90% CPU throttling during document sync operations. The doc-sync CronJob failed 50% of runs due to AnythingLLM crashing mid-embedding.

Detection

Alertmanager fired two alerts: - CPUThrottlingHigh - 90.61% CPU throttling in anythingllm namespace - KubeJobFailed - Multiple doc-sync jobs failed to complete

Root Cause

Resource limits were insufficient for document embedding workload: - Memory limit: 2Gi (caused OOM during batch embedding of 192 docs) - CPU limit: 2 cores (caused 90% throttling during native embedder processing)

The doc-sync CronJob uploads 192 markdown files then embeds them in batches of 20. The native embedder (Xenova/nomic-embed-text-v1) is CPU and memory intensive, exceeding the original limits.

Timeline

  • ~02:00 UTC - Doc-sync CronJob deployed with 30-minute schedule
  • ~03:00 UTC - First successful sync completed (took 18 minutes)
  • ~03:30-07:00 UTC - Alternating success/failure pattern as jobs stress AnythingLLM
  • ~13:00 UTC - Alerts fired, 66 restarts observed
  • ~13:20 UTC - Root cause identified (exit code 137 = OOM)
  • ~13:25 UTC - Resource limits increased, pod restarted

Resolution

Increased resource limits in manifests/apps/anythingllm/anythingllm.yaml:

# Before
resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "2Gi"
    cpu: "2000m"

# After
resources:
  requests:
    memory: "1Gi"
    cpu: "500m"
  limits:
    memory: "4Gi"
    cpu: "4000m"

Also cleaned up failed jobs to clear alerts:

kubectl delete job -n anythingllm anythingllm-doc-sync-29463360 ...

Prevention

  1. Load testing - Test resource-intensive features before deploying CronJobs
  2. Gradual rollout - Start with fewer documents or longer intervals
  3. Monitoring - Add memory usage alerts for AnythingLLM specifically

Lessons Learned

  • Native embedders are more resource-intensive than expected
  • Batch embedding 192 documents requires significant memory headroom
  • The 0.5s delay between uploads isn't enough to prevent resource spikes during embedding
  • manifests/apps/anythingllm/anythingllm.yaml - Main deployment
  • manifests/apps/anythingllm/doc-sync-cronjob.yaml - Sync CronJob