Local LLM Setup¶
This guide documents the local LLM infrastructure running on the Windows gaming PC with RTX 3080, providing Claude Code-like functionality without API costs.
Note: This infrastructure uses llama.cpp exclusively. Ollama was previously used but has been deprecated in favor of llama.cpp's simpler single-model architecture and OpenAI-compatible API.
Architecture¶
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────────┐
│ Proxmox Host │ │ Proxmox Host │ │ Windows PC │
│ │ │ │ │ 10.89.97.100 │
│ clhome ────────┼─────┼──────────────────┼────►│ llama.cpp:8080 │
│ │ │ │ │ RTX 3080 (10GB) │
│ claude-local ──┼────►│ proxy:8082 ─────┼────►│ │
└─────────────────┘ └──────────────────┘ └─────────────────────┘
Components:
| Component | Location | Purpose |
|---|---|---|
| llama.cpp | Windows PC (10.89.97.100:8080) | Model inference with GPU |
| claude-code-proxy | Proxmox (:8082) | Translates Anthropic API → OpenAI API |
| clhome | /usr/local/bin/clhome |
Quick CLI queries (direct) |
| claude-local | /usr/local/bin/claude-local |
Claude Code with local backend |
Quick Start¶
Simple CLI Queries (clhome)¶
# One-shot queries - goes direct to llama.cpp
clhome "explain git rebase in 2 sentences"
# Pipe input
echo "fix this code: def foo(x) return x+1" | clhome -
# Multi-word prompts
clhome write a bash function to check if a port is open
Claude Code with Local LLM (claude-local)¶
# Requires proxy to be running first
tmux new -s claude-proxy -d '/root/tools/claude-code-proxy/start-proxy.sh'
# Then use claude-local instead of claude
claude-local
Service Management¶
Windows llama-server (Auto-starts on boot)¶
# Start/stop/restart
ssh jakec@10.89.97.100 "nssm start llama-server"
ssh jakec@10.89.97.100 "nssm stop llama-server"
ssh jakec@10.89.97.100 "nssm restart llama-server"
# Check status
ssh jakec@10.89.97.100 "tasklist | findstr llama"
# Test endpoint
curl http://10.89.97.100:8080/v1/models
Proxmox Proxy (Manual start)¶
# Start in tmux
tmux new -s claude-proxy -d '/root/tools/claude-code-proxy/start-proxy.sh'
# Attach to view logs
tmux attach -t claude-proxy
# Kill
tmux kill-session -t claude-proxy
# Test endpoint
curl http://localhost:8082/v1/models
Configuration¶
Proxy Configuration¶
File: /root/tools/claude-code-proxy/.env
OPENAI_API_KEY="not-needed"
OPENAI_BASE_URL="http://10.89.97.100:8080/v1"
ANTHROPIC_API_KEY="local-llm"
# All models point to same local model
BIG_MODEL="Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"
MIDDLE_MODEL="Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"
SMALL_MODEL="Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"
HOST="0.0.0.0"
PORT="8082"
REQUEST_TIMEOUT="120"
llama.cpp Server Parameters¶
Service command: (configured via NSSM)
llama-server.exe
--model C:\llama.cpp\models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
--host 0.0.0.0
--port 8080
--ctx-size 32768
--n-gpu-layers -1
--threads 8
| Parameter | Value | Description |
|---|---|---|
--ctx-size |
32768 | Context window (tokens) |
--n-gpu-layers |
-1 | Offload all layers to GPU |
--threads |
8 | CPU threads for CPU ops |
Directory Structure¶
Windows PC (10.89.97.100):
C:\llama.cpp\
├── llama-server.exe # Main server binary
├── nssm.exe # Service manager
├── start-server.bat # Manual startup script
├── models\
│ └── Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf # 4.68GB model
└── *.dll # CUDA runtime libraries
Proxmox Host:
/root/tower-fleet/scripts/llm/
├── clhome # Quick CLI queries
├── claude-local # Claude Code wrapper
└── llama-model # Model management
/root/tools/claude-code-proxy/ # API translation proxy
├── .env # Configuration
├── start-proxy.sh # Startup script
└── venv/ # Python virtual environment
Model Information¶
Current Model: Qwen3-8B-Q4_K_M
| Spec | Value |
|---|---|
| Parameters | 8.19B |
| File Size | 4.8 GB |
| Quantization | Q4_K_M |
| Context Window | 40K tokens |
| VRAM Usage | ~5-6GB |
Available Models (on Windows PC):
| Model | Size | Type | Status |
|---|---|---|---|
| Qwen3-8B-Q4_K_M | 4.8GB | Chat | Current |
| Qwen2.5-Coder-7B-Instruct-Q4_K_M | 4.7GB | Chat | Available |
| nomic-embed-text-v1.5.f16 | 274MB | Embedding | Available |
Recommended Models (for RTX 3080 10GB):
| Model | Quant | Size | Best For |
|---|---|---|---|
| Qwen3-8B | Q4_K_M | 4.8GB | Balanced, thinking mode |
| Qwen3-8B | Q5_K_M | 5.8GB | Higher quality |
| Qwen3-4B | Q6_K | 3.3GB | Fast, long context |
| Qwen2.5-Coder-7B | Q5_K_M | 5.5GB | Pure coding focus |
| Qwen2.5-Coder-14B | IQ3_M | 6-7GB | Better code, lower quant |
| Mistral-7B-v0.3 | Q4_K_M | 4.4GB | General purpose |
Performance Expectations¶
- Response time: 5-30 seconds depending on prompt complexity
- Quality: Lower than Claude Sonnet, suitable for simpler tasks
- Context: 40K tokens (Qwen3), 32K tokens (Qwen2.5)
- Best for: Quick queries, simple code generation, explanations
- Qwen3 bonus: Supports thinking mode for step-by-step reasoning
OpenAI API Compatibility¶
The llama-server exposes OpenAI-compatible endpoints that other services can use directly.
Endpoints available:
GET http://10.89.97.100:8080/v1/models
POST http://10.89.97.100:8080/v1/chat/completions
POST http://10.89.97.100:8080/v1/completions
POST http://10.89.97.100:8080/v1/embeddings
Configuration for other services:
Python example:
from openai import OpenAI
client = OpenAI(
base_url="http://10.89.97.100:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local", # Model name is ignored, uses loaded model
messages=[{"role": "user", "content": "Hello"}]
)
Note: The model name in requests is largely ignored since llama-server loads a single model at startup. You can pass any string.
Swapping Models¶
Quick Swap (From Proxmox)¶
Use the helper script to swap models without manually SSH'ing:
# List available models
llama-model list
# Swap to a different model (stops server, updates config, restarts)
llama-model swap Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
# Download a new model from HuggingFace
llama-model download "Qwen/Qwen3-4B-GGUF" "Qwen3-4B-Q6_K.gguf"
# Check current model
llama-model current
Manual Swap (SSH to Windows)¶
# SSH to Windows
ssh jakec@10.89.97.100
# Stop service (use full path - nssm not in PATH via SSH)
C:\llama.cpp\nssm.exe stop llama-server
# Update NSSM parameters with new model path
C:\llama.cpp\nssm.exe set llama-server AppParameters "--model C:\llama.cpp\models\NewModel.gguf --host 0.0.0.0 --port 8080 --ctx-size 32768 --n-gpu-layers -1 --threads 8"
# Start service
C:\llama.cpp\nssm.exe start llama-server
Downloading New Models¶
Models are downloaded from HuggingFace. Use GGUF format for llama.cpp.
# From Proxmox (recommended)
llama-model download "Qwen/Qwen3-4B-GGUF" "Qwen3-4B-Q6_K.gguf"
# From Windows (PowerShell) - for manual downloads
cd C:\llama.cpp\models
curl -L -o model-name.gguf "https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf"
Download commands for recommended models:
# Qwen3 (recommended - has thinking mode)
llama-model download "Qwen/Qwen3-8B-GGUF" "Qwen3-8B-Q4_K_M.gguf"
llama-model download "Qwen/Qwen3-8B-GGUF" "Qwen3-8B-Q5_K_M.gguf"
llama-model download "Qwen/Qwen3-4B-GGUF" "Qwen3-4B-Q6_K.gguf"
# Qwen2.5 Coder (pure coding focus)
llama-model download "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF" "qwen2.5-coder-7b-instruct-q5_k_m.gguf"
llama-model download "bartowski/Qwen2.5-Coder-14B-Instruct-GGUF" "Qwen2.5-Coder-14B-Instruct-IQ3_M.gguf"
# General purpose
llama-model download "mistralai/Mistral-7B-Instruct-v0.3-GGUF" "Mistral-7B-Instruct-v0.3-Q4_K_M.gguf"
Context size adjustments: Larger models may need reduced context. Use --ctx-size 8192 or 16384 if OOM errors occur.
Embeddings for RAG¶
Why Swap Models?¶
llama.cpp loads one model at a time. For RAG systems like AnythingLLM: - Chat needs a large language model (Qwen3-8B) - Embeddings need a specialized embedding model (nomic-embed-text)
Since embedding is a batch operation (index documents once, query many times), swapping is acceptable.
Embedding Model¶
Model: nomic-embed-text-v1.5.f16.gguf (274MB)
| Spec | Value |
|---|---|
| Parameters | 137M |
| Dimensions | 768 |
| Context | 8192 tokens |
| MTEB Score | ~62% (matches OpenAI text-embedding-3-small) |
Embedding Workflow¶
# Switch to embedding mode for document indexing
llama-model embed-docs
# This will:
# 1. Save current model (e.g., Qwen3-8B)
# 2. Swap to nomic-embed-text with embedding-optimized params
# 3. Wait for you to trigger indexing in AnythingLLM
# 4. Restore original chat model when you press ENTER
Manual Embedding (If Needed)¶
# Swap to embedding model manually
llama-model swap nomic-embed-text-v1.5.f16.gguf
# Generate embeddings via API
curl http://10.89.97.100:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model": "nomic", "input": "Your text here"}'
# Swap back to chat model
llama-model swap Qwen3-8B-Q4_K_M.gguf
Cloud vs Local Comparison¶
| Model | Type | Quality (MTEB) | Cost |
|---|---|---|---|
| nomic-embed-text-v1.5 | Local | ~62% | Free |
| OpenAI text-embedding-3-small | Cloud | ~62% | $0.02/1M tokens |
| OpenAI text-embedding-3-large | Cloud | ~64% | $0.13/1M tokens |
| Voyage-3 | Cloud | ~67% | $0.06/1M tokens |
Recommendation: Use local nomic-embed-text for AnythingLLM (free, same quality as OpenAI small). Use cloud for real-time applications like Palimpsest.
Troubleshooting¶
Server won't start¶
# Check if port in use
ssh jakec@10.89.97.100 "netstat -an | findstr 8080"
# Check CUDA available
ssh jakec@10.89.97.100 "nvidia-smi"
# View service logs (Windows Event Viewer or nssm)
ssh jakec@10.89.97.100 "nssm status llama-server"
Connection refused from Proxmox¶
# Test network connectivity
ping 10.89.97.100
# Check Windows firewall rule exists
ssh jakec@10.89.97.100 "netsh advfirewall firewall show rule name=\"llama.cpp Server\""
Proxy errors¶
# Check proxy logs
tmux attach -t claude-proxy
# Verify .env configuration
cat /root/tools/claude-code-proxy/.env
# Test direct to llama.cpp (bypassing proxy)
curl http://10.89.97.100:8080/v1/models
Out of memory errors¶
- Reduce context size: edit NSSM parameters to use
--ctx-size 8192 - Use smaller quantization (Q3_K_M instead of Q4_K_M)
- Check VRAM with
nvidia-smi
Scripts Reference¶
clhome¶
Location: /root/tower-fleet/scripts/llm/clhome → /usr/local/bin/clhome
#!/bin/bash
# Quick LLM queries direct to llama.cpp
API_URL="http://10.89.97.100:8080/v1/chat/completions"
if [ $# -eq 0 ]; then
echo "Usage: clhome \"your prompt here\""
exit 1
fi
if [ "$1" = "-" ]; then
PROMPT=$(cat)
else
PROMPT="$*"
fi
curl -s "$API_URL" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf\",
\"messages\": [{\"role\": \"user\", \"content\": $(echo "$PROMPT" | jq -Rs .)}],
\"max_tokens\": 2048,
\"temperature\": 0.7
}" | jq -r '.choices[0].message.content'
claude-local¶
Location: /root/tower-fleet/scripts/llm/claude-local → /usr/local/bin/claude-local
#!/bin/bash
# Claude Code with local LLM backend
export ANTHROPIC_BASE_URL="http://localhost:8082"
export ANTHROPIC_AUTH_TOKEN="local-llm"
exec claude "$@"
start-proxy.sh¶
Location: /root/tools/claude-code-proxy/start-proxy.sh
#!/bin/bash
cd /root/tools/claude-code-proxy
source venv/bin/activate
export $(grep -v '^#' .env | xargs)
python start_proxy.py
llama-model¶
Location: /root/tower-fleet/scripts/llm/llama-model → /usr/local/bin/llama-model
Helper script to manage models on the Windows PC from Proxmox.
# List available models
llama-model list
# Check current model and server status
llama-model current
llama-model status
# Swap to a different model (stops server, updates config, restarts)
llama-model swap DeepSeek-Coder-6.7B-Q4_K_M.gguf
# Download new model from HuggingFace
llama-model download "TheBloke/Mistral-7B-Instruct-v0.2-GGUF" "mistral-7b-instruct-v0.2.Q4_K_M.gguf"
Related Documentation¶
- PROJECTS.md - Quick reference in Infrastructure Tools section
- claude-code-proxy GitHub - Upstream project
- llama.cpp GitHub - Inference engine