Local LLM Setup¶

This guide documents the local LLM infrastructure running on the Windows gaming PC with RTX 3080, providing Claude Code-like functionality without API costs.

Note: This infrastructure uses llama.cpp exclusively. Ollama was previously used but has been deprecated in favor of llama.cpp's simpler single-model architecture and OpenAI-compatible API.

Architecture¶

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│  Proxmox Host   │     │  Proxmox Host    │     │  Windows PC         │
│                 │     │                  │     │  10.89.97.100       │
│  clhome ────────┼─────┼──────────────────┼────►│  llama.cpp:8080     │
│                 │     │                  │     │  RTX 3080 (10GB)    │
│  claude-local ──┼────►│  proxy:8082 ─────┼────►│                     │
└─────────────────┘     └──────────────────┘     └─────────────────────┘

Components:

Component	Location	Purpose
llama.cpp	Windows PC (10.89.97.100:8080)	Model inference with GPU
claude-code-proxy	Proxmox (:8082)	Translates Anthropic API → OpenAI API
clhome	`/usr/local/bin/clhome`	Quick CLI queries (direct)
claude-local	`/usr/local/bin/claude-local`	Claude Code with local backend

Quick Start¶

Simple CLI Queries (clhome)¶

# One-shot queries - goes direct to llama.cpp
clhome "explain git rebase in 2 sentences"

# Pipe input
echo "fix this code: def foo(x) return x+1" | clhome -

# Multi-word prompts
clhome write a bash function to check if a port is open

Claude Code with Local LLM (claude-local)¶

# Requires proxy to be running first
tmux new -s claude-proxy -d '/root/tools/claude-code-proxy/start-proxy.sh'

# Then use claude-local instead of claude
claude-local

Service Management¶

Windows llama-server (Auto-starts on boot)¶

# Start/stop/restart
ssh jakec@10.89.97.100 "nssm start llama-server"
ssh jakec@10.89.97.100 "nssm stop llama-server"
ssh jakec@10.89.97.100 "nssm restart llama-server"

# Check status
ssh jakec@10.89.97.100 "tasklist | findstr llama"

# Test endpoint
curl http://10.89.97.100:8080/v1/models

Proxmox Proxy (Manual start)¶

# Start in tmux
tmux new -s claude-proxy -d '/root/tools/claude-code-proxy/start-proxy.sh'

# Attach to view logs
tmux attach -t claude-proxy

# Kill
tmux kill-session -t claude-proxy

# Test endpoint
curl http://localhost:8082/v1/models

Configuration¶

Proxy Configuration¶

File: /root/tools/claude-code-proxy/.env

OPENAI_API_KEY="not-needed"
OPENAI_BASE_URL="http://10.89.97.100:8080/v1"
ANTHROPIC_API_KEY="local-llm"

# All models point to same local model
BIG_MODEL="Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"
MIDDLE_MODEL="Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"
SMALL_MODEL="Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"

HOST="0.0.0.0"
PORT="8082"
REQUEST_TIMEOUT="120"

llama.cpp Server Parameters¶

Service command: (configured via NSSM)

llama-server.exe
  --model C:\llama.cpp\models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
  --host 0.0.0.0
  --port 8080
  --ctx-size 32768
  --n-gpu-layers -1
  --threads 8

Parameter	Value	Description
`--ctx-size`	32768	Context window (tokens)
`--n-gpu-layers`	-1	Offload all layers to GPU
`--threads`	8	CPU threads for CPU ops

Directory Structure¶

Windows PC (10.89.97.100):
C:\llama.cpp\
├── llama-server.exe      # Main server binary
├── nssm.exe              # Service manager
├── start-server.bat      # Manual startup script
├── models\
│   └── Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf  # 4.68GB model
└── *.dll                 # CUDA runtime libraries

Proxmox Host:
/root/tower-fleet/scripts/llm/
├── clhome                # Quick CLI queries
├── claude-local          # Claude Code wrapper
└── llama-model           # Model management

/root/tools/claude-code-proxy/    # API translation proxy
├── .env                          # Configuration
├── start-proxy.sh                # Startup script
└── venv/                         # Python virtual environment

Model Information¶

Current Model: Qwen3-8B-Q4_K_M

Spec	Value
Parameters	8.19B
File Size	4.8 GB
Quantization	Q4_K_M
Context Window	40K tokens
VRAM Usage	~5-6GB

Available Models (on Windows PC):

Model	Size	Type	Status
Qwen3-8B-Q4_K_M	4.8GB	Chat	Current
Qwen2.5-Coder-7B-Instruct-Q4_K_M	4.7GB	Chat	Available
nomic-embed-text-v1.5.f16	274MB	Embedding	Available

Recommended Models (for RTX 3080 10GB):

Model	Quant	Size	Best For
Qwen3-8B	Q4_K_M	4.8GB	Balanced, thinking mode
Qwen3-8B	Q5_K_M	5.8GB	Higher quality
Qwen3-4B	Q6_K	3.3GB	Fast, long context
Qwen2.5-Coder-7B	Q5_K_M	5.5GB	Pure coding focus
Qwen2.5-Coder-14B	IQ3_M	6-7GB	Better code, lower quant
Mistral-7B-v0.3	Q4_K_M	4.4GB	General purpose

Performance Expectations¶

Response time: 5-30 seconds depending on prompt complexity
Quality: Lower than Claude Sonnet, suitable for simpler tasks
Context: 40K tokens (Qwen3), 32K tokens (Qwen2.5)
Best for: Quick queries, simple code generation, explanations
Qwen3 bonus: Supports thinking mode for step-by-step reasoning

OpenAI API Compatibility¶

The llama-server exposes OpenAI-compatible endpoints that other services can use directly.

Endpoints available:

GET  http://10.89.97.100:8080/v1/models
POST http://10.89.97.100:8080/v1/chat/completions
POST http://10.89.97.100:8080/v1/completions
POST http://10.89.97.100:8080/v1/embeddings

Configuration for other services:

OPENAI_API_KEY=not-needed        # Any non-empty string works
OPENAI_BASE_URL=http://10.89.97.100:8080/v1

Python example:

from openai import OpenAI

client = OpenAI(
    base_url="http://10.89.97.100:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local",  # Model name is ignored, uses loaded model
    messages=[{"role": "user", "content": "Hello"}]
)

Note: The model name in requests is largely ignored since llama-server loads a single model at startup. You can pass any string.

Swapping Models¶

Quick Swap (From Proxmox)¶

Use the helper script to swap models without manually SSH'ing:

# List available models
llama-model list

# Swap to a different model (stops server, updates config, restarts)
llama-model swap Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf

# Download a new model from HuggingFace
llama-model download "Qwen/Qwen3-4B-GGUF" "Qwen3-4B-Q6_K.gguf"

# Check current model
llama-model current

Manual Swap (SSH to Windows)¶

# SSH to Windows
ssh jakec@10.89.97.100

# Stop service (use full path - nssm not in PATH via SSH)
C:\llama.cpp\nssm.exe stop llama-server

# Update NSSM parameters with new model path
C:\llama.cpp\nssm.exe set llama-server AppParameters "--model C:\llama.cpp\models\NewModel.gguf --host 0.0.0.0 --port 8080 --ctx-size 32768 --n-gpu-layers -1 --threads 8"

# Start service
C:\llama.cpp\nssm.exe start llama-server

Downloading New Models¶

Models are downloaded from HuggingFace. Use GGUF format for llama.cpp.

# From Proxmox (recommended)
llama-model download "Qwen/Qwen3-4B-GGUF" "Qwen3-4B-Q6_K.gguf"

# From Windows (PowerShell) - for manual downloads
cd C:\llama.cpp\models
curl -L -o model-name.gguf "https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf"

Download commands for recommended models:

# Qwen3 (recommended - has thinking mode)
llama-model download "Qwen/Qwen3-8B-GGUF" "Qwen3-8B-Q4_K_M.gguf"
llama-model download "Qwen/Qwen3-8B-GGUF" "Qwen3-8B-Q5_K_M.gguf"
llama-model download "Qwen/Qwen3-4B-GGUF" "Qwen3-4B-Q6_K.gguf"

# Qwen2.5 Coder (pure coding focus)
llama-model download "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF" "qwen2.5-coder-7b-instruct-q5_k_m.gguf"
llama-model download "bartowski/Qwen2.5-Coder-14B-Instruct-GGUF" "Qwen2.5-Coder-14B-Instruct-IQ3_M.gguf"

# General purpose
llama-model download "mistralai/Mistral-7B-Instruct-v0.3-GGUF" "Mistral-7B-Instruct-v0.3-Q4_K_M.gguf"

Context size adjustments: Larger models may need reduced context. Use --ctx-size 8192 or 16384 if OOM errors occur.

Embeddings for RAG¶

Why Swap Models?¶

llama.cpp loads one model at a time. For RAG systems like AnythingLLM: - Chat needs a large language model (Qwen3-8B) - Embeddings need a specialized embedding model (nomic-embed-text)

Since embedding is a batch operation (index documents once, query many times), swapping is acceptable.

Embedding Model¶

Model: nomic-embed-text-v1.5.f16.gguf (274MB)

Spec	Value
Parameters	137M
Dimensions	768
Context	8192 tokens
MTEB Score	~62% (matches OpenAI text-embedding-3-small)

Embedding Workflow¶

# Switch to embedding mode for document indexing
llama-model embed-docs

# This will:
# 1. Save current model (e.g., Qwen3-8B)
# 2. Swap to nomic-embed-text with embedding-optimized params
# 3. Wait for you to trigger indexing in AnythingLLM
# 4. Restore original chat model when you press ENTER

Manual Embedding (If Needed)¶

# Swap to embedding model manually
llama-model swap nomic-embed-text-v1.5.f16.gguf

# Generate embeddings via API
curl http://10.89.97.100:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "nomic", "input": "Your text here"}'

# Swap back to chat model
llama-model swap Qwen3-8B-Q4_K_M.gguf

Cloud vs Local Comparison¶

Model	Type	Quality (MTEB)	Cost
nomic-embed-text-v1.5	Local	~62%	Free
OpenAI text-embedding-3-small	Cloud	~62%	$0.02/1M tokens
OpenAI text-embedding-3-large	Cloud	~64%	$0.13/1M tokens
Voyage-3	Cloud	~67%	$0.06/1M tokens

Recommendation: Use local nomic-embed-text for AnythingLLM (free, same quality as OpenAI small). Use cloud for real-time applications like Palimpsest.

Troubleshooting¶

Server won't start¶

# Check if port in use
ssh jakec@10.89.97.100 "netstat -an | findstr 8080"

# Check CUDA available
ssh jakec@10.89.97.100 "nvidia-smi"

# View service logs (Windows Event Viewer or nssm)
ssh jakec@10.89.97.100 "nssm status llama-server"

Connection refused from Proxmox¶

# Test network connectivity
ping 10.89.97.100

# Check Windows firewall rule exists
ssh jakec@10.89.97.100 "netsh advfirewall firewall show rule name=\"llama.cpp Server\""

Proxy errors¶

# Check proxy logs
tmux attach -t claude-proxy

# Verify .env configuration
cat /root/tools/claude-code-proxy/.env

# Test direct to llama.cpp (bypassing proxy)
curl http://10.89.97.100:8080/v1/models

Out of memory errors¶

Reduce context size: edit NSSM parameters to use --ctx-size 8192
Use smaller quantization (Q3_K_M instead of Q4_K_M)
Check VRAM with nvidia-smi

Scripts Reference¶

clhome¶

Location: /root/tower-fleet/scripts/llm/clhome → /usr/local/bin/clhome

#!/bin/bash
# Quick LLM queries direct to llama.cpp

API_URL="http://10.89.97.100:8080/v1/chat/completions"

if [ $# -eq 0 ]; then
    echo "Usage: clhome \"your prompt here\""
    exit 1
fi

if [ "$1" = "-" ]; then
    PROMPT=$(cat)
else
    PROMPT="$*"
fi

curl -s "$API_URL" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf\",
    \"messages\": [{\"role\": \"user\", \"content\": $(echo "$PROMPT" | jq -Rs .)}],
    \"max_tokens\": 2048,
    \"temperature\": 0.7
  }" | jq -r '.choices[0].message.content'

claude-local¶

Location: /root/tower-fleet/scripts/llm/claude-local → /usr/local/bin/claude-local

#!/bin/bash
# Claude Code with local LLM backend

export ANTHROPIC_BASE_URL="http://localhost:8082"
export ANTHROPIC_AUTH_TOKEN="local-llm"

exec claude "$@"

start-proxy.sh¶

Location: /root/tools/claude-code-proxy/start-proxy.sh

#!/bin/bash
cd /root/tools/claude-code-proxy
source venv/bin/activate
export $(grep -v '^#' .env | xargs)
python start_proxy.py

llama-model¶

Location: /root/tower-fleet/scripts/llm/llama-model → /usr/local/bin/llama-model

Helper script to manage models on the Windows PC from Proxmox.

# List available models
llama-model list

# Check current model and server status
llama-model current
llama-model status

# Swap to a different model (stops server, updates config, restarts)
llama-model swap DeepSeek-Coder-6.7B-Q4_K_M.gguf

# Download new model from HuggingFace
llama-model download "TheBloke/Mistral-7B-Instruct-v0.2-GGUF" "mistral-7b-instruct-v0.2.Q4_K_M.gguf"

PROJECTS.md - Quick reference in Infrastructure Tools section
claude-code-proxy GitHub - Upstream project
llama.cpp GitHub - Inference engine