Skip to content

Local LLM Setup

This guide documents the local LLM infrastructure running on the Windows gaming PC with RTX 3080, providing Claude Code-like functionality without API costs.

Note: This infrastructure uses llama.cpp exclusively. Ollama was previously used but has been deprecated in favor of llama.cpp's simpler single-model architecture and OpenAI-compatible API.

Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│  Proxmox Host   │     │  Proxmox Host    │     │  Windows PC         │
│                 │     │                  │     │  10.89.97.100       │
│  clhome ────────┼─────┼──────────────────┼────►│  llama.cpp:8080     │
│                 │     │                  │     │  RTX 3080 (10GB)    │
│  claude-local ──┼────►│  proxy:8082 ─────┼────►│                     │
└─────────────────┘     └──────────────────┘     └─────────────────────┘

Components:

Component Location Purpose
llama.cpp Windows PC (10.89.97.100:8080) Model inference with GPU
claude-code-proxy Proxmox (:8082) Translates Anthropic API → OpenAI API
clhome /usr/local/bin/clhome Quick CLI queries (direct)
claude-local /usr/local/bin/claude-local Claude Code with local backend

Quick Start

Simple CLI Queries (clhome)

# One-shot queries - goes direct to llama.cpp
clhome "explain git rebase in 2 sentences"

# Pipe input
echo "fix this code: def foo(x) return x+1" | clhome -

# Multi-word prompts
clhome write a bash function to check if a port is open

Claude Code with Local LLM (claude-local)

# Requires proxy to be running first
tmux new -s claude-proxy -d '/root/tools/claude-code-proxy/start-proxy.sh'

# Then use claude-local instead of claude
claude-local

Service Management

Windows llama-server (Auto-starts on boot)

# Start/stop/restart
ssh jakec@10.89.97.100 "nssm start llama-server"
ssh jakec@10.89.97.100 "nssm stop llama-server"
ssh jakec@10.89.97.100 "nssm restart llama-server"

# Check status
ssh jakec@10.89.97.100 "tasklist | findstr llama"

# Test endpoint
curl http://10.89.97.100:8080/v1/models

Proxmox Proxy (Manual start)

# Start in tmux
tmux new -s claude-proxy -d '/root/tools/claude-code-proxy/start-proxy.sh'

# Attach to view logs
tmux attach -t claude-proxy

# Kill
tmux kill-session -t claude-proxy

# Test endpoint
curl http://localhost:8082/v1/models

Configuration

Proxy Configuration

File: /root/tools/claude-code-proxy/.env

OPENAI_API_KEY="not-needed"
OPENAI_BASE_URL="http://10.89.97.100:8080/v1"
ANTHROPIC_API_KEY="local-llm"

# All models point to same local model
BIG_MODEL="Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"
MIDDLE_MODEL="Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"
SMALL_MODEL="Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"

HOST="0.0.0.0"
PORT="8082"
REQUEST_TIMEOUT="120"

llama.cpp Server Parameters

Service command: (configured via NSSM)

llama-server.exe
  --model C:\llama.cpp\models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
  --host 0.0.0.0
  --port 8080
  --ctx-size 32768
  --n-gpu-layers -1
  --threads 8
Parameter Value Description
--ctx-size 32768 Context window (tokens)
--n-gpu-layers -1 Offload all layers to GPU
--threads 8 CPU threads for CPU ops

Directory Structure

Windows PC (10.89.97.100):
C:\llama.cpp\
├── llama-server.exe      # Main server binary
├── nssm.exe              # Service manager
├── start-server.bat      # Manual startup script
├── models\
│   └── Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf  # 4.68GB model
└── *.dll                 # CUDA runtime libraries

Proxmox Host:
/root/tower-fleet/scripts/llm/
├── clhome                # Quick CLI queries
├── claude-local          # Claude Code wrapper
└── llama-model           # Model management

/root/tools/claude-code-proxy/    # API translation proxy
├── .env                          # Configuration
├── start-proxy.sh                # Startup script
└── venv/                         # Python virtual environment

Model Information

Current Model: Qwen3-8B-Q4_K_M

Spec Value
Parameters 8.19B
File Size 4.8 GB
Quantization Q4_K_M
Context Window 40K tokens
VRAM Usage ~5-6GB

Available Models (on Windows PC):

Model Size Type Status
Qwen3-8B-Q4_K_M 4.8GB Chat Current
Qwen2.5-Coder-7B-Instruct-Q4_K_M 4.7GB Chat Available
nomic-embed-text-v1.5.f16 274MB Embedding Available

Recommended Models (for RTX 3080 10GB):

Model Quant Size Best For
Qwen3-8B Q4_K_M 4.8GB Balanced, thinking mode
Qwen3-8B Q5_K_M 5.8GB Higher quality
Qwen3-4B Q6_K 3.3GB Fast, long context
Qwen2.5-Coder-7B Q5_K_M 5.5GB Pure coding focus
Qwen2.5-Coder-14B IQ3_M 6-7GB Better code, lower quant
Mistral-7B-v0.3 Q4_K_M 4.4GB General purpose

Performance Expectations

  • Response time: 5-30 seconds depending on prompt complexity
  • Quality: Lower than Claude Sonnet, suitable for simpler tasks
  • Context: 40K tokens (Qwen3), 32K tokens (Qwen2.5)
  • Best for: Quick queries, simple code generation, explanations
  • Qwen3 bonus: Supports thinking mode for step-by-step reasoning

OpenAI API Compatibility

The llama-server exposes OpenAI-compatible endpoints that other services can use directly.

Endpoints available:

GET  http://10.89.97.100:8080/v1/models
POST http://10.89.97.100:8080/v1/chat/completions
POST http://10.89.97.100:8080/v1/completions
POST http://10.89.97.100:8080/v1/embeddings

Configuration for other services:

OPENAI_API_KEY=not-needed        # Any non-empty string works
OPENAI_BASE_URL=http://10.89.97.100:8080/v1

Python example:

from openai import OpenAI

client = OpenAI(
    base_url="http://10.89.97.100:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local",  # Model name is ignored, uses loaded model
    messages=[{"role": "user", "content": "Hello"}]
)

Note: The model name in requests is largely ignored since llama-server loads a single model at startup. You can pass any string.

Swapping Models

Quick Swap (From Proxmox)

Use the helper script to swap models without manually SSH'ing:

# List available models
llama-model list

# Swap to a different model (stops server, updates config, restarts)
llama-model swap Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf

# Download a new model from HuggingFace
llama-model download "Qwen/Qwen3-4B-GGUF" "Qwen3-4B-Q6_K.gguf"

# Check current model
llama-model current

Manual Swap (SSH to Windows)

# SSH to Windows
ssh jakec@10.89.97.100

# Stop service (use full path - nssm not in PATH via SSH)
C:\llama.cpp\nssm.exe stop llama-server

# Update NSSM parameters with new model path
C:\llama.cpp\nssm.exe set llama-server AppParameters "--model C:\llama.cpp\models\NewModel.gguf --host 0.0.0.0 --port 8080 --ctx-size 32768 --n-gpu-layers -1 --threads 8"

# Start service
C:\llama.cpp\nssm.exe start llama-server

Downloading New Models

Models are downloaded from HuggingFace. Use GGUF format for llama.cpp.

# From Proxmox (recommended)
llama-model download "Qwen/Qwen3-4B-GGUF" "Qwen3-4B-Q6_K.gguf"

# From Windows (PowerShell) - for manual downloads
cd C:\llama.cpp\models
curl -L -o model-name.gguf "https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf"

Download commands for recommended models:

# Qwen3 (recommended - has thinking mode)
llama-model download "Qwen/Qwen3-8B-GGUF" "Qwen3-8B-Q4_K_M.gguf"
llama-model download "Qwen/Qwen3-8B-GGUF" "Qwen3-8B-Q5_K_M.gguf"
llama-model download "Qwen/Qwen3-4B-GGUF" "Qwen3-4B-Q6_K.gguf"

# Qwen2.5 Coder (pure coding focus)
llama-model download "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF" "qwen2.5-coder-7b-instruct-q5_k_m.gguf"
llama-model download "bartowski/Qwen2.5-Coder-14B-Instruct-GGUF" "Qwen2.5-Coder-14B-Instruct-IQ3_M.gguf"

# General purpose
llama-model download "mistralai/Mistral-7B-Instruct-v0.3-GGUF" "Mistral-7B-Instruct-v0.3-Q4_K_M.gguf"

Context size adjustments: Larger models may need reduced context. Use --ctx-size 8192 or 16384 if OOM errors occur.

Embeddings for RAG

Why Swap Models?

llama.cpp loads one model at a time. For RAG systems like AnythingLLM: - Chat needs a large language model (Qwen3-8B) - Embeddings need a specialized embedding model (nomic-embed-text)

Since embedding is a batch operation (index documents once, query many times), swapping is acceptable.

Embedding Model

Model: nomic-embed-text-v1.5.f16.gguf (274MB)

Spec Value
Parameters 137M
Dimensions 768
Context 8192 tokens
MTEB Score ~62% (matches OpenAI text-embedding-3-small)

Embedding Workflow

# Switch to embedding mode for document indexing
llama-model embed-docs

# This will:
# 1. Save current model (e.g., Qwen3-8B)
# 2. Swap to nomic-embed-text with embedding-optimized params
# 3. Wait for you to trigger indexing in AnythingLLM
# 4. Restore original chat model when you press ENTER

Manual Embedding (If Needed)

# Swap to embedding model manually
llama-model swap nomic-embed-text-v1.5.f16.gguf

# Generate embeddings via API
curl http://10.89.97.100:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "nomic", "input": "Your text here"}'

# Swap back to chat model
llama-model swap Qwen3-8B-Q4_K_M.gguf

Cloud vs Local Comparison

Model Type Quality (MTEB) Cost
nomic-embed-text-v1.5 Local ~62% Free
OpenAI text-embedding-3-small Cloud ~62% $0.02/1M tokens
OpenAI text-embedding-3-large Cloud ~64% $0.13/1M tokens
Voyage-3 Cloud ~67% $0.06/1M tokens

Recommendation: Use local nomic-embed-text for AnythingLLM (free, same quality as OpenAI small). Use cloud for real-time applications like Palimpsest.

Troubleshooting

Server won't start

# Check if port in use
ssh jakec@10.89.97.100 "netstat -an | findstr 8080"

# Check CUDA available
ssh jakec@10.89.97.100 "nvidia-smi"

# View service logs (Windows Event Viewer or nssm)
ssh jakec@10.89.97.100 "nssm status llama-server"

Connection refused from Proxmox

# Test network connectivity
ping 10.89.97.100

# Check Windows firewall rule exists
ssh jakec@10.89.97.100 "netsh advfirewall firewall show rule name=\"llama.cpp Server\""

Proxy errors

# Check proxy logs
tmux attach -t claude-proxy

# Verify .env configuration
cat /root/tools/claude-code-proxy/.env

# Test direct to llama.cpp (bypassing proxy)
curl http://10.89.97.100:8080/v1/models

Out of memory errors

  • Reduce context size: edit NSSM parameters to use --ctx-size 8192
  • Use smaller quantization (Q3_K_M instead of Q4_K_M)
  • Check VRAM with nvidia-smi

Scripts Reference

clhome

Location: /root/tower-fleet/scripts/llm/clhome/usr/local/bin/clhome

#!/bin/bash
# Quick LLM queries direct to llama.cpp

API_URL="http://10.89.97.100:8080/v1/chat/completions"

if [ $# -eq 0 ]; then
    echo "Usage: clhome \"your prompt here\""
    exit 1
fi

if [ "$1" = "-" ]; then
    PROMPT=$(cat)
else
    PROMPT="$*"
fi

curl -s "$API_URL" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf\",
    \"messages\": [{\"role\": \"user\", \"content\": $(echo "$PROMPT" | jq -Rs .)}],
    \"max_tokens\": 2048,
    \"temperature\": 0.7
  }" | jq -r '.choices[0].message.content'

claude-local

Location: /root/tower-fleet/scripts/llm/claude-local/usr/local/bin/claude-local

#!/bin/bash
# Claude Code with local LLM backend

export ANTHROPIC_BASE_URL="http://localhost:8082"
export ANTHROPIC_AUTH_TOKEN="local-llm"

exec claude "$@"

start-proxy.sh

Location: /root/tools/claude-code-proxy/start-proxy.sh

#!/bin/bash
cd /root/tools/claude-code-proxy
source venv/bin/activate
export $(grep -v '^#' .env | xargs)
python start_proxy.py

llama-model

Location: /root/tower-fleet/scripts/llm/llama-model/usr/local/bin/llama-model

Helper script to manage models on the Windows PC from Proxmox.

# List available models
llama-model list

# Check current model and server status
llama-model current
llama-model status

# Swap to a different model (stops server, updates config, restarts)
llama-model swap DeepSeek-Coder-6.7B-Q4_K_M.gguf

# Download new model from HuggingFace
llama-model download "TheBloke/Mistral-7B-Instruct-v0.2-GGUF" "mistral-7b-instruct-v0.2.Q4_K_M.gguf"