RoyceSelf-Hosted LLM Guide: Run DeepSeek and Qwen Locally DeepSeek R1 shocked the AI world in...
DeepSeek R1 shocked the AI world in early 2025 by matching o1-level reasoning at a fraction of the training cost. Qwen 2.5 and Qwen3 from Alibaba brought frontier-class coding ability to open-weight models. Both are available under permissive licenses. Both run locally via Ollama. And both eliminate the API costs and privacy concerns of using cloud LLM services.
This guide covers hardware requirements for every model size, exact Ollama commands, cost comparison vs. cloud APIs, and what to actually expect from local inference in 2026.
For reasoning tasks (math, coding, logic): DeepSeek R1 8B or 32B depending on your hardware.
For coding specifically: Qwen2.5-Coder-32B on a 24GB GPU is currently the best local coding model — matches GPT-4o-mini on most benchmarks.
Budget hardware: DeepSeek R1 Distill 7B or Qwen3 8B run on consumer GPUs from 2019–2020.
No GPU: 7B models on Apple Silicon (M1+) are fully usable for real work.
Privacy: No data leaves your machine. Prompts containing code, customer data, medical records, or proprietary information stay local.
Cost at scale: GPT-4o at $2.50/1M input tokens adds up. 1M tokens/day = $75/month. Self-hosted: $0 marginal cost after hardware.
No rate limits: Commercial APIs throttle requests. Local inference is limited only by your GPU/CPU.
Latency: With a good GPU, local 7B models respond in <200ms for short prompts.
Offline capability: Works without internet. Useful in air-gapped environments, travel, or unreliable connections.
Experimentation: Try 20 different models in an afternoon without billing anxiety.
Model files are distributed in quantized formats that trade quality for size and speed. The key formats you'll encounter:
| Format | Size reduction | Quality loss | Best for |
|---|---|---|---|
| Q4_K_M | ~75% vs FP16 | Minimal | Default choice; best quality-per-GB |
| Q4_0 | ~75% vs FP16 | Slight | Faster than Q4_K_M, marginally lower quality |
| Q8_0 | ~50% vs FP16 | Negligible | When you have VRAM to spare |
| FP16 | No compression | None | Full quality; requires large VRAM |
| GGUF | Varies | Varies | Format used by Ollama/llama.cpp |
For daily use, Q4_K_M is the right default. Quality is nearly indistinguishable from full precision for conversational and coding tasks.
Requirement: 6–8GB VRAM or 8–16GB RAM (CPU)
GPU options: RTX 3060 12GB, RTX 4060 8GB, RX 6700 XT, M1/M2/M3 MacBook
ollama pull deepseek-r1:7b # DeepSeek R1 Distill 7B
ollama pull qwen2.5-coder:7b # Qwen 2.5 Coder 7B
ollama pull qwen3:8b # Qwen3 8B
Expected speed: 30–80 t/s on GPU, 5–15 t/s on CPU (M2 MacBook gets ~25–40 t/s)
Real-world capability: Solid for Q&A, summarization, basic coding, classification. Reasoning quality is noticeably below GPT-4o for complex multi-step problems.
Requirement: 12–24GB VRAM or 32GB RAM
GPU options: RTX 3090 (24GB), RTX 4090 (24GB), RTX 4080 (16GB for 14B), M2/M3 Max/Ultra
ollama pull deepseek-r1:14b # DeepSeek R1 Distill 14B
ollama pull deepseek-r1:32b # DeepSeek R1 Distill 32B (needs 24GB VRAM)
ollama pull qwen2.5-coder:32b # Qwen 2.5 Coder 32B — best local coding model
ollama pull qwen2.5:14b # Qwen 2.5 14B general
Expected speed: 20–60 t/s on RTX 4090
Real-world capability: 32B models are where local LLMs become genuinely impressive. Qwen2.5-Coder-32B benchmarks at GPT-4o-mini level on coding tasks. DeepSeek R1 32B handles multi-step reasoning that smaller models struggle with.
Requirement: 2× 24GB VRAM (2× RTX 3090/4090) or 64GB+ unified memory (M2 Ultra, M3 Ultra)
GPU options: Dual RTX 4090 (NVLink not required), M2/M3 Ultra Mac Studio
ollama pull deepseek-r1:70b # DeepSeek R1 70B
ollama pull qwen2.5:72b # Qwen 2.5 72B
ollama pull llama3.3:70b # Meta Llama 3.3 70B (general)
Expected speed: 20–40 t/s on dual RTX 4090 (offloads layers across GPUs automatically)
Real-world capability: Near-frontier reasoning quality. The quality gap vs. GPT-4o is minimal for most tasks.
DeepSeek V3/R1 full 671B models require 8× H100 80GB or equivalent — not consumer hardware. The distilled models above (7B–70B) use DeepSeek's knowledge in smaller architectures and are the practical choice for local deployment.
| GPU | VRAM | Max Model (Q4_K_M) |
|---|---|---|
| RTX 4060 | 8GB | 7B |
| RTX 3060 12GB | 12GB | 13B |
| RTX 4080 | 16GB | 13–14B |
| RTX 3090 / 4090 | 24GB | 32B |
| 2× RTX 4090 | 48GB | 70B |
| M2 Max (32GB unified) | 32GB | 32B |
| M3 Ultra (192GB unified) | 192GB | 671B (quantized) |
Apple Silicon's unified memory architecture means the GPU and CPU share the same memory pool — an M2 Max with 32GB can run 30B models that would need a dedicated 24GB VRAM GPU on a Windows machine.
# Install Ollama (Linux/macOS)
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run DeepSeek R1
ollama run deepseek-r1:14b
# With explicit thinking visible
# DeepSeek R1 shows its chain-of-thought in <think> tags
DeepSeek R1's reasoning chains are displayed by default — you see the model "thinking" through problems step by step. This is useful for debugging and understanding the model's approach, not just the final answer.
# Chat with DeepSeek R1
response = ollama.chat(
model='deepseek-r1:14b',
messages=[
{
'role': 'user',
'content': 'Implement a binary search tree in Python with insert, search, and delete methods.'
}
]
)
print(response['message']['content'])
# Or with curl (OpenAI-compatible API)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1:14b",
"messages": [{"role": "user", "content": "Explain recursion with a simple example"}]
}'
Qwen2.5-Coder-32B is the recommended local model for software development in 2026. It supports 100+ programming languages, fill-in-the-middle completion, and long-context code understanding.
# Pull the coding model
ollama pull qwen2.5-coder:32b
# Or the 7B version for constrained hardware
ollama pull qwen2.5-coder:7b
Continue is a VS Code/JetBrains extension that connects to Ollama for local AI coding assistance:
{
"models": [
{
"title": "Qwen2.5-Coder 32B",
"provider": "ollama",
"model": "qwen2.5-coder:32b",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "Qwen2.5-Coder 7B (fast)",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
}
Use the 32B model for chat/explanations and the 7B for tab autocomplete (autocomplete needs to be fast — 7B delivers better latency).
Assumptions: 500K tokens/day, 50% input / 50% output, 22 working days/month = 11M tokens/month
| Provider | Cost | Notes |
|---|---|---|
| GPT-4o | $2.50/$10 per 1M | ~$65/month |
| Claude Sonnet 4.6 | $3/$15 per 1M | ~$72/month |
| Groq (Llama 3.3 70B) | $0.59/$0.79 per 1M | ~$7.30/month |
| Qwen2.5-Coder 32B local | $0/month | Hardware amortized |
Hardware payback period:
The math favors local when: You use AI heavily (>500K tokens/day), have an existing GPU, or already have a desktop workstation where GPU cost is shared with gaming/other work.
Cloud wins for low-volume: If you're doing 50K tokens/day, Groq at $0.73/month is unbeatable.
Don't believe the hype or the FUD. Realistic assessment for 2026:
| Task | Local 7B | Local 32B | GPT-4o | Claude Opus 4.6 |
|---|---|---|---|---|
| Simple Q&A | ✅ Good | ✅ Great | ✅ Great | ✅ Great |
| Code generation (common patterns) | ✅ Good | ✅ Great | ✅ Great | ✅ Great |
| Complex reasoning | ⚠️ Mediocre | ✅ Good | ✅ Great | ✅ Best |
| Math (competition level) | ❌ Poor | ✅ Good (R1) | ✅ Good | ✅ Great |
| Long document analysis | ❌ Limited | ✅ Good | ✅ Great | ✅ Best |
| Creative writing | ✅ Good | ✅ Great | ✅ Great | ✅ Best |
| Multilingual | ✅ Good (Qwen) | ✅ Great | ✅ Great | ✅ Great |
The 32B tier is genuinely competitive for day-to-day development work. The gap vs. frontier models shows most clearly in complex multi-step reasoning, mathematical proofs, and tasks requiring judgment about nuanced tradeoffs.
deepseek-r1:7b or qwen3:8b
qwen2.5-coder:32b or deepseek-r1:32b
qwen2.5-coder:32b (single 4090) or llama3.3:70b (dual 3090)Qwen3, released mid-2025, introduced a "thinking mode" similar to DeepSeek R1's chain-of-thought reasoning. Qwen3 models support two modes:
/think): Extended reasoning with visible thought process. Use for complex coding, math, and multi-step problems./no_think): Fast conversational responses. Use for Q&A, summarization, and simple code completions.
# Qwen3 with thinking mode
ollama run qwen3:8b
>>> /think Write a recursive function to flatten nested lists in Python.
# Qwen3 without thinking (faster)
>>> /no_think What does the zip() function do?
For teams that previously used separate models for "fast chat" and "deep reasoning," Qwen3 consolidates this into a single model. Run the 8B for solo developers; the 30B (requires 24GB VRAM) for team-shared endpoints.
With Ollama + Open WebUI, you can run a local inference server that your whole team connects to:
# docker-compose.yml — team AI server
services:
ollama:
image: ollama/ollama:latest
volumes:
- ollama:/root/.ollama
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
volumes:
- open-webui:/app/backend/data
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
A single RTX 4090 machine serves 3–5 concurrent users on 32B models. 5 developers sharing a $1,800 GPU = $360 each — paid back in 3–5 months vs. ChatGPT Plus subscriptions.
Not every use case needs the same model. Here's how to match model to task:
| Use case | Recommended model | Why |
|---|---|---|
| Code autocomplete | qwen2.5-coder:7b |
Fast response time matters more than depth |
| Code review / refactoring | qwen2.5-coder:32b |
Needs broad context and reasoning |
| Math / logic problems | deepseek-r1:14b |
Chain-of-thought reasoning |
| Document summarization | qwen2.5:14b |
Strong instruction following, context length |
| Multilingual content | qwen2.5:7b |
Qwen models excel at non-English languages |
| Agentic workflows | deepseek-r1:32b |
Better multi-step planning |
The biggest mistake new local LLM users make is running a single large model for everything. Use 7B models where speed matters (autocomplete, quick lookups) and 14B–32B models where quality matters (code review, complex reasoning). Ollama handles multiple loaded models with separate GPU memory allocation.
Model loads but responses are slow (CPU offloading)
If Ollama outputs llm_load_tensors: offloaded X/Y layers to GPU, some model layers are running on CPU because they don't fit in VRAM. Options: use a smaller model, use a more aggressive quantization (Q4_0 instead of Q4_K_M), or add more VRAM.
# Check what's happening during load
OLLAMA_DEBUG=1 ollama run deepseek-r1:14b
Out of memory errors
# Reduce context window (default is model-max, often 4096–32768)
ollama run deepseek-r1:14b --num-ctx 2048
Reducing context from 32K to 4K can cut VRAM usage by 30–50% with minimal impact for most conversational use.
Multiple users hitting the same endpoint
Ollama processes one request at a time by default. For concurrent users, set:
OLLAMA_NUM_PARALLEL=4 # Process up to 4 requests simultaneously
This enables batching but increases per-request VRAM usage.
Browse all AI self-hosting guides at OSSAlt.
Related: Self-Host Your AI: Ollama + Open WebUI 2026 · 10 Open-Source Tools to Replace SaaS in 2026