Self-Hosted LLM: DeepSeek and Qwen 2026

# deepseek# qwen# ollama# llm
Self-Hosted LLM: DeepSeek and Qwen 2026Royce

Self-Hosted LLM Guide: Run DeepSeek and Qwen Locally DeepSeek R1 shocked the AI world in...

Self-Hosted LLM Guide: Run DeepSeek and Qwen Locally

DeepSeek R1 shocked the AI world in early 2025 by matching o1-level reasoning at a fraction of the training cost. Qwen 2.5 and Qwen3 from Alibaba brought frontier-class coding ability to open-weight models. Both are available under permissive licenses. Both run locally via Ollama. And both eliminate the API costs and privacy concerns of using cloud LLM services.

This guide covers hardware requirements for every model size, exact Ollama commands, cost comparison vs. cloud APIs, and what to actually expect from local inference in 2026.

Quick Verdict

For reasoning tasks (math, coding, logic): DeepSeek R1 8B or 32B depending on your hardware.
For coding specifically: Qwen2.5-Coder-32B on a 24GB GPU is currently the best local coding model — matches GPT-4o-mini on most benchmarks.
Budget hardware: DeepSeek R1 Distill 7B or Qwen3 8B run on consumer GPUs from 2019–2020.
No GPU: 7B models on Apple Silicon (M1+) are fully usable for real work.


Why Run Models Locally in 2026?

Privacy: No data leaves your machine. Prompts containing code, customer data, medical records, or proprietary information stay local.

Cost at scale: GPT-4o at $2.50/1M input tokens adds up. 1M tokens/day = $75/month. Self-hosted: $0 marginal cost after hardware.

No rate limits: Commercial APIs throttle requests. Local inference is limited only by your GPU/CPU.

Latency: With a good GPU, local 7B models respond in <200ms for short prompts.

Offline capability: Works without internet. Useful in air-gapped environments, travel, or unreliable connections.

Experimentation: Try 20 different models in an afternoon without billing anxiety.


Understanding Quantization

Model files are distributed in quantized formats that trade quality for size and speed. The key formats you'll encounter:

Format Size reduction Quality loss Best for
Q4_K_M ~75% vs FP16 Minimal Default choice; best quality-per-GB
Q4_0 ~75% vs FP16 Slight Faster than Q4_K_M, marginally lower quality
Q8_0 ~50% vs FP16 Negligible When you have VRAM to spare
FP16 No compression None Full quality; requires large VRAM
GGUF Varies Varies Format used by Ollama/llama.cpp

For daily use, Q4_K_M is the right default. Quality is nearly indistinguishable from full precision for conversational and coding tasks.


Hardware Requirements by Model Size

7B–8B Models (Entry Level)

Requirement: 6–8GB VRAM or 8–16GB RAM (CPU)
GPU options: RTX 3060 12GB, RTX 4060 8GB, RX 6700 XT, M1/M2/M3 MacBook

ollama pull deepseek-r1:7b           # DeepSeek R1 Distill 7B
ollama pull qwen2.5-coder:7b         # Qwen 2.5 Coder 7B
ollama pull qwen3:8b                 # Qwen3 8B
Enter fullscreen mode Exit fullscreen mode

Expected speed: 30–80 t/s on GPU, 5–15 t/s on CPU (M2 MacBook gets ~25–40 t/s)

Real-world capability: Solid for Q&A, summarization, basic coding, classification. Reasoning quality is noticeably below GPT-4o for complex multi-step problems.

14B–32B Models (Mid-Range)

Requirement: 12–24GB VRAM or 32GB RAM
GPU options: RTX 3090 (24GB), RTX 4090 (24GB), RTX 4080 (16GB for 14B), M2/M3 Max/Ultra

ollama pull deepseek-r1:14b          # DeepSeek R1 Distill 14B
ollama pull deepseek-r1:32b          # DeepSeek R1 Distill 32B (needs 24GB VRAM)
ollama pull qwen2.5-coder:32b        # Qwen 2.5 Coder 32B — best local coding model
ollama pull qwen2.5:14b              # Qwen 2.5 14B general
Enter fullscreen mode Exit fullscreen mode

Expected speed: 20–60 t/s on RTX 4090

Real-world capability: 32B models are where local LLMs become genuinely impressive. Qwen2.5-Coder-32B benchmarks at GPT-4o-mini level on coding tasks. DeepSeek R1 32B handles multi-step reasoning that smaller models struggle with.

70B Models (High-End)

Requirement: 2× 24GB VRAM (2× RTX 3090/4090) or 64GB+ unified memory (M2 Ultra, M3 Ultra)
GPU options: Dual RTX 4090 (NVLink not required), M2/M3 Ultra Mac Studio

ollama pull deepseek-r1:70b          # DeepSeek R1 70B
ollama pull qwen2.5:72b              # Qwen 2.5 72B
ollama pull llama3.3:70b             # Meta Llama 3.3 70B (general)
Enter fullscreen mode Exit fullscreen mode

Expected speed: 20–40 t/s on dual RTX 4090 (offloads layers across GPUs automatically)

Real-world capability: Near-frontier reasoning quality. The quality gap vs. GPT-4o is minimal for most tasks.

671B Models (Requires Server Hardware)

DeepSeek V3/R1 full 671B models require 8× H100 80GB or equivalent — not consumer hardware. The distilled models above (7B–70B) use DeepSeek's knowledge in smaller architectures and are the practical choice for local deployment.


VRAM Quick Reference

GPU VRAM Max Model (Q4_K_M)
RTX 4060 8GB 7B
RTX 3060 12GB 12GB 13B
RTX 4080 16GB 13–14B
RTX 3090 / 4090 24GB 32B
2× RTX 4090 48GB 70B
M2 Max (32GB unified) 32GB 32B
M3 Ultra (192GB unified) 192GB 671B (quantized)

Apple Silicon's unified memory architecture means the GPU and CPU share the same memory pool — an M2 Max with 32GB can run 30B models that would need a dedicated 24GB VRAM GPU on a Windows machine.


Setting Up DeepSeek R1 with Ollama

# Install Ollama (Linux/macOS)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run DeepSeek R1
ollama run deepseek-r1:14b

# With explicit thinking visible
# DeepSeek R1 shows its chain-of-thought in <think> tags
Enter fullscreen mode Exit fullscreen mode

DeepSeek R1's reasoning chains are displayed by default — you see the model "thinking" through problems step by step. This is useful for debugging and understanding the model's approach, not just the final answer.

Running via API


# Chat with DeepSeek R1
response = ollama.chat(
    model='deepseek-r1:14b',
    messages=[
        {
            'role': 'user',
            'content': 'Implement a binary search tree in Python with insert, search, and delete methods.'
        }
    ]
)
print(response['message']['content'])
Enter fullscreen mode Exit fullscreen mode
# Or with curl (OpenAI-compatible API)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1:14b",
    "messages": [{"role": "user", "content": "Explain recursion with a simple example"}]
  }'
Enter fullscreen mode Exit fullscreen mode

Setting Up Qwen2.5-Coder for Development

Qwen2.5-Coder-32B is the recommended local model for software development in 2026. It supports 100+ programming languages, fill-in-the-middle completion, and long-context code understanding.

# Pull the coding model
ollama pull qwen2.5-coder:32b

# Or the 7B version for constrained hardware
ollama pull qwen2.5-coder:7b
Enter fullscreen mode Exit fullscreen mode

Integrate with VS Code via Continue

Continue is a VS Code/JetBrains extension that connects to Ollama for local AI coding assistance:

  1. Install Continue extension
  2. Open Continue settings → add model:
{
  "models": [
    {
      "title": "Qwen2.5-Coder 32B",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5-Coder 7B (fast)",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}
Enter fullscreen mode Exit fullscreen mode

Use the 32B model for chat/explanations and the 7B for tab autocomplete (autocomplete needs to be fast — 7B delivers better latency).


Cost Comparison: Local vs. Cloud

Scenario: Developer using AI coding assistant

Assumptions: 500K tokens/day, 50% input / 50% output, 22 working days/month = 11M tokens/month

Provider Cost Notes
GPT-4o $2.50/$10 per 1M ~$65/month
Claude Sonnet 4.6 $3/$15 per 1M ~$72/month
Groq (Llama 3.3 70B) $0.59/$0.79 per 1M ~$7.30/month
Qwen2.5-Coder 32B local $0/month Hardware amortized

Hardware payback period:

  • RTX 4090 (24GB): ~$1,600–1,800 new
  • At $65/month savings vs GPT-4o: payback in 25–28 months
  • At $7.30/month savings vs Groq: payback in 18+ years (Groq wins for low-volume)

The math favors local when: You use AI heavily (>500K tokens/day), have an existing GPU, or already have a desktop workstation where GPU cost is shared with gaming/other work.

Cloud wins for low-volume: If you're doing 50K tokens/day, Groq at $0.73/month is unbeatable.


Model Quality Reality Check

Don't believe the hype or the FUD. Realistic assessment for 2026:

Task Local 7B Local 32B GPT-4o Claude Opus 4.6
Simple Q&A ✅ Good ✅ Great ✅ Great ✅ Great
Code generation (common patterns) ✅ Good ✅ Great ✅ Great ✅ Great
Complex reasoning ⚠️ Mediocre ✅ Good ✅ Great ✅ Best
Math (competition level) ❌ Poor ✅ Good (R1) ✅ Good ✅ Great
Long document analysis ❌ Limited ✅ Good ✅ Great ✅ Best
Creative writing ✅ Good ✅ Great ✅ Great ✅ Best
Multilingual ✅ Good (Qwen) ✅ Great ✅ Great ✅ Great

The 32B tier is genuinely competitive for day-to-day development work. The gap vs. frontier models shows most clearly in complex multi-step reasoning, mathematical proofs, and tasks requiring judgment about nuanced tradeoffs.


Recommended Setups by Budget

Budget: No Dedicated GPU (Under $0 extra)

  • Hardware: M1/M2/M3 MacBook (8–16GB), or any modern CPU with 16GB RAM
  • Model: deepseek-r1:7b or qwen3:8b
  • Speed: 15–30 t/s on Apple Silicon, 5–10 t/s on CPU-only x86
  • Best for: Light coding assistance, Q&A, summarization

Mid-Range: ~$400–800

  • Hardware: Used RTX 3090 (~$500 used) or new RTX 4070 Ti Super (~$800)
  • Model: qwen2.5-coder:32b or deepseek-r1:32b
  • Speed: 30–50 t/s
  • Best for: Full-time development assistant, replaces GitHub Copilot + ChatGPT

High-End: ~$1,600–3,600

  • Hardware: RTX 4090 (24GB) or 2× RTX 3090
  • Model: qwen2.5-coder:32b (single 4090) or llama3.3:70b (dual 3090)
  • Speed: 40–80 t/s
  • Best for: Production inference server, team-shared AI endpoint, agentic workflows

Qwen3 vs Qwen2.5: What Changed

Qwen3, released mid-2025, introduced a "thinking mode" similar to DeepSeek R1's chain-of-thought reasoning. Qwen3 models support two modes:

  • Thinking mode (/think): Extended reasoning with visible thought process. Use for complex coding, math, and multi-step problems.
  • Non-thinking mode (/no_think): Fast conversational responses. Use for Q&A, summarization, and simple code completions.
# Qwen3 with thinking mode
ollama run qwen3:8b
>>> /think Write a recursive function to flatten nested lists in Python.

# Qwen3 without thinking (faster)
>>> /no_think What does the zip() function do?
Enter fullscreen mode Exit fullscreen mode

For teams that previously used separate models for "fast chat" and "deep reasoning," Qwen3 consolidates this into a single model. Run the 8B for solo developers; the 30B (requires 24GB VRAM) for team-shared endpoints.


Running a Shared Team Endpoint

With Ollama + Open WebUI, you can run a local inference server that your whole team connects to:

# docker-compose.yml — team AI server
services:
  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    volumes:
      - open-webui:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
Enter fullscreen mode Exit fullscreen mode

A single RTX 4090 machine serves 3–5 concurrent users on 32B models. 5 developers sharing a $1,800 GPU = $360 each — paid back in 3–5 months vs. ChatGPT Plus subscriptions.


Choosing Models for Specific Tasks

Not every use case needs the same model. Here's how to match model to task:

Use case Recommended model Why
Code autocomplete qwen2.5-coder:7b Fast response time matters more than depth
Code review / refactoring qwen2.5-coder:32b Needs broad context and reasoning
Math / logic problems deepseek-r1:14b Chain-of-thought reasoning
Document summarization qwen2.5:14b Strong instruction following, context length
Multilingual content qwen2.5:7b Qwen models excel at non-English languages
Agentic workflows deepseek-r1:32b Better multi-step planning

The biggest mistake new local LLM users make is running a single large model for everything. Use 7B models where speed matters (autocomplete, quick lookups) and 14B–32B models where quality matters (code review, complex reasoning). Ollama handles multiple loaded models with separate GPU memory allocation.


Troubleshooting Common Issues

Model loads but responses are slow (CPU offloading)

If Ollama outputs llm_load_tensors: offloaded X/Y layers to GPU, some model layers are running on CPU because they don't fit in VRAM. Options: use a smaller model, use a more aggressive quantization (Q4_0 instead of Q4_K_M), or add more VRAM.

# Check what's happening during load
OLLAMA_DEBUG=1 ollama run deepseek-r1:14b
Enter fullscreen mode Exit fullscreen mode

Out of memory errors

# Reduce context window (default is model-max, often 4096–32768)
ollama run deepseek-r1:14b --num-ctx 2048
Enter fullscreen mode Exit fullscreen mode

Reducing context from 32K to 4K can cut VRAM usage by 30–50% with minimal impact for most conversational use.

Multiple users hitting the same endpoint

Ollama processes one request at a time by default. For concurrent users, set:

OLLAMA_NUM_PARALLEL=4  # Process up to 4 requests simultaneously
Enter fullscreen mode Exit fullscreen mode

This enables batching but increases per-request VRAM usage.


Browse all AI self-hosting guides at OSSAlt.

Related: Self-Host Your AI: Ollama + Open WebUI 2026 · 10 Open-Source Tools to Replace SaaS in 2026