LLM Architectures Explained - From Transformers to Reasoning Models 🏗️

# ai# architecture# llm# machinelearning

Soumia

Part 2 of 3: LLM Fundamentals Series Updated February 2026 - The LLM landscape transformed in 2025....

Part 2 of 3: LLM Fundamentals Series

Updated February 2026 - The LLM landscape transformed in 2025. This guide explains the architectures powering today's frontier models, from the transformer foundation to the reasoning revolution.

2025's Big Shift: It's not about bigger models anymore - it's about smarter training. RLVR (Reinforcement Learning from Verifiable Rewards) changed everything.

The Transformer Foundation
The Reasoning Revolution (2025)
Current Frontier Models
Architecture Deep Dive
Choosing the Right Architecture

The Foundation: Transformer Architecture 🏗️

The Simple Explanation

For a 5-year-old: Think of a transformer like a really smart reading buddy who can look at ALL the words in a story at the same time, instead of reading one word at a time like you do.

For developers: Transformers use self-attention to process entire sequences in parallel, replacing the sequential processing of RNNs/LSTMs.

Why Transformers Won

Before transformers (RNNs/LSTMs):

Input: "The cat sat on the mat"

Processing: Sequential →
Step 1: Process "The"
Step 2: Process "cat" (remembering "The")
Step 3: Process "sat" (remembering "The cat")
... and so on

Problem: Slow, hard to parallelize, forgets long-range dependencies

With transformers (Attention):

Input: "The cat sat on the mat"

Processing: Parallel ⚡
ALL words processed simultaneously
Each word "attends to" all other words
Can capture long-range dependencies

Result: 10-100x faster training, better quality

The Key Innovation: Self-Attention

Query: "What does 'it' refer to?"
Sentence: "The cat sat on the mat because it was comfortable"

Attention mechanism:
"it" → looks at all words
      ↓
   scores each word
      ↓
   "mat" gets highest score (0.87)
   "cat" gets medium score (0.45)
   "The" gets low score (0.03)
      ↓
   Determines "it" = "mat"

Mathematical intuition (simplified):

# For each word
for word in sentence:
    # Calculate "how much attention" to pay to every other word
    attention_scores = compute_similarity(word, all_other_words)

    # Use those scores to create weighted combination
    context = weighted_sum(all_other_words, attention_scores)

    # This context helps understand the word better
    enhanced_representation = combine(word, context)

Core Components

Every transformer has:

Embedding Layer - Converts tokens to vectors
Positional Encoding - Adds position information
Multi-Head Attention - Parallel attention mechanisms
Feed-Forward Networks - Processes each position
Layer Normalization - Stabilizes training
Residual Connections - Helps gradient flow

The 2025 Revolution: Reasoning Models 🧠

What Changed

Old paradigm (2020-2024):

Bigger model + More data + More compute = Better performance

New paradigm (2025+):

Base model + RLVR training + Test-time compute = Reasoning ability

What is RLVR?

Simple explanation: Instead of just teaching a model what answers LOOK like, we make it solve actual problems and reward it when it gets them right!

Technical explanation:

RLVR = Reinforcement Learning from Verifiable Rewards

Traditional training:
Input: "What is 2+2?"
Target: "The answer is 4"
Loss: How close is model output to target text?

RLVR training:
Input: "What is 2+2?"
Model generates: "Let me think... 2+2... that's 4"
Verification: Check if 4 is mathematically correct ✓
Reward: +1 for correct, -1 for incorrect
Update: Reinforce behaviors that led to correct answer

Key insight: Model learns to develop intermediate reasoning steps because those strategies lead to correct final answers!

The Breakthrough

Models trained with RLVR spontaneously develop:

Chain-of-thought reasoning
Self-verification
Backtracking when stuck
Multiple solution strategies

Example - DeepSeek-R1-Zero (pure RL, no supervised training):

Problem: "If x² - 5x + 6 = 0, find x"

Model's internal reasoning (discovered on its own!):
<think>
Let me factor this equation.
I need two numbers that multiply to 6 and add to -5.
Those numbers are -2 and -3.
So (x - 2)(x - 3) = 0
This means x = 2 or x = 3

Let me verify:
For x = 2: 2² - 5(2) + 6 = 4 - 10 + 6 = 0 ✓
For x = 3: 3² - 5(3) + 6 = 9 - 15 + 6 = 0 ✓
</think>

The solutions are x = 2 and x = 3.

Nobody taught it to "show work" - it learned this helps get correct answers!

Test-Time Compute: The New Scaling Law

Old scaling:

Better performance = Bigger model

New scaling:

Better performance = More thinking time

Practical example:

GPT-5 on math problem:

Low effort (fast):
- Thinking tokens: ~100
- Time: 3 seconds
- Accuracy: 65%

Medium effort:
- Thinking tokens: ~500
- Time: 12 seconds
- Accuracy: 82%

High effort:
- Thinking tokens: ~2,500  
- Time: 45 seconds
- Accuracy: 94%

Same model, different compute budget!

Current Frontier Models (February 2026)

GPT-5 Series (OpenAI) - The Adaptive Thinker

Simple explanation: GPT-5 is like a kid who can choose whether to answer quickly or think really hard depending on the question.

Architecture:

Type: Decoder-only transformer
Parameters: ~1.8T total (estimated)
Context: 400K tokens
Special features:
  - Adaptive reasoning (Instant/Auto/Thinking modes)
  - Unified architecture (one model, multiple modes)
  - Extended multimodal (text, image, audio)

Training:
  Phase 1: Pre-training on internet-scale data
  Phase 2: Supervised fine-tuning
  Phase 3: RLHF (Reinforcement Learning from Human Feedback)
  Phase 4: RLVR (Reasoning training)

Key innovations:

Dynamic compute allocation - Spends more time on hard problems
Integrated reasoning - No separate "reasoning model"
95% reduction in hallucinations vs GPT-4o when reasoning

Architecture diagram:

Input → Embedding → Transformer Blocks (N layers) → Output
                         ↓
                    [Reasoning path]
                         ↓
                    Extra processing
                         ↓
                    More transformer passes
                         ↓
                    Output with COT

Claude 4.5 (Anthropic) - The Safe Reasoner

Simple explanation: Claude not only thinks hard but also makes sure its thinking follows important rules about being helpful and harmless.

Architecture:

Type: Decoder-only transformer + Constitutional AI
Parameters: ~500B (Opus), ~200B (Sonnet)
Context: 200K tokens
Special features:
  - Extended thinking with tool use
  - Constitutional AI alignment
  - Chrome browser integration

Training:
  Phase 1: Pre-training
  Phase 2: Constitutional AI (AI-generated feedback)
  Phase 3: RLHF with harmlessness criteria
  Phase 4: Reasoning fine-tuning

Constitutional AI explained:

Instead of just training on human feedback:

Step 1: Generate many responses
Step 2: AI critiques its own responses against principles
Step 3: AI revises responses to be better
Step 4: Train on revised responses
Step 5: RLHF with harmlessness as key criteria

Principles include:
- Be helpful and harmless
- Respect user agency
- Be honest about limitations
- Avoid deception

Why it matters: Built-in safety without sacrificing capability.

DeepSeek V3/R1/V3.2 - The Efficient Genius

Simple explanation: DeepSeek figured out how to be as smart as the expensive models while using way less computer power - like getting straight A's while studying half as long!

Architecture:

Type: Mixture-of-Experts (MoE) + Multi-head Latent Attention
Total parameters: 671B
Active parameters: 37B (per token)
Context: 128K tokens (extensible to 256K)

MoE structure:
  - 256 routed experts
  - 1 shared expert (always active)
  - Top-8 routing per token

Special features:
  - Multi-head Latent Attention (MLA)
  - FP8 precision training
  - DeepSeek Sparse Attention (V3.2)

The MoE Magic:

Traditional dense model:
Every token → All 671B parameters
Cost: High memory, slow inference

DeepSeek MoE:
Every token → 37B active parameters
            → (8 experts + 1 shared expert)
Cost: 94% reduction in active compute!

How it works:
Input token: "quantum"
Router scores all 256 experts
Selects top 8: [Physics, Math, Chemistry, ...]
Activates those + shared expert
Other 248 experts: dormant

Multi-head Latent Attention (MLA):

Problem: Standard attention stores huge KV cache
- Memory explodes with long contexts
- Costs $$$ in inference

MLA solution: Compress KV representations
- Low-rank compression of keys/values
- 50% reduction in memory
- Actually BETTER performance (!?)

Training efficiency:

GPT-4: ~$100M+ estimated training cost
DeepSeek-V3: $5.5M actual training cost

18x cheaper, competitive performance

V3 → R1 → V3.1 → V3.2 Evolution:

V3 (Dec 2024): Base MoE model
    ↓
R1 (Jan 2025): Pure RL reasoning (no SFT!)
    ↓
V3.1 (Aug 2025): Hybrid with mode switching
    ↓
V3.2 (Sep 2025): Added DeepSeek Sparse Attention

Gemini 3 (Google) - The Multimodal Giant

Simple explanation: Gemini is like a kid who's not just good with words, but can see pictures, hear sounds, watch videos, AND read code - all at once!

Architecture:

Type: Natively multimodal decoder transformer
Parameters: ~1.5T (estimated for Pro)
Context: 1M tokens (massive!)
Special features:
  - Native multimodal (not adapted text model)
  - Deep Think reasoning mode
  - Extreme long-context optimization

Modalities:
  - Text (all languages)
  - Images (high resolution)
  - Audio
  - Video (native, not frame-by-frame)
  - Code

Native multimodality:

Old approach (GPT-4V):
1. Train text model
2. Add vision encoder
3. Bridge with projection layer
4. Fine-tune together

Gemini approach:
1. Train on ALL modalities from day 1
2. Unified token space for text/image/audio
3. Attention across all modalities
4. No separate encoders

Result: Better cross-modal understanding

1M context window:

Techniques used:
- Ring attention (distributed attention)
- Optimized position embeddings (ALiBi)
- Sparse attention patterns
- Efficient KV caching

Enables:
- Entire codebases in context
- Full books
- Hours of video transcripts
- Massive document analysis

LLaMA 4 (Meta) - The Open Champion

Simple explanation: LLaMA is Meta's gift to everyone - powerful AI that anyone can use, modify, and run on their own computers!

Architecture:

Type: Decoder-only + MoE (Maverick)
Parameters: 
  Scout: 70B (dense)
  Maverick: 400B (MoE with ~50B active)
Context:
  Scout: 10M tokens (!!!)
  Maverick: 1M tokens

License: Llama 4 Community License

LLaMA 4 Scout - The Context King:

10 MILLION token context!

What fits:
- Entire Harry Potter series
- Complete large codebases
- 20,000 pages of documents
- Years of company emails

How they did it:
- Grouped Query Attention (GQA)
- Optimized for single-GPU inference
- Aggressive KV cache compression
- Special training for long contexts

MoE Design (Maverick):

Different from DeepSeek:

DeepSeek: MoE in every layer
LLaMA 4: Alternating MoE and dense layers

Layer pattern:
1. Dense FFN
2. MoE FFN
3. Dense FFN
4. MoE FFN
...

Trade-off:
- More stable training
- Better dense computation
- Slightly less parameter-efficient

Kimi K2.5 (Moonshot AI) - The Memory Master

Simple explanation: Kimi has an absolutely massive memory - it can remember entire libraries of books!

Architecture:

Type: MoE (based on DeepSeek V3)
Parameters: ~671B total, ~40B active
Context: 256K tokens (Think variant)
Special features:
  - More experts than DeepSeek
  - Fewer attention heads
  - Extended context training

Differences from DeepSeek V3:
  Experts: 320 (vs 256 in DeepSeek)
  Attention heads: 64 (vs 128 in DeepSeek)
  Training: More long-context data

Why more experts?

More experts = More specialization

Example routing:
Token: "quantum entanglement"

DeepSeek (256 experts):
  Activates: Physics, Quantum, ...

Kimi (320 experts):
  Activates: Quantum Physics, 
             Quantum Computing,
             Particle Physics, ...

Result: More fine-grained expertise

Mixtral (Mistral AI) - The Efficient Specialist

Simple explanation: Mixtral has 8 experts and picks the best 2 for each task - like having a team where you always pick the right specialists!

Architecture:

Type: Sparse Mixture-of-Experts
Models:
  8x7B: 47B total, 13B active
  8x22B: 141B total, 39B active
Context: 32K tokens
License: Apache 2.0

MoE structure:
  - 8 experts per layer
  - Top-2 routing
  - Simple, clean design

Mistral Medium 3.1 (2025):

The value champion:

Performance: ~90% of Claude Sonnet 3.7
Cost: $0.40/M tokens (8x cheaper!)
Use case: Cost-sensitive production workloads

Architecture: Same MoE, better training

Architecture Deep Dive 🔬

Attention Mechanisms Evolution

Standard Multi-Head Attention:

# Conceptual pseudocode
def multi_head_attention(query, key, value):
    # Split into multiple heads
    Q = split_heads(query, num_heads=128)
    K = split_heads(key, num_heads=128)
    V = split_heads(value, num_heads=128)

    # For each head
    attention_scores = softmax(Q @ K.T / sqrt(d_k))
    output = attention_scores @ V

    # Combine heads
    return concat_heads(output)

Memory cost: O(sequence_length × num_heads × head_dim)

Multi-head Latent Attention (MLA - DeepSeek):

def multi_head_latent_attention(query, key, value):
    # Compress K,V to latent space
    K_latent = compress(key)  # Low-rank projection
    V_latent = compress(value)

    # Attention in compressed space
    attention = softmax(query @ K_latent.T)
    output = attention @ V_latent

    # Expand back
    return expand(output)

Memory cost: O(sequence_length × latent_dim)
                where latent_dim << num_heads × head_dim

Savings: 50-70% memory reduction!

Sparse Attention (V3.2):

def sparse_attention(query, key, value):
    # Don't attend to ALL tokens
    # Use pattern: local + global

    local_window = 256  # Attend to nearby tokens
    global_stride = 64  # Attend to distant tokens periodically

    # Build sparse attention mask
    mask = create_sparse_mask(local_window, global_stride)

    # Apply attention only where mask = 1
    scores = softmax(Q @ K.T, mask)
    output = scores @ V

    return output

Computation: O(sequence_length × (local_window + seq/stride))
             vs O(sequence_length²) for dense

Mixture of Experts (MoE) Deep Dive

The Routing Problem:

def route_to_experts(token, experts, k=8):
    """
    Given a token, decide which experts should process it
    """
    # Each token gets a routing score for each expert
    router_weights = compute_router_scores(token)
    # Shape: [num_experts]
    # Example: [0.1, 0.05, 0.8, 0.3, ..., 0.2]

    # Select top-k experts
    top_k_indices = topk(router_weights, k)
    top_k_weights = router_weights[top_k_indices]

    # Normalize weights (sum to 1)
    top_k_weights = softmax(top_k_weights)

    # Process token through selected experts
    outputs = []
    for expert_idx, weight in zip(top_k_indices, top_k_weights):
        expert_output = experts[expert_idx](token)
        outputs.append(weight * expert_output)

    # Weighted combination
    final_output = sum(outputs)

    return final_output

Load Balancing Challenge:

# Problem: All tokens might route to same "favorite" experts
# Expert 1: processes 80% of tokens (overloaded!)
# Expert 2-8: process 20% of tokens (underutilized)

# Solution: Auxiliary load balancing loss
def load_balancing_loss(router_weights, expert_assignments):
    # Measure how evenly distributed experts are
    expert_usage = count_assignments_per_expert(expert_assignments)

    # Ideal: each expert gets 1/num_experts of tokens
    ideal_usage = total_tokens / num_experts

    # Penalize deviation from ideal
    loss = mean((expert_usage - ideal_usage)²)

    return loss

# Added to training loss to encourage balanced routing

Shared Expert (DeepSeek innovation):

def moe_with_shared_expert(token, routed_experts, shared_expert):
    # Standard routing to top-k experts
    routed_output = route_to_experts(token, routed_experts, k=8)

    # ALWAYS process through shared expert
    shared_output = shared_expert(token)

    # Combine
    final_output = routed_output + shared_output

    return final_output

# Why this helps:
# - Shared expert learns common patterns
# - Routed experts specialize more
# - Better gradient flow during training

Position Encodings

Why we need them:

Attention is permutation-invariant!

"The cat chased the mouse"
vs
"The mouse chased the cat"

Without positions → same attention patterns → wrong!

RoPE (Rotary Position Embedding):

# Used by LLaMA, DeepSeek, many modern models

def rope_encoding(embeddings, positions):
    # Rotate embeddings based on position
    # Each position gets a unique rotation

    for i, pos in enumerate(positions):
        # Create rotation matrix for this position
        theta = pos / 10000^(2i/d)
        rotation = [[cos(theta), -sin(theta)],
                    [sin(theta),  cos(theta)]]

        # Apply rotation
        embeddings[i] = rotation @ embeddings[i]

    return embeddings

# Properties:
# - Relative positions naturally encoded
# - Extrapolates to longer sequences
# - No learned parameters

ALiBi (Attention with Linear Biases):

# Used by Gemini for extreme long contexts

def alibi_attention(Q, K, positions):
    # Compute base attention scores
    scores = Q @ K.T

    # Add position-based bias
    for i, j in positions:
        distance = abs(i - j)
        bias = -distance * slope  # slope is learned
        scores[i][j] += bias

    # Penalizes attending to distant tokens
    # Allows scaling to very long contexts

    return softmax(scores)

The Scaling Laws

Pre-2025 Scaling (Chinchilla):

Optimal compute allocation:
For compute budget C:
  Model params ∝ C^0.5
  Training tokens ∝ C^0.5

Example:
  2x compute → ~1.4x params, ~1.4x tokens

Post-2025 Scaling (with RLVR):

New dimensions:
1. Pre-training compute
2. Pre-training data
3. RL compute budget
4. Test-time compute

Observation: RL often > pre-training for ROI

Example (DeepSeek):
  $5.5M pre-training → GPT-4 class model
  + RL training → o1-class reasoning

More efficient than scaling pre-training!

Choosing the Right Architecture 🎯

Decision Matrix

USE CASE                    | RECOMMENDED
----------------------------|---------------------------
General chat/assistance     | GPT-5, Claude Sonnet 4.5
Complex reasoning           | GPT-5 Thinking, o3, R1
Code generation             | GPT-5.2-Codex, Opus 4.5
Long documents (100K+ tok)  | Gemini 3, LLaMA 4 Scout
Multimodal (vision+text)    | Gemini 3, GPT-5
Cost-sensitive production   | DeepSeek V3.2, Mixtral
Open-source/self-hosting    | LLaMA 4, DeepSeek, Qwen
Fine-tuning needed          | LLaMA 4, Mistral, Qwen
Research/experimentation    | Open models (LLaMA, DS)

Architecture Trade-offs

Dense vs Sparse (MoE):

Dense models (GPT-5, Claude):
✅ Simpler architecture
✅ Easier to train
✅ More stable
❌ Higher inference cost
❌ Less parameter-efficient

MoE models (DeepSeek, Mixtral):
✅ More parameters, same compute
✅ Lower inference cost
✅ Better specialization
❌ More complex training
❌ Load balancing challenges

Reasoning vs Non-Reasoning:

Standard models:
✅ Fast inference
✅ Predictable costs
✅ Good for simple tasks
❌ Struggles with complex logic

Reasoning models:
✅ Excellent on hard problems
✅ Shows work (interpretable)
✅ Self-correcting
❌ 5-10x slower
❌ 5-10x more expensive
❌ Overkill for simple tasks

Large vs Small Context:

Standard context (128K-400K):
✅ Efficient
✅ Lower cost
✅ Sufficient for most tasks
❌ May need chunking/RAG

Mega context (1M-10M):
✅ Process massive documents
✅ No chunking needed
✅ Better coherence
❌ Higher cost per token
❌ Slower processing

Cost vs Performance

TIER 1: Premium ($10-15/M input)
- Claude Opus 4.5
- GPT-5 Pro
Use when: Quality > cost, mission-critical

TIER 2: Standard ($1-3/M input)
- GPT-5
- Claude Sonnet 4.5
- Gemini 3 Pro
Use when: Balanced quality and cost

TIER 3: Budget ($0.40-0.60/M input)
- DeepSeek-V3.2
- Mixtral Medium 3.1
- Gemini 3 Flash
Use when: High volume, cost-sensitive

TIER 4: Open-source (self-hosting)
- LLaMA 4
- Qwen 3
- Mistral
Use when: Data privacy, customization

The Future of LLM Architectures 🚀

Emerging Trends

1. Hybrid Architectures

Combining strengths:
- Transformer + State Space Models (Mamba)
- Dense + Sparse (dynamic routing)
- Reasoning + Fast modes in one model

2. Smaller, Smarter Models

2024: 70B to match GPT-3.5
2025: 7B to match GPT-4
2026: 1B to match GPT-4?

Techniques:
- Better distillation
- Improved training
- Architectural efficiency

3. Multimodal Standard

Text-only → Exception
Native multimodal → Standard

Next: Touch, sensor data, robotics

4. Extreme Context

Current: 10M tokens
Future: 100M+ tokens?

Enables:
- Lifetime conversation history
- Company-wide knowledge
- Real-time learning

What Won't Change

Transformers still dominant (at least through 2027)
Attention remains core mechanism
Pre-training + fine-tuning paradigm
Decoder-only for generation

Quick Reference 📋

ARCHITECTURE TYPES
==================
Decoder-only: GPT-5, Claude, LLaMA
  Best for: Generation, chat

Encoder-only: BERT (legacy)
  Best for: Classification, embeddings

Encoder-decoder: T5 (legacy)
  Best for: Translation, old systems

MoE: DeepSeek, Mixtral, LLaMA 4 Maverick
  Best for: Efficiency, specialization

Multimodal: Gemini, GPT-5
  Best for: Vision + text tasks

KEY INNOVATIONS (2025)
======================
✅ RLVR training
✅ Test-time compute scaling
✅ Multi-head Latent Attention
✅ Sparse attention patterns
✅ 1M-10M context windows
✅ Native multimodality
✅ Open-source parity

SELECTION CHECKLIST
===================
☐ Task complexity (simple vs reasoning)?
☐ Context requirements (<128K vs >1M)?
☐ Budget (premium vs economy)?
☐ Latency tolerance (fast vs accurate)?
☐ Multimodal needed?
☐ Open-source preferred?
☐ Fine-tuning planned?

Conclusion 🎯

The 2025 architecture revolution wasn't about new building blocks - transformers still dominate. The revolution was in how we train models:

RLVR > pure scaling - Smarter training beats bigger models
MoE went mainstream - Sparse activation is the new standard
Reasoning is a feature - Not a separate model family
Context exploded - 10M tokens is real
Open-source caught up - Chinese labs proved efficiency matters

For developers: Choose based on your use case, not hype. Test-time compute and efficient architectures often beat brute-force scaling.

Next in series: Part 3: Fine-Tuning Fundamentals - When and how to customize models for your needs

Further Reading:

Attention Is All You Need - Original transformer paper
DeepSeek-V3 Technical Report
DeepSeek-R1: Incentivizing Reasoning via RL
The Illustrated Transformer
Sebastian Raschka's Architecture Comparison

LLM Architectures Explained - From Transformers to Reasoning Models 🏗️

Table of Contents

The Foundation: Transformer Architecture 🏗️

The Simple Explanation

Why Transformers Won

The Key Innovation: Self-Attention

Core Components

The 2025 Revolution: Reasoning Models 🧠

What Changed

What is RLVR?

The Breakthrough

Test-Time Compute: The New Scaling Law

Current Frontier Models (February 2026)

GPT-5 Series (OpenAI) - The Adaptive Thinker

Claude 4.5 (Anthropic) - The Safe Reasoner

DeepSeek V3/R1/V3.2 - The Efficient Genius

Gemini 3 (Google) - The Multimodal Giant

LLaMA 4 (Meta) - The Open Champion

Kimi K2.5 (Moonshot AI) - The Memory Master

Mixtral (Mistral AI) - The Efficient Specialist

Architecture Deep Dive 🔬

Attention Mechanisms Evolution

Mixture of Experts (MoE) Deep Dive

Position Encodings

The Scaling Laws

Choosing the Right Architecture 🎯

Decision Matrix

Architecture Trade-offs

Cost vs Performance

The Future of LLM Architectures 🚀

Emerging Trends

What Won't Change

Quick Reference 📋

Conclusion 🎯