
SoumiaPart 2 of 3: LLM Fundamentals Series Updated February 2026 - The LLM landscape transformed in 2025....
Part 2 of 3: LLM Fundamentals Series
Updated February 2026 - The LLM landscape transformed in 2025. This guide explains the architectures powering today's frontier models, from the transformer foundation to the reasoning revolution.
2025's Big Shift: It's not about bigger models anymore - it's about smarter training. RLVR (Reinforcement Learning from Verifiable Rewards) changed everything.
For a 5-year-old: Think of a transformer like a really smart reading buddy who can look at ALL the words in a story at the same time, instead of reading one word at a time like you do.
For developers: Transformers use self-attention to process entire sequences in parallel, replacing the sequential processing of RNNs/LSTMs.
Before transformers (RNNs/LSTMs):
Input: "The cat sat on the mat"
Processing: Sequential ā
Step 1: Process "The"
Step 2: Process "cat" (remembering "The")
Step 3: Process "sat" (remembering "The cat")
... and so on
Problem: Slow, hard to parallelize, forgets long-range dependencies
With transformers (Attention):
Input: "The cat sat on the mat"
Processing: Parallel ā”
ALL words processed simultaneously
Each word "attends to" all other words
Can capture long-range dependencies
Result: 10-100x faster training, better quality
Query: "What does 'it' refer to?"
Sentence: "The cat sat on the mat because it was comfortable"
Attention mechanism:
"it" ā looks at all words
ā
scores each word
ā
"mat" gets highest score (0.87)
"cat" gets medium score (0.45)
"The" gets low score (0.03)
ā
Determines "it" = "mat"
Mathematical intuition (simplified):
# For each word
for word in sentence:
# Calculate "how much attention" to pay to every other word
attention_scores = compute_similarity(word, all_other_words)
# Use those scores to create weighted combination
context = weighted_sum(all_other_words, attention_scores)
# This context helps understand the word better
enhanced_representation = combine(word, context)
Every transformer has:
Old paradigm (2020-2024):
Bigger model + More data + More compute = Better performance
New paradigm (2025+):
Base model + RLVR training + Test-time compute = Reasoning ability
Simple explanation: Instead of just teaching a model what answers LOOK like, we make it solve actual problems and reward it when it gets them right!
Technical explanation:
RLVR = Reinforcement Learning from Verifiable Rewards
Traditional training:
Input: "What is 2+2?"
Target: "The answer is 4"
Loss: How close is model output to target text?
RLVR training:
Input: "What is 2+2?"
Model generates: "Let me think... 2+2... that's 4"
Verification: Check if 4 is mathematically correct ā
Reward: +1 for correct, -1 for incorrect
Update: Reinforce behaviors that led to correct answer
Key insight: Model learns to develop intermediate reasoning steps because those strategies lead to correct final answers!
Models trained with RLVR spontaneously develop:
Example - DeepSeek-R1-Zero (pure RL, no supervised training):
Problem: "If x² - 5x + 6 = 0, find x"
Model's internal reasoning (discovered on its own!):
<think>
Let me factor this equation.
I need two numbers that multiply to 6 and add to -5.
Those numbers are -2 and -3.
So (x - 2)(x - 3) = 0
This means x = 2 or x = 3
Let me verify:
For x = 2: 2² - 5(2) + 6 = 4 - 10 + 6 = 0 ā
For x = 3: 3² - 5(3) + 6 = 9 - 15 + 6 = 0 ā
</think>
The solutions are x = 2 and x = 3.
Nobody taught it to "show work" - it learned this helps get correct answers!
Old scaling:
Better performance = Bigger model
New scaling:
Better performance = More thinking time
Practical example:
GPT-5 on math problem:
Low effort (fast):
- Thinking tokens: ~100
- Time: 3 seconds
- Accuracy: 65%
Medium effort:
- Thinking tokens: ~500
- Time: 12 seconds
- Accuracy: 82%
High effort:
- Thinking tokens: ~2,500
- Time: 45 seconds
- Accuracy: 94%
Same model, different compute budget!
Simple explanation: GPT-5 is like a kid who can choose whether to answer quickly or think really hard depending on the question.
Architecture:
Type: Decoder-only transformer
Parameters: ~1.8T total (estimated)
Context: 400K tokens
Special features:
- Adaptive reasoning (Instant/Auto/Thinking modes)
- Unified architecture (one model, multiple modes)
- Extended multimodal (text, image, audio)
Training:
Phase 1: Pre-training on internet-scale data
Phase 2: Supervised fine-tuning
Phase 3: RLHF (Reinforcement Learning from Human Feedback)
Phase 4: RLVR (Reasoning training)
Key innovations:
Architecture diagram:
Input ā Embedding ā Transformer Blocks (N layers) ā Output
ā
[Reasoning path]
ā
Extra processing
ā
More transformer passes
ā
Output with COT
Simple explanation: Claude not only thinks hard but also makes sure its thinking follows important rules about being helpful and harmless.
Architecture:
Type: Decoder-only transformer + Constitutional AI
Parameters: ~500B (Opus), ~200B (Sonnet)
Context: 200K tokens
Special features:
- Extended thinking with tool use
- Constitutional AI alignment
- Chrome browser integration
Training:
Phase 1: Pre-training
Phase 2: Constitutional AI (AI-generated feedback)
Phase 3: RLHF with harmlessness criteria
Phase 4: Reasoning fine-tuning
Constitutional AI explained:
Instead of just training on human feedback:
Step 1: Generate many responses
Step 2: AI critiques its own responses against principles
Step 3: AI revises responses to be better
Step 4: Train on revised responses
Step 5: RLHF with harmlessness as key criteria
Principles include:
- Be helpful and harmless
- Respect user agency
- Be honest about limitations
- Avoid deception
Why it matters: Built-in safety without sacrificing capability.
Simple explanation: DeepSeek figured out how to be as smart as the expensive models while using way less computer power - like getting straight A's while studying half as long!
Architecture:
Type: Mixture-of-Experts (MoE) + Multi-head Latent Attention
Total parameters: 671B
Active parameters: 37B (per token)
Context: 128K tokens (extensible to 256K)
MoE structure:
- 256 routed experts
- 1 shared expert (always active)
- Top-8 routing per token
Special features:
- Multi-head Latent Attention (MLA)
- FP8 precision training
- DeepSeek Sparse Attention (V3.2)
The MoE Magic:
Traditional dense model:
Every token ā All 671B parameters
Cost: High memory, slow inference
DeepSeek MoE:
Every token ā 37B active parameters
ā (8 experts + 1 shared expert)
Cost: 94% reduction in active compute!
How it works:
Input token: "quantum"
Router scores all 256 experts
Selects top 8: [Physics, Math, Chemistry, ...]
Activates those + shared expert
Other 248 experts: dormant
Multi-head Latent Attention (MLA):
Problem: Standard attention stores huge KV cache
- Memory explodes with long contexts
- Costs $$$ in inference
MLA solution: Compress KV representations
- Low-rank compression of keys/values
- 50% reduction in memory
- Actually BETTER performance (!?)
Training efficiency:
GPT-4: ~$100M+ estimated training cost
DeepSeek-V3: $5.5M actual training cost
18x cheaper, competitive performance
V3 ā R1 ā V3.1 ā V3.2 Evolution:
V3 (Dec 2024): Base MoE model
ā
R1 (Jan 2025): Pure RL reasoning (no SFT!)
ā
V3.1 (Aug 2025): Hybrid with mode switching
ā
V3.2 (Sep 2025): Added DeepSeek Sparse Attention
Simple explanation: Gemini is like a kid who's not just good with words, but can see pictures, hear sounds, watch videos, AND read code - all at once!
Architecture:
Type: Natively multimodal decoder transformer
Parameters: ~1.5T (estimated for Pro)
Context: 1M tokens (massive!)
Special features:
- Native multimodal (not adapted text model)
- Deep Think reasoning mode
- Extreme long-context optimization
Modalities:
- Text (all languages)
- Images (high resolution)
- Audio
- Video (native, not frame-by-frame)
- Code
Native multimodality:
Old approach (GPT-4V):
1. Train text model
2. Add vision encoder
3. Bridge with projection layer
4. Fine-tune together
Gemini approach:
1. Train on ALL modalities from day 1
2. Unified token space for text/image/audio
3. Attention across all modalities
4. No separate encoders
Result: Better cross-modal understanding
1M context window:
Techniques used:
- Ring attention (distributed attention)
- Optimized position embeddings (ALiBi)
- Sparse attention patterns
- Efficient KV caching
Enables:
- Entire codebases in context
- Full books
- Hours of video transcripts
- Massive document analysis
Simple explanation: LLaMA is Meta's gift to everyone - powerful AI that anyone can use, modify, and run on their own computers!
Architecture:
Type: Decoder-only + MoE (Maverick)
Parameters:
Scout: 70B (dense)
Maverick: 400B (MoE with ~50B active)
Context:
Scout: 10M tokens (!!!)
Maverick: 1M tokens
License: Llama 4 Community License
LLaMA 4 Scout - The Context King:
10 MILLION token context!
What fits:
- Entire Harry Potter series
- Complete large codebases
- 20,000 pages of documents
- Years of company emails
How they did it:
- Grouped Query Attention (GQA)
- Optimized for single-GPU inference
- Aggressive KV cache compression
- Special training for long contexts
MoE Design (Maverick):
Different from DeepSeek:
DeepSeek: MoE in every layer
LLaMA 4: Alternating MoE and dense layers
Layer pattern:
1. Dense FFN
2. MoE FFN
3. Dense FFN
4. MoE FFN
...
Trade-off:
- More stable training
- Better dense computation
- Slightly less parameter-efficient
Simple explanation: Kimi has an absolutely massive memory - it can remember entire libraries of books!
Architecture:
Type: MoE (based on DeepSeek V3)
Parameters: ~671B total, ~40B active
Context: 256K tokens (Think variant)
Special features:
- More experts than DeepSeek
- Fewer attention heads
- Extended context training
Differences from DeepSeek V3:
Experts: 320 (vs 256 in DeepSeek)
Attention heads: 64 (vs 128 in DeepSeek)
Training: More long-context data
Why more experts?
More experts = More specialization
Example routing:
Token: "quantum entanglement"
DeepSeek (256 experts):
Activates: Physics, Quantum, ...
Kimi (320 experts):
Activates: Quantum Physics,
Quantum Computing,
Particle Physics, ...
Result: More fine-grained expertise
Simple explanation: Mixtral has 8 experts and picks the best 2 for each task - like having a team where you always pick the right specialists!
Architecture:
Type: Sparse Mixture-of-Experts
Models:
8x7B: 47B total, 13B active
8x22B: 141B total, 39B active
Context: 32K tokens
License: Apache 2.0
MoE structure:
- 8 experts per layer
- Top-2 routing
- Simple, clean design
Mistral Medium 3.1 (2025):
The value champion:
Performance: ~90% of Claude Sonnet 3.7
Cost: $0.40/M tokens (8x cheaper!)
Use case: Cost-sensitive production workloads
Architecture: Same MoE, better training
Standard Multi-Head Attention:
# Conceptual pseudocode
def multi_head_attention(query, key, value):
# Split into multiple heads
Q = split_heads(query, num_heads=128)
K = split_heads(key, num_heads=128)
V = split_heads(value, num_heads=128)
# For each head
attention_scores = softmax(Q @ K.T / sqrt(d_k))
output = attention_scores @ V
# Combine heads
return concat_heads(output)
Memory cost: O(sequence_length Ć num_heads Ć head_dim)
Multi-head Latent Attention (MLA - DeepSeek):
def multi_head_latent_attention(query, key, value):
# Compress K,V to latent space
K_latent = compress(key) # Low-rank projection
V_latent = compress(value)
# Attention in compressed space
attention = softmax(query @ K_latent.T)
output = attention @ V_latent
# Expand back
return expand(output)
Memory cost: O(sequence_length Ć latent_dim)
where latent_dim << num_heads Ć head_dim
Savings: 50-70% memory reduction!
Sparse Attention (V3.2):
def sparse_attention(query, key, value):
# Don't attend to ALL tokens
# Use pattern: local + global
local_window = 256 # Attend to nearby tokens
global_stride = 64 # Attend to distant tokens periodically
# Build sparse attention mask
mask = create_sparse_mask(local_window, global_stride)
# Apply attention only where mask = 1
scores = softmax(Q @ K.T, mask)
output = scores @ V
return output
Computation: O(sequence_length Ć (local_window + seq/stride))
vs O(sequence_length²) for dense
The Routing Problem:
def route_to_experts(token, experts, k=8):
"""
Given a token, decide which experts should process it
"""
# Each token gets a routing score for each expert
router_weights = compute_router_scores(token)
# Shape: [num_experts]
# Example: [0.1, 0.05, 0.8, 0.3, ..., 0.2]
# Select top-k experts
top_k_indices = topk(router_weights, k)
top_k_weights = router_weights[top_k_indices]
# Normalize weights (sum to 1)
top_k_weights = softmax(top_k_weights)
# Process token through selected experts
outputs = []
for expert_idx, weight in zip(top_k_indices, top_k_weights):
expert_output = experts[expert_idx](token)
outputs.append(weight * expert_output)
# Weighted combination
final_output = sum(outputs)
return final_output
Load Balancing Challenge:
# Problem: All tokens might route to same "favorite" experts
# Expert 1: processes 80% of tokens (overloaded!)
# Expert 2-8: process 20% of tokens (underutilized)
# Solution: Auxiliary load balancing loss
def load_balancing_loss(router_weights, expert_assignments):
# Measure how evenly distributed experts are
expert_usage = count_assignments_per_expert(expert_assignments)
# Ideal: each expert gets 1/num_experts of tokens
ideal_usage = total_tokens / num_experts
# Penalize deviation from ideal
loss = mean((expert_usage - ideal_usage)²)
return loss
# Added to training loss to encourage balanced routing
Shared Expert (DeepSeek innovation):
def moe_with_shared_expert(token, routed_experts, shared_expert):
# Standard routing to top-k experts
routed_output = route_to_experts(token, routed_experts, k=8)
# ALWAYS process through shared expert
shared_output = shared_expert(token)
# Combine
final_output = routed_output + shared_output
return final_output
# Why this helps:
# - Shared expert learns common patterns
# - Routed experts specialize more
# - Better gradient flow during training
Why we need them:
Attention is permutation-invariant!
"The cat chased the mouse"
vs
"The mouse chased the cat"
Without positions ā same attention patterns ā wrong!
RoPE (Rotary Position Embedding):
# Used by LLaMA, DeepSeek, many modern models
def rope_encoding(embeddings, positions):
# Rotate embeddings based on position
# Each position gets a unique rotation
for i, pos in enumerate(positions):
# Create rotation matrix for this position
theta = pos / 10000^(2i/d)
rotation = [[cos(theta), -sin(theta)],
[sin(theta), cos(theta)]]
# Apply rotation
embeddings[i] = rotation @ embeddings[i]
return embeddings
# Properties:
# - Relative positions naturally encoded
# - Extrapolates to longer sequences
# - No learned parameters
ALiBi (Attention with Linear Biases):
# Used by Gemini for extreme long contexts
def alibi_attention(Q, K, positions):
# Compute base attention scores
scores = Q @ K.T
# Add position-based bias
for i, j in positions:
distance = abs(i - j)
bias = -distance * slope # slope is learned
scores[i][j] += bias
# Penalizes attending to distant tokens
# Allows scaling to very long contexts
return softmax(scores)
Pre-2025 Scaling (Chinchilla):
Optimal compute allocation:
For compute budget C:
Model params ā C^0.5
Training tokens ā C^0.5
Example:
2x compute ā ~1.4x params, ~1.4x tokens
Post-2025 Scaling (with RLVR):
New dimensions:
1. Pre-training compute
2. Pre-training data
3. RL compute budget
4. Test-time compute
Observation: RL often > pre-training for ROI
Example (DeepSeek):
$5.5M pre-training ā GPT-4 class model
+ RL training ā o1-class reasoning
More efficient than scaling pre-training!
USE CASE | RECOMMENDED
----------------------------|---------------------------
General chat/assistance | GPT-5, Claude Sonnet 4.5
Complex reasoning | GPT-5 Thinking, o3, R1
Code generation | GPT-5.2-Codex, Opus 4.5
Long documents (100K+ tok) | Gemini 3, LLaMA 4 Scout
Multimodal (vision+text) | Gemini 3, GPT-5
Cost-sensitive production | DeepSeek V3.2, Mixtral
Open-source/self-hosting | LLaMA 4, DeepSeek, Qwen
Fine-tuning needed | LLaMA 4, Mistral, Qwen
Research/experimentation | Open models (LLaMA, DS)
Dense vs Sparse (MoE):
Dense models (GPT-5, Claude):
ā
Simpler architecture
ā
Easier to train
ā
More stable
ā Higher inference cost
ā Less parameter-efficient
MoE models (DeepSeek, Mixtral):
ā
More parameters, same compute
ā
Lower inference cost
ā
Better specialization
ā More complex training
ā Load balancing challenges
Reasoning vs Non-Reasoning:
Standard models:
ā
Fast inference
ā
Predictable costs
ā
Good for simple tasks
ā Struggles with complex logic
Reasoning models:
ā
Excellent on hard problems
ā
Shows work (interpretable)
ā
Self-correcting
ā 5-10x slower
ā 5-10x more expensive
ā Overkill for simple tasks
Large vs Small Context:
Standard context (128K-400K):
ā
Efficient
ā
Lower cost
ā
Sufficient for most tasks
ā May need chunking/RAG
Mega context (1M-10M):
ā
Process massive documents
ā
No chunking needed
ā
Better coherence
ā Higher cost per token
ā Slower processing
TIER 1: Premium ($10-15/M input)
- Claude Opus 4.5
- GPT-5 Pro
Use when: Quality > cost, mission-critical
TIER 2: Standard ($1-3/M input)
- GPT-5
- Claude Sonnet 4.5
- Gemini 3 Pro
Use when: Balanced quality and cost
TIER 3: Budget ($0.40-0.60/M input)
- DeepSeek-V3.2
- Mixtral Medium 3.1
- Gemini 3 Flash
Use when: High volume, cost-sensitive
TIER 4: Open-source (self-hosting)
- LLaMA 4
- Qwen 3
- Mistral
Use when: Data privacy, customization
1. Hybrid Architectures
Combining strengths:
- Transformer + State Space Models (Mamba)
- Dense + Sparse (dynamic routing)
- Reasoning + Fast modes in one model
2. Smaller, Smarter Models
2024: 70B to match GPT-3.5
2025: 7B to match GPT-4
2026: 1B to match GPT-4?
Techniques:
- Better distillation
- Improved training
- Architectural efficiency
3. Multimodal Standard
Text-only ā Exception
Native multimodal ā Standard
Next: Touch, sensor data, robotics
4. Extreme Context
Current: 10M tokens
Future: 100M+ tokens?
Enables:
- Lifetime conversation history
- Company-wide knowledge
- Real-time learning
ARCHITECTURE TYPES
==================
Decoder-only: GPT-5, Claude, LLaMA
Best for: Generation, chat
Encoder-only: BERT (legacy)
Best for: Classification, embeddings
Encoder-decoder: T5 (legacy)
Best for: Translation, old systems
MoE: DeepSeek, Mixtral, LLaMA 4 Maverick
Best for: Efficiency, specialization
Multimodal: Gemini, GPT-5
Best for: Vision + text tasks
KEY INNOVATIONS (2025)
======================
ā
RLVR training
ā
Test-time compute scaling
ā
Multi-head Latent Attention
ā
Sparse attention patterns
ā
1M-10M context windows
ā
Native multimodality
ā
Open-source parity
SELECTION CHECKLIST
===================
ā Task complexity (simple vs reasoning)?
ā Context requirements (<128K vs >1M)?
ā Budget (premium vs economy)?
ā Latency tolerance (fast vs accurate)?
ā Multimodal needed?
ā Open-source preferred?
ā Fine-tuning planned?
The 2025 architecture revolution wasn't about new building blocks - transformers still dominate. The revolution was in how we train models:
For developers: Choose based on your use case, not hype. Test-time compute and efficient architectures often beat brute-force scaling.
Next in series: Part 3: Fine-Tuning Fundamentals - When and how to customize models for your needs
Further Reading: