
SoumiaPart 1 of 3: LLM Fundamentals Series Before you can master LLMs, you need to understand tokens - the...
Part 1 of 3: LLM Fundamentals Series
Before you can master LLMs, you need to understand tokens - the fundamental units that make everything work. Whether you're building with GPT-5, Claude, or running LLaMA locally, tokens determine your costs, context limits, and performance.
For a 5-year-old: A token is like a puzzle piece of language! Just like you break LEGO builds into individual bricks, language models break text into small pieces called tokens.
When you say "I love ice cream!" - the model sees it as 5 puzzle pieces:
Tokens are the fundamental units that LLMs process. They're NOT always whole words:
Examples:
"hello" β 1 token
"understanding" β ["under", "stand", "ing"] β 3 tokens
"ChatGPT" β ["Chat", "GPT"] β 2 tokens
"π" β 1 token
" " β 1 token
"!" β 1 token
Key principle: Tokens represent frequent patterns in the training data. Common words = 1 token. Rare words or combinations = multiple tokens.
APIs charge per token, not per word or character.
Real pricing (February 2026):
GPT-5:
Input: $1.25 per million tokens
Output: $10.00 per million tokens
Claude Opus 4.5:
Input: $15.00 per million tokens
Output: $75.00 per million tokens
DeepSeek-V3.2:
Input: $0.55 per million tokens
Output: $2.20 per million tokens
Gemini 3 Pro:
Input: $1.25 per million tokens
Output: $5.00 per million tokens
Example calculation:
Task: Summarize 100 customer reviews
Average per review: 200 words
Total words: 20,000
Token conversion: 20,000 words Γ· 0.75 = ~26,667 tokens
Costs:
- GPT-5: $0.03 input + $0.27 output (if 1K token summary)
- DeepSeek: $0.01 input + $0.06 output
- Claude Opus: $0.40 input + $2.00 output
Every model has a maximum token "budget" for each conversation.
Current limits (February 2026):
GPT-5: 400,000 tokens (~300,000 words, ~600 pages)
GPT-5.2-Codex: 400,000 tokens
Claude Opus 4.5: 200,000 tokens (~150,000 words, ~300 pages)
Claude Sonnet 4.5: 200,000 tokens
Gemini 3 Pro: 1,000,000 tokens (~750,000 words, ~1,500 pages)
DeepSeek-V3.2: 128,000 tokens (~96,000 words, ~192 pages)
LLaMA 4 Scout: 10,000,000 tokens (~7.5M words, ~15,000 pages!)
LLaMA 4 Maverick: 1,000,000 tokens
Kimi K2.5: 256,000 tokens
What this means in practice:
128K context = Can fit:
- ~1 novel
- ~50 research papers
- ~500 emails
- ~2,000 Slack messages
1M context = Can fit:
- ~8 novels
- ~400 research papers
- ~4,000 emails
- Entire small codebases
10M context = Can fit:
- Harry Potter series + LOTR + more
- Entire large codebases
- Years of company documents
Processing speed is measured in tokens per second.
Typical speeds:
Fast models (inference-optimized):
GPT-5 Instant: 100-150 tokens/sec
Gemini 3 Flash: 120-180 tokens/sec
Claude Haiku 4.5: 100-130 tokens/sec
Standard models:
GPT-5: 60-80 tokens/sec
Claude Sonnet 4.5: 50-70 tokens/sec
Reasoning models (showing work):
GPT-5 Thinking: 20-40 tokens/sec
o3: 15-30 tokens/sec
DeepSeek-R1: 25-35 tokens/sec
Real-world impact:
Generate 2,000 token response:
Fast model (120 tok/sec): ~17 seconds
Standard model (70 tok/sec): ~29 seconds
Reasoning model (30 tok/sec): ~67 seconds
Most modern models use BPE (Byte Pair Encoding) or SentencePiece:
Input text: "Tokenization is cool!"
# Step 1: Break into candidate pieces
# (Based on trained vocabulary)
pieces = ["Token", "ization", " is", " cool", "!"]
# Step 2: Look up in vocabulary (example IDs)
vocabulary = {
"Token": 5678,
"ization": 1234,
" is": 318,
" cool": 4500,
"!": 0
}
# Step 3: Convert to IDs
token_ids = [5678, 1234, 318, 4500, 0]
# Step 4: Model processes these IDs
# (This is what actually goes through the neural network)
# Step 5: Model predicts next token ID
next_token_id = 2589 # Let's say this means " Amazing"
# Step 6: Convert back to text
output = "Tokenization is cool! Amazing"
Same text, different token counts:
Text: "The DeepSeek-V3 model achieves state-of-the-art results!"
GPT-4 tokenizer (cl100k_base):
['The', ' Deep', 'Se', 'ek', '-', 'V', '3', ' model', ' achieves',
' state', '-', 'of', '-', 'the', '-', 'art', ' results', '!']
= 18 tokens
Claude tokenizer (~100k vocab):
['The', ' DeepSeek', '-', 'V3', ' model', ' achieves',
' state-of-the-art', ' results', '!']
= 9 tokens (more efficient!)
LLaMA tokenizer (~32k vocab):
['The', ' Deep', 'Se', 'ek', '-', 'V', '3', ' model', ' ach', 'ieves',
' state', '-', 'of', '-', 'the', '-', 'art', ' results', '!']
= 19 tokens
Why the difference?
English text (approximate):
1 token β 0.75 words
1 token β 4 characters
Reverse:
1 word β 1.33 tokens
1 character β 0.25 tokens
Pages:
1 page (500 words) β 667 tokens
1,000 tokens β 1.5 pages
Example 1: Blog Post
Content: 2,000 word blog post
Tokens: 2,000 Γ 1.33 = ~2,667 tokens
Context needed:
- Input (post): 2,667 tokens
- Prompt: 100 tokens
- Output (summary): 200 tokens
Total: ~2,967 tokens
Cost (GPT-5):
- Input: 2,767 Γ $1.25/M = $0.0035
- Output: 200 Γ $10/M = $0.002
Total: $0.0055 per blog post
Example 2: Customer Support Bot
Average conversation:
- User messages: 50 words Γ 5 messages = 250 words
- Bot responses: 75 words Γ 5 messages = 375 words
- System prompt: 150 words
Total: 775 words = ~1,033 tokens
Daily volume: 1,000 conversations
Monthly tokens: 1,033 Γ 1,000 Γ 30 = 30,990,000 tokens
Cost (DeepSeek-V3.2):
Input: 30.99M Γ $0.55/M = $17.04
Output: Similar estimate = $17.04
Monthly total: ~$34/month
Example 3: Code Analysis
Codebase: 50,000 lines of Python
Average line: 40 characters
Total characters: 2,000,000
Tokens: 2,000,000 Γ· 4 = ~500,000 tokens
Fits in:
β
GPT-5 (400K) - NO, need chunking
β
Gemini 3 Pro (1M) - YES
β
LLaMA 4 Maverick (1M) - YES
β
LLaMA 4 Scout (10M) - YES, with room to spare
The problem: Most tokenizers are optimized for English.
Real examples:
English: "Hello world"
Tokens: ["Hello", " world"] = 2 tokens
Chinese: "δ½ ε₯½δΈη" (same meaning)
Tokens: ["δ½ ", "ε₯½", "δΈ", "η"] = 4 tokens
Cost: 2x English!
Arabic: "Ω
Ψ±ΨΨ¨Ψ§ Ψ¨Ψ§ΩΨΉΨ§ΩΩ
"
Tokens: ~6-8 tokens (depends on model)
Cost: 3-4x English!
Code: def hello_world():
Tokens: ["def", " hello", "_", "world", "(", ")", ":"] = 7 tokens
Cost: 3.5x the words!
π Most Efficient (1.0-1.2 tokens/word):
- Simple English prose
- Common words
- Standard punctuation
π Medium Efficiency (1.3-1.8 tokens/word):
- Technical English
- Mixed language
- Marketing copy
β Least Efficient (2.0-4.0 tokens/word):
- Code (especially with special characters)
- Non-English languages
- URLs and identifiers
- Mathematical notation
- Emoji-heavy text
Real data from a multilingual app:
Feature: Translate product descriptions
English (baseline): 1,000 words = 1,333 tokens
Spanish: 1,000 words = 1,450 tokens (+9%)
French: 1,000 words = 1,520 tokens (+14%)
German: 1,000 words = 1,680 tokens (+26%)
Chinese: 1,000 words = 2,100 tokens (+58%)
Japanese: 1,000 words = 2,400 tokens (+80%)
Arabic: 1,000 words = 2,300 tokens (+73%)
Monthly cost impact (10K translations):
English baseline: $16.66 (GPT-5)
Japanese: $30.00 (+80% cost!)
GPT-3/GPT-4/GPT-5: ~50,000 tokens (cl100k_base)
Claude (all versions): ~100,000 tokens
LLaMA 2/3: ~32,000 tokens
LLaMA 4: ~50,000 tokens
DeepSeek: ~100,000 tokens
Qwen: ~150,000 tokens
Mistral: ~32,000 tokens
Gemini: ~256,000 tokens (largest!)
Larger vocabulary = more efficient encoding
Example text: "The anthropological archaeological investigation"
Small vocab (32K tokens):
["The", " anthrop", "ological", " arch", "ae", "ological", " invest", "igation"]
= 8 tokens
Medium vocab (50K tokens):
["The", " anthropological", " archaeological", " investigation"]
= 4 tokens (50% fewer!)
Large vocab (100K tokens):
["The", " anthropological", " archaeological", " investigation"]
= 4 tokens (similar to 50K, but better for technical terms)
Trade-offs:
New in 2025 with models like GPT-5, o3, DeepSeek-R1:
Simple explanation: When a model "thinks," it uses special hidden tokens that you don't see but still pay for!
Your question: "Prove the Pythagorean theorem"
Input tokens: 5
π§ Model's internal thinking (HIDDEN from you):
- "Let's start with a right triangle..."
- "Consider squares on each side..."
- "Area of square on hypotenuse = ..."
- "By geometric rearrangement..."
- [continues for 2,500 tokens]
Reasoning tokens: 2,500 (hidden)
π€ Model's visible answer: "Here's a proof using..."
Output tokens: 200 (visible)
π° Total billable tokens: 5 + 2,500 + 200 = 2,705 tokens!
Standard mode vs Reasoning mode:
Simple question: "What's 2+2?"
Standard: 5 input + 10 output = 15 tokens
Reasoning: 5 input + 50 reasoning + 10 output = 65 tokens
Multiplier: 4.3x
Complex question: "Debug this React component"
Standard: 200 input + 300 output = 500 tokens
Reasoning: 200 input + 3,000 reasoning + 300 output = 3,500 tokens
Multiplier: 7x
Math proof: "Prove Fermat's Last Theorem for n=3"
Standard: Might fail or hallucinate
Reasoning: 50 input + 15,000 reasoning + 500 output = 15,550 tokens
Multiplier: Can't compare (only reasoning can do it)
β Use reasoning when:
β Don't use reasoning for:
# OpenAI API example
response = client.chat.completions.create(
model="gpt-5-2025-08-07",
messages=[{"role": "user", "content": "Solve this proof..."}]
)
# Check token usage
usage = response.usage
print(f"Input tokens: {usage.prompt_tokens}")
print(f"Output tokens: {usage.completion_tokens}")
print(f"Reasoning tokens: {usage.completion_tokens_details.reasoning_tokens}")
print(f"Total cost: ${calculate_cost(usage)}")
# Output:
# Input tokens: 45
# Output tokens: 387
# Reasoning tokens: 4,522
# Total cost: $0.046
# β Inefficient (187 tokens)
prompt = """
Please carefully analyze the following customer review
and provide a comprehensive breakdown including:
- The overall sentiment (positive, negative, or neutral)
- Key topics and themes mentioned in the review
- Specific action items or recommendations
- An overall customer satisfaction score from 1-10
Here is the customer review:
"Great product, fast shipping!"
Please structure your response clearly.
"""
# β
Efficient (52 tokens)
prompt = """Analyze review: "Great product, fast shipping!"
Return: sentiment, topics, actions, score (1-10)"""
# Savings: 72% fewer tokens!
# β Verbose (generates ~200 tokens)
"Please write a summary of this article with key points."
# β
Structured (generates ~100 tokens)
"Summarize as JSON: {title, main_points: [3 items], conclusion}"
# Response is more predictable and shorter
Many providers cache repeated context:
# Expensive: sending full context every time
for question in questions:
response = client.chat.completions.create(
messages=[
{"role": "system", "content": long_context}, # 10K tokens
{"role": "user", "content": question}
]
)
# Cost: 10K tokens Γ 100 questions = 1M tokens
# β
Cached: reuses context
# First call: 10K tokens (full price)
# Subsequent calls: 10K tokens (50-90% discount)
# Savings: up to 90% on repeated context
Task matrix:
Simple Q&A:
β GPT-5 ($1.25/M input)
β
GPT-5-mini ($0.25/M input) - 5x cheaper!
Code generation:
β
GPT-5.2-Codex (specialized)
β General GPT-5 (overkill)
Reasoning:
β
GPT-5 Thinking mode (when needed)
β Always-on reasoning (wasteful)
Bulk processing:
β Claude Opus ($15/M)
β
DeepSeek-V3.2 ($0.55/M) - 27x cheaper!
# β Inefficient: 100 API calls
for review in reviews:
summarize(review) # 200 tokens each
# Total: 20,000 tokens + 100 API calls
# β
Efficient: 1 API call
batch_prompt = f"""Summarize each review (1-2 sentences):
{chr(10).join(f"{i+1}. {r}" for i, r in enumerate(reviews))}
"""
# Total: ~15,000 tokens + 1 API call
# Savings: 25% tokens + 99 fewer API calls
# β Raw: 50,000 tokens
full_document = load_entire_manual()
# β
Compressed: 5,000 tokens
key_sections = extract_relevant_sections(full_document, query)
# 10x reduction in tokens
# Track token usage
import logging
def track_tokens(request, response):
tokens_used = response.usage.total_tokens
cost = calculate_cost(tokens_used, model)
logging.info(f"Request: {tokens_used} tokens, ${cost:.4f}")
# Alert on anomalies
if tokens_used > 10000:
alert_team(f"High token usage: {tokens_used}")
return response
Before sending to API:
OpenAI: tiktoken library (Python)
pip install tiktoken
Anthropic: anthropic tokenizer
https://console.anthropic.com/tokenizer
Universal: tiktokenizer.vercel.app
Paste text, get instant count
HuggingFace: transformers library
tokenizer = AutoTokenizer.from_pretrained(model_name)
# OpenAI token counting
import tiktoken
def count_tokens(text, model="gpt-5"):
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
text = "How many tokens is this?"
tokens = count_tokens(text)
print(f"Tokens: {tokens}") # Output: Tokens: 6
# Cost estimation
def estimate_cost(text, model="gpt-5", is_input=True):
tokens = count_tokens(text, model)
rate = 1.25 if is_input else 10.0 # per million
cost = (tokens / 1_000_000) * rate
return cost
cost = estimate_cost("Your prompt here")
print(f"Estimated cost: ${cost:.6f}")
# Anthropic token counting
from anthropic import Anthropic
client = Anthropic()
# Count before sending
response = client.messages.count_tokens(
model="claude-sonnet-4-5-20250929",
messages=[{"role": "user", "content": "Hello!"}]
)
print(f"Input tokens: {response.input_tokens}")
class TokenBudget:
def __init__(self, monthly_budget_usd, model="gpt-5"):
self.budget = monthly_budget_usd
self.rates = {
"gpt-5": {"input": 1.25, "output": 10.0},
"claude-opus-4.5": {"input": 15.0, "output": 75.0},
"deepseek-v3.2": {"input": 0.55, "output": 2.20}
}
self.model = model
def tokens_available(self, ratio_input_output=0.5):
"""Calculate tokens available for budget"""
rate = self.rates[self.model]
avg_rate = (rate["input"] * ratio_input_output +
rate["output"] * (1 - ratio_input_output))
tokens_per_dollar = 1_000_000 / avg_rate
total_tokens = self.budget * tokens_per_dollar
return int(total_tokens)
def requests_available(self, tokens_per_request):
"""How many requests can we make?"""
total_tokens = self.tokens_available()
return int(total_tokens / tokens_per_request)
# Example usage
budget = TokenBudget(monthly_budget_usd=100, model="gpt-5")
print(f"Monthly tokens: {budget.tokens_available():,}")
print(f"Requests (1K tokens each): {budget.requests_available(1000):,}")
# Output:
# Monthly tokens: 18,181,818
# Requests (1K tokens each): 18,181
β Wrong assumption:
words = 1000
tokens = 1000 # WRONG!
β
Correct:
words = 1000
tokens = int(words * 1.33) # ~1,330 tokens
β Dangerous:
# Enabling reasoning for everything
response = client.chat(
model="gpt-5",
reasoning_effort="high", # β οΈ 5-10x cost!
messages=[{"role": "user", "content": "Hi"}]
)
β
Safe:
# Use reasoning selectively
if is_complex_task(question):
reasoning_effort = "high"
else:
reasoning_effort = None
β Flying blind:
for i in range(10000):
client.chat(messages=messages)
# Surprise $500 bill!
β
Monitored:
total_tokens = 0
for i in range(10000):
response = client.chat(messages=messages)
total_tokens += response.usage.total_tokens
if total_tokens > BUDGET_TOKENS:
raise Exception("Budget exceeded!")
β Missing opportunity:
# Using GPT-4 for Chinese text
# Inefficient tokenization = 2x cost
β
Better:
# Use model optimized for target language
# Qwen for Chinese (better tokenization)
# Or use Claude (larger vocab, more efficient)
TOKEN BASICS
============
1 token β 0.75 English words
1 token β 4 characters
1 page β 667 tokens
CONTEXT LIMITS (Feb 2026)
==========================
GPT-5: 400K tokens
Claude Opus 4.5: 200K tokens
Gemini 3 Pro: 1M tokens
DeepSeek-V3.2: 128K tokens
LLaMA 4 Scout: 10M tokens
PRICING (per million tokens)
============================
Input Output
GPT-5 $1.25 $10.00
Claude O $15.00 $75.00
DeepSeek $0.55 $2.20
Gemini 3 $1.25 $5.00
OPTIMIZATION CHECKLIST
======================
β Count tokens before sending
β Use smallest viable model
β Enable caching for repeated context
β Batch similar requests
β Use reasoning mode sparingly
β Monitor usage with alerts
β Compress long contexts
β Use structured outputs
Understanding tokens is fundamental to working effectively with LLMs:
Next in series: Part 2: LLM Architectures Explained and Part 3: Fine-Tuning Fundamentals
About this guide: Updated February 2026 with current pricing, context limits, and reasoning token information.
Further reading: