Understanding Tokens in LLMs - The Complete Guide 🧩

Understanding Tokens in LLMs - The Complete Guide 🧩

# ai# llm# machinelearning# tutorial
Understanding Tokens in LLMs - The Complete Guide 🧩Soumia

Part 1 of 3: LLM Fundamentals Series Before you can master LLMs, you need to understand tokens - the...

Part 1 of 3: LLM Fundamentals Series

Before you can master LLMs, you need to understand tokens - the fundamental units that make everything work. Whether you're building with GPT-5, Claude, or running LLaMA locally, tokens determine your costs, context limits, and performance.


What Exactly is a Token? πŸ€”

The Simple Explanation

For a 5-year-old: A token is like a puzzle piece of language! Just like you break LEGO builds into individual bricks, language models break text into small pieces called tokens.

When you say "I love ice cream!" - the model sees it as 5 puzzle pieces:

  • "I" (1 token)
  • " love" (1 token)
  • " ice" (1 token)
  • " cream" (1 token)
  • "!" (1 token)

The Technical Reality

Tokens are the fundamental units that LLMs process. They're NOT always whole words:

Examples:
"hello"          β†’ 1 token
"understanding"  β†’ ["under", "stand", "ing"] β†’ 3 tokens
"ChatGPT"        β†’ ["Chat", "GPT"] β†’ 2 tokens
"πŸŽ‰"             β†’ 1 token
" "              β†’ 1 token
"!"              β†’ 1 token
Enter fullscreen mode Exit fullscreen mode

Key principle: Tokens represent frequent patterns in the training data. Common words = 1 token. Rare words or combinations = multiple tokens.


Why Tokens Matter: The Three Critical Reasons πŸ’‘

1. Tokens = Money πŸ’°

APIs charge per token, not per word or character.

Real pricing (February 2026):

GPT-5:
  Input:  $1.25 per million tokens
  Output: $10.00 per million tokens

Claude Opus 4.5:
  Input:  $15.00 per million tokens  
  Output: $75.00 per million tokens

DeepSeek-V3.2:
  Input:  $0.55 per million tokens
  Output: $2.20 per million tokens

Gemini 3 Pro:
  Input:  $1.25 per million tokens
  Output: $5.00 per million tokens
Enter fullscreen mode Exit fullscreen mode

Example calculation:

Task: Summarize 100 customer reviews
Average per review: 200 words
Total words: 20,000

Token conversion: 20,000 words Γ· 0.75 = ~26,667 tokens

Costs:
- GPT-5:        $0.03 input + $0.27 output (if 1K token summary)
- DeepSeek:     $0.01 input + $0.06 output  
- Claude Opus:  $0.40 input + $2.00 output
Enter fullscreen mode Exit fullscreen mode

2. Tokens = Context Limits πŸ“

Every model has a maximum token "budget" for each conversation.

Current limits (February 2026):

GPT-5:              400,000 tokens  (~300,000 words, ~600 pages)
GPT-5.2-Codex:      400,000 tokens
Claude Opus 4.5:    200,000 tokens  (~150,000 words, ~300 pages)
Claude Sonnet 4.5:  200,000 tokens
Gemini 3 Pro:     1,000,000 tokens  (~750,000 words, ~1,500 pages)
DeepSeek-V3.2:      128,000 tokens  (~96,000 words, ~192 pages)
LLaMA 4 Scout:   10,000,000 tokens  (~7.5M words, ~15,000 pages!)
LLaMA 4 Maverick: 1,000,000 tokens
Kimi K2.5:          256,000 tokens
Enter fullscreen mode Exit fullscreen mode

What this means in practice:

128K context = Can fit:
  - ~1 novel
  - ~50 research papers
  - ~500 emails
  - ~2,000 Slack messages

1M context = Can fit:
  - ~8 novels
  - ~400 research papers  
  - ~4,000 emails
  - Entire small codebases

10M context = Can fit:
  - Harry Potter series + LOTR + more
  - Entire large codebases
  - Years of company documents
Enter fullscreen mode Exit fullscreen mode

3. Tokens = Speed ⚑

Processing speed is measured in tokens per second.

Typical speeds:

Fast models (inference-optimized):
  GPT-5 Instant:     100-150 tokens/sec
  Gemini 3 Flash:    120-180 tokens/sec
  Claude Haiku 4.5:  100-130 tokens/sec

Standard models:
  GPT-5:             60-80 tokens/sec
  Claude Sonnet 4.5: 50-70 tokens/sec

Reasoning models (showing work):
  GPT-5 Thinking:    20-40 tokens/sec
  o3:                15-30 tokens/sec
  DeepSeek-R1:       25-35 tokens/sec
Enter fullscreen mode Exit fullscreen mode

Real-world impact:

Generate 2,000 token response:

Fast model (120 tok/sec):  ~17 seconds
Standard model (70 tok/sec): ~29 seconds  
Reasoning model (30 tok/sec): ~67 seconds
Enter fullscreen mode Exit fullscreen mode

How Tokenization Actually Works πŸ”§

The Process Step-by-Step

Most modern models use BPE (Byte Pair Encoding) or SentencePiece:

Input text: "Tokenization is cool!"

# Step 1: Break into candidate pieces
# (Based on trained vocabulary)
pieces = ["Token", "ization", " is", " cool", "!"]

# Step 2: Look up in vocabulary (example IDs)
vocabulary = {
    "Token": 5678,
    "ization": 1234,
    " is": 318,
    " cool": 4500,
    "!": 0
}

# Step 3: Convert to IDs
token_ids = [5678, 1234, 318, 4500, 0]

# Step 4: Model processes these IDs
# (This is what actually goes through the neural network)

# Step 5: Model predicts next token ID
next_token_id = 2589  # Let's say this means " Amazing"

# Step 6: Convert back to text
output = "Tokenization is cool! Amazing"
Enter fullscreen mode Exit fullscreen mode

Real Example with Different Tokenizers

Same text, different token counts:

Text: "The DeepSeek-V3 model achieves state-of-the-art results!"

GPT-4 tokenizer (cl100k_base):
['The', ' Deep', 'Se', 'ek', '-', 'V', '3', ' model', ' achieves', 
 ' state', '-', 'of', '-', 'the', '-', 'art', ' results', '!']
= 18 tokens

Claude tokenizer (~100k vocab):
['The', ' DeepSeek', '-', 'V3', ' model', ' achieves', 
 ' state-of-the-art', ' results', '!']
= 9 tokens (more efficient!)

LLaMA tokenizer (~32k vocab):
['The', ' Deep', 'Se', 'ek', '-', 'V', '3', ' model', ' ach', 'ieves',
 ' state', '-', 'of', '-', 'the', '-', 'art', ' results', '!']
= 19 tokens
Enter fullscreen mode Exit fullscreen mode

Why the difference?

  • Claude has larger vocabulary (100K vs 50K) β†’ can encode more efficiently
  • LLaMA optimized for different balance (smaller vocab = less memory)
  • Each tokenizer trained on different data

Token Math: Your Practical Guide πŸ“Š

Universal Conversion Formulas

English text (approximate):

1 token β‰ˆ 0.75 words
1 token β‰ˆ 4 characters

Reverse:
1 word β‰ˆ 1.33 tokens
1 character β‰ˆ 0.25 tokens

Pages:
1 page (500 words) β‰ˆ 667 tokens
1,000 tokens β‰ˆ 1.5 pages
Enter fullscreen mode Exit fullscreen mode

Real-World Calculations

Example 1: Blog Post

Content: 2,000 word blog post
Tokens: 2,000 Γ— 1.33 = ~2,667 tokens

Context needed:
- Input (post): 2,667 tokens
- Prompt: 100 tokens
- Output (summary): 200 tokens
Total: ~2,967 tokens

Cost (GPT-5):
- Input: 2,767 Γ— $1.25/M = $0.0035
- Output: 200 Γ— $10/M = $0.002
Total: $0.0055 per blog post
Enter fullscreen mode Exit fullscreen mode

Example 2: Customer Support Bot

Average conversation:
- User messages: 50 words Γ— 5 messages = 250 words
- Bot responses: 75 words Γ— 5 messages = 375 words
- System prompt: 150 words
Total: 775 words = ~1,033 tokens

Daily volume: 1,000 conversations
Monthly tokens: 1,033 Γ— 1,000 Γ— 30 = 30,990,000 tokens

Cost (DeepSeek-V3.2):
Input: 30.99M Γ— $0.55/M = $17.04
Output: Similar estimate = $17.04
Monthly total: ~$34/month
Enter fullscreen mode Exit fullscreen mode

Example 3: Code Analysis

Codebase: 50,000 lines of Python
Average line: 40 characters
Total characters: 2,000,000

Tokens: 2,000,000 Γ· 4 = ~500,000 tokens

Fits in:
βœ… GPT-5 (400K) - NO, need chunking
βœ… Gemini 3 Pro (1M) - YES
βœ… LLaMA 4 Maverick (1M) - YES
βœ… LLaMA 4 Scout (10M) - YES, with room to spare
Enter fullscreen mode Exit fullscreen mode

Language & Content Differences 🌍

Why Different Languages Have Different Costs

The problem: Most tokenizers are optimized for English.

Real examples:

English: "Hello world"
Tokens: ["Hello", " world"] = 2 tokens

Chinese: "δ½ ε₯½δΈ–η•Œ" (same meaning)
Tokens: ["δ½ ", "ε₯½", "δΈ–", "η•Œ"] = 4 tokens
Cost: 2x English!

Arabic: "Ω…Ψ±Ψ­Ψ¨Ψ§ Ψ¨Ψ§Ω„ΨΉΨ§Ω„Ω…"
Tokens: ~6-8 tokens (depends on model)
Cost: 3-4x English!

Code: def hello_world():
Tokens: ["def", " hello", "_", "world", "(", ")", ":"] = 7 tokens
Cost: 3.5x the words!
Enter fullscreen mode Exit fullscreen mode

Content Type Token Efficiency

πŸ“Š Most Efficient (1.0-1.2 tokens/word):
- Simple English prose
- Common words
- Standard punctuation

πŸ”„ Medium Efficiency (1.3-1.8 tokens/word):
- Technical English
- Mixed language
- Marketing copy

❌ Least Efficient (2.0-4.0 tokens/word):
- Code (especially with special characters)
- Non-English languages
- URLs and identifiers
- Mathematical notation
- Emoji-heavy text
Enter fullscreen mode Exit fullscreen mode

Real data from a multilingual app:

Feature: Translate product descriptions

English (baseline):     1,000 words = 1,333 tokens
Spanish:                1,000 words = 1,450 tokens (+9%)
French:                 1,000 words = 1,520 tokens (+14%)
German:                 1,000 words = 1,680 tokens (+26%)
Chinese:                1,000 words = 2,100 tokens (+58%)
Japanese:               1,000 words = 2,400 tokens (+80%)
Arabic:                 1,000 words = 2,300 tokens (+73%)

Monthly cost impact (10K translations):
English baseline: $16.66 (GPT-5)
Japanese: $30.00 (+80% cost!)
Enter fullscreen mode Exit fullscreen mode

Model-Specific Tokenizer Differences πŸ”€

Vocabulary Sizes

GPT-3/GPT-4/GPT-5:    ~50,000 tokens (cl100k_base)
Claude (all versions): ~100,000 tokens
LLaMA 2/3:            ~32,000 tokens
LLaMA 4:              ~50,000 tokens
DeepSeek:             ~100,000 tokens
Qwen:                 ~150,000 tokens
Mistral:              ~32,000 tokens
Gemini:               ~256,000 tokens (largest!)
Enter fullscreen mode Exit fullscreen mode

Why Vocabulary Size Matters

Larger vocabulary = more efficient encoding

Example text: "The anthropological archaeological investigation"

Small vocab (32K tokens):
["The", " anthrop", "ological", " arch", "ae", "ological", " invest", "igation"]
= 8 tokens

Medium vocab (50K tokens):
["The", " anthropological", " archaeological", " investigation"]
= 4 tokens (50% fewer!)

Large vocab (100K tokens):
["The", " anthropological", " archaeological", " investigation"]
= 4 tokens (similar to 50K, but better for technical terms)
Enter fullscreen mode Exit fullscreen mode

Trade-offs:

  • Larger vocab: More efficient, but larger embedding layer (more memory)
  • Smaller vocab: Less memory, but less efficient encoding
  • Sweet spot: 50K-100K for most modern models

The Hidden Cost: Reasoning Tokens 🧠

What Are Reasoning Tokens?

New in 2025 with models like GPT-5, o3, DeepSeek-R1:

Simple explanation: When a model "thinks," it uses special hidden tokens that you don't see but still pay for!

How It Works

Your question: "Prove the Pythagorean theorem"
Input tokens: 5

🧠 Model's internal thinking (HIDDEN from you):
- "Let's start with a right triangle..."
- "Consider squares on each side..."
- "Area of square on hypotenuse = ..."
- "By geometric rearrangement..."
- [continues for 2,500 tokens]

Reasoning tokens: 2,500 (hidden)

πŸ“€ Model's visible answer: "Here's a proof using..."
Output tokens: 200 (visible)

πŸ’° Total billable tokens: 5 + 2,500 + 200 = 2,705 tokens!
Enter fullscreen mode Exit fullscreen mode

Real Cost Impact

Standard mode vs Reasoning mode:

Simple question: "What's 2+2?"
Standard: 5 input + 10 output = 15 tokens
Reasoning: 5 input + 50 reasoning + 10 output = 65 tokens
Multiplier: 4.3x

Complex question: "Debug this React component"
Standard: 200 input + 300 output = 500 tokens
Reasoning: 200 input + 3,000 reasoning + 300 output = 3,500 tokens
Multiplier: 7x

Math proof: "Prove Fermat's Last Theorem for n=3"
Standard: Might fail or hallucinate
Reasoning: 50 input + 15,000 reasoning + 500 output = 15,550 tokens
Multiplier: Can't compare (only reasoning can do it)
Enter fullscreen mode Exit fullscreen mode

When to Use Reasoning Mode

βœ… Use reasoning when:

  • Complex math/logic problems
  • Multi-step debugging
  • Strategic planning
  • Code architecture decisions
  • Scientific analysis
  • Legal/medical document review

❌ Don't use reasoning for:

  • Simple queries ("What's the weather?")
  • Creative writing
  • Translation
  • Summarization
  • Basic Q&A

Monitoring Reasoning Tokens

# OpenAI API example
response = client.chat.completions.create(
    model="gpt-5-2025-08-07",
    messages=[{"role": "user", "content": "Solve this proof..."}]
)

# Check token usage
usage = response.usage
print(f"Input tokens: {usage.prompt_tokens}")
print(f"Output tokens: {usage.completion_tokens}")
print(f"Reasoning tokens: {usage.completion_tokens_details.reasoning_tokens}")
print(f"Total cost: ${calculate_cost(usage)}")

# Output:
# Input tokens: 45
# Output tokens: 387
# Reasoning tokens: 4,522
# Total cost: $0.046
Enter fullscreen mode Exit fullscreen mode

Token Optimization Strategies πŸŽ“

1. Trim Unnecessary Tokens

# ❌ Inefficient (187 tokens)
prompt = """
Please carefully analyze the following customer review 
and provide a comprehensive breakdown including:
- The overall sentiment (positive, negative, or neutral)
- Key topics and themes mentioned in the review
- Specific action items or recommendations
- An overall customer satisfaction score from 1-10

Here is the customer review:
"Great product, fast shipping!"

Please structure your response clearly.
"""

# βœ… Efficient (52 tokens)
prompt = """Analyze review: "Great product, fast shipping!"
Return: sentiment, topics, actions, score (1-10)"""

# Savings: 72% fewer tokens!
Enter fullscreen mode Exit fullscreen mode

2. Use Structured Outputs

# ❌ Verbose (generates ~200 tokens)
"Please write a summary of this article with key points."

# βœ… Structured (generates ~100 tokens)
"Summarize as JSON: {title, main_points: [3 items], conclusion}"

# Response is more predictable and shorter
Enter fullscreen mode Exit fullscreen mode

3. Leverage Prompt Caching

Many providers cache repeated context:

# Expensive: sending full context every time
for question in questions:
    response = client.chat.completions.create(
        messages=[
            {"role": "system", "content": long_context},  # 10K tokens
            {"role": "user", "content": question}
        ]
    )
# Cost: 10K tokens Γ— 100 questions = 1M tokens

# βœ… Cached: reuses context
# First call: 10K tokens (full price)
# Subsequent calls: 10K tokens (50-90% discount)
# Savings: up to 90% on repeated context
Enter fullscreen mode Exit fullscreen mode

4. Choose the Right Model for the Task

Task matrix:

Simple Q&A:
  ❌ GPT-5 ($1.25/M input)
  βœ… GPT-5-mini ($0.25/M input) - 5x cheaper!

Code generation:
  βœ… GPT-5.2-Codex (specialized)
  ❌ General GPT-5 (overkill)

Reasoning:
  βœ… GPT-5 Thinking mode (when needed)
  ❌ Always-on reasoning (wasteful)

Bulk processing:
  ❌ Claude Opus ($15/M)
  βœ… DeepSeek-V3.2 ($0.55/M) - 27x cheaper!
Enter fullscreen mode Exit fullscreen mode

5. Batch Similar Requests

# ❌ Inefficient: 100 API calls
for review in reviews:
    summarize(review)  # 200 tokens each
# Total: 20,000 tokens + 100 API calls

# βœ… Efficient: 1 API call
batch_prompt = f"""Summarize each review (1-2 sentences):
{chr(10).join(f"{i+1}. {r}" for i, r in enumerate(reviews))}
"""
# Total: ~15,000 tokens + 1 API call
# Savings: 25% tokens + 99 fewer API calls
Enter fullscreen mode Exit fullscreen mode

6. Compress Long Contexts

# ❌ Raw: 50,000 tokens
full_document = load_entire_manual()

# βœ… Compressed: 5,000 tokens
key_sections = extract_relevant_sections(full_document, query)

# 10x reduction in tokens
Enter fullscreen mode Exit fullscreen mode

7. Monitor and Alert

# Track token usage
import logging

def track_tokens(request, response):
    tokens_used = response.usage.total_tokens
    cost = calculate_cost(tokens_used, model)

    logging.info(f"Request: {tokens_used} tokens, ${cost:.4f}")

    # Alert on anomalies
    if tokens_used > 10000:
        alert_team(f"High token usage: {tokens_used}")

    return response
Enter fullscreen mode Exit fullscreen mode

Tools & Resources πŸ› οΈ

Token Counting Tools

Before sending to API:

OpenAI: tiktoken library (Python)
  pip install tiktoken

Anthropic: anthropic tokenizer
  https://console.anthropic.com/tokenizer

Universal: tiktokenizer.vercel.app
  Paste text, get instant count

HuggingFace: transformers library
  tokenizer = AutoTokenizer.from_pretrained(model_name)
Enter fullscreen mode Exit fullscreen mode

Example Code

# OpenAI token counting
import tiktoken

def count_tokens(text, model="gpt-5"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

text = "How many tokens is this?"
tokens = count_tokens(text)
print(f"Tokens: {tokens}")  # Output: Tokens: 6

# Cost estimation
def estimate_cost(text, model="gpt-5", is_input=True):
    tokens = count_tokens(text, model)
    rate = 1.25 if is_input else 10.0  # per million
    cost = (tokens / 1_000_000) * rate
    return cost

cost = estimate_cost("Your prompt here")
print(f"Estimated cost: ${cost:.6f}")
Enter fullscreen mode Exit fullscreen mode
# Anthropic token counting
from anthropic import Anthropic

client = Anthropic()

# Count before sending
response = client.messages.count_tokens(
    model="claude-sonnet-4-5-20250929",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(f"Input tokens: {response.input_tokens}")
Enter fullscreen mode Exit fullscreen mode

Token Budgeting Calculator

class TokenBudget:
    def __init__(self, monthly_budget_usd, model="gpt-5"):
        self.budget = monthly_budget_usd
        self.rates = {
            "gpt-5": {"input": 1.25, "output": 10.0},
            "claude-opus-4.5": {"input": 15.0, "output": 75.0},
            "deepseek-v3.2": {"input": 0.55, "output": 2.20}
        }
        self.model = model

    def tokens_available(self, ratio_input_output=0.5):
        """Calculate tokens available for budget"""
        rate = self.rates[self.model]
        avg_rate = (rate["input"] * ratio_input_output + 
                   rate["output"] * (1 - ratio_input_output))

        tokens_per_dollar = 1_000_000 / avg_rate
        total_tokens = self.budget * tokens_per_dollar
        return int(total_tokens)

    def requests_available(self, tokens_per_request):
        """How many requests can we make?"""
        total_tokens = self.tokens_available()
        return int(total_tokens / tokens_per_request)

# Example usage
budget = TokenBudget(monthly_budget_usd=100, model="gpt-5")
print(f"Monthly tokens: {budget.tokens_available():,}")
print(f"Requests (1K tokens each): {budget.requests_available(1000):,}")

# Output:
# Monthly tokens: 18,181,818
# Requests (1K tokens each): 18,181
Enter fullscreen mode Exit fullscreen mode

Common Pitfalls & How to Avoid Them ⚠️

Pitfall 1: Assuming Words = Tokens

❌ Wrong assumption:
words = 1000
tokens = 1000  # WRONG!

βœ… Correct:
words = 1000
tokens = int(words * 1.33)  # ~1,330 tokens
Enter fullscreen mode Exit fullscreen mode

Pitfall 2: Ignoring Reasoning Token Costs

❌ Dangerous:
# Enabling reasoning for everything
response = client.chat(
    model="gpt-5",
    reasoning_effort="high",  # ⚠️ 5-10x cost!
    messages=[{"role": "user", "content": "Hi"}]
)

βœ… Safe:
# Use reasoning selectively
if is_complex_task(question):
    reasoning_effort = "high"
else:
    reasoning_effort = None
Enter fullscreen mode Exit fullscreen mode

Pitfall 3: Not Monitoring Token Usage

❌ Flying blind:
for i in range(10000):
    client.chat(messages=messages)
# Surprise $500 bill!

βœ… Monitored:
total_tokens = 0
for i in range(10000):
    response = client.chat(messages=messages)
    total_tokens += response.usage.total_tokens

    if total_tokens > BUDGET_TOKENS:
        raise Exception("Budget exceeded!")
Enter fullscreen mode Exit fullscreen mode

Pitfall 4: Inefficient Encoding for Non-English

❌ Missing opportunity:
# Using GPT-4 for Chinese text
# Inefficient tokenization = 2x cost

βœ… Better:
# Use model optimized for target language
# Qwen for Chinese (better tokenization)
# Or use Claude (larger vocab, more efficient)
Enter fullscreen mode Exit fullscreen mode

Quick Reference Card πŸ“‹

TOKEN BASICS
============
1 token β‰ˆ 0.75 English words
1 token β‰ˆ 4 characters
1 page β‰ˆ 667 tokens

CONTEXT LIMITS (Feb 2026)
==========================
GPT-5:           400K tokens
Claude Opus 4.5: 200K tokens
Gemini 3 Pro:    1M tokens
DeepSeek-V3.2:   128K tokens
LLaMA 4 Scout:   10M tokens

PRICING (per million tokens)
============================
           Input   Output
GPT-5      $1.25   $10.00
Claude O   $15.00  $75.00
DeepSeek   $0.55   $2.20
Gemini 3   $1.25   $5.00

OPTIMIZATION CHECKLIST
======================
☐ Count tokens before sending
☐ Use smallest viable model
☐ Enable caching for repeated context
☐ Batch similar requests
☐ Use reasoning mode sparingly
☐ Monitor usage with alerts
☐ Compress long contexts
☐ Use structured outputs
Enter fullscreen mode Exit fullscreen mode

Conclusion 🎯

Understanding tokens is fundamental to working effectively with LLMs:

  1. Tokens β‰  Words - Always measure actual token count
  2. Tokens = Cost - Optimize to save real money
  3. Different models, different efficiency - Choose wisely
  4. Reasoning tokens are expensive - Use strategically
  5. Non-English costs more - Factor into budgets
  6. Monitor everything - Prevent surprise bills

Next in series: Part 2: LLM Architectures Explained and Part 3: Fine-Tuning Fundamentals


About this guide: Updated February 2026 with current pricing, context limits, and reasoning token information.

Further reading: