
SoumiaPart 3 of 3: LLM Fundamentals Series Updated February 2026 - Fine-tuning transforms generic models...
Part 3 of 3: LLM Fundamentals Series
Updated February 2026 - Fine-tuning transforms generic models into specialized experts. This guide covers everything from deciding IF you should fine-tune, to HOW to do it effectively, with real-world examples and costs.
Hot take: Most people fine-tune when they shouldn't. But when you SHOULD fine-tune, the results can be game-changing.
For a 5-year-old: Imagine you have a really smart friend who knows about EVERYTHING - space, dinosaurs, cooking, math. But you need them to become a SUPER EXPERT at just dinosaurs. Fine-tuning is like giving them a special dinosaur course so they become the BEST at dinosaurs!
For developers: Fine-tuning is the process of taking a pre-trained model and continuing its training on a specialized dataset to adapt it for specific tasks or domains.
Training from scratch:
- Cost: $5M-$100M+
- Time: Months
- Data needed: Billions of tokens
- Expertise: Research-level ML team
- Result: General-purpose model
Fine-tuning existing model:
- Cost: $10-$10,000
- Time: Hours to days
- Data needed: 100s-10,000s examples
- Expertise: ML engineer
- Result: Specialized model
Pre-trained model = Experienced chef
ā
Knows all cooking techniques
Can make any dish reasonably well
Fine-tuning = Teaching your recipes
ā
Same chef, now expert at YOUR restaurant
Knows YOUR style, YOUR ingredients
Consistent with YOUR brand
From scratch = Culinary school from age 5
ā
Start with nothing
Learn everything
Takes years
1. Specialized Domain Knowledge
Example: Medical diagnosis
ā GPT-5: "That rash might be eczema or psoriasis"
ā
Fine-tuned: "Based on morphology and distribution pattern,
differential diagnosis includes:
1. Psoriasis vulgaris (most likely - 85%)
2. Seborrheic dermatitis (12%)
3. Drug eruption (3%)
Recommend: biopsy for confirmation"
Why: Medical terminology, diagnostic reasoning, treatment protocols
2. Consistent Format/Style
Example: Legal document generation
ā GPT-5: Creates documents with varying structure
ā
Fine-tuned: Always follows exact company template
Uses specific clause language
Maintains consistent formatting
Why: You need EXACT formatting every time
3. Private/Proprietary Knowledge
Example: Company-specific customer support
ā GPT-5: Doesn't know your products/policies
ā
Fine-tuned: Expert on YOUR product line
Knows YOUR return policies
Uses YOUR brand voice
Why: Information not in public training data
4. Performance on Specific Task
Example: SQL query generation from natural language
ā GPT-5: 70% correct queries
ā
Fine-tuned: 95% correct queries
Why: Learns your database schema, naming conventions
5. Cost at Scale
Example: 1M requests/month for sentiment analysis
ā GPT-5 API: $1,250/month
ā
Fine-tuned GPT-3.5: $200/month (6x cheaper!)
Why: Smaller models can match performance on narrow tasks
1. Prompt Engineering Can Solve It
ā Don't fine-tune: "Make responses more concise"
ā
Instead: Add to system prompt:
"Keep responses under 3 sentences. Be direct."
Cost: Free vs $500+ for fine-tuning
2. RAG (Retrieval-Augmented Generation) is Better
ā Don't fine-tune: Adding company knowledge base
ā
Instead: Use RAG
- Embed documents
- Retrieve relevant context
- Pass to model in prompt
Why: More flexible, easier to update, no retraining
3. You Have <100 Quality Examples
ā Don't fine-tune: 50 examples
ā
Instead: Use few-shot learning in prompt
Show 3-5 examples in each request
Why: Insufficient data = overfitting
4. Task Changes Frequently
ā Don't fine-tune: News categorization (topics change)
ā
Instead: Use general model + updated prompt
Why: Fine-tuning is expensive to repeat
5. Model Already Excels
ā Don't fine-tune: GPT-5 already 95% accurate
ā
Instead: Use it as-is
Why: Marginal gains not worth effort
Start: Do I need custom behavior?
ā
No ā Use base model
ā
Yes ā Can prompt engineering solve it?
ā
No ā Is it knowledge-based?
ā
Yes ā Use RAG
ā
No ā Do I have 500+ quality examples?
ā
No ā More data collection OR few-shot learning
ā
Yes ā Do I need <100ms latency OR >1M requests/month?
ā
No ā Maybe stick with API
ā
Yes ā FINE-TUNE! šÆ
What it is:
Update ALL model parameters during training.
How it works:
# Conceptual example
model = load_pretrained_model("llama-4-70b")
# Update EVERY weight in all layers
for batch in training_data:
loss = model.forward(batch)
loss.backward() # Gradients for ALL 70B parameters
optimizer.step() # Update ALL parameters
Pros:
Cons:
When to use:
Real cost:
LLaMA 4 Maverick (400B):
GPU: 8x H100 (80GB each)
Time: 24-72 hours
Cost: $5,000-$15,000
LLaMA 3.1 70B:
GPU: 4x A100 (80GB each)
Time: 8-24 hours
Cost: $800-$2,400
What it is:
Freeze original model weights, train small "adapter" matrices.
Simple explanation:
Imagine the model is a huge library. Instead of rewriting books, you add sticky notes with updates!
How it works:
# Original model: 70B parameters
base_model = load_model("llama-4-70b") # FROZEN
# Add small trainable matrices
lora_A = nn.Linear(4096, 8) # Very small!
lora_B = nn.Linear(8, 4096) # Very small!
# During inference:
output = base_model(x) + lora_B(lora_A(x))
ā ā
frozen trainable (~0.1% of params)
The math (simplified):
Original layer: W (4096 Ć 4096) = 16M parameters
LoRA decomposition:
W_new = W + B @ A
Where:
W: 4096 Ć 4096 (frozen)
B: 4096 Ć 8 (trainable)
A: 8 Ć 4096 (trainable)
Total trainable: 4096Ć8 + 8Ć4096 = 65K parameters
Reduction: 16M ā 65K = 99.6% fewer parameters!
Pros:
Cons:
When to use:
Real cost:
LLaMA 4 Maverick (400B) with LoRA:
GPU: 1x H100 (80GB)
Time: 2-6 hours
Cost: $50-$200
LLaMA 3.1 70B with LoRA:
GPU: 1x A100 (80GB)
Time: 1-3 hours
Cost: $20-$80
GPT-3.5 via OpenAI API:
Cost: ~$8 per 1M training tokens
What it is:
LoRA + quantization = Fine-tune on consumer hardware!
How it works:
# Load model in 4-bit precision
model = load_in_4bit("llama-3.1-70b")
# Memory: 70B Ć 4 bits = 35 GB (vs 140 GB for 16-bit!)
# Add LoRA adapters (still 16-bit for training)
lora_adapters = create_lora_adapters(rank=8)
# Train only adapters
for batch in data:
# Forward pass through 4-bit model
loss = model(batch)
# Backprop only through adapters
loss.backward()
Quantization explained:
16-bit (standard):
Number: 3.14159
Precision: Very high
Memory: 2 bytes
4-bit (quantized):
Number: ~3.125 (rounded)
Precision: Lower, but often good enough
Memory: 0.5 bytes (4x reduction!)
Pros:
Cons:
When to use:
Real cost:
LLaMA 3.1 70B with QLoRA:
GPU: 1x RTX 4090 (24GB)
Time: 4-12 hours
Cost: $0-40 (if you own GPU)
What it is:
Don't tune model at all - just tune soft prompts!
How it works:
# Instead of modifying weights...
# Add learnable "virtual tokens" to input
learnable_prefix = nn.Parameter(torch.randn(10, 4096))
# 10 tokens Ć 4096 dimensions = 40K parameters only!
# Prepend to all inputs
input_with_prefix = torch.cat([learnable_prefix, user_input])
output = frozen_model(input_with_prefix)
Simple explanation:
Like having a magic phrase that makes the model behave exactly right, but the phrase is learned numbers instead of words!
Pros:
Cons:
When to use:
Method | Params | Memory | Time | Cost | Quality | Use When
--------------|--------|--------|------|-------|---------|----------
Full FT | 100% | 100GB+ | 24h | $5K+ | ā
ā
ā
ā
ā
| Max quality needed
LoRA | 0.1% | 40GB | 3h | $100 | ā
ā
ā
ā
ā | Default choice
QLoRA | 0.1% | 20GB | 8h | $30 | ā
ā
ā
āā | Limited hardware
Prompt Tuning | 0.001% | 40GB | 1h | $20 | ā
ā
āāā | Simple changes
OpenAI:
Fine-tunable models (Feb 2026):
- GPT-4o-mini ā
- GPT-3.5-turbo ā
- GPT-4o (limited access) ā
Method: API-based (upload data, they train)
Cost: $8-25 per 1M training tokens
Time: 10 minutes - 2 hours
Data format: JSONL with prompt-completion pairs
Anthropic (via AWS Bedrock):
Fine-tunable models:
- Claude 3 Haiku ā
- Claude 3.5 Haiku ā
Method: AWS Bedrock integration
Cost: ~$10-30 per 1M tokens
Time: 1-4 hours
Google:
Fine-tunable models:
- Gemini 1.5 Flash ā
- Gemini 1.5 Pro (limited) ā
Method: Vertex AI
Cost: ~$7-20 per 1M tokens
Cohere:
All models fine-tunable ā
Method: API
Cost: ~$10 per 1M tokens
Meta LLaMA Family:
LLaMA 4 Scout (70B): ā
Full access
LLaMA 4 Maverick (400B): ā
Full access
LLaMA 3.3 (70B): ā
Full access
LLaMA 3.1 (8B, 70B, 405B): ā
Full access
License: Llama 4 Community License
- Free for research
- Free for commercial use
- Can distribute fine-tuned models
Methods supported:
- Full fine-tuning ā
- LoRA ā
- QLoRA ā
- DPO (preference tuning) ā
DeepSeek:
DeepSeek-V3 (671B): ā
Full access
DeepSeek-R1 (671B): ā
Full access
DeepSeek-Coder (1-236B): ā
Full access
License: DeepSeek License (MIT-like)
- Free for all use
- Can modify and distribute
Best for: Coding, reasoning tasks
Mistral:
Mistral Medium 3.1: ā
Mixtral 8x7B: ā
Mixtral 8x22B: ā
License: Apache 2.0
- Truly open source
- No restrictions
Best for: European users, cost efficiency
Qwen (Alibaba):
Qwen 3 (235B-A22B): ā
Qwen 2.5 (all sizes): ā
License: Qwen License
- Free for research and commercial
Best for: Multilingual, especially Chinese
Comparison Matrix:
Model Family | Open? | License | Best For | FT Difficulty
---------------|-------|------------|------------------------|---------------
GPT-3.5/4o | No | Proprietary| Easy API integration | Easy
Claude | No | Proprietary| Safety, helpfulness | Medium
Gemini | No | Proprietary| Multimodal | Medium
LLaMA 4 | Yes | Permissive | General purpose | Easy-Med
DeepSeek | Yes | MIT-like | Code, reasoning | Medium
Mistral | Yes | Apache 2.0 | EU compliance | Easy
Qwen | Yes | Permissive | Multilingual | Easy
Company: SaaS startup (project management tool)
Problem:
Generic GPT-4:
User: "How do I share a board?"
GPT-4: "To share a board, typically you'd look for a 'Share'
button in your application..."
ā Generic, doesn't mention their specific UI
ā Doesn't use company terminology
Solution: Fine-tuned GPT-3.5
Data collected:
500 support ticket examples:
{
"prompt": "How do I share a board?",
"completion": "Click the 'š„ Collaborators' button in the top-right
of any board. Then click 'Invite Members' and enter
their email. They'll receive an invite link immediately!"
}
{
"prompt": "Can I export to Excel?",
"completion": "Yes! Go to Board Settings > Export > 'Download as Excel'.
Premium plans can also schedule automated exports."
}
... 498 more examples
Training:
# Using OpenAI API
openai api fine_tunes.create \
-t support_qa_train.jsonl \
-m gpt-3.5-turbo \
--suffix "support-v1"
Cost: $12 (500 examples Ć 100 tokens avg Ć $8/1M)
Time: 18 minutes
Results:
Metric | Before | After | Improvement
------------------------|--------|--------|-------------
Answer accuracy | 65% | 94% | +45%
Uses company terms | 10% | 98% | +880%
Mentions correct UI | 20% | 96% | +380%
Customer satisfaction | 3.2/5 | 4.7/5 | +47%
Cost savings:
- Fewer escalations: -35% ā $8K/month saved
- Faster resolution: -40% time ā 2 FTE saved
Total ROI: ~$25K/month from $12 fine-tune!
Organization: Telemedicine platform
Problem:
Generic model hallucinates medical advice
Can't differentiate severity
Doesn't follow medical reasoning protocols
Solution: Fine-tuned LLaMA 4 Scout (70B) with LoRA
Data collected:
10,000 anonymized case notes from dermatologists:
Format:
{
"input": "Patient: 35F, presents with raised, scaly patches on elbows
and knees, silvery appearance, no pain, worsens in winter",
"output": "<reasoning>
Presentation suggests psoriasis vulgaris:
- Symmetrical distribution (elbows/knees)
- Silvery scales (characteristic)
- Koebner phenomenon possible
- Seasonal variation (common in psoriasis)
Differential diagnosis:
1. Psoriasis vulgaris (90% confidence)
2. Eczema (5%)
3. Fungal infection (5%)
</reasoning>
<recommendation>
- Confirm with skin biopsy
- Start topical corticosteroid
- Refer to dermatology if no improvement in 2 weeks
- Avoid triggers (stress, dry air)
</recommendation>"
}
Training setup:
# Using HuggingFace + PEFT (LoRA)
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-4-Scout-70B",
load_in_8bit=True,
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Train
trainer.train()
Hardware: 2x A100 (80GB)
Time: 12 hours
Cost: ~$300 (cloud GPU rental)
Results:
Metric | Base LLaMA | Fine-tuned | Specialist
----------------------------|------------|------------|-------------
Correct diagnosis (top-1) | 62% | 89% | 92%
Correct diagnosis (top-3) | 78% | 96% | 97%
Follows medical reasoning | 45% | 94% | 98%
Hallucinates treatments | 18% | 2% | 0.5%
Appropriate urgency level | 70% | 95% | 97%
Impact:
- Reduced misdiagnosis: 27% ā 4%
- Faster triage: -60% time
- Doctor confidence in AI assist: 85%
Company: Business intelligence platform
Problem:
User: "Show me top 10 customers by revenue last quarter"
Generic GPT-4:
SELECT customer_name, revenue
FROM customers
WHERE date >= DATE_SUB(NOW(), INTERVAL 3 MONTH)
ORDER BY revenue DESC
LIMIT 10;
ā Wrong table names (they use 'clients', not 'customers')
ā Wrong column names
ā Doesn't use their data warehouse conventions
Solution: Fine-tuned CodeLlama 34B
Data collected:
2,500 natural language ā SQL pairs from their schema:
{
"instruction": "Show me top 10 customers by revenue last quarter",
"input": "Schema: clients (client_id, name, industry),
orders (order_id, client_id, amount, order_date)",
"output": "SELECT
c.name,
c.industry,
SUM(o.amount) as total_revenue
FROM clients c
JOIN orders o ON c.client_id = o.client_id
WHERE o.order_date >= DATE_TRUNC('quarter',
CURRENT_DATE - INTERVAL '1 quarter')
AND o.order_date < DATE_TRUNC('quarter', CURRENT_DATE)
GROUP BY c.client_id, c.name, c.industry
ORDER BY total_revenue DESC
LIMIT 10;"
}
Training:
# Using Axolotl framework
accelerate launch -m axolotl.cli.train config.yml
Config highlights:
base_model: codellama/CodeLlama-34b-hf
adapter: qlora
micro_batch_size: 2
gradient_accumulation_steps: 8
num_epochs: 3
learning_rate: 2e-4
Hardware: 1x A100 (80GB)
Time: 8 hours
Cost: $200
Results:
Metric | Base Model | Fine-tuned
--------------------------|------------|------------
Syntactically correct SQL | 82% | 98%
Uses correct table names | 45% | 99%
Uses correct columns | 51% | 97%
Follows conventions | 30% | 95%
Executable queries | 68% | 94%
Business impact:
- Queries generated correctly: 68% ā 94%
- Time to insights: -70% (analysts)
- Self-service adoption: +300%
Firm: Corporate law firm
Problem:
Need to extract specific clauses from 1000s of contracts
Generic models miss nuanced legal language
Solution: Fine-tuned GPT-4o-mini
Data collected:
1,000 annotated contracts with clause extractions:
{
"prompt": "Extract termination clauses from this contract:\n\n[contract text]",
"completion": "Termination Clauses Found:\n\n
1. Section 12.3 - Either party may terminate with 90 days
written notice\n
2. Section 12.4 - Immediate termination for material breach\n
3. Section 12.5 - Automatic termination if Company acquired\n\n
Key Terms:\n
- Notice period: 90 days
- Cure period: 30 days (breach)
- Survival clauses: Sections 8, 11, 13"
}
Training:
# OpenAI API
import openai
openai.FineTuningJob.create(
training_file="file-abc123",
model="gpt-4o-mini-2024-07-18",
hyperparameters={
"n_epochs": 4,
"batch_size": 1,
"learning_rate_multiplier": 0.3
}
)
Cost: $150 (1000 examples, average 2K tokens each)
Time: 2.5 hours
Results:
Metric | GPT-4o-mini | Fine-tuned | Lawyer
------------------------------|-------------|------------|--------
Finds all relevant clauses | 73% | 96% | 98%
Correctly interprets terms | 68% | 92% | 95%
Misses critical clauses | 15% | 2% | 0.5%
Extraction time (per contract)| 3 min | 1 min | 25 min
ROI:
- Process 50 contracts/day (was 8/day)
- 6.25x productivity increase
- Cost: $150 one-time + $0.15/contract inference
- Savings: $180K/year in paralegal time
Step 1: Define Success Metrics
BEFORE starting:
ā
What does success look like?
ā
How will you measure it?
ā
What's the baseline performance?
Example:
Metric: SQL query correctness
Current: 68% executable queries
Target: >90% executable queries
Measurement: Test set of 200 queries
Step 2: Collect/Create Dataset
Data requirements:
Minimum viable:
- Classification: 100-500 examples
- Generation: 500-2,000 examples
- Complex reasoning: 2,000-10,000 examples
Quality > Quantity:
1 great example > 10 mediocre examples
Data format (most common):
[
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is photosynthesis?"},
{"role": "assistant", "content": "Photosynthesis is..."}
]
},
{
"messages": [...]
}
]
Step 3: Clean and Validate Data
def validate_dataset(data):
"""Check for common issues"""
issues = []
for i, example in enumerate(data):
# Check format
if "messages" not in example:
issues.append(f"Example {i}: Missing 'messages' key")
# Check length
messages = example.get("messages", [])
total_tokens = estimate_tokens(messages)
if total_tokens > 4000:
issues.append(f"Example {i}: Too long ({total_tokens} tokens)")
# Check quality
if len(messages[-1]["content"]) < 10:
issues.append(f"Example {i}: Response too short")
return issues
# Fix common issues
def clean_dataset(data):
cleaned = []
for example in data:
# Remove empty responses
if len(example["messages"][-1]["content"]) < 10:
continue
# Truncate if too long
truncated = truncate_to_length(example, max_tokens=4000)
cleaned.append(truncated)
return cleaned
Step 4: Split Data
Training set: 80% (800 examples)
Validation set: 10% (100 examples)
Test set: 10% (100 examples)
Important: Keep test set COMPLETELY separate
Option A: Using OpenAI API (Easiest)
import openai
from openai import OpenAI
client = OpenAI()
# 1. Upload training file
with open("training_data.jsonl", "rb") as f:
training_file = client.files.create(
file=f,
purpose="fine-tune"
)
# 2. Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-3.5-turbo",
hyperparameters={
"n_epochs": 3,
"batch_size": 1,
"learning_rate_multiplier": 0.5
}
)
# 3. Monitor progress
while True:
job_status = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {job_status.status}")
if job_status.status == "succeeded":
model_name = job_status.fine_tuned_model
print(f"Model ready: {model_name}")
break
time.sleep(60)
# 4. Use fine-tuned model
response = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": "test query"}]
)
Option B: Using HuggingFace + LoRA (Most Control)
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer
)
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
# 1. Load model and tokenizer
model_name = "meta-llama/Llama-3.1-70B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# 2. Configure LoRA
lora_config = LoraConfig(
r=16, # rank - higher = more capacity
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj"], # which layers to adapt
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 8.4M || all params: 70.6B || trainable%: 0.01%
# 3. Load and preprocess data
dataset = load_dataset("json", data_files="training_data.jsonl")
def preprocess(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=2048
)
tokenized_dataset = dataset.map(preprocess, batched=True)
# 4. Set training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_steps=100,
evaluation_strategy="steps",
eval_steps=100,
)
# 5. Create trainer and train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
)
trainer.train()
# 6. Save adapter
model.save_pretrained("./my-fine-tuned-model")
Option C: Using Axolotl (Best for QLora)
# config.yml
base_model: NousResearch/Llama-3.1-70B
# LoRA config
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
# Dataset
datasets:
- path: training_data.jsonl
type: alpaca
# Training
num_epochs: 3
micro_batch_size: 2
gradient_accumulation_steps: 8
learning_rate: 0.0002
warmup_steps: 100
# Quantization
load_in_4bit: true
# Run training
accelerate launch -m axolotl.cli.train config.yml
# Merge adapter (optional)
python -m axolotl.cli.merge_lora config.yml --lora_model_dir="./output"
Quantitative Evaluation:
def evaluate_model(model, test_set):
results = {
"correct": 0,
"total": 0,
"errors": []
}
for example in test_set:
prediction = model.generate(example["input"])
ground_truth = example["output"]
# Task-specific evaluation
is_correct = evaluate_answer(prediction, ground_truth)
if is_correct:
results["correct"] += 1
else:
results["errors"].append({
"input": example["input"],
"predicted": prediction,
"expected": ground_truth
})
results["total"] += 1
results["accuracy"] = results["correct"] / results["total"]
return results
# Run evaluation
eval_results = evaluate_model(fine_tuned_model, test_data)
print(f"Accuracy: {eval_results['accuracy']:.2%}")
# Analyze errors
for error in eval_results["errors"][:5]: # First 5 errors
print(f"\nInput: {error['input']}")
print(f"Predicted: {error['predicted']}")
print(f"Expected: {error['expected']}")
Qualitative Evaluation:
# Compare base vs fine-tuned side-by-side
test_prompts = [
"How do I reset my password?",
"What's included in Premium plan?",
"Can I export data to CSV?"
]
for prompt in test_prompts:
base_response = base_model.generate(prompt)
ft_response = finetuned_model.generate(prompt)
print(f"\n{'='*50}")
print(f"Prompt: {prompt}")
print(f"\nBase model: {base_response}")
print(f"\nFine-tuned: {ft_response}")
print(f"{'='*50}")
A/B Testing (Production):
# Route 50% to each model
import random
def get_response(user_query):
if random.random() < 0.5:
model = "base"
response = base_model.generate(user_query)
else:
model = "fine-tuned"
response = finetuned_model.generate(user_query)
# Log for analysis
log_interaction(user_query, response, model)
return response
# After 1 week, analyze
analyze_ab_test_results()
Deployment options:
1. API endpoint:
from fastapi import FastAPI
from peft import PeftModel
app = FastAPI()
# Load model once at startup
base_model = AutoModelForCausalLM.from_pretrained("llama-3.1-70B")
model = PeftModel.from_pretrained(base_model, "./my-adapter")
@app.post("/generate")
async def generate(prompt: str):
response = model.generate(prompt)
return {"response": response}
2. Replace OpenAI calls:
# Before
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
# After
response = openai.ChatCompletion.create(
model="ft:gpt-3.5-turbo:my-org:custom_suffix:id",
messages=[{"role": "user", "content": prompt}]
)
3. Gradual rollout:
def get_model_for_user(user_id):
# Gradual rollout: 0% ā 10% ā 50% ā 100%
rollout_percentage = get_rollout_percentage()
if hash(user_id) % 100 < rollout_percentage:
return fine_tuned_model
else:
return base_model
OpenAI API:
GPT-3.5-turbo:
Cost: $8 per 1M training tokens
Example: 1,000 examples Ć 500 tokens = 500K tokens
Total: $4
GPT-4o-mini:
Cost: $25 per 1M training tokens
Same example: $12.50
Hosting: $0 (served by OpenAI)
Self-hosted (LoRA):
LLaMA 3.1 70B:
GPU: 1x A100 80GB
Time: 3 hours
Cost: $10/hour Ć 3 = $30
Storage: $0.10/GB/month
Model: 140GB Ć $0.10 = $14/month
Inference: $0.50-2/hour GPU time
Self-hosted (QLoRA):
LLaMA 3.1 70B:
GPU: 1x RTX 4090 (your own)
Time: 8 hours
Cost: $0 (electricity ~$2)
Or cloud RTX 4090:
Cost: $0.50/hour Ć 8 = $4
Cost per 1M tokens:
Model | Input | Output | Total (50/50)
-------------------------|--------|--------|---------------
GPT-4o | $2.50 | $10.00 | $6.25
GPT-3.5-turbo | $0.50 | $1.50 | $1.00
Fine-tuned GPT-3.5 | $0.50 | $1.50 | $1.00
Claude Sonnet 3.5 | $3.00 | $15.00 | $9.00
Self-hosted LLaMA 3.1 70B| $0.80 | $0.80 | $0.80
Example 1: Customer Support
Scenario: 100K queries/month
Before (human agents):
Cost: 10 agents Ć $4K/month = $40K/month
After (fine-tuned GPT-3.5):
API cost: 100K queries Ć 1K tokens Ć $1/M = $100/month
1 agent for escalations: $4K/month
Total: $4,100/month
Savings: $35,900/month
ROI: (35,900 - 100) / 100 = 359x in month 1
Fine-tuning investment: $50
Break-even: Instant
Example 2: Legal Contract Review
Scenario: 500 contracts/month
Before (paralegals):
Time: 500 contracts Ć 2 hours = 1,000 hours
Cost: 1,000 hours Ć $50/hour = $50K/month
After (fine-tuned GPT-4o-mini):
API cost: 500 Ć 10K tokens Ć $1.50/M = $7.50
Review time: 500 Ć 0.25 hours = 125 hours
Cost: 125 Ć $50 = $6,250
Total: $6,257.50/month
Savings: $43,742.50/month
ROI: (43,742 - 257) / 257 = 170x
Fine-tuning investment: $150
Break-even: Day 1
Example 3: Code Generation
Scenario: Internal tool, 10 developers
Before (manual coding):
Time saved: 5 hours/week/dev = 50 hours/week
Value: 50 hours Ć $100/hour Ć 4 weeks = $20K/month
After (self-hosted LLaMA-Coder):
Training cost (one-time): $200
Hosting: $500/month (GPU server)
Net savings: $20K - $500 = $19,500/month
ROI: 39x
Break-even: 11 days
Symptom:
Training accuracy: 99%
Test accuracy: 65%
Causes:
Solutions:
# 1. Reduce epochs
num_epochs = 1 # Start with 1, increase if needed
# 2. Increase dropout
lora_dropout = 0.1 # Regularization
# 3. Early stopping
early_stopping_patience = 3
# 4. Data augmentation
def augment_data(example):
# Paraphrase questions
# Add variations
return variations
Symptom:
Model becomes expert at new task
But forgets how to do basic things
Solutions:
# Use LoRA instead of full fine-tuning
# Keeps base model frozen
# Or: Mix in general examples
training_data = domain_specific_data + general_data
Symptom:
Model learns bad patterns
Inconsistent outputs
Solutions:
# 1. Manual review sample
review_random_sample(data, n=100)
# 2. Automated checks
def check_quality(example):
checks = {
"too_short": len(example["output"]) < 20,
"repetitive": has_repetition(example["output"]),
"formatting": not well_formatted(example["output"])
}
return checks
# 3. Remove low-quality examples
high_quality_data = [ex for ex in data if passes_quality_checks(ex)]
Symptom:
Fine-tuned model only 2% better
(But you spent 2 weeks fine-tuning)
Solution:
ALWAYS establish baseline first:
1. Try prompt engineering
2. Try few-shot learning
3. Try RAG
4. THEN consider fine-tuning
Only fine-tune if baseline < 80% and you need >90%
BEFORE FINE-TUNING:
ā Tried prompt engineering?
ā Tried RAG?
ā Baseline performance < 80%?
ā Have 500+ quality examples?
ā Clear success metrics defined?
ā Budget allocated?
CHOOSING METHOD:
ā Unlimited budget? ā Full fine-tuning
ā Normal budget? ā LoRA
ā Low budget? ā QLoRA
ā API user? ā OpenAI/Anthropic fine-tuning
DATA PREPARATION:
ā Format validated?
ā Train/val/test split done?
ā Quality checked?
ā Diverse examples?
ā Edge cases included?
TRAINING:
ā Start with 1 epoch
ā Monitor validation loss
ā Save checkpoints
ā Log everything
EVALUATION:
ā Test on HELD-OUT data
ā Compare to baseline
ā Qualitative review
ā Error analysis
ā A/B test in production
DEPLOYMENT:
ā Gradual rollout
ā Monitoring in place
ā Rollback plan ready
ā Cost tracking enabled
Fine-tuning is powerful but not always necessary:
When to fine-tune:
When NOT to fine-tune:
Best practices:
Next steps:
Further Resources: