
TemitopeIntroduction How do you know if your AI-generated translation is actually...
How do you know if your AI-generated translation is actually good?
Traditional metrics like BLEU scores measure word overlap — but they miss fluency, context, and cultural nuance entirely. A translation can score well on BLEU and still read like gibberish to a native speaker.
This is where LLM-as-a-Judge comes in — using a large language model to evaluate the quality of another model's output. In this tutorial, we'll build a practical evaluation pipeline that scores translation quality across multiple dimensions using Claude as the judge.
By the end, you'll have a working system you can plug into any translation workflow.
What We're Building
A Python-based evaluation pipeline that:
Sends both to an LLM judge with a structured scoring prompt.
Returns scores for fluency, accuracy, and cultural appropriateness.
Logs results to a simple JSON file for tracking over time.
Python 3.9+
An Anthropic API key (or OpenAI)
Basic familiarity with REST APIs and Python
pip install anthropic python-dotenv
Step 1: Set Up Your Project Structure
llm-judge-pipeline/
├── evaluator.py
├── prompts.py
├── logger.py
├── results/
│ └── evaluations.json
└── .env
Create your .env:
ANTHROPIC_API_KEY=your_api_key_here
Step 2: Design Your Evaluation Prompt
The quality of your judge depends almost entirely on your prompt. We want structured, consistent output — so we'll ask the model to respond in JSON.
Prompts.py
JUDGE_PROMPT = """
You are an expert translation evaluator with deep knowledge of linguistics and cultural context.
You will be given:
- SOURCE: The original text
- TRANSLATION: The translated output to evaluate
Evaluate the translation on three dimensions and return ONLY a JSON object:
{{
"fluency": {{
"score": <1-10>,
"reason": "<one sentence explanation>"
}},
"accuracy": {{
"score": <1-10>,
"reason": "<one sentence explanation>"
}},
"cultural_appropriateness": {{
"score": <1-10>,
"reason": "<one sentence explanation>"
}},
"overall_score": <average of the three scores>,
"recommendation": "<pass | review | reject>"
}}
SOURCE: {source}
TRANSLATION: {translation}
TARGET LANGUAGE: {target_language}
"""
Step 3: Build the Evaluator
evaluator.py
import os
import json
import anthropic
from dotenv import load_dotenv
from prompts import JUDGE_PROMPT
load_dotenv()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
def evaluate_translation(source: str, translation: str, target_language: str) -> dict:
prompt = JUDGE_PROMPT.format(
source=source,
translation=translation,
target_language=target_language
)
message = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[
{"role": "user", "content": prompt}
]
)
raw_response = message.content[0].text
try:
result = json.loads(raw_response)
except json.JSONDecodeError:
import re
json_match = re.search(r'\{.*\}', raw_response, re.DOTALL)
if json_match:
result = json.loads(json_match.group())
else:
raise ValueError("Could not parse JSON from model response")
return result
def batch_evaluate(pairs: list[dict]) -> list[dict]:
"""Evaluate multiple source/translation pairs"""
results = []
for pair in pairs:
result = evaluate_translation(
source=pair["source"],
translation=pair["translation"],
target_language=pair["target_language"]
)
result["source"] = pair["source"]
result["translation"] = pair["translation"]
results.append(result)
return results
Step 4: Add a Logger
logger.py
import json
import os
from datetime import datetime
RESULTS_FILE = "results/evaluations.json"
def log_evaluation(evaluation: dict):
os.makedirs("results", exist_ok=True)
existing = []
if os.path.exists(RESULTS_FILE):
with open(RESULTS_FILE, "r") as f:
existing = json.load(f)
evaluation["timestamp"] = datetime.utcnow().isoformat()
existing.append(evaluation)
with open(RESULTS_FILE, "w") as f:
json.dump(existing, f, indent=2)
print(f"✅ Logged evaluation. Overall score: {evaluation['overall_score']}/10")
Step 5: Run It
main.py
from evaluator import evaluate_translation
from logger import log_evaluation
source = "Please ensure the patient takes the medication twice daily with food."
translation = "Jọwọ rii daju pe alaisan mu oogun naa lẹmeji lojoojumọ pẹlu ounjẹ."
target_language = "Yoruba"
result = evaluate_translation(source, translation, target_language)
log_evaluation(result)
print(json.dumps(result, indent=2))
Sample Output
{
"fluency": {
"score": 9,
"reason": "The translation reads naturally and follows Yoruba grammatical structure."
},
"accuracy": {
"score": 8,
"reason": "Core meaning is preserved; minor phrasing differences don't affect intent."
},
"cultural_appropriateness": {
"score": 9,
"reason": "Terminology is appropriate for a Nigerian Yoruba-speaking audience."
},
"overall_score": 8.67,
"recommendation": "pass"
}
Step 6: Scale It with Batch Processing
from evaluator import batch_evaluate
from logger import log_evaluation
pairs = [
{
"source": "Welcome to our platform.",
"translation": "Kaabọ si pẹpẹ wa.",
"target_language": "Yoruba"
},
{
"source": "Your payment was successful.",
"translation": "Isanwo rẹ ti ṣaṣeyọri.",
"target_language": "Yoruba"
}
]
results = batch_evaluate(pairs)
for result in results:
log_evaluation(result)
LLM-as-a-Judge works best when:
You need nuanced evaluation beyond keyword matching
You're working with low-resource languages where reference datasets are scarce
You want explainable scores — not just a number, but a reason
You're building human-in-the-loop review systems
Cost: Every evaluation is an API call — batch wisely
Judge bias: LLMs have their own language biases; calibrate against human evaluators
Consistency: Add temperature=0 for more deterministic scoring
Self-evaluation: Don't use the same model as judge and translator
Add a dashboard to visualize score trends over time.
Integrate with GitHub Actions to auto-evaluate translations in CI/CD
Extend to multi-language pairs with language-specific rubrics
Add human feedback loops to fine-tune your judge prompt over time.
Conclusion
LLM-as-a-Judge is one of the most practical evaluation techniques available today — especially for language tasks where ground truth is hard to define. With just a well-crafted prompt and a structured output format, you can build an evaluation system that catches what traditional metrics miss.
The full code is available on GitHub