Building an LLM-as-a-Judge Evaluation Pipeline for Translation Quality

# llm# ai# machinelearning# python

Temitope

Introduction How do you know if your AI-generated translation is actually...

Introduction

How do you know if your AI-generated translation is actually good?

Traditional metrics like BLEU scores measure word overlap — but they miss fluency, context, and cultural nuance entirely. A translation can score well on BLEU and still read like gibberish to a native speaker.

This is where LLM-as-a-Judge comes in — using a large language model to evaluate the quality of another model's output. In this tutorial, we'll build a practical evaluation pipeline that scores translation quality across multiple dimensions using Claude as the judge.

By the end, you'll have a working system you can plug into any translation workflow.
What We're Building
A Python-based evaluation pipeline that:

Accepts a source text + translated output.
Sends both to an LLM judge with a structured scoring prompt.
Returns scores for fluency, accuracy, and cultural appropriateness.
Logs results to a simple JSON file for tracking over time.

Prerequisites

Python 3.9+
An Anthropic API key (or OpenAI)
Basic familiarity with REST APIs and Python

pip install anthropic python-dotenv

Step 1: Set Up Your Project Structure

llm-judge-pipeline/
├── evaluator.py
├── prompts.py
├── logger.py
├── results/
│   └── evaluations.json
└── .env

Create your .env:

ANTHROPIC_API_KEY=your_api_key_here

Step 2: Design Your Evaluation Prompt
The quality of your judge depends almost entirely on your prompt. We want structured, consistent output — so we'll ask the model to respond in JSON.

Prompts.py

JUDGE_PROMPT = """
You are an expert translation evaluator with deep knowledge of linguistics and cultural context.

You will be given:
- SOURCE: The original text
- TRANSLATION: The translated output to evaluate

Evaluate the translation on three dimensions and return ONLY a JSON object:

{{
  "fluency": {{
    "score": <1-10>,
    "reason": "<one sentence explanation>"
  }},
  "accuracy": {{
    "score": <1-10>,
    "reason": "<one sentence explanation>"
  }},
  "cultural_appropriateness": {{
    "score": <1-10>,
    "reason": "<one sentence explanation>"
  }},
  "overall_score": <average of the three scores>,
  "recommendation": "<pass | review | reject>"
}}

SOURCE: {source}
TRANSLATION: {translation}
TARGET LANGUAGE: {target_language}
"""

Step 3: Build the Evaluator
evaluator.py

import os
import json
import anthropic
from dotenv import load_dotenv
from prompts import JUDGE_PROMPT

load_dotenv()

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

def evaluate_translation(source: str, translation: str, target_language: str) -> dict:
    prompt = JUDGE_PROMPT.format(
        source=source,
        translation=translation,
        target_language=target_language
    )

    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )

    raw_response = message.content[0].text

    try:
        result = json.loads(raw_response)
    except json.JSONDecodeError:
        import re
        json_match = re.search(r'\{.*\}', raw_response, re.DOTALL)
        if json_match:
            result = json.loads(json_match.group())
        else:
            raise ValueError("Could not parse JSON from model response")

    return result


def batch_evaluate(pairs: list[dict]) -> list[dict]:
    """Evaluate multiple source/translation pairs"""
    results = []
    for pair in pairs:
        result = evaluate_translation(
            source=pair["source"],
            translation=pair["translation"],
            target_language=pair["target_language"]
        )
        result["source"] = pair["source"]
        result["translation"] = pair["translation"]
        results.append(result)
    return results

Step 4: Add a Logger
logger.py

import json
import os
from datetime import datetime

RESULTS_FILE = "results/evaluations.json"

def log_evaluation(evaluation: dict):
    os.makedirs("results", exist_ok=True)

    existing = []
    if os.path.exists(RESULTS_FILE):
        with open(RESULTS_FILE, "r") as f:
            existing = json.load(f)

    evaluation["timestamp"] = datetime.utcnow().isoformat()
    existing.append(evaluation)

    with open(RESULTS_FILE, "w") as f:
        json.dump(existing, f, indent=2)

    print(f"✅ Logged evaluation. Overall score: {evaluation['overall_score']}/10")

Step 5: Run It
main.py

from evaluator import evaluate_translation
from logger import log_evaluation

source = "Please ensure the patient takes the medication twice daily with food."
translation = "Jọwọ rii daju pe alaisan mu oogun naa lẹmeji lojoojumọ pẹlu ounjẹ."
target_language = "Yoruba"

result = evaluate_translation(source, translation, target_language)
log_evaluation(result)

print(json.dumps(result, indent=2))

Sample Output

{
  "fluency": {
    "score": 9,
    "reason": "The translation reads naturally and follows Yoruba grammatical structure."
  },
  "accuracy": {
    "score": 8,
    "reason": "Core meaning is preserved; minor phrasing differences don't affect intent."
  },
  "cultural_appropriateness": {
    "score": 9,
    "reason": "Terminology is appropriate for a Nigerian Yoruba-speaking audience."
  },
  "overall_score": 8.67,
  "recommendation": "pass"
}

Step 6: Scale It with Batch Processing

from evaluator import batch_evaluate
from logger import log_evaluation

pairs = [
    {
        "source": "Welcome to our platform.",
        "translation": "Kaabọ si pẹpẹ wa.",
        "target_language": "Yoruba"
    },
    {
        "source": "Your payment was successful.",
        "translation": "Isanwo rẹ ti ṣaṣeyọri.",
        "target_language": "Yoruba"
    }
]

results = batch_evaluate(pairs)
for result in results:
    log_evaluation(result)

When to Use This Pattern

LLM-as-a-Judge works best when:

You need nuanced evaluation beyond keyword matching
You're working with low-resource languages where reference datasets are scarce
You want explainable scores — not just a number, but a reason
You're building human-in-the-loop review systems

Limitations to Be Aware Of

Cost: Every evaluation is an API call — batch wisely
Judge bias: LLMs have their own language biases; calibrate against human evaluators
Consistency: Add temperature=0 for more deterministic scoring
Self-evaluation: Don't use the same model as judge and translator

What's Next

Add a dashboard to visualize score trends over time.
Integrate with GitHub Actions to auto-evaluate translations in CI/CD
Extend to multi-language pairs with language-specific rubrics
Add human feedback loops to fine-tune your judge prompt over time.

Conclusion
LLM-as-a-Judge is one of the most practical evaluation techniques available today — especially for language tasks where ground truth is hard to define. With just a well-crafted prompt and a structured output format, you can build an evaluation system that catches what traditional metrics miss.
The full code is available on GitHub