Evaluation

LLM-as-judge

Also called: LLM as a judge, model-graded eval

An LLM-as-judge uses one model call to score the output of another against a rubric. It is how you evaluate fuzzy, open-ended work at scale when there is no single correct answer to match against.

James Phoenix
Understanding Data Updated July 3, 2026

Some outputs are easy to grade: does this equal that? But most interesting work is fuzzy. Is this summary faithful? Is this answer actually grounded in the retrieved context? Is this reply on-brand? You cannot exact-match your way to those judgments, and you cannot put a human on every one at scale. An LLM-as-judge is the practical middle ground: a model call that scores another model's output against a rubric you write.

Why it matters for context engineering

You cannot improve what you cannot measure. Every context technique in this dictionary, from retrieval to memory, is only worth adding if it moves a number. A judge gives you that number for tasks where "correct" is a matter of degree, which turns tuning your system from guesswork into something you can actually iterate on.

The pattern: force a structured verdict

The key is to make the judge return a structured score, not a paragraph of prose you then have to parse. Give it a schema and it hands back a clean, typed verdict:

TypeScript
import { generateObject } from 'ai'
import { openai } from '@ai-sdk/openai'
import { z } from 'zod'

const { object } = await generateObject({
  model: openai('gpt-5-mini'),
  schema: z.object({
    score: z.number().min(0).max(1),
    reason: z.string(),
  }),
  prompt: 'Rate how well "Paris" answers "capital of France?". Return a score 0-1 and a short reason.',
})

The reason field matters as much as the score: it makes the judgment debuggable, so when a grade looks wrong you can see why the judge decided it.

Using it well

  • Write a rubric, not a vibe. "Rate 1-10" gets you noise. Spell out what each level means, or break the score into named sub-checks.
  • Judge one thing at a time. Faithfulness, relevance, and tone are separate questions; asking for all three in one number hides which one failed.
  • Watch for bias. Judges can favour longer answers, their own style, or the first option shown. Check the judge against a few human labels before you trust it.
Tip
Combine a judge with self-consistency: generate several candidate answers, then let the judge pick the best rather than taking a plain majority vote. It is the same idea, applied at the evaluation step.

Related terms

Engineering context for real systems?

Getting the right information into the window at the right time is most of the job. If you want that thinking applied to your product, that is what I do.

See how I can help