LLM-as-judge - Context Engineering Dictionary

Some outputs are easy to grade: does this equal that? But most interesting work is fuzzy. Is this summary faithful? Is this answer actually grounded in the retrieved context? Is this reply on-brand? You cannot exact-match your way to those judgments, and you cannot put a human on every one at scale. An LLM-as-judge is the practical middle ground: a model call that scores another model's output against a rubric you write.

Why it matters for context engineering

You cannot improve what you cannot measure. Every context technique in this dictionary, from retrieval to memory, is only worth adding if it moves a number. A judge gives you that number for tasks where "correct" is a matter of degree, which turns tuning your system from guesswork into something you can actually iterate on.

The pattern: force a structured verdict

The key is to make the judge return a structured score, not a paragraph of prose you then have to parse. Give it a schema and it hands back a clean, typed verdict:

TypeScript

import { generateObject } from 'ai'
import { openai } from '@ai-sdk/openai'
import { z } from 'zod'

const { object } = await generateObject({
  model: openai('gpt-5-mini'),
  schema: z.object({
    score: z.number().min(0).max(1),
    reason: z.string(),
  }),
  prompt: 'Rate how well "Paris" answers "capital of France?". Return a score 0-1 and a short reason.',
})

The reason field matters as much as the score: it makes the judgment debuggable, so when a grade looks wrong you can see why the judge decided it.

Using it well

Write a rubric, not a vibe. "Rate 1-10" gets you noise. Spell out what each level means, or break the score into named sub-checks.
Judge one thing at a time. Faithfulness, relevance, and tone are separate questions; asking for all three in one number hides which one failed.
Watch for bias. Judges can favour longer answers, their own style, or the first option shown. Check the judge against a few human labels before you trust it.

Tip

Combine a judge with self-consistency: generate several candidate answers, then let the judge pick the best rather than taking a plain majority vote. It is the same idea, applied at the evaluation step.

Related terms

Self-consistency

Self-consistency samples the same prompt several times and takes the majority answer. It trades a few extra calls for a big drop in variance, turning a model that sometimes slips into one that reliably lands on its best answer.

Read definition →

Agents vs. workflows

A workflow follows a path you designed in advance; an agent decides its own path at run time by calling tools in a loop toward a goal. Knowing which one you actually need is the first context-engineering decision.