Self-consistency - Context Engineering Dictionary

A model is non-deterministic: ask it the same question twice and you can get two different answers. Self-consistency turns that from a bug into a feature. Instead of trusting a single sample, you generate several, then let them vote. The most common answer wins, and the odd unlucky miss gets outvoted.

Why it works

Different runs take different reasoning paths, but correct paths tend to converge on the same answer while wrong ones scatter. Counting the answers surfaces the one the model lands on most often, which is usually the right one. You are spending a handful of extra calls to buy down variance.

The pattern

Sample the prompt N times, normalise the answers, and take the mode:

TypeScript

import { generateText } from 'ai'
import { openai } from '@ai-sdk/openai'

const model = openai('gpt-5-mini')
const runs = await Promise.all(
  Array.from({ length: 5 }, () =>
    generateText({ model, prompt: 'Is 17 prime? Answer only yes or no.' })
  )
)

const tally = {}
for (const r of runs) {
  const a = r.text.trim().toLowerCase().replace(/[^a-z]/g, '')
  tally[a] = (tally[a] || 0) + 1
}
const answer = Object.entries(tally).sort((a, b) => b[1] - a[1])[0][0]

Five samples, one majority answer. The calls run in parallel, so you pay in tokens, not wall-clock time.

When to use it

Best on tasks with a small answer space: classifications, yes/no calls, short extractions where "the same answer" is easy to define.
The simplest reliability lever there is. Unlike a full evaluator loop, it needs no reward function, just a vote.
It has a ceiling. If the model is wrong most of the time, the majority is wrong too. Voting sharpens a decent model; it cannot rescue a bad one.

Tip

Self-consistency pairs well with an LLM-as-judge: sample several answers, then let a judge pick the best instead of a plain majority vote when the answers are longer than a single label.

Related terms

Agents vs. workflows

A workflow follows a path you designed in advance; an agent decides its own path at run time by calling tools in a loop toward a goal. Knowing which one you actually need is the first context-engineering decision.

Read definition →

LLM-as-judge

An LLM-as-judge uses one model call to score the output of another against a rubric. It is how you evaluate fuzzy, open-ended work at scale when there is no single correct answer to match against.