Retrieval-augmented generation (RAG) - Context Engineering Dictionary

Retrieval-augmented generation is how you get a model to answer from information it was never trained on. Instead of hoping the model already knows something, you retrieve the relevant material at request time, drop it into the context, and let the model generate grounded in what you just handed it. The model stops guessing from memory and starts reading from a source.

Why it exists

A model knows two kinds of thing: what it absorbed in training (frozen, generic, unsourced) and what is in the context right now. RAG is how you get the second kind in: your docs, your codebase, last week's numbers. It is the difference between a plausible answer and a grounded one.

The shape of it

Every RAG system is three moves: embed a corpus, retrieve the chunks closest to the query, then generate with those chunks in context. Here it is end to end with the Vercel AI SDK, ranking a small corpus by cosine similarity over embeddings:

TypeScript

import { embed, embedMany, cosineSimilarity, generateText } from 'ai'
import { openai } from '@ai-sdk/openai'

const emb = openai.embedding('text-embedding-3-small')
const docs = [
  'James Phoenix runs Understanding Data.',
  'Paris is the capital of France.',
]
const { embeddings } = await embedMany({ model: emb, values: docs })

const { embedding } = await embed({ model: emb, value: 'Who runs Understanding Data?' })
const ranked = docs
  .map((text, i) => ({ text, score: cosineSimilarity(embedding, embeddings[i]) }))
  .sort((a, b) => b.score - a.score)

const { text } = await generateText({
  model: openai('gpt-5-mini'),
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Answer the question using only this context:' },
        { type: 'text', text: ranked[0].text },
        { type: 'text', text: 'Question: Who runs Understanding Data?' },
      ],
    },
  ],
})

The retrieved sentence goes into the prompt, so the model answers from the source instead of its own memory.

Getting it right

Retrieval quality is everything. If the wrong chunk comes back, the model answers confidently from the wrong context.
Less is more. Stuffing twenty marginal chunks in dilutes attention and invites lost-in-the-middle failures. Retrieve few, retrieve well.
Measure it. You can score whether an answer is actually supported by the retrieved context with an LLM-as-judge.

Tip

RAG is not only for chatbots. The same retrieve-then-generate move powers grounded agents, code assistants that read your repo, and anything that has to answer from information the model never saw in training.

Related terms

Agents vs. workflows

A workflow follows a path you designed in advance; an agent decides its own path at run time by calling tools in a loop toward a goal. Knowing which one you actually need is the first context-engineering decision.

Read definition →

LLM-as-judge

An LLM-as-judge uses one model call to score the output of another against a rubric. It is how you evaluate fuzzy, open-ended work at scale when there is no single correct answer to match against.