Retrieval & RAG

Retrieval-augmented generation (RAG)

Also called: RAG

RAG is the workhorse pattern of context engineering: retrieve the material relevant to a request, put it in the context, and let the model generate an answer grounded in it rather than guessing from memory.

James Phoenix
Understanding Data Updated July 3, 2026

Retrieval-augmented generation is how you get a model to answer from information it was never trained on. Instead of hoping the model already knows something, you retrieve the relevant material at request time, drop it into the context, and let the model generate grounded in what you just handed it. The model stops guessing from memory and starts reading from a source.

Why it exists

A model knows two kinds of thing: what it absorbed in training (frozen, generic, unsourced) and what is in the context right now. RAG is how you get the second kind in: your docs, your codebase, last week's numbers. It is the difference between a plausible answer and a grounded one.

The shape of it

Every RAG system is three moves: embed a corpus, retrieve the chunks closest to the query, then generate with those chunks in context. Here it is end to end with the Vercel AI SDK, ranking a small corpus by cosine similarity over embeddings:

TypeScript
import { embed, embedMany, cosineSimilarity, generateText } from 'ai'
import { openai } from '@ai-sdk/openai'

const emb = openai.embedding('text-embedding-3-small')
const docs = [
  'James Phoenix runs Understanding Data.',
  'Paris is the capital of France.',
]
const { embeddings } = await embedMany({ model: emb, values: docs })

const { embedding } = await embed({ model: emb, value: 'Who runs Understanding Data?' })
const ranked = docs
  .map((text, i) => ({ text, score: cosineSimilarity(embedding, embeddings[i]) }))
  .sort((a, b) => b.score - a.score)

const { text } = await generateText({
  model: openai('gpt-5-mini'),
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Answer the question using only this context:' },
        { type: 'text', text: ranked[0].text },
        { type: 'text', text: 'Question: Who runs Understanding Data?' },
      ],
    },
  ],
})

The retrieved sentence goes into the prompt, so the model answers from the source instead of its own memory.

Getting it right

  • Retrieval quality is everything. If the wrong chunk comes back, the model answers confidently from the wrong context.
  • Less is more. Stuffing twenty marginal chunks in dilutes attention and invites lost-in-the-middle failures. Retrieve few, retrieve well.
  • Measure it. You can score whether an answer is actually supported by the retrieved context with an LLM-as-judge.
Tip
RAG is not only for chatbots. The same retrieve-then-generate move powers grounded agents, code assistants that read your repo, and anything that has to answer from information the model never saw in training.

Related terms

Engineering context for real systems?

Getting the right information into the window at the right time is most of the job. If you want that thinking applied to your product, that is what I do.

See how I can help