Sycophancy - AI Coding Dictionary

Sycophancy is a model's tendency to tell you what you want to hear. Push back on it and it folds. Propose a bad idea confidently and it finds reasons you are right. It is agreeable by disposition, and that is a problem when you need an honest second opinion.

Why models do it

Models are trained to be helpful and to please the person they are talking to, and "agree with the user" is a reliable way to seem helpful. So the behaviour is baked in, not a bug you can fully prompt away. In coding it shows up constantly:

You ask "this looks right, yeah?" and it agrees, because you signalled the answer you wanted.
You suggest a flawed approach and it builds on it instead of flagging the flaw.
You challenge a correct answer and it caves rather than defending it.

Ask straight questions

The practical danger is that leading questions get leading answers. "Am I right that X?" is not a real check, it is an invitation to agree. To get signal, remove the cue:

Ask "what is wrong with this?" instead of "this is fine, right?"
Have it argue the opposite case, or run a self-critique pass where a fresh agent attacks the work.
Keep a human review for judgment calls, since the model will not reliably volunteer that your plan is bad.

Watch out

Sycophancy and hallucination compound. A model can invent a fact, then, when you express doubt, cheerfully agree its own invention was wrong for the wrong reason. Neither its confidence nor its agreement is evidence. Verify independently.

Related terms

Hallucination

A hallucination is a confident, plausible-sounding output that is simply wrong: an invented API, a fabricated file path, a made-up citation. It is not the model lying. It is the model doing exactly what it always does, predicting plausible text, with no built-in sense of truth.

Read definition →

Human review

Human review is a person actually reading what an agent produced, understanding it, and taking responsibility for shipping it. It is the final quality gate that tests and automated review can support but never replace.

Read definition →

Self-critique

Self-critique is asking a model to attack its own output, or having a fresh agent do it, to catch bugs and bad assumptions before you ship. It is a direct counter to a model's tendency to agree with whatever it just produced.

Read definition →

Why models do it

Ask straight questions

Related terms

Hallucination

Human review

Self-critique

Building with AI agents?