Govern / Risk and Quality

10

Safety and Guardrails

Keep agents inside clear safety boundaries

Constrain agent behaviour with input handling, output validation, permissions, sandboxing, policy checks, and circuit breakers.

Principle

Guardrails belong around the system boundary, not only inside the prompt.

Why it matters

Agents act on untrusted input, call tools, and may touch sensitive systems. Prompt instructions help, but they are not enough. Production safety comes from layered controls: least privilege, validation, approvals, isolation, content policy, and fast shutdown when behaviour drifts.

Build this

  • Threat models for prompt injection, data exfiltration, unsafe tool use, policy violations, and destructive actions.
  • Input classification and output validation for structured data, generated content, and external messages.
  • Permission boundaries by user, tenant, tool, environment, and action risk.
  • Circuit breakers for repeated failures, abnormal spend, policy hits, and unsafe outputs.

Watch for

  • Relying on a system prompt to protect secrets or production systems.
  • Tool descriptions that reveal capabilities the user should not access.
  • Agents reading untrusted content and treating it as instructions.
  • Safety checks that happen after the external side effect has already happened.

Proof it works

  • Prompt injection tests cannot make tools access forbidden data.
  • Unsafe or policy-violating output is blocked before external delivery.
  • A runaway loop, cost spike, or repeated tool failure trips a circuit breaker.

Implementation checklist

01

Apply least privilege to every tool and credential.

02

Run untrusted work in a sandbox or constrained environment.

03

Validate structured output with schemas and reject invalid states.

04

Keep high-risk actions behind approval and make rollback part of the action design.

Related dictionary terms

Keep the framework connected.

Each factor is useful alone, but the system only becomes production-grade when build, run, and govern controls reinforce each other.

Return to the hub