Safety and Guardrails for AI Agents

Principle

Guardrails belong around the system boundary, not only inside the prompt.

Why it matters

Agents act on untrusted input, call tools, and may touch sensitive systems. Prompt instructions help, but they are not enough. Production safety comes from layered controls: least privilege, validation, approvals, isolation, content policy, and fast shutdown when behaviour drifts.

Build this

Threat models for prompt injection, data exfiltration, unsafe tool use, policy violations, and destructive actions.
Input classification and output validation for structured data, generated content, and external messages.
Permission boundaries by user, tenant, tool, environment, and action risk.
Circuit breakers for repeated failures, abnormal spend, policy hits, and unsafe outputs.

Watch for

Relying on a system prompt to protect secrets or production systems.
Tool descriptions that reveal capabilities the user should not access.
Agents reading untrusted content and treating it as instructions.
Safety checks that happen after the external side effect has already happened.

Proof it works

Prompt injection tests cannot make tools access forbidden data.
Unsafe or policy-violating output is blocked before external delivery.
A runaway loop, cost spike, or repeated tool failure trips a circuit breaker.

Implementation checklist

Apply least privilege to every tool and credential.

Run untrusted work in a sandbox or constrained environment.

Validate structured output with schemas and reject invalid states.

Keep high-risk actions behind approval and make rollback part of the action design.

Related dictionary terms

Sandbox Permission mode Tool call

Keep agents inside clear safety boundaries