Model Serving Layer for AI Agents

Principle

Treat models as replaceable runtime dependencies, not as the architecture of the product.

Why it matters

Provider APIs, prices, context windows, tool support, safety behaviour, and latency profiles change. If feature code talks directly to one provider SDK, every model change becomes a risky product change. A serving layer lets you move that risk into a narrow, testable boundary.

Build this

A typed request and response contract for chat, structured output, tool calls, streaming, and usage data.
Provider adapters that normalise errors, retry semantics, model capabilities, and token accounting.
A routing policy that chooses models by task class, budget, latency target, context length, and fallback priority.
A capability registry for JSON mode, tool calling, vision, reasoning effort, context size, and cache support.

Watch for

Prompt code importing provider SDKs directly.
Fallbacks that return a different schema or silently drop tool support.
Model upgrades shipped without a baseline eval or latency comparison.
Cost reports that cannot explain which feature, tenant, or task spent the tokens.

Proof it works

A model can be swapped by configuration for at least one task path.
Traces record provider, model, routing reason, retry count, latency, and token usage.
Failover has been tested against provider timeout, rate limit, schema failure, and safety refusal cases.

Implementation checklist

Define the application contract first, then map each provider into it.

Keep prompts and routing rules versioned together so output changes can be traced.

Log every model selection as a decision, not just as metadata.

Run upgrade evals before changing the default model for a production task.

Related dictionary terms

Model provider Model provider request Token

Make model choice a runtime decision