Build / Design and Development

01

Model Serving Layer

Make model choice a runtime decision

Put every model provider behind a stable serving contract so the product can route, fail over, compare, and upgrade models without rewriting feature code.

Principle

Treat models as replaceable runtime dependencies, not as the architecture of the product.

Why it matters

Provider APIs, prices, context windows, tool support, safety behaviour, and latency profiles change. If feature code talks directly to one provider SDK, every model change becomes a risky product change. A serving layer lets you move that risk into a narrow, testable boundary.

Build this

  • A typed request and response contract for chat, structured output, tool calls, streaming, and usage data.
  • Provider adapters that normalise errors, retry semantics, model capabilities, and token accounting.
  • A routing policy that chooses models by task class, budget, latency target, context length, and fallback priority.
  • A capability registry for JSON mode, tool calling, vision, reasoning effort, context size, and cache support.

Watch for

  • Prompt code importing provider SDKs directly.
  • Fallbacks that return a different schema or silently drop tool support.
  • Model upgrades shipped without a baseline eval or latency comparison.
  • Cost reports that cannot explain which feature, tenant, or task spent the tokens.

Proof it works

  • A model can be swapped by configuration for at least one task path.
  • Traces record provider, model, routing reason, retry count, latency, and token usage.
  • Failover has been tested against provider timeout, rate limit, schema failure, and safety refusal cases.

Implementation checklist

01

Define the application contract first, then map each provider into it.

02

Keep prompts and routing rules versioned together so output changes can be traced.

03

Log every model selection as a decision, not just as metadata.

04

Run upgrade evals before changing the default model for a production task.

Related dictionary terms

Keep the framework connected.

Each factor is useful alone, but the system only becomes production-grade when build, run, and govern controls reinforce each other.

Return to the hub