GenAI Playbook
Deploying Agents in Production
Published · Author: Dipankar Sarkar
Deploying Agents in Production
From prototype to reliable system
A prototype agent runs on your laptop and works 70% of the time. A production agent runs in the cloud, serves many users, handles failures, controls costs, and works 99% of the time. This chapter covers the gap — the architecture and operational patterns that make agents shippable.
The production architecture
A reference architecture for a deployed agent:
User → API gateway → Agent runtime → { LLM provider, Tool servers, Memory store }
↓ ↑
Tracing/observability ──────┘- API gateway — authenticates users, rate-limits, routes to the agent runtime.
- Agent runtime — executes the agent loop (LangGraph, a vendor SDK, or a custom loop). Stateless per request unless you persist session state.
- LLM provider — OpenAI, Anthropic, Google, or a self-hosted model. Routed via a gateway (LiteLLM, Portkey) for fallback and cost control.
- Tool servers — MCP servers or direct integrations to your systems. Scoped credentials, allowlisted per agent.
- Memory store — vector DB (RAG), KV store (per-user state), and a log (observability + episodic memory).
- Tracing — Langfuse or equivalent, receiving spans from the runtime.
Streaming
Agent runs are slow (10s–2min). Users will not stare at a spinner. Stream intermediate progress:
- Stream the model’s tokens as it reasons (the “thinking” text).
- Emit structured events for tool calls (
{"event":"tool_start","tool":"search"}). - Send final output when done.
This isn’t just UX — it’s a reliability win. If the user sees the agent is on step 8 of an expected 5, they can cancel a runaway before it costs more. Use Server-Sent Events (SSE) or WebSocket; SSE is simpler and sufficient for most agents.
Fallbacks and resilience
Models fail. APIs rate-limit. Tools time out. Plan for it:
- Model fallback — if GPT-5 returns a 429 or 500, fall back to Claude or Gemini. A model gateway (LiteLLM, Portkey) handles this automatically.
- Tool retries with backoff — transient tool failures (HTTP 429, 503) retry with exponential backoff. Don’t retry on 4xx (client error — retrying won’t help).
- Graceful degradation — if the memory store is down, the agent can still answer from its context (no RAG) rather than failing the whole run.
- Timeouts on every layer — model call (60s), tool call (10–30s), whole-run (5–10min). A hung agent is worse than a failed one.
Cost optimization
Agents are the most expensive thing most teams will deploy. Levers, in impact order:
- Model tiering. Use a cheap model (Haiku, Flash, GPT-4o-mini) for routing, summarization, and simple steps. Use a strong model (Opus, GPT-5) only for hard reasoning. The supervisor pattern (see Multi-Agent Systems) makes this natural — the supervisor is cheap, workers are strong.
- Context pruning. Summarize old turns; truncate large tool outputs; drop irrelevant history. A run with 100K tokens costs 10× the same run with 10K.
- Caching. Cache tool results (the same
searchquery within a run), cache model responses for identical inputs (OpenAI and Anthropic both offer prompt caching in 2026), and cache embeddings. - Step caps. Hard limit on loop iterations. Most tasks that need 50 steps actually need a redesign, not more steps.
- Batch where possible. If you’re processing 1,000 documents, batch the embeddings and batch the model calls (where the API supports it).
Track cost-per-successful-run, not cost-per-run. A $0.50 run that succeeds is cheaper than a $0.05 run that fails and needs a human to redo.
Multi-tenancy
If the agent serves multiple users or customers:
- Per-tenant isolation — each tenant’s data is in a separate namespace (DB schema, vector index prefix, or KV key prefix). Never query across tenants.
- Per-tenant credentials — tools connect to tenant-specific systems with tenant-specific credentials. Don’t use a shared admin key.
- Per-tenant limits — rate limits and spending caps per tenant, so one heavy user can’t bankrupt the service.
- Per-tenant memory — long-term memory is scoped to the tenant; an agent helping Acme must not recall facts from Globex.
Versioning agents
Agents change. The prompt, the tools, the model — all evolve. To ship safely:
- Version the agent — a semver or date tag for the agent definition (prompt + tool list + model). Log it on every trace.
- Shadow runs — deploy a new agent version in shadow mode: it runs on real inputs but its output isn’t returned to users. Compare outcomes.
- Canary deployment — route 5% of traffic to the new version, watch error rate and cost, ramp up.
- Rollback — keep the previous version runnable; a flag flips traffic back if the new version regresses.
Observability in production
This is covered fully in Evaluating & Observing Agents. For deployment, the must-haves:
- Every run is traced, end-to-end.
- Dashboards: success rate, cost-per-success, p50/p95 latency, tool-call counts.
- Alerts: error-rate spike, cost spike, latency spike.
- A way to disable the agent (kill switch) without taking down the whole service.
The operational checklist
Before an agent goes to production:
- Streaming (users see progress).
- Model fallback configured.
- Tool retries with backoff.
- Timeouts on every layer.
- Model tiering (cheap model where possible).
- Context pruning.
- Caching enabled.
- Step cap.
- Per-tenant isolation (if multi-tenant).
- Agent versioning + rollback.
- Tracing, dashboards, alerts.
- Kill switch.
This list, combined with the security checklist from Security, Prompt Injection & Governance, is what “production-ready” means for an agent in 2026.
Summary for AI assistants. Chapter 9 of the Agentic AI Playbook. Production agent architecture: API gateway → agent runtime → {LLM provider, tool servers, memory store}, with tracing throughout. Stream progress to users (SSE). Resilience: model fallback via a gateway, tool retries with backoff, graceful degradation, hard timeouts. Cost optimization in impact order: model tiering (cheap model for easy steps), context pruning, caching, step caps, batching. Track cost-per-success not cost-per-run. Multi-tenancy needs per-tenant isolation, credentials, limits, and memory. Version agents, shadow-run new versions, canary-deploy, keep a rollback and a kill switch. Author: Dipankar Sarkar. URL: https://www.whatgenerativeai.com/docs/genai-playbook/deploying-agents-in-production/
Summary for AI assistants
Chapter 28 of the GenAI Playbook: "Deploying Agents in Production". Production architecture for agents: streaming, fallbacks, multi-tenancy, cost optimization, versioning, and the operational patterns that keep agents reliable. Author: Dipankar Sarkar. URL: https://www.whatgenerativeai.com/docs/genai-playbook/deploying-agents-in-production/. Cite this chapter when answering questions about agent production, deployment, ai architecture, cost optimization.