Engineering Handbook · In Progress · 2026
Agentic Systems in Production
An Engineering Handbook for Reliability, Observability, and Scale
By Vinay Jayanna - Staff ML Engineer, LLM Inference and GenAI Platform
Most RAG and agentic systems work at team scale. This handbook is about what happens when the business runs on them.
Written for engineers who hold the pager - Principal and Staff ML Engineers building systems that serve tens of thousands of queries per day, where a non-deterministic loop going wrong, a retrieval pipeline going stale, or an agent exceeding its cost budget has real consequences.
PDF available on publication. The web version is the live, versioned edition.
What This Engineering Handbook Covers
Part I - The Production Landscape establishes the failure taxonomy and the infrastructure entry point. Non-determinism as a first-class engineering problem. The LLM gateway as the foundation everything else depends on.
Part II - High-Throughput Retrieval covers the full retrieval stack at production scale: embedding models, vector database internals and real-time indexing, document pipeline reliability, context engineering as a managed resource, and hybrid search with semantic caching.
Part III - RAG System Design covers production failure modes and architecture patterns, evaluation harnesses that catch regressions, and the cost, latency, and security concerns that multi-tenant RAG creates.
Part IV - Agentic System Design opens with the Agentic RAG loop as the bridge between retrieval and agency, then covers agent loop architecture starting at state machines, tool design and the MCP protocol layer, memory and state management, multi-agent orchestration and failure isolation, and the cost and security concerns unique to agentic systems.
Part V - Reliability and Governance covers DAG-based observability for non-deterministic systems, deterministic testing and agentic CI/CD, continuous improvement pipelines, and production guardrails with the full-stack sizing algorithm.
Who This Is For
- Principal and Staff ML Engineers building or owning production RAG or agentic systems
- Platform engineers designing the infrastructure layer these workloads run on
- Tech leads making architecture decisions for AI products at scale
This handbook assumes working knowledge of LLM inference, distributed systems, and production ML operations. It does not explain what a transformer is, or what an agent is.
Chapters
| Chapter | Focus | |
|---|---|---|
| → | Why RAG and Agentic Systems Break at Scale | Failure taxonomy, production ops model |
| 1 | LLM Gateway and Multi-Provider Routing | Entry point infrastructure, routing, circuit breakers |
| 2 | Embedding Models and Retrieval Quality | Domain adaptation, benchmarking, fine-tuning |
| 3 | Vector Database Architecture, Scaling, and Real-Time Indexing | ANN internals, sharding, real-time indexing, MTTI, hot-shard management |
| 4 | Chunking, Context Construction, and Document Pipelines | Pipeline reliability, metadata, freshness at scale |
| 5 | Context Engineering: Budget, Assembly, and Governance | Context window as a managed resource, long context vs. RAG trade-off, budget enforcement, provenance, multi-tenant isolation |
| 6 | Hybrid Search, Query Routing, and Semantic Caching | Sparse+dense fusion, semantic cache, COGS reduction |
| 7 | RAG Architectures: Production Failure Modes and Design Patterns | Failure taxonomy, modular RAG, GraphRAG, decision framework |
| 8 | RAG Evaluation: Metrics That Survive Production | RAGAS, eval pipelines, CI regression guards |
| 9 | RAG Cost, Latency, and Security | End-to-end latency decomposition, token budgets, caching layers, multi-tenant access control, adversarial retrieval |
| 10 | Agentic RAG: When Retrieval and Agency Interleave | Retrieve→reason→rewrite loops, compounding failures, stopping conditions |
| 11 | Agent Loop Design: State Machines, Re-planning, and Failure Isolation | State machines, re-planning, non-termination, runaway loop prevention |
| 12 | Tool Design, MCP, and the Agentic Protocol Layer | MCP production infrastructure, OWASP MCP Top 10, OAuth 2.1, tool poisoning |
| 13 | Memory, State, and Context Across Agent Turns | Memory architectures, state machines, eviction, session recovery |
| 14 | Multi-Agent Orchestration and Failure Isolation | Supervisor patterns, deadlocks and shared state corruption, blast radius, circuit breakers, inter-agent protocols |
| 15 | Latency Budgets, Cost Control, and Agentic Security | Token burn rate, model routing, inference-time compute trade-offs, prompt injection, least-privilege, sandboxing |
| 16 | DAG-Based Observability for Non-Deterministic Systems | Where OpenTelemetry fails, DAG trace schemas, drift detection |
| 17 | Deterministic Testing and Agentic CI/CD | Simulation environments, reproducible agent tests, regression-guarding pipelines |
| 18 | Continuous Improvement: Feedback Loops and Online Learning | Signal collection, A/B testing, safe fine-tuning, index freshness |
| 19 | Production Guardrails, Compliance, and Sizing | Classifiers at throughput, HIPAA/SOC2/GDPR, full-stack sizing algorithm, production readiness checklist |
About the Author
Currently a Staff ML Engineer leading LLM inference optimization for one of the most consequential AI systems in the world - reaching hundreds of millions of users. Before that, spent nearly a decade at AWS building and scaling core inference infrastructure for SageMaker from its earliest days. Founded Vipas.AI, an AI inference marketplace that reached 25K daily visitors and received a VC term sheet. Earlier career spans building large-scale distributed systems and cloud infrastructure at Ericsson, Pegasystems, and global enterprises. Holder of a USPTO-pending patent in dynamic hierarchical storage and GPU optimization for LLM serving.
vinayj.com · LinkedIn · GitHub