Skip to main content

Engineering Handbook · In Progress · 2026

Agentic Systems in Production

An Engineering Handbook for Reliability, Observability, and Scale

By Vinay Jayanna - Staff ML Engineer, LLM Inference and GenAI Platform


Most RAG and agentic systems work at team scale. This handbook is about what happens when the business runs on them.

Written for engineers who hold the pager - Principal and Staff ML Engineers building systems that serve tens of thousands of queries per day, where a non-deterministic loop going wrong, a retrieval pipeline going stale, or an agent exceeding its cost budget has real consequences.

Download

PDF available on publication. The web version is the live, versioned edition.


What This Engineering Handbook Covers

Part I - The Production Landscape establishes the failure taxonomy and the infrastructure entry point. Non-determinism as a first-class engineering problem. The LLM gateway as the foundation everything else depends on.

Part II - High-Throughput Retrieval covers the full retrieval stack at production scale: embedding models, vector database internals and real-time indexing, document pipeline reliability, context engineering as a managed resource, and hybrid search with semantic caching.

Part III - RAG System Design covers production failure modes and architecture patterns, evaluation harnesses that catch regressions, and the cost, latency, and security concerns that multi-tenant RAG creates.

Part IV - Agentic System Design opens with the Agentic RAG loop as the bridge between retrieval and agency, then covers agent loop architecture starting at state machines, tool design and the MCP protocol layer, memory and state management, multi-agent orchestration and failure isolation, and the cost and security concerns unique to agentic systems.

Part V - Reliability and Governance covers DAG-based observability for non-deterministic systems, deterministic testing and agentic CI/CD, continuous improvement pipelines, and production guardrails with the full-stack sizing algorithm.


Who This Is For

  • Principal and Staff ML Engineers building or owning production RAG or agentic systems
  • Platform engineers designing the infrastructure layer these workloads run on
  • Tech leads making architecture decisions for AI products at scale

This handbook assumes working knowledge of LLM inference, distributed systems, and production ML operations. It does not explain what a transformer is, or what an agent is.


Chapters

ChapterFocus
Why RAG and Agentic Systems Break at ScaleFailure taxonomy, production ops model
1LLM Gateway and Multi-Provider RoutingEntry point infrastructure, routing, circuit breakers
2Embedding Models and Retrieval QualityDomain adaptation, benchmarking, fine-tuning
3Vector Database Architecture, Scaling, and Real-Time IndexingANN internals, sharding, real-time indexing, MTTI, hot-shard management
4Chunking, Context Construction, and Document PipelinesPipeline reliability, metadata, freshness at scale
5Context Engineering: Budget, Assembly, and GovernanceContext window as a managed resource, long context vs. RAG trade-off, budget enforcement, provenance, multi-tenant isolation
6Hybrid Search, Query Routing, and Semantic CachingSparse+dense fusion, semantic cache, COGS reduction
7RAG Architectures: Production Failure Modes and Design PatternsFailure taxonomy, modular RAG, GraphRAG, decision framework
8RAG Evaluation: Metrics That Survive ProductionRAGAS, eval pipelines, CI regression guards
9RAG Cost, Latency, and SecurityEnd-to-end latency decomposition, token budgets, caching layers, multi-tenant access control, adversarial retrieval
10Agentic RAG: When Retrieval and Agency InterleaveRetrieve→reason→rewrite loops, compounding failures, stopping conditions
11Agent Loop Design: State Machines, Re-planning, and Failure IsolationState machines, re-planning, non-termination, runaway loop prevention
12Tool Design, MCP, and the Agentic Protocol LayerMCP production infrastructure, OWASP MCP Top 10, OAuth 2.1, tool poisoning
13Memory, State, and Context Across Agent TurnsMemory architectures, state machines, eviction, session recovery
14Multi-Agent Orchestration and Failure IsolationSupervisor patterns, deadlocks and shared state corruption, blast radius, circuit breakers, inter-agent protocols
15Latency Budgets, Cost Control, and Agentic SecurityToken burn rate, model routing, inference-time compute trade-offs, prompt injection, least-privilege, sandboxing
16DAG-Based Observability for Non-Deterministic SystemsWhere OpenTelemetry fails, DAG trace schemas, drift detection
17Deterministic Testing and Agentic CI/CDSimulation environments, reproducible agent tests, regression-guarding pipelines
18Continuous Improvement: Feedback Loops and Online LearningSignal collection, A/B testing, safe fine-tuning, index freshness
19Production Guardrails, Compliance, and SizingClassifiers at throughput, HIPAA/SOC2/GDPR, full-stack sizing algorithm, production readiness checklist

About the Author

Currently a Staff ML Engineer leading LLM inference optimization for one of the most consequential AI systems in the world - reaching hundreds of millions of users. Before that, spent nearly a decade at AWS building and scaling core inference infrastructure for SageMaker from its earliest days. Founded Vipas.AI, an AI inference marketplace that reached 25K daily visitors and received a VC term sheet. Earlier career spans building large-scale distributed systems and cloud infrastructure at Ericsson, Pegasystems, and global enterprises. Holder of a USPTO-pending patent in dynamic hierarchical storage and GPU optimization for LLM serving.

vinayj.com · LinkedIn · GitHub