Skip to main content

Why RAG and Agentic Systems Break at Scale

You have already launched. The system works. A team is using it, value is being captured, and the business has noticed. Now comes the harder ask: scale it across the entire organization - every team, every region, every business-critical workflow, millions of users. That is when the real engineering challenges begin.


From working to load-bearing

There is a meaningful difference between an agentic system that works in production and one that the business runs on.

A system that works in production has a team that trusts it, a use case that is well-defined, and enough observability that someone notices when things go wrong. The edges are known. The failure modes are manageable. The cost is predictable.

A system the business runs on is something else. It handles workflows that cannot fail. It serves users who have no fallback. It operates across teams with different needs, data with different quality, and query distributions that nobody fully anticipated. The people who depend on it don't know it exists - they just expect it to work, the same way they expect email to work.

The gap between these two states is not closed by adding more GPUs or training more models. It is an architectural and operational gap - and it is where most of the hard engineering problems in this field live.

Think of it like the difference between a restaurant that works - the kitchen runs, customers come back, quality holds - and scaling that into a global operation across thousands of locations in dozens of countries. The original recipes change too: what works in one market needs to be adapted for local ingredients, local tastes, local regulations. The supply chain that worked for one kitchen cannot simply be replicated - it has to be redesigned for reliability at a scale where any single failure ripples across the whole network. The cost structure that was acceptable for one location becomes unsustainable when multiplied by thousands. The quality that was maintained by a head chef's direct attention now has to be engineered into the system - through training, tooling, measurement, and feedback loops that work without that direct oversight. What got the original restaurant to work was skill and focus. What makes the global operation work is infrastructure, cost engineering, reliability systems, and the ability to absorb constant change without breaking. These are genuinely different problems, even though both are nominally about serving food.


What the data tells us

The industry is in the middle of this transition right now - and the failure rate reflects how hard it is.

McKinsey's State of AI 2025 found that 88% of organizations regularly use AI in at least one business function. Yet 94% of respondents reported not seeing significant value from those investments. The technology is running. The results at scale are not arriving. McKinsey identifies the core blocker as what they call "pilot purgatory" - systems that work at team scale but never successfully expand to enterprise-wide impact, blocked by data quality, workflow rigidity, and the absence of operational infrastructure designed for scale.

The picture for agentic systems is sharper. Gartner predicts that over 40% of agentic AI projects will be canceled by end of 2027 - not because the underlying technology failed, but because of escalating costs, unclear governance, and inadequate risk controls at scale. These are engineering and architectural failures, not algorithmic ones. And Gartner simultaneously forecasts that 40% of enterprise applications will include task-specific agents by end of 2026, up from less than 5% today. The systems are being scaled. The engineering discipline to sustain them at that scale is still being built.

By 2029, Gartner predicts 70% of enterprises will deploy agentic AI as part of their core IT infrastructure. The trajectory is clear. So is the challenge: operating these systems reliably when the business depends on them is a different problem from launching them.


Where the challenges compound

When an agentic system moves from a single team to an entire organization, several things happen at once that individually seem manageable but together create the engineering challenges this handbook is built around.

The failure surface expands and becomes invisible. At team scale, someone is watching. Query distributions are narrow, data is relatively homogeneous, and failures are noticed quickly. At enterprise scale, the query distribution expands to cover use cases nobody anticipated, the data spans teams with wildly different quality standards, and failures degrade silently for days or weeks before anyone catches them. A retrieval index that is a week stale is fine when someone is checking it daily. It is a serious problem when thousands of users depend on it and nobody has visibility into its freshness. The system appears healthy - latency is fine, error rates are nominal - while quietly giving wrong answers to people who have no reason to doubt it.

Failures in these systems also compound in a way that classical systems do not. A slightly stale retrieval leads to a slightly off-target context, which leads to a subtly wrong generation, which in an agentic loop influences the next tool call, which compounds further. Each individual component looks locally reasonable. The aggregate is a failure. And because no single component crossed a threshold, nothing alerted.

Cost becomes unpredictable at the request level. At team scale, cost is manageable and roughly predictable. At enterprise scale, the query distribution includes edge cases that nobody planned for. One class of queries triggers three retrieval rounds instead of one. Another fills the context window and requires compression. An agent loop that should terminate in four steps runs for twelve before hitting a budget limit. An edge case can trigger a chain of retries that costs fifty times more than the normal path. Multiply that across millions of daily requests and the financial model built during the pilot phase no longer holds.

Non-determinism compounds with scale. A non-deterministic system that serves a hundred users a day is manageable - you can review outputs, spot patterns, and course-correct quickly. A non-deterministic system that serves a million users a day produces a distribution of outputs that is nearly impossible to monitor exhaustively. The same query asked by different users at different times can produce meaningfully different answers. At scale, the tail of that distribution - the rare but confidently wrong outputs - becomes a real operational concern. The monitoring, alerting, and testing infrastructure that worked at team scale was not designed for this.

Tool and integration surface area multiplies. A single-team deployment might use three or four tools. An enterprise-wide deployment integrates with the data systems, APIs, and workflows of every team it serves. Each integration is a dependency, a versioning concern, a failure point, and a security boundary. The MCP protocol has made it dramatically easier to connect agents to external systems - OpenAI, Google, Microsoft, and Anthropic have all adopted it as a standard. That is genuinely useful. It also means the surface area of a production agentic system at enterprise scale is orders of magnitude larger than what was running in the pilot. Operating that surface area - handling tool failures gracefully, maintaining audit trails, enforcing access boundaries across teams - is work that most organizations have not yet built the capability to do.

Action reliability and blast radius become engineering constraints. At team scale, an agent that takes a wrong action is caught quickly and corrected. At enterprise scale, agents are taking autonomous actions - modifying records, triggering workflows, calling external APIs - across thousands of users simultaneously. The blast radius of a bad action is no longer local. An agent loop with a subtle logic error that fires ten times a day on one team fires ten thousand times a day across the organization. Blast radius containment, versioned tool contracts, tenant-level isolation, and rollback procedures that actually work across a distributed system are not policy decisions - they are engineering decisions with direct latency, cost, and reliability implications. Systems that were not designed for failure isolation at scale will fail at scale, and recovery will be expensive.

Blast radius at scale

At team scale, a bad agent action is an incident. At enterprise scale, it can be a systemic failure affecting thousands of users before anyone can intervene. The difference between these outcomes is how the system was engineered - not how it was prompted.


What this engineering handbook is built for

The chapters that follow address these challenges layer by layer - from the gateway infrastructure that handles traffic at enterprise scale, through the retrieval stack, the RAG and agentic architecture, the observability systems that surface problems before users do, and the engineering discipline required to keep the whole system running reliably when the business depends on it.

The intended reader is a Principal or Staff Engineer who has already shipped an agentic system and is now responsible for scaling it. The chapters assume working knowledge of LLM inference, agentic systems, RAG pipelines, distributed systems, and production operations. They address the problems that emerge when agents become load-bearing infrastructure - when the business is running on them and the pager is yours.

The failure modes, architectural decision frameworks, and operational principles here are designed to remain relevant across model generations. The specific tools and pricing numbers will change. The engineering problems will not.


Next: Chapter 1 - LLM Gateway and Multi-Provider Routing