Skip to main content

Why Inference Sizing Is Now a Capital Allocation Problem

There is a version of the AI infrastructure conversation that peaked around 2022: which cluster to train on, how many GPUs the training run needs, how to parallelize the backward pass. That conversation was important. It is also largely finished for most organizations.

GPT-4 was trained on an estimated $78-100 million of compute according to the Stanford AI Index Report 2025. That number sounds large. It is a one-time event.

Inference is not a one-time event. Inference runs every second, against every user, for the entire operational lifetime of the model. One analysis projected GPT-4's inference bill at approximately $2.3 billion in 2024 - roughly 15-23× its training cost - simply from the cumulative weight of serving millions of requests per day. Training is a capital expenditure. Inference is an operating expenditure that compounds indefinitely with usage growth.

This shift from training-centric to inference-centric infrastructure spend is now well underway. The inference market is projected to reach $255 billion by 2030. Andreessen Horowitz has documented what they call "LLMflation" - the cost of inference per token has dropped roughly 10× every year since 2022 - yet total inference spend continues to rise because usage is growing faster than prices are falling. Per-token costs are collapsing. Total bills are not.

For organizations running LLMs in production, inference cost is the number that matters most. And yet the discipline of rigorous inference sizing - knowing exactly how many GPUs you need, under what configuration, to serve a defined workload at a defined latency target - remains surprisingly underdeveloped. Most teams either over-provision by instinct ("just add more GPUs") or under-provision until the system breaks under load, then scramble to scale. Neither approach is defensible at production scale.

The "just add more GPUs" instinct is expensive. A fleet running at 40% average GPU utilization is paying roughly 2.5× the cost per token of a well-sized fleet at 70% utilization. One analysis found that chips and staffing together constitute 70-80% of total LLM deployment costs - meaning the hardware sizing decision alone dominates the majority of your production AI budget. Getting it wrong by even 30% has compounding financial consequences over a 3-4 year hardware depreciation cycle.

What makes LLM inference uniquely hard to size is that it is not a single workload. It is two fundamentally different computational workloads - prefill and decode - running on the same hardware, with conflicting hardware bottlenecks, competing for the same GPU memory, and interacting in ways that are non-obvious without understanding the underlying system dynamics. The GPU memory budget is split between a static component (model weights, loaded once) and a highly dynamic component (KV cache, which grows with every token in every active request). The system that fits in memory at idle will OOM in production if you have not accounted for KV cache at peak concurrency. The system that performs well at batch size 1 will behave completely differently at batch size 64.These are not implementation details. They are the reason inference sizing requires a disciplined framework rather than a rule of thumb - which is what the rest of this guide provides.