Field Guide · v1.1 · May 2026
Sizing LLM Inference for Production
Inference is where AI infrastructure spend compounds indefinitely with usage growth. Most production LLM fleets are paying 2-3× what they need to, not from hardware limits but from sizing decisions made without a disciplined framework. This guide provides that framework: a complete decision sequence from workload characterization to production fleet sizing, grounded in the roofline model, queueing theory, and empirical benchmarking.
107 pages · 9 chapters · 13-step sizing algorithm · Staff / Principal ML Engineers
Contents
| Chapter | What you'll be able to do | |
|---|---|---|
| → | Why Inference Sizing Is a Capital Allocation Problem | Understand why inference - not training - now dominates AI infrastructure spend, and why "just add GPUs" is a compounding financial mistake |
| 1 | Workload Characterization | Derive the P95 input/output length distributions, peak RPS, and latency targets that every downstream calculation depends on - before touching a single config parameter |
| 2 | GPU Memory Sizing | Calculate the minimum GPU count from first principles: model weights, KV cache at peak concurrency, activations, and framework overhead - using your workload numbers, not model card estimates |
| 3 | The Roofline Model | Explain precisely why prefill and decode require different hardware, and why buying H100s for their TFLOPS fails decode-dominated workloads |
| 4 | Quantization | Sequence weight, activation, and KV cache quantization decisions correctly - and understand why FP8 is a GPU count decision, not a quality tuning decision |
| 5 | Parallelism Strategy | Select TP, PP, DP, and EP degrees in the right order, understand why TP=4 on PCIe can be slower than TP=2 on NVLink, and size MoE models correctly |
| 6 | Batching Strategy | Choose between continuous batching, chunked prefill, disaggregated P/D, and speculative decoding based on your specific TTFT/ITL tradeoff - not framework defaults |
| 7 | KV Cache Optimization | Deploy PagedAttention, prefix caching, and cache-aware routing as a coordinated stack - and diagnose the memory pressure cascade before it presents as a latency problem |
| 8 | Latency-Throughput Curve | Generate the empirical curve for your deployment, find the SLO-constrained operating point, and size the fleet from measured data rather than theoretical peaks |
| 9 | Sizing Algorithm & Monitoring | Execute the full 13-step sizing sequence in the correct dependency order, instrument the metrics that fire early enough to intervene, and avoid the utilization trap in TCO |
| ✦ | First Principles, Last | The reasoning framework that outlasts every hardware generation and framework version in this guide |
This field moves fast - specific numbers, framework defaults, and GPU specs will age. The reasoning framework will not. Read the numbers as illustrations of the method, not as configuration targets to copy verbatim.


