Skip to main content
Sizing LLM Inference for Production - CoverWhat's InsideAbout the Author

Field Guide · v1.1 · May 2026

Sizing LLM Inference for Production

Inference is where AI infrastructure spend compounds indefinitely with usage growth. Most production LLM fleets are paying 2-3× what they need to, not from hardware limits but from sizing decisions made without a disciplined framework. This guide provides that framework: a complete decision sequence from workload characterization to production fleet sizing, grounded in the roofline model, queueing theory, and empirical benchmarking.

107 pages · 9 chapters · 13-step sizing algorithm · Staff / Principal ML Engineers


Contents

ChapterWhat you'll be able to do
Why Inference Sizing Is a Capital Allocation ProblemUnderstand why inference - not training - now dominates AI infrastructure spend, and why "just add GPUs" is a compounding financial mistake
1Workload CharacterizationDerive the P95 input/output length distributions, peak RPS, and latency targets that every downstream calculation depends on - before touching a single config parameter
2GPU Memory SizingCalculate the minimum GPU count from first principles: model weights, KV cache at peak concurrency, activations, and framework overhead - using your workload numbers, not model card estimates
3The Roofline ModelExplain precisely why prefill and decode require different hardware, and why buying H100s for their TFLOPS fails decode-dominated workloads
4QuantizationSequence weight, activation, and KV cache quantization decisions correctly - and understand why FP8 is a GPU count decision, not a quality tuning decision
5Parallelism StrategySelect TP, PP, DP, and EP degrees in the right order, understand why TP=4 on PCIe can be slower than TP=2 on NVLink, and size MoE models correctly
6Batching StrategyChoose between continuous batching, chunked prefill, disaggregated P/D, and speculative decoding based on your specific TTFT/ITL tradeoff - not framework defaults
7KV Cache OptimizationDeploy PagedAttention, prefix caching, and cache-aware routing as a coordinated stack - and diagnose the memory pressure cascade before it presents as a latency problem
8Latency-Throughput CurveGenerate the empirical curve for your deployment, find the SLO-constrained operating point, and size the fleet from measured data rather than theoretical peaks
9Sizing Algorithm & MonitoringExecute the full 13-step sizing sequence in the correct dependency order, instrument the metrics that fire early enough to intervene, and avoid the utilization trap in TCO
First Principles, LastThe reasoning framework that outlasts every hardware generation and framework version in this guide

This field moves fast - specific numbers, framework defaults, and GPU specs will age. The reasoning framework will not. Read the numbers as illustrations of the method, not as configuration targets to copy verbatim.