Field Guide · v1.0 · April 2026
Sizing LLM Inference for Production
From first principles to cost-efficient scale. A complete decision framework for GPU capacity planning — workload characterization, memory budgeting, parallelism, KV cache, and a 13-step sizing algorithm.
107 pages · 9 chapters · 13-step sizing algorithm · Staff / Principal ML Engineers
Preface
LLM inference has become one of the most consequential infrastructure engineering problems in the industry. The individual techniques are well-documented — quantization, parallelism, batching, KV cache optimization each have deep literature behind them. What is harder to find is a single framework that shows how these pieces fit together, in what order to apply them, and how to turn the output into a GPU count and cost estimate you can stand behind.
This guide is built around that framework. Every decision covered here — memory budgeting, parallelism strategy, quantization format, KV cache configuration — follows from measurable properties of your workload and your hardware. The goal is not to give you configurations to copy. It is to give you the reasoning to derive the right configuration for your specific model, traffic, and constraints — and to understand it well enough to adapt it when things change.
Who this is for. Staff and Principal ML Engineers, ML Platform Engineers, AI Infrastructure Architects, and Applied Scientists moving into production ownership. It assumes you are comfortable with transformer fundamentals, have hands-on GPU experience, and understand distributed systems concepts like memory hierarchies, throughput, and latency. Prior inference optimization experience is not required — that is what the guide builds from the ground up.
How to use this guide. The sections form a decision sequence — each one produces an output that feeds the next. Workload characterization informs memory sizing. Memory sizing constrains parallelism selection. Parallelism selection determines the benchmark configuration. The benchmark produces the operating point. The operating point sizes the fleet.
A note on the pace of change. This field moves fast — specific numbers, framework defaults, and GPU specs will age. The reasoning framework will not. Read the numbers as illustrations of the method, not as configuration targets to copy verbatim.
Contents
| Chapter | Key Question | |
|---|---|---|
| → | Why Inference Sizing Is a Capital Allocation Problem | Why GPU spend is now an engineering decision |
| 1 | Workload Characterization | What exactly am I serving? |
| 2 | GPU Memory Sizing | How many GPUs do I need at minimum? |
| 3 | The Roofline Model | Why do prefill and decode behave differently? |
| 4 | Quantization | How do I trade precision for scale? |
| 5 | Parallelism Strategy | How do I split the model across GPUs? |
| 6 | Batching Strategy | How do I maximize throughput without hurting latency? |
| 7 | KV Cache Optimization | What is my most under-used lever? |
| 8 | Latency-Throughput Curve | Where is my operating point? |
| 9 | Sizing Algorithm & Monitoring | How do I put it all together? |
| ✦ | First Principles, Last | The mental model underneath it all |
This field moves fast — specific numbers, framework defaults, and GPU specs will age. The reasoning framework will not. Read the numbers as illustrations of the method, not as configuration targets to copy verbatim.


