Sizing LLM Inference for Production — Cover

Field Guide · v1.0 · April 2026

Sizing LLM Inference for Production

From first principles to cost-efficient scale. A complete decision framework for GPU capacity planning — workload characterization, memory budgeting, parallelism, KV cache, and a 13-step sizing algorithm.

↓ Download PDF (107 pages)

Read Online →

107 pages · 9 chapters · 13-step sizing algorithm · Staff / Principal ML Engineers

Preface

LLM inference has become one of the most consequential infrastructure engineering problems in the industry. The individual techniques are well-documented — quantization, parallelism, batching, KV cache optimization each have deep literature behind them. What is harder to find is a single framework that shows how these pieces fit together, in what order to apply them, and how to turn the output into a GPU count and cost estimate you can stand behind.

This guide is built around that framework. Every decision covered here — memory budgeting, parallelism strategy, quantization format, KV cache configuration — follows from measurable properties of your workload and your hardware. The goal is not to give you configurations to copy. It is to give you the reasoning to derive the right configuration for your specific model, traffic, and constraints — and to understand it well enough to adapt it when things change.

Who this is for. Staff and Principal ML Engineers, ML Platform Engineers, AI Infrastructure Architects, and Applied Scientists moving into production ownership. It assumes you are comfortable with transformer fundamentals, have hands-on GPU experience, and understand distributed systems concepts like memory hierarchies, throughput, and latency. Prior inference optimization experience is not required — that is what the guide builds from the ground up.

How to use this guide. The sections form a decision sequence — each one produces an output that feeds the next. Workload characterization informs memory sizing. Memory sizing constrains parallelism selection. Parallelism selection determines the benchmark configuration. The benchmark produces the operating point. The operating point sizes the fleet.

A note on the pace of change. This field moves fast — specific numbers, framework defaults, and GPU specs will age. The reasoning framework will not. Read the numbers as illustrations of the method, not as configuration targets to copy verbatim.

	Chapter	Key Question
→	Why Inference Sizing Is a Capital Allocation Problem	Why GPU spend is now an engineering decision
1	Workload Characterization	What exactly am I serving?
2	GPU Memory Sizing	How many GPUs do I need at minimum?
3	The Roofline Model	Why do prefill and decode behave differently?
4	Quantization	How do I trade precision for scale?
5	Parallelism Strategy	How do I split the model across GPUs?
6	Batching Strategy	How do I maximize throughput without hurting latency?
7	KV Cache Optimization	What is my most under-used lever?
8	Latency-Throughput Curve	Where is my operating point?
9	Sizing Algorithm & Monitoring	How do I put it all together?
✦	First Principles, Last	The mental model underneath it all

This field moves fast — specific numbers, framework defaults, and GPU specs will age. The reasoning framework will not. Read the numbers as illustrations of the method, not as configuration targets to copy verbatim.

Preface​

Contents​

Preface

Contents