Skip to main content

LLM Inference · GPU Systems · Production AI

LLM Inference &
ML Infrastructure

Technical field guides written from production experience. Each guide is a comprehensive decision framework you can apply to production systems.

About the author

Currently a Staff ML Engineer leading LLM inference optimization for one of the most consequential AI systems in the world - reaching hundreds of millions of users. Before that, spent nearly a decade at AWS building and scaling core inference infrastructure for SageMaker from its earliest days. Founded Vipas.AI, an AI inference marketplace that reached 25K daily visitors and received a VC term sheet. Earlier career spans building large-scale distributed systems and cloud infrastructure at Ericsson, Pegasystems, and global enterprises. Holder of a USPTO-pending patent in dynamic hierarchical storage and GPU optimization for LLM serving.

Field Guides

Each guide is a complete treatment of a production ML systems topic - not a survey, not a tutorial, but a decision framework you can apply directly to real deployments.

Field Guide · v1.1
Sizing LLM Inference for Production
Inference is where AI infrastructure spend compounds indefinitely with usage growth. Most production LLM fleets are paying 2–3× what they need to - not from hardware limits, but from sizing decisions made without a disciplined framework. This guide covers the complete decision sequence: workload characterization, GPU memory sizing, parallelism strategy, KV cache optimization, and fleet sizing from the latency-throughput curve.
📄 107 pages9 chapters👤 Staff / Principal ML Engineers
Read the guide →
Field Guide · In Progress
Agentic Systems in Production
Production architecture for multi-agent LLM systems: orchestration patterns, tool reliability, memory and state management, latency budgets, failure modes, observability, and cost control. The guide that bridges research prototypes and production deployments.
📄 100+ pages10 chapters👤 Staff / Principal ML EngineersIn Progress