About
Staff ML Engineer specializing in LLM inference optimization, GPU capacity planning, and the systems decisions that determine whether AI deployments are cost-efficient at scale. Currently leading inference optimization for one of the most consequential AI systems in the world - reaching hundreds of millions of users. I write field guides to make rigorous engineering frameworks accessible to the engineers who need them.
Leading LLM inference optimization and GenAI platform engineering for one of the most consequential AI systems in the world - reaching hundreds of millions of users. Focus areas include GPU memory architecture, KV cache optimization, serving framework evaluation (vLLM, TensorRT-LLM), parallelism strategy for frontier-class models, latency-throughput operating point selection, and production monitoring for LLM serving systems.
Founded and built an AI inference marketplace enabling model creators to host and monetize industry-specific AI models with IP protection, monitoring, and pay-per-prediction APIs. Grew to 25K daily visitors within 90 days. Received a VC term sheet. Selected into NVIDIA Inception, AWS Activate, and Google Cloud Scale programs.
Built core inference infrastructure for AWS SageMaker from its earliest days - scaling it to serve production ML workloads for enterprises across finance, healthcare, and telecom. Drove the full ML platform stack: AI inference, training infrastructure, MLOps infra, and the abstractions that became the industry standard for managed ML at scale.
Built large-scale distributed systems and cloud infrastructure across global enterprises. Designed and delivered systems operating at significant scale - from real-time network infrastructure at Ericsson to enterprise platform engineering at Pegasystems. This work established the distributed systems foundations in my career that underpin everything that followed in AI infrastructure.
GPU memory architecture, KV cache design, quantization strategies (FP8, INT4, GPTQ, AWQ), parallelism (tensor, pipeline, data, expert), continuous and disaggregated batching, speculative decoding. Operating point selection from latency-throughput curves.
Workload characterization, roofline analysis, fleet sizing from first principles, TCO modeling, utilization trap avoidance, heterogeneous fleet composition. Turning benchmark results into GPU counts and cost estimates you can defend.
LLM serving framework selection and evaluation, multi-LoRA serving architectures, production guardrails, rate limiting strategies, SLO-aware scheduling, and the operational engineering that keeps GenAI systems reliable at scale.
Multi-agent orchestration patterns, tool reliability, memory and state management across agent turns, latency budgets for compound AI systems, failure mode analysis, and observability for non-deterministic LLM pipelines.
A system and method for dynamic tiered memory management across GPU VRAM, CPU DRAM, and NVMe storage for large language model inference workloads, with adaptive scheduling based on request priority and latency constraints.
A complete decision framework for GPU capacity planning: workload characterization, memory budgeting, roofline analysis, quantization, parallelism, batching, KV cache optimization, and a 13-step sizing algorithm. Written for Staff and Principal ML Engineers. Read the guide →
A framework-level comparison of the two dominant LLM serving stacks - throughput characteristics, memory efficiency, ease of deployment, and the workload profiles where each excels in production.
Architecture patterns for deploying LLMs on Ray Serve at scale - autoscaling configuration, batching strategy, multi-model routing, and operational lessons from production deployments.
How multi-LoRA serving, prefix caching, and cache-aware routing combine to make multi-tenant inference economically viable - and where the production engineering challenges diverge from the research framing.
Why KV cache is the most under-used lever in production LLM serving - and how PagedAttention, prefix caching, and quantization combine to turn it into a concurrency multiplier.
The retrieval infrastructure underneath RAG systems - why vector search is a production engineering problem, not just an algorithm selection.
Additional writing on GenAI platform engineering, vector search infrastructure, MLOps, and production AI strategy.