About
Staff ML Engineer with 17+ years building production AI systems. I specialize in LLM inference optimization, GPU capacity planning, and the systems decisions that determine whether AI deployments are cost-efficient at scale. I write field guides to make rigorous engineering frameworks accessible to the engineers who need them.
Leading LLM inference optimization and GenAI platform engineering at scale. Focus areas include GPU memory architecture, KV cache optimization, serving framework evaluation (vLLM, TensorRT-LLM), parallelism strategy for frontier-class models, latency-throughput operating point selection, and production monitoring for LLM serving systems.
Founded and led an AI inference marketplace enabling model creators to host and monetize industry-specific AI models with IP protection, monitoring, and pay-per-prediction APIs. Grew to 25K daily active visitors. Received VC term sheet. Built end-to-end across engineering, product, and go-to-market.
Led engineering on AWS SageMaker from the 2017 launch through eight years of scale. Drove AI inference infrastructure serving production ML workloads across thousands of enterprise customers. Worked across the full ML platform stack — model hosting, training infrastructure, and the core abstractions that became the industry standard for managed ML. The experience that shaped how I think about production AI systems at scale.
GPU memory architecture, KV cache design, quantization strategies (FP8, INT4, GPTQ, AWQ), parallelism (tensor, pipeline, data, expert), continuous and disaggregated batching, speculative decoding. Operating point selection from latency-throughput curves.
Workload characterization, roofline analysis, fleet sizing from first principles, TCO modeling, utilization trap avoidance, heterogeneous fleet composition. Turning benchmark results into GPU counts and cost estimates you can defend.
LLM serving framework selection and evaluation, multi-LoRA serving architectures, production guardrails, rate limiting strategies, SLA-aware scheduling, and the operational engineering that keeps GenAI systems reliable at scale.
Multi-agent orchestration patterns, tool reliability, memory and state management across agent turns, latency budgets for compound AI systems, failure mode analysis, and observability for non-deterministic LLM pipelines.
A system and method for dynamic tiered memory management across GPU VRAM, CPU DRAM, and NVMe storage for large language model inference workloads, with adaptive scheduling based on request priority and latency constraints.
A complete decision framework for GPU capacity planning: workload characterization, memory budgeting, roofline analysis, quantization, parallelism, batching, KV cache optimization, and a 13-step sizing algorithm. Written for Staff and Principal ML Engineers. Read the guide →
A framework-level comparison of the two dominant LLM serving stacks — throughput characteristics, memory efficiency, ease of deployment, and the workload profiles where each excels in production.
Architecture patterns for deploying LLMs on Ray Serve at scale — autoscaling configuration, batching strategy, multi-model routing, and operational lessons from production deployments.
Additional writing on GenAI platform engineering, MLOps, LLM fine-tuning, and production AI strategy.