About

Vinay Jayanna

Staff ML Engineer with 17+ years building production AI systems. I specialize in LLM inference optimization, GPU capacity planning, and the systems decisions that determine whether AI deployments are cost-efficient at scale. I write field guides to make rigorous engineering frameworks accessible to the engineers who need them.

GitHub →LinkedIn →

Experience

2025 — Present

Staff ML Engineer

Large-scale Generative AI Platform

Leading LLM inference optimization and GenAI platform engineering at scale. Focus areas include GPU memory architecture, KV cache optimization, serving framework evaluation (vLLM, TensorRT-LLM), parallelism strategy for frontier-class models, latency-throughput operating point selection, and production monitoring for LLM serving systems.

LLM InferenceGenAI PlatformGPU OptimizationvLLMTensorRT-LLMRay Serve

2024 — 2025

Founder & CEO

Vipas.AI — AI Inference Marketplace

Founded and led an AI inference marketplace enabling model creators to host and monetize industry-specific AI models with IP protection, monitoring, and pay-per-prediction APIs. Grew to 25K daily active visitors. Received VC term sheet. Built end-to-end across engineering, product, and go-to-market.

FounderLLM ServingInference MarketplaceProductGo-to-Market

~8 years

Engineering Leader

AWS SageMaker — Amazon Web Services

Led engineering on AWS SageMaker from the 2017 launch through eight years of scale. Drove AI inference infrastructure serving production ML workloads across thousands of enterprise customers. Worked across the full ML platform stack — model hosting, training infrastructure, and the core abstractions that became the industry standard for managed ML. The experience that shaped how I think about production AI systems at scale.

AWS SageMakerML PlatformAI InferenceDistributed SystemsEnterprise Scale

Technical Focus

LLM Inference Optimization

GPU memory architecture, KV cache design, quantization strategies (FP8, INT4, GPTQ, AWQ), parallelism (tensor, pipeline, data, expert), continuous and disaggregated batching, speculative decoding. Operating point selection from latency-throughput curves.

GPU Capacity Planning & Cost

Workload characterization, roofline analysis, fleet sizing from first principles, TCO modeling, utilization trap avoidance, heterogeneous fleet composition. Turning benchmark results into GPU counts and cost estimates you can defend.

GenAI Platform Engineering

LLM serving framework selection and evaluation, multi-LoRA serving architectures, production guardrails, rate limiting strategies, SLA-aware scheduling, and the operational engineering that keeps GenAI systems reliable at scale.

Agentic AI Systems

Multi-agent orchestration patterns, tool reliability, memory and state management across agent turns, latency budgets for compound AI systems, failure mode analysis, and observability for non-deterministic LLM pipelines.

Patents & Publications

USPTO Patent — Pending

Dynamic Hierarchical Storage and GPU Optimization for LLM Serving

A system and method for dynamic tiered memory management across GPU VRAM, CPU DRAM, and NVMe storage for large language model inference workloads, with adaptive scheduling based on request priority and latency constraints.

Field Guide — v1.0 · 107 pages

Sizing LLM Inference Systems at Scale

A complete decision framework for GPU capacity planning: workload characterization, memory budgeting, roofline analysis, quantization, parallelism, batching, KV cache optimization, and a 13-step sizing algorithm. Written for Staff and Principal ML Engineers. Read the guide →

Selected Writing

Technical Comparison · LinkedIn

TensorRT-LLM vs. vLLM: A Production Comparison

A framework-level comparison of the two dominant LLM serving stacks — throughput characteristics, memory efficiency, ease of deployment, and the workload profiles where each excels in production.

Production Guide · LinkedIn

Ray Serve for Production LLM Serving

Architecture patterns for deploying LLMs on Ray Serve at scale — autoscaling configuration, batching strategy, multi-model routing, and operational lessons from production deployments.

Contact

I'm reachable on LinkedIn and GitHub. For substantive technical discussions — inference sizing, GenAI platform architecture, or field guide feedback — LinkedIn DMs work best.