Skip to main content

About

Vinay Jayanna

Staff ML Engineer specializing in LLM inference optimization, GPU capacity planning, and the systems decisions that determine whether AI deployments are cost-efficient at scale. Currently leading inference optimization for one of the most consequential AI systems in the world - reaching hundreds of millions of users. I write field guides to make rigorous engineering frameworks accessible to the engineers who need them.

Experience

2025 - Present
Staff ML Engineer
Large-scale Generative AI Platform

Leading LLM inference optimization and GenAI platform engineering for one of the most consequential AI systems in the world - reaching hundreds of millions of users. Focus areas include GPU memory architecture, KV cache optimization, serving framework evaluation (vLLM, TensorRT-LLM), parallelism strategy for frontier-class models, latency-throughput operating point selection, and production monitoring for LLM serving systems.

LLM InferenceGenAI PlatformGPU OptimizationvLLMTensorRT-LLMRay Serve
2024 - 2025
Founder
Vipas.AI - AI Inference Marketplace

Founded and built an AI inference marketplace enabling model creators to host and monetize industry-specific AI models with IP protection, monitoring, and pay-per-prediction APIs. Grew to 25K daily visitors within 90 days. Received a VC term sheet. Selected into NVIDIA Inception, AWS Activate, and Google Cloud Scale programs.

FounderLLM ServingInference MarketplaceProductGo-to-Market
~8 years
AI Engineering Leader
AWS SageMaker - Amazon Web Services

Built core inference infrastructure for AWS SageMaker from its earliest days - scaling it to serve production ML workloads for enterprises across finance, healthcare, and telecom. Drove the full ML platform stack: AI inference, training infrastructure, MLOps infra, and the abstractions that became the industry standard for managed ML at scale.

AWS SageMakerML PlatformAI InferenceDistributed SystemsEnterprise Scale
Earlier career
Engineering Leader
Ericsson · Pegasystems · Global Enterprises

Built large-scale distributed systems and cloud infrastructure across global enterprises. Designed and delivered systems operating at significant scale - from real-time network infrastructure at Ericsson to enterprise platform engineering at Pegasystems. This work established the distributed systems foundations in my career that underpin everything that followed in AI infrastructure.

Distributed SystemsCloud InfrastructureEnterprise EngineeringGlobal Scale

Technical Focus

LLM Inference Optimization

GPU memory architecture, KV cache design, quantization strategies (FP8, INT4, GPTQ, AWQ), parallelism (tensor, pipeline, data, expert), continuous and disaggregated batching, speculative decoding. Operating point selection from latency-throughput curves.

GPU Capacity Planning & Cost

Workload characterization, roofline analysis, fleet sizing from first principles, TCO modeling, utilization trap avoidance, heterogeneous fleet composition. Turning benchmark results into GPU counts and cost estimates you can defend.

GenAI Platform Engineering

LLM serving framework selection and evaluation, multi-LoRA serving architectures, production guardrails, rate limiting strategies, SLO-aware scheduling, and the operational engineering that keeps GenAI systems reliable at scale.

Agentic AI Systems

Multi-agent orchestration patterns, tool reliability, memory and state management across agent turns, latency budgets for compound AI systems, failure mode analysis, and observability for non-deterministic LLM pipelines.

Patents & Publications

USPTO Patent - Pending
Dynamic Hierarchical Storage and GPU Optimization for LLM Serving

A system and method for dynamic tiered memory management across GPU VRAM, CPU DRAM, and NVMe storage for large language model inference workloads, with adaptive scheduling based on request priority and latency constraints.

Field Guide - v1.1 · 107 pages
Sizing LLM Inference for Production

A complete decision framework for GPU capacity planning: workload characterization, memory budgeting, roofline analysis, quantization, parallelism, batching, KV cache optimization, and a 13-step sizing algorithm. Written for Staff and Principal ML Engineers. Read the guide →

Selected Writing

Technical Comparison · LinkedIn

A framework-level comparison of the two dominant LLM serving stacks - throughput characteristics, memory efficiency, ease of deployment, and the workload profiles where each excels in production.

Production Guide · LinkedIn

Architecture patterns for deploying LLMs on Ray Serve at scale - autoscaling configuration, batching strategy, multi-model routing, and operational lessons from production deployments.

Deep Dive · LinkedIn

How multi-LoRA serving, prefix caching, and cache-aware routing combine to make multi-tenant inference economically viable - and where the production engineering challenges diverge from the research framing.

Technical Explainer · LinkedIn

Why KV cache is the most under-used lever in production LLM serving - and how PagedAttention, prefix caching, and quantization combine to turn it into a concurrency multiplier.

Infrastructure · LinkedIn

The retrieval infrastructure underneath RAG systems - why vector search is a production engineering problem, not just an algorithm selection.

All writing

Additional writing on GenAI platform engineering, vector search infrastructure, MLOps, and production AI strategy.

Contact

Reachable on LinkedIn and GitHub. For substantive technical discussions - inference sizing, GenAI platform architecture, or field guide feedback - LinkedIn DMs work best.