This hands-on session is designed for developers and architects building and scaling generative AI services. We will provide a practical look at Google Kubernetes Engine (GKE) as the foundation for high-performance large language model (LLM) inference. The session will feature a live demo of the GKE Inference Gateway, highlighting its model-aware routing and serving priority features. We will then delve into the open-source llm-d project, showcasing its vLLM-aware scheduling and disaggregated serving capabilities. To cap it off, we'll explore the impressive performance gains of running vLLM on Cloud TPUs for maximum throughput and efficiency. You will leave with actionable insights and code examples to optimize your LLM serving stack.

Ravi Mahendrakar
Ravi Mahendrakar is a Product Management leader at Google, focused on ML Frameworks & Ecosystems. With over 20 years of experience, including product roles at AWS, Aerospike, VAST Data, Pure Storage, Veritas, and IBM. Ravi specializes in bringing innovative data and enterprise software solutions to market. Ravi has an MBA from Chicago Booth and a Master's in Computer Science from CSU Chico.