Demo: High-throughput LLM inference with Kubernetes, llm-d, and Google Cloud TPUs

This hands-on session is designed for developers and architects building and scaling generative AI services. We will provide a practical look at Google Kubernetes Engine (GKE) as the foundation for high-performance large language model (LLM) inference. The session will feature a live demo of the GKE Inference Gateway, highlighting its model-aware routing and serving priority features. We will then delve into the open-source llm-d project, showcasing its vLLM-aware scheduling and disaggregated serving capabilities. To cap it off, we'll explore the impressive performance gains of running vLLM on Cloud TPUs for maximum throughput and efficiency. You will leave with actionable insights and code examples to optimize your LLM serving stack.

Speaker(s):

Author:

Nathan Beach

Director, Product Management

Google Cloud

Nathan Beach is Director of Product Management for Google Kubernetes Engine (GKE). He leads the product team working to make GKE a great platform on which to run AI workloads. He received his MBA from Harvard Business School and, prior to Google, led his own startup. He is a builder and creator passionate about making products that superbly meet user needs. He enjoys career coaching and mentoring, and he is eager to help others transition into product management and excel in their careers.

Session Type:

General Session (Presentation)