Demo: High-throughput LLM inference with Kubernetes, llm-d, and Google Cloud TPUs | Kisaco Research

This hands-on session is designed for developers and architects building and scaling generative AI services. We will provide a practical look at Google Kubernetes Engine (GKE) as the foundation for high-performance large language model (LLM) inference. The session will feature a live demo of the GKE Inference Gateway, highlighting its model-aware routing and serving priority features. We will then delve into the open-source llm-d project, showcasing its vLLM-aware scheduling and disaggregated serving capabilities. To cap it off, we'll explore the impressive performance gains of running vLLM on Cloud TPUs for maximum throughput and efficiency. You will leave with actionable insights and code examples to optimize your LLM serving stack.

Sponsor(s): 
Google
Speaker(s): 

Author:

Ravi Mahendrakar

Senior Product Manager
Google Cloud

Ravi Mahendrakar is a Product Management leader at Google, focused on ML Frameworks & Ecosystems. With over 20 years of experience, including product roles at AWS, Aerospike, VAST Data, Pure Storage, Veritas, and IBM. Ravi  specializes in bringing innovative data and enterprise software solutions to market. Ravi has an MBA from Chicago Booth and a Master's in Computer Science from CSU Chico.

 

Ravi Mahendrakar

Senior Product Manager
Google Cloud

Ravi Mahendrakar is a Product Management leader at Google, focused on ML Frameworks & Ecosystems. With over 20 years of experience, including product roles at AWS, Aerospike, VAST Data, Pure Storage, Veritas, and IBM. Ravi  specializes in bringing innovative data and enterprise software solutions to market. Ravi has an MBA from Chicago Booth and a Master's in Computer Science from CSU Chico.

 

Session Type: 
General Session (Presentation)