Short CourseIntermediate2 Hours 31 Minutes

Efficiently Serving LLMs

Instructor: Travis Addair

Predibase
  • Intermediate
  • 2 Hours 31 Minutes
  • 9 Video Lessons
  • 7 Code Examples
  • Instructor: Travis Addair
    • Predibase
    Predibase

What you'll learn

  • Learn how Large Language Models (LLMs) repeatedly predict the next token, and how techniques like KV caching can greatly speed up text generation.

  • Write code to efficiently serve LLM applications to a large number of users, and examine the tradeoffs between quickly returning the output of the model and serving many users at once.

  • Explore the fundamentals of Low Rank Adapters (LoRA) and see how Predibase builds their LoRAX framework inference server to serve multiple fine-tuned models at once.

About this course

Join our new short course, Efficiently Serving Large Language Models, to build a ground-up understanding of how to serve LLM applications from Travis Addair, CTO at Predibase. Whether you’re ready to launch your own application or just getting started building it, the topics you’ll explore in this course will deepen your foundational knowledge of how LLMs work, and help you better understand the performance trade-offs you must consider when building LLM applications that will serve large numbers of users.

You’ll walk through the most important optimizations that allow LLM vendors to efficiently serve models to many customers, including strategies for working with multiple fine-tuned models at once. In this course, you will:

  • Learn how auto-regressive large language models generate text one token at a time
  • Implement the foundational elements of a modern LLM inference stack in code, including KV caching, continuous batching, and model quantization, and benchmark their impacts on inference throughput and latency.
  • Explore the details of how LoRA adapters work, and learn how batching techniques allow different LoRA adapters to be served to multiple customers simultaneously.
  • Get hands-on with Predibase’s LoRAX framework inference server to see these optimization techniques implemented in a real world LLM inference server.

Knowing more about how LLM servers operate under the hood will greatly enhance your understanding of the options you have to increase the performance and efficiency of your LLM-powered applications.

Who should join?

Anyone who wants to understand the components, techniques, and tradeoffs of efficiently serving LLM applications, and gain a step-by-step understanding of how they work. This course relies on intermediate Python knowledge and demonstrates real-world techniques and applications.

Course Outline

9 Lessons・7 Code Examples
  • Introduction

    Video5 mins

  • Text Generation

    Video with code examples20 mins

  • Batching

    Video with code examples22 mins

  • Continuous Batching

    Video with code examples18 mins

  • Quantization

    Video with code examples19 mins

  • Low-Rank Adaptation

    Video with code examples15 mins

  • Multi-LoRA inference

    Video with code examples19 mins

  • LoRAX

    Video with code examples30 mins

  • Conclusion

    Video1 min

Instructor

Travis Addair

Travis Addair

Co-Founder and CTO at Predibase

Course access is free for a limited time during the DeepLearning.AI learning platform beta!

Want to learn more about Generative AI?

Keep learning with updates on curated AI news, courses, and events, as well as Andrew’s thoughts from DeepLearning.AI!