Short CourseBeginner1 Hour 6 Minutes

Attention in Transformers: Concepts and Code in PyTorch

Instructor: Josh Starmer

StatQuest
  • Beginner
  • 1 Hour 6 Minutes
  • 11 Video Lessons
  • 4 Code Examples
  • Instructor: Josh Starmer
    • StatQuest
    StatQuest

What you'll learn

  • Learn how the attention mechanism in LLMs helps convert base token embeddings into rich context-aware embeddings.

  • Understand the Query, Key, and Value matrices, what they are for, how to produce them, and how to use them in attention.

  • Learn the difference between self-attention, masked self-attention, and cross-attention, and how multi-head attention scales the algorithm.

About this course

This course clearly explains the ideas behind the attention mechanism. It walks through the algorithm itself and how to code it in Pytorch. Attention in Transformers: Concepts and Code in PyTorch, was built in collaboration with StatQuest, and taught by its Founder and CEO, Josh Starmer.

The attention mechanism was a breakthrough that led to transformers, the architecture powering large language models like ChatGPT. Transformers, introduced in the 2017 paper “Attention is All You Need” by Ashish Vaswani and others, revolutionized AI with their scalable design. 

Learn how this foundational architecture works to improve your intuition about building reliable, functional, and scalable AI applications.

What you’ll do: 

  • Understand the evolution of the attention mechanism, a key breakthrough that led to transformers.
  • Learn the relationships between word embeddings, positional embeddings, and attention.
  • Learn about the Query, Key, and Value matrices, how to produce them, and how to use them in attention.
  • Go through the math required to calculate self-attention and masked self-attention to learn how and why the equation works the way it does.
  • Understand the difference between self-attention and masked self-attention, and how one is used in the encoder to build context-aware embeddings and the other is used in the decoder for generative outputs.
  • Learn the details of the encoder-decoder architecture, cross-attention, and multi-head attention, and how they are incorporated into a transformer.
  • Use PyTorch to code a class that implements self-attention, masked self-attention, and multi-head attention.

Who should join?

Anyone who has basic Python knowledge and wants to learn how the attention mechanism in LLMs like ChatGPT works.

Course Outline

11 Lessons・4 Code Examples
  • Introduction

    Video6 mins

  • The Main Ideas Behind Transformers and Attention

    Video4 mins

  • The Matrix Math for Calculating Self-Attention

    Video11 mins

  • Coding Self-Attention in PyTorch

    Video with code examples8 mins

  • Self-Attention vs Masked Self-Attention

    Video14 mins

  • The Matrix Math for Calculating Masked Self-Attention

    Video3 mins

  • Coding Masked Self-Attention in PyTorch

    Video with code examples5 mins

  • Encoder-Decoder Attention

    Video4 mins

  • Multi-Head Attention

    Video2 mins

  • Coding Encoder-Decoder Attention and Multi-Head Attention in PyTorch

    Video with code examples4 mins

  • Conclusion

    Video1 min

  • Appendix – Tips and Help

    Code examples1 min

Instructor

Josh Starmer

Josh Starmer

Founder and CEO of StatQuest

Course access is free for a limited time during the DeepLearning.AI learning platform beta!

Want to learn more about Generative AI?

Keep learning with updates on curated AI news, courses, and events, as well as Andrew’s thoughts from DeepLearning.AI!