Short CourseIntermediate1 Hour 0 Minutes

Multimodal RAG: Chat with Videos

Instructor: Vasudev Lal

Intel
  • Intermediate
  • 1 Hour 0 Minutes
  • 8 Video Lessons
  • 6 Code Examples
  • Instructor: Vasudev Lal
    • Intel
    Intel

What you'll learn

  • Create a sophisticated question-answering system that processes, understands, and interacts with complex multimodal data.

  • Explore the concept of multimodal semantic space and its importance in AI.

  • Learn the differences between traditional RAG and multimodal RAG systems, focusing on the complexities of integrating different models.

This course, developed in partnership with Intel, teaches you to build an interactive system for querying video content using multimodal AI. You’ll create a sophisticated question-answering system that processes, understands, and interacts with video. 

Increasingly, language models and AI applications have added the capability to process images, audio, and video. In this course, you will learn more about these models and applications by implementing a multimodal RAG system. You will understand and use a multimodal embedding model to embed images and captions in a multimodal semantic space. Using that common space, you will build and use a retrieval system that returns images using text prompts. You will use a Large Vision Language Model (LVLM) to generate a response using the images and text from the retrieval.

By the end of this course, you’ll have the expertise to create AI systems that can intelligently interact with video content. This skill set opens up possibilities for developing advanced search engines that understand visual context, creating AI assistants capable of discussing video content, and building automated systems for video content analysis and summarization. Whether you’re looking to enhance content management systems, improve accessibility features, or push the boundaries of human-AI interaction, the techniques learned in this course will provide a solid foundation for innovation in multimodal AI applications.

In this course, you will make API calls to access multimodal models hosted by Prediction Guard on Intel’s cloud.

About this course

  • Introduction to Multimodal RAG Systems: Understand the architecture of multimodal RAG systems and interact with a Gradio app demonstrating multimodal video chat capabilities.
  • Multimodal Embedding with BridgeTower: Explore the BridgeTower model to create joint embeddings for image-caption pairs, measure similarities, and visualize high-dimensional embeddings.
  • Video Pre-processing for Multimodal RAG: Learn to extract frames and transcripts from videos, generate transcriptions using the Whisper model, and create captions using Large Vision Language Models (LVLMs).
  • Building a Multimodal Vector Database: Implement multimodal retrieval using LanceDB and LangChain, performing similarity searches on multimodal data.
  • Leveraging Large Vision Language Models (LVLMs): Understand the architecture of LVLMs like LLaVA and implement image captioning, visual question answering, and multi-turn conversations.

Key technologies and concepts

  • Multimodal Embedding Models: BridgeTower for creating joint embeddings of image-caption pairs
  • Video Processing: Whisper model for transcription, LVLMs for captioning
  • Vector Stores: LanceDB for efficient storage and retrieval of high-dimensional vectors
  • Retrieval Systems: LangChain for building a retrieval pipeline 
  • Large Vision Language Models (LVLMs): LLaVA 1.5 for advanced visual-textual understanding
  • APIs and Cloud Infrastructure: PredictionGuard APIs, Intel Gaudi AI accelerators, Intel Developer Cloud

Hands-on project

Throughout the course, you’ll build a complete multimodal RAG system that:

  • Processes and embeds video content (frames, transcripts, and captions)
  • Stores multimodal data in a vector database
  • Retrieves relevant video segments given text queries
  • Generates contextual responses using LVLMs
  • Maintains multi-turn conversations about video content

Who should join?

Anyone with intermediate to advanced knowledge of Python programming, familiarity with machine learning concepts and deep learning frameworks, and a basic understanding of natural language processing and computer vision.

Course Outline

8 Lessons・6 Code Examples
  • Introduction

    Video4 mins

  • Interactive Demo and Multimodal RAG System Architecture

    Video with code examples7 mins

  • Multimodal Embeddings

    Video with code examples9 mins

  • Preprocessing Videos for Multimodal RAG

    Video with code examples9 mins

  • Multimodal Retrieval from Vector Stores

    Video with code examples6 mins

  • Large Vision - Language Models (LVLMs)

    Video with code examples7 mins

  • Multimodal RAG with Multimodal Langchain

    Video with code examples13 mins

  • Conclusion

    Video1 mins

Instructor

Vasudev Lal

Vasudev Lal

Principal AI Research Scientist at Intel Labs

Course access is free for a limited time during the DeepLearning.AI learning platform beta!

Want to learn more about Generative AI?

Keep learning with updates on curated AI news, courses, and events, as well as Andrew’s thoughts from DeepLearning.AI!