Short CourseIntermediate1 Hour 0 Minutes

Multimodal RAG: Chat with Videos

Instructor: Vasudev Lal

Enroll for Free

Intermediate
1 Hour 0 Minutes
8 Video Lessons
6 Code Examples
Instructor: Vasudev Lal
Intel

What you'll learn

Create a sophisticated question-answering system that processes, understands, and interacts with complex multimodal data.
Explore the concept of multimodal semantic space and its importance in AI.
Learn the differences between traditional RAG and multimodal RAG systems, focusing on the complexities of integrating different models.

This course, developed in partnership with Intel, teaches you to build an interactive system for querying video content using multimodal AI. You’ll create a sophisticated question-answering system that processes, understands, and interacts with video.

Increasingly, language models and AI applications have added the capability to process images, audio, and video. In this course, you will learn more about these models and applications by implementing a multimodal RAG system. You will understand and use a multimodal embedding model to embed images and captions in a multimodal semantic space. Using that common space, you will build and use a retrieval system that returns images using text prompts. You will use a Large Vision Language Model (LVLM) to generate a response using the images and text from the retrieval.

By the end of this course, you’ll have the expertise to create AI systems that can intelligently interact with video content. This skill set opens up possibilities for developing advanced search engines that understand visual context, creating AI assistants capable of discussing video content, and building automated systems for video content analysis and summarization. Whether you’re looking to enhance content management systems, improve accessibility features, or push the boundaries of human-AI interaction, the techniques learned in this course will provide a solid foundation for innovation in multimodal AI applications.

In this course, you will make API calls to access multimodal models hosted by Prediction Guard on Intel’s cloud.

About this course

Introduction to Multimodal RAG Systems: Understand the architecture of multimodal RAG systems and interact with a Gradio app demonstrating multimodal video chat capabilities.
Multimodal Embedding with BridgeTower: Explore the BridgeTower model to create joint embeddings for image-caption pairs, measure similarities, and visualize high-dimensional embeddings.
Video Pre-processing for Multimodal RAG: Learn to extract frames and transcripts from videos, generate transcriptions using the Whisper model, and create captions using Large Vision Language Models (LVLMs).
Building a Multimodal Vector Database: Implement multimodal retrieval using LanceDB and LangChain, performing similarity searches on multimodal data.
Leveraging Large Vision Language Models (LVLMs): Understand the architecture of LVLMs like LLaVA and implement image captioning, visual question answering, and multi-turn conversations.

Key technologies and concepts

Multimodal Embedding Models: BridgeTower for creating joint embeddings of image-caption pairs
Video Processing: Whisper model for transcription, LVLMs for captioning
Vector Stores: LanceDB for efficient storage and retrieval of high-dimensional vectors
Retrieval Systems: LangChain for building a retrieval pipeline
Large Vision Language Models (LVLMs): LLaVA 1.5 for advanced visual-textual understanding
APIs and Cloud Infrastructure: PredictionGuard APIs, Intel Gaudi AI accelerators, Intel Developer Cloud

Hands-on project

Throughout the course, you’ll build a complete multimodal RAG system that:

Processes and embeds video content (frames, transcripts, and captions)
Stores multimodal data in a vector database
Retrieves relevant video segments given text queries
Generates contextual responses using LVLMs
Maintains multi-turn conversations about video content

Who should join?

Anyone with intermediate to advanced knowledge of Python programming, familiarity with machine learning concepts and deep learning frameworks, and a basic understanding of natural language processing and computer vision.

Course Outline

8 Lessons・6 Code Examples

Introduction
Video・4 mins
Interactive Demo and Multimodal RAG System Architecture
Video with code examples・7 mins
Multimodal Embeddings
Video with code examples・9 mins
Preprocessing Videos for Multimodal RAG
Video with code examples・9 mins
Multimodal Retrieval from Vector Stores
Video with code examples・6 mins
Large Vision - Language Models (LVLMs)
Video with code examples・7 mins
Multimodal RAG with Multimodal Langchain
Video with code examples・13 mins
Conclusion
Video・1 mins

Instructor

Vasudev Lal

Principal AI Research Scientist at Intel Labs