Short CourseBeginner1 Hour 10 Minutes

Preprocessing Unstructured Data for LLM Applications

Instructor: Matt Robinson

Unstructured
  • Beginner
  • 1 Hour 10 Minutes
  • 8 Video Lessons
  • 6 Code Examples
  • Instructor: Matt Robinson
    • Unstructured
    Unstructured

What you'll learn

  • Learn to extract and normalize content from a wide variety of document types, such as PDFs, PowerPoints, Word, and HTML files, tables, and images to expand the information accessible to your LLM.

  • Enrich your content with metadata, enhancing retrieval augmented generation (RAG) results and supporting more nuanced search capabilities.

  • Explore document image analysis techniques like layout detection and vision and table transformers, and learn how to apply these methods to preprocess PDFs, images, and tables.

About this course

Enhancing a RAG system’s performance depends on efficiently processing diverse unstructured data sources. 

In this course, you’ll learn techniques for representing all sorts of unstructured data, like text, images, and tables, from many different sources and implement them to extend your LLM RAG pipeline to include Excel, Word, PowerPoint, PDF, and EPUB files.

Join this course and learn:

  • How to preprocess data for your LLM application development, focusing on how to work with different document types.
  • How to extract and normalize various documents into a common JSON format and enrich it with metadata to improve search results. 
  • Techniques for document image analysis, including layout detection and vision transformers, to extract and understand PDFs, images, and tables. 
  • How to build a RAG bot that is able to ingest different documents like PDFs, PowerPoints, and Markdown files.

Apply the skills you’ll learn in this course to real-world scenarios, enhancing your RAG application and expanding its versatility.

Who should join?

Anyone who is interested in learning how to effectively process and use diverse data types and formats to build high-performing LLM RAG systems.

Course Outline

8 Lessons・6 Code Examples
  • Introduction

    Video4 mins

  • Overview of LLM Data Preprocessing

    Video3 mins

  • Normalizing the Content

    Video with code examples14 mins

  • Metadata Extraction and Chunking

    Video with code examples21 mins

  • Preprocessing PDFs and Images

    Video with code examples10 mins

  • Extracting Tables

    Video with code examples6 mins

  • Build Your Own RAG Bot

    Video with code examples9 mins

  • Conclusion

    Video1 min

  • Appendix - Tips and Help

    Code examples1 min

Instructor

Matt Robinson

Matt Robinson

Head of Product at Unstructured

    Course access is free for a limited time during the DeepLearning.AI learning platform beta!

    Want to learn more about Generative AI?

    Keep learning with updates on curated AI news, courses, and events, as well as Andrew’s thoughts from DeepLearning.AI!