Evaluating AI Agents
Instructors: John Gilhuly, Aman Khan

- Beginner
- 2 Hours 16 Minutes
- 15 Video Lessons
- 6 Code Examples
- Instructors: John Gilhuly, Aman Khan
What you'll learn
Learn how to add observability to your agent to gain insights into its steps and know how to debug it.
Learn how to set up evaluations for the agent components by preparing testing examples, choosing the appropriate evaluator (code-based or LLM-as-a-Judge), and identifying the right metrics.
Learn how to structure your evaluation into experiments to iterate on and improve the output quality and the path taken by your agent.
About this course
Learn how to systematically assess and improve your AI agent’s performance in Evaluating AI Agents, a short course built in partnership with Arize AI and taught by John Gilhuly, Head of Developer Relations, and Aman Khan, Director of Product.
When you’re building an AI Agent, an important part of the development process is evaluations or evals. Whether you’re building a shopping assistant, coding agent, or research assistant, having a structured evaluation process helps you refine its performance systematically—rather than relying on trial and error.
With a systematic approach, you structure your evaluations to assess the performance of each component of the agent, as well as its end-to-end performance. For each component, you select the appropriate evaluators, testing examples, and metrics. This process helps you identify any areas of improvement so you can iterate on your agent during development and in production.
In this course, you’ll build an AI agent, add observability to visualize and debug its steps, and evaluate its performance component-wise.
In detail, you’ll:
- Distinguish between evaluating LLM-based systems and traditional software testing.
- Explore the basic structure of AI agents – routers, skills, and memory – and implement an AI agent from scratch.
- Add observability to the agent by collecting traces of the steps taken by the agent and visualizing the traces.
- Choose the appropriate evaluator – code-based, LLM-as-a-Judge, and human annotations – for each component of the agent.
- Set up evaluations for the skills and router decisions of the agent example using code-based and LLM-as-a-judge evaluators, by creating testing examples from collected traces and preparing detailed prompts for the LLM-as-a-judge.
- Compute a convergence score to evaluate if the example agent can respond to a query in an efficient number of steps.
- Run structured experiments to improve the performance of the agent by exploring changes to the prompt, LLM model, or the agent’s logic.
- Understand how to deploy these evaluation techniques to monitor the agent’s performance in production.
By the end of this course, you’ll know how to trace AI agents, systematically evaluate them, and improve their performance.
Who should join?
Anyone who has basic Python knowledge and wants to learn to evaluate, troubleshoot, and improve AI agents effectively—both during development and in production. Familiarity with prompting an LLM model would be helpful but not required.
Course Outline
15 Lessons・6 Code ExamplesIntroduction
Video・3 mins
Evaluation in the time of LLMs
Video・7 mins
Decomposing agents
Video・6 mins
Lab 1: Building your agent
Video with code examples・16 mins
Tracing agents
Video・4 mins
Lab 2: Tracing your agent
Video with code examples・16 mins
Adding router and skill evaluations
Video・12 mins
Lab 3: Adding router and skill evaluations
Video with code examples・17 mins
Adding trajectory evaluations
Video・5 mins
Lab 4: Adding trajectory evaluations
Video with code examples・9 mins
Adding structure to your evaluations
Video・7 mins
Lab 5: Adding structure to your evaluations
Video with code examples・15 mins
Improving your LLM-as-a-judge
Video・4 mins
Monitoring agents
Video・6 mins
Conclusion
Video・1 min
Appendix - Resources, Tips and Help
Code examples・1 min
Instructors
Evaluating AI Agents
- Beginner
- 2 Hours 16 Minutes
- 15 Video Lessons
- 6 Code Examples
- Instructors: John Gilhuly, Aman Khan
Course access is free for a limited time during the DeepLearning.AI learning platform beta!
Want to learn more about Generative AI?
Keep learning with updates on curated AI news, courses, and events, as well as Andrew’s thoughts from DeepLearning.AI!