Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

September 25, 2025 1 hr 46 min
🎧 Listen Now

🤖 AI Summary

Overview

This episode dives into the emerging importance of AI evals (evaluations) as a critical skill for product builders. Hamel Husain and Shreya Shankar, creators of the leading course on AI evals, provide a comprehensive walkthrough of the eval process, including error analysis, open coding, and building automated evaluators. They also address misconceptions, share practical tips, and explore the ongoing debate between vibes and systematic evals.

Notable Quotes

- The goal is not to do evals perfectly, it's to actionably improve your product.Shreya Shankar

- It's the highest ROI activity you can engage in.Hamel Husain

- I did not realize how much controversy and drama there is around evals.Lenny Rachitsky

🧩 What Are Evals and Why They Matter

- Evals are systematic methods to measure and improve AI applications, akin to unit tests for AI.

- They help identify and address errors in AI products, ensuring better user experiences.

- Shreya emphasized that evals are not just about testing but about creating actionable insights to improve product quality.

- The process includes error analysis, categorization (open and axial coding), and building automated evaluators.

🔍 Error Analysis: The Foundation of Evals

- The first step is manually reviewing application logs (traces) to identify errors.

- Hamel demonstrated this with a property management AI assistant, highlighting issues like poor conversational flow and hallucinated responses.

- Notes on errors should be specific and actionable (e.g., did not confirm call transfer with user).

- Theoretical saturation is the point where no new error types are discovered, signaling when to stop manual analysis.

🤖 LLM-as-Judge vs. Code-Based Evals

- Two types of automated evaluators:

- Code-Based Evals: Use Python or other tools to test simple, binary outcomes (e.g., Is the output JSON?).

- LLM-as-Judge: Employs an LLM to evaluate complex, subjective failure modes (e.g., Was the human handoff appropriate?).

- LLM-as-Judge prompts should focus on binary outcomes to simplify decision-making.

- Shreya stressed the importance of validating LLM judges against human-labeled data to ensure alignment.

⚡ Practical Tips for Implementing Evals

- Start with error analysis: Look at real data to uncover unexpected failure modes.

- Use AI tools to synthesize and categorize errors, but don’t rely on them for initial analysis.

- Build lightweight tools to streamline data review and annotation.

- Focus on high-impact failure modes and prioritize evals for persistent or critical issues.

- After the initial setup (3-4 days), maintaining evals requires only ~30 minutes per week.

🔥 The Debate: Vibes vs. Systematic Evals

- Some argue for vibes (intuitive testing) over systematic evals, citing examples like Claude Code.

- Shreya clarified that even vibes often involve implicit evals, such as error analysis and monitoring.

- A/B testing is complementary to evals, as both rely on systematic measurement of quality.

- Misconceptions about evals stem from narrow definitions or poorly implemented processes.

💡 Key Takeaways

- Evals are a powerful tool for improving AI products, but they require thoughtful implementation.

- The process is iterative, flexible, and tailored to the specific product and team.

- Success lies in balancing manual analysis, automation, and continuous learning.

- As Hamel noted, It's not about having a beautiful eval suite—it's about making your product better.

AI-generated content may not be accurate or complete and should not be relied upon as a sole source of truth.

📋 Episode Description

Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, walk through real examples, and share practical techniques that’ll help you improve your AI product.

What you’ll learn:

1. WTF evals are

2. Why they’ve become the most important new skill for AI product builders

3. A step-by-step walkthrough of how to create an effective eval

4. A deep dive into error analysis, open coding, and axial coding

5. Code-based evals vs. LLM-as-judge

6. The most common pitfalls and how to avoid them

7. Practical tips for implementing evals with minimal time investment (30 minutes per week after initial setup)

8. Insight into the debate between “vibes” and systematic evals

Brought to you by:

Fin—The #1 AI agent for customer service

Dscout—The UX platform to capture insights at every stage: from ideation to production

Mercury—The art of simplified finances

Where to find Shreya Shankar

• X: https://x.com/sh_reya

• LinkedIn: https://www.linkedin.com/in/shrshnk/

• Website: https://www.sh-reya.com/

• Maven course: https://bit.ly/4myp27m

Where to find Hamel Husain

• X: https://x.com/HamelHusain

• LinkedIn: https://www.linkedin.com/in/hamelhusain/

• Website: https://hamel.dev/

• Maven course: https://bit.ly/4myp27m

In this episode, we cover:

(00:00) Introduction to Hamel and Shreya

(04:57) What are evals?

(09:56) Demo: Examining real traces from a property management AI assistant

(16:51) Writing notes on errors

(23:54) Why LLMs can’t replace humans in the initial error analysis

(25:16) The concept of a “benevolent dictator” in the eval process

(28:07) Theoretical saturation: when to stop

(31:39) Using axial codes to help categorize and synthesize error notes

(44:39) The results

(46:06) Building an LLM-as-judge to evaluate specific failure modes

(48:31) The difference be