The next big breakthrough will be AIs learning on the job

The next big breakthrough will be AIs learning on the job

June 26, 2026 19 min
🎧 Listen Now

🤖 AI Summary

Overview

This episode explores the future of AI development, focusing on the challenges and potential breakthroughs in creating artificial general intelligence (AGI). Key topics include reinforcement learning (RL), continual learning, sample efficiency, and speculative ideas like dreaming for AI training. The discussion emphasizes the limitations of current AI paradigms and envisions a future where AIs learn dynamically from real-world deployment.

Notable Quotes

- What if you could just fit six months of on-the-job learning into the context window? – On the potential of extended context windows to replace continual learning.

- Unless you can build a very replayable training target for a domain, the models will struggle to make much progress. – On the importance of grindable environments for AI training.

- Every time you interact with an AI, it'll be smarter—not just from your sessions, but from all its interactions with users worldwide. – On the transformative potential of continual learning.

🧠 The Big Bet on Reinforcement Learning (RL)

- AI labs are betting that training models on millions of verifiable tasks across diverse RL environments will lead to AGI.

- Critics argue that current models are data-inefficient and lack continual learning, but proponents believe scaling training will overcome these deficits.

- RL training has enabled AIs to solve increasingly complex problems, particularly in coding and math, but challenges remain in domains like computer use.

🔍 Grindability vs. Verifiability in AI Training

- Verifiability alone isn’t enough for AI progress; domains must also be grindable, meaning they allow for parallel, replayable simulations.

- Computer use lags behind coding because it lacks scalable, deterministic simulators. Creating high-fidelity clones of applications like Slack or Gmail could accelerate progress.

- Real-world domains like politics or business-building are even harder to simulate, highlighting the need for sample-efficient learning.

⚙️ Continual Learning and Weight Updates

- Current AIs can’t effectively learn from real-world deployment because they don’t update their weights based on new experiences.

- Human learning involves compressing insights into long-term memory, whereas AIs rely on in-context learning, which is memory-intensive and unsustainable.

- Techniques like on-policy self-distillation (OPSD) aim to distill session learnings into model weights, offering a more targeted and efficient approach than traditional RL.

💭 Dreaming: A Speculative Training Paradigm

- Dreaming involves AIs creating their own simulations to practice skills and strategies, akin to humans rehearsing scenarios in their minds.

- This approach could provide a new axis of scaling, allowing AIs to train on simulated environments tailored to real-world tasks.

- While promising, building accurate simulations of complex real-world domains remains a significant challenge.

🚀 The Vision for 2027 and Beyond

- By 2027, AIs could achieve extended context lengths, enabling week-long collaborative sessions with users.

- Continual learning techniques like OPSD and dreaming could allow AIs to improve dynamically from real-world deployment.

- The ultimate goal is for AIs to expand their capabilities far beyond their initial training, learning from diverse user interactions and tasks across the economy.

AI-generated content may not be accurate or complete and should not be relied upon as a sole source of truth.

📋 Episode Description

Read it here.

Thanks to Mercury for sponsoring this essay.

Mercury has automated basically my entire bill pay process for my business. I just give contractors a dedicated email address, and when they send an invoice, Mercury automatically creates a draft payment for me to review. I no longer have to hunt through my inbox for invoices or deal with messy spreadsheets to track my bills. Mercury handles it all. Learn more at mercury.com

Timestamps:

(00:00:00) – The big research bet the labs are making

(00:02:12) – Grindability is just as important as verifiability

(00:06:10) – Will RLVR alone generalize?

(00:08:41) – Getting the learning back to the weights

(00:15:22) – Dreaming

(00:17:23) – What 2027 looks like



Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe