OpenAI’s IMO Team on Why Models Are Finally Solving Elite-Level Math

OpenAI’s IMO Team on Why Models Are Finally Solving Elite-Level Math

July 30, 2025 30 min
🎧 Listen Now

🤖 AI Summary

Overview

This episode explores how a small team at OpenAI achieved a groundbreaking milestone: gold medal performance at the International Mathematical Olympiad (IMO). The discussion delves into the techniques, challenges, and implications of this achievement, including the use of general-purpose reinforcement learning, the model's surprising self-awareness, and the broader potential for AI in reasoning and problem-solving.

Notable Quotes

- The pace that it's just blown through all of these math benchmarks is really astonishing.Noam Brown, on the rapid progress of AI in mathematics.

- The fact that your model knew that it couldn't solve problem 6 was one of the things that gave you hope.Sonya Huang, on the model's self-awareness.

- There's an intimidating gap between these time-boxed competition problems and a real research breakthrough, which takes a year's worth of work.Alex Wei, on the challenges of scaling AI to solve real-world problems.

🧮 The Journey to IMO Gold

- Origins of the effort: The team had long considered the IMO gold a key milestone, but the final push to achieve it came together in just two months.

- Team dynamics: Despite being a small, three-person team, they built on the foundational work of many others at OpenAI.

- Empowered research culture: OpenAI’s environment allowed researchers like Alex Wei to pursue ambitious, high-risk ideas, even in the face of initial skepticism.

🛠️ Techniques and Innovations

- General-purpose reinforcement learning: The team prioritized scalable, general-purpose techniques over narrow, bespoke solutions like Lean (a formal verification tool).

- Scaling test-time compute: They developed methods to enable the model to reason for extended periods, from minutes to hours, which was critical for solving IMO-level problems.

- Parallel compute: Multi-agent systems were used to scale up computational power, emphasizing generality for broader applicability.

🤔 Self-Awareness and Hard-to-Verify Tasks

- Problem 6 insight: The model’s ability to recognize its limitations and refrain from generating incorrect answers marked a significant leap in AI self-awareness.

- Verification challenges: Outputs were graded by external IMO medalists to ensure correctness, as even the team members found the proofs beyond their comprehension.

- Human readability: While the model’s proofs were initially difficult to read, the team opted for transparency by publishing raw outputs.

🚀 Implications and Future Directions

- Beyond competition math: The next frontier involves tackling problems requiring deeper reasoning over longer timeframes, such as research-level mathematics.

- Scientific reasoning: The techniques developed for IMO are expected to enhance AI capabilities in other domains, including scientific and general reasoning.

- Creating novel problems: While the model excels at solving problems, generating new, meaningful challenges remains a significant hurdle.

🎢 The IMO Day Experience

- Real-time monitoring: The team stayed up late to observe the model’s progress, with Alex Wei hand-checking results out of curiosity.

- Model behavior: The AI expressed its confidence or uncertainty in natural language during problem-solving, offering insights into its reasoning process.

- Celebration and reflection: The achievement was both thrilling and humbling, highlighting the vast gap between competition-level and research-level problem-solving.

AI-generated content may not be accurate or complete and should not be relied upon as a sole source of truth.

📋 Episode Description

In just two months, a scrappy three-person team at OpenAI sprinted to fulfill what the entire AI field has been chasing for years—gold-level performance on the International Mathematical Olympiad problems. Alex Wei, Sheryl Hsu and Noam Brown discuss their unique approach using general-purpose reinforcement learning techniques on hard-to-verify tasks rather than formal verification tools. The model showed surprising self-awareness by admitting it couldn’t solve problem six, and revealed the humbling gap between solving competition problems and genuine mathematical research breakthroughs.


Hosted by Sonya Huang, Sequoia Capital