Eric Jang – Building AlphaGo from scratch

May 15, 2026 • 2 hr 37 min

🎧 Listen Now

🤖 AI Summary

Overview

Eric Jang explains how to build AlphaGo from scratch using modern AI tools, offering insights into the foundational principles of intelligence: search, learning from experience, and self-play. The discussion extends to comparing AlphaGo's Monte Carlo Tree Search (MCTS) with reinforcement learning (RL) in large language models (LLMs), highlighting the efficiency of MCTS in sidestepping the credit assignment problem. The episode also explores the potential and limitations of automating AI research using LLMs.

Notable Quotes

- A 10-layer neural network can amortize and approximate to a very high fidelity a nearly intractable search problem. – Eric Jang, on the profound implications of AlphaGo's design.

- The problem with Go and chess is that the other player is always trying to do some shit. – Eric Jang, humorously illustrating the dynamic challenges of adversarial games.

- Automating AI research is one of the most exciting skills being developed right now. – Eric Jang, on the transformative potential of AI in scientific discovery.

🎲 The Basics of Go and AlphaGo's Significance

- Go is a deterministic, perfect-information game with immense complexity due to its 19x19 board and vast branching factor.

- AlphaGo's breakthrough was its ability to combine deep learning with MCTS to solve a problem previously deemed computationally intractable.

- Eric highlights how AlphaGo's neural network compresses complex search processes into a few layers, enabling efficient decision-making.

- Modern Go bots like KataGo have achieved significant compute efficiency improvements, reducing training costs from millions to a few thousand dollars.

🧠 Monte Carlo Tree Search (MCTS) and Its Superiority

- MCTS iteratively improves policy by exploring game trees, using neural networks to evaluate states and guide search.

- Unlike naive RL, which struggles with credit assignment across long trajectories, MCTS provides a strictly better action at every move.

- The policy network learns from the MCTS distribution, not just the selected action, leveraging the richer information in soft targets.

- MCTS's iterative nature ensures stable training, avoiding the pitfalls of sparse or noisy reward signals common in RL.

📈 Scaling Laws and Compute Efficiency

- Eric discusses how scaling laws, like those in LLMs, apply to board games, predicting compute requirements for larger boards or stronger models.

- Advances in hardware (e.g., GPUs) and simplifications in training pipelines have made it possible to replicate AlphaGo's performance with a fraction of the original compute.

- Techniques like pretraining on smaller boards (e.g., 9x9) and leveraging transfer learning significantly reduce training time and cost.

🤖 Automating AI Research with LLMs

- LLMs excel at implementing experiments, optimizing hyperparameters, and generating insights from data.

- However, they struggle with lateral thinking, escaping research dead ends, and selecting the next high-impact question to investigate.

- Eric envisions using games like Go as a sandbox for training automated researchers, with win rate or scaling law predictions as verifiable outer loops.

- The challenge lies in designing environments that incentivize both local optimization and broader scientific discovery.

🔍 Comparing AlphaGo and LLM RL

- AlphaGo's MCTS offers a clean, iterative improvement loop, while LLM RL often suffers from sparse rewards and high variance in learning signals.

- Eric contrasts the efficiency of MCTS's local search with the sucking supervision through a straw nature of policy gradient RL in LLMs.

- The discussion highlights the need for better methods to handle long-horizon credit assignment in RL, especially as tasks grow more complex.

AI-generated content may not be accurate or complete and should not be relied upon as a sole source of truth.

📋 Episode Description

Eric Jang walks through how to build AlphaGo from scratch, but with modern AI tools.

Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn.

Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second.

Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside.

Watch on YouTube. Read the transcript.

And check out the flashcards I wrote to retain the insights.