
Mapping the Mind of a Neural Net: Goodfire’s Eric Ho on the Future of Interpretability
🤖 AI Summary
Overview
Eric Ho, founder of Goodfire, discusses the critical importance of AI interpretability as neural networks increasingly take on mission-critical roles in society. He shares breakthroughs in understanding neural networks, including resolving superposition and enabling precise model editing. The conversation explores the potential for interpretability to transform AI development, enhance safety, and even provide insights into human biology.
Notable Quotes
- What if we could crack open the black box of AI and see exactly how it thinks?
– Eric Ho, on the vision behind Goodfire.
- AI models are like wild trees; interpretability lets us shape them into bonsai—intentionally designed to serve humanity.
– Eric Ho, on the transformative potential of interpretability.
- By 2028, we’ll fully decode neural nets, transforming AI from black boxes into intentional designs.
– Eric Ho, predicting the future of AI interpretability.
🧠 The Importance of AI Interpretability
- Eric Ho emphasizes that as AI systems become integral to critical applications like power grids and investment decisions, understanding their inner workings is essential for safety and reliability.
- He compares current AI development to early steam engines, which were transformative but dangerous until thermodynamics provided a deeper understanding.
- Interpretability allows for white box
AI design, enabling intentional development rather than relying on trial-and-error methods.
🔍 Breakthroughs in Mechanistic Interpretability
- Goodfire has made strides in resolving superposition, where single neurons encode multiple concepts, by using sparse autoencoders to disentangle these neurons into interpretable units.
- Techniques like auto-interpretability leverage AI to analyze AI, scaling interpretability as models grow in complexity.
- These advancements enable partial mapping of neural networks, providing insights into their thought processes
and paving the way for more precise interventions.
🧬 Applications in Genomics and Beyond
- Goodfire partnered with Arc Institute to interpret DNA foundation models, uncovering biological concepts encoded in AI systems. This work could accelerate discoveries in genomics, such as understanding junk DNA.
- Interpretability tools have also been applied to image models, allowing users to manipulate specific features (e.g., adding a dragon or pyramid to an image) with precision.
⚖️ Ethical and Practical Implications
- Interpretability can help audit AI models, identify biases, and ensure alignment with societal values. For example, it could detect and mitigate harmful behaviors introduced during fine-tuning.
- The field could play a role in understanding and modifying AI systems developed in other geopolitical contexts, ensuring they align with local values and norms.
- As AI becomes more pervasive, interpretability will be critical for explaining model decisions in legal and ethical contexts, potentially serving as expert testimony in trials.
🚀 The Future of AI and Interpretability
- Eric Ho predicts that by 2028, we will achieve a comprehensive understanding of neural networks, enabling intentional design at every stage of AI development.
- He envisions interpretability as foundational to all aspects of AI, from training data selection to real-world deployment.
- The field is evolving rapidly, with independent research companies like Goodfire playing a pivotal role in advancing techniques and collaborating across domains.
AI-generated content may not be accurate or complete and should not be relied upon as a sole source of truth.
📋 Episode Description
Eric Ho is building Goodfire to solve one of AI’s most critical challenges: understanding what’s actually happening inside neural networks. His team is developing techniques to understand, audit and edit neural networks at the feature level. Eric discusses breakthrough results in resolving superposition through sparse autoencoders, successful model editing demonstrations and real-world applications in genomics with Arc Institute's DNA foundation models. He argues that interpretability will be critical as AI systems become more powerful and take on mission-critical roles in society.
Hosted by Sonya Huang and Roelof Botha, Sequoia Capital
Mentioned in this episode:
-
Mech interp: Mechanistic interpretability, list of important papers here
-
Phineas Gage: 19th century railway engineer who lost most of his brain’s left frontal lobe in an accident. Became a famous case study in neuroscience.
-
Human Genome Project: Effort from 1990-2003 to generate the first sequence of the human genome which accelerated the study of human biology
-
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
-
Zoom In: An Introduction to Circuits: First important mechanistic interpretability paper from OpenAI in 2020
-
Superposition: Concept from physics applied to interpretability that allows neural networks to simulate larger networks (e.g. more concepts than neurons)
-
Apollo Research: AI safety company that designs AI model evaluations and conducts interpretability research
-
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. 2023 Anthropic paper that uses a sparse autoencoder to extract interpretable features; followed by Scaling Monosemanticity
-
Under the Hood of a Reasoning Model: 2025 Goodfire paper that interprets DeepSeek’s reasoning model R1
-