Reiner Pope – The math behind how LLMs are trained and served

April 29, 2026 • 2 hr 13 min

🎧 Listen Now

🤖 AI Summary

Overview

This episode features a deep dive into the technical underpinnings of training and serving large language models (LLMs), led by Reiner Pope, CEO of MatX and former TPU architect at Google. Using a blackboard lecture format, the discussion explores the interplay between hardware, model architecture, and economics, revealing how AI labs optimize for performance and cost. Topics include batching, memory constraints, inference latency, and the surprising insights that can be deduced from public API pricing.

Notable Quotes

- If you do not batch together many users, the costs and economics you get can be like a thousand times worse than if you do batch many users together. – Reiner Pope, on the critical role of batching in AI economics.

- The reason the bigger scale-up domains matter is not the memory capacity of the whole scale-up, but really the memory bandwidth. – Reiner Pope, on why larger hardware configurations unlock AI progress.

- Each model should generate the sum of human knowledge on the output that it gets on the input. – Dwarkesh Patel, reflecting on the relationship between training data and inference usage.

🧮 The Economics of Batching and Latency

- Reiner Pope explains how batching multiple users together drastically reduces costs by amortizing memory and compute overheads. Without batching, costs can increase by orders of magnitude.

- The trade-off between latency and cost is visualized through roofline models, showing how batch size and hardware configurations impact performance.

- Inference latency is constrained by the train departure model, where GPUs process batches every 20 milliseconds, with worst-case latencies reaching 40 milliseconds.

💾 Memory Bandwidth vs. Capacity

- Memory bandwidth, not capacity, is the primary bottleneck for scaling LLMs. Larger scale-up domains (e.g., NVIDIA Blackwell racks) improve memory bandwidth, enabling faster inference and longer context lengths.

- Sparse attention mechanisms can mitigate memory bandwidth constraints but come with trade-offs in model quality.

- Pipelining during training reduces memory capacity requirements but introduces complexities like micro-batching, which can limit weight amortization.

📊 Insights from API Pricing

- Public API pricing reveals operational constraints. For example, the higher cost of decoding tokens compared to prefill tokens suggests that inference is often memory bandwidth-limited.

- Cache hits are significantly cheaper than cache misses, reflecting the cost of rematerializing KV caches versus storing them in high-bandwidth memory (HBM).

- The pricing structure for longer context lengths aligns with the crossover point where memory fetches dominate compute costs.

🔄 Reversible Networks and Cryptographic Parallels

- Reiner Pope discusses how reversible networks (RevNets) borrow from cryptographic principles like Feistel ciphers to make neural networks invertible. This reduces memory usage during training by rematerializing activations on the fly.

- Neural networks and cryptographic protocols share structural similarities, as both rely on mixing and scrambling information. However, their goals diverge: cryptography seeks to obscure structure, while neural networks aim to extract it.

🔍 Overtraining and AI Progress

- The discussion estimates that LLMs are overtrained by up to 100x compared to Chinchilla scaling laws, balancing training, RL fine-tuning, and inference costs.

- Larger scale-up domains and improved memory bandwidth have enabled recent increases in model size and context length, though memory constraints still limit further scaling.

- Sparse attention and other architectural innovations may help overcome some of these barriers, but fundamental hardware limitations remain a challenge.

AI-generated content may not be accurate or complete and should not be relied upon as a sole source of truth.

📋 Episode Description

Did a very different format with Reiner Pope - a blackboard lecture where he walks through how frontier LLMs are trained and served.

It’s shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk.

It’s a bit technical, but I encourage you to hang in there – it’s really worth it.

There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him.

Recommend watching this one on YouTube so you can see the chalkboard.

Reiner is CEO of MatX, a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on software efficiency, compilers, and TPU architecture.

Download markdown of transcript here to chat with an LLM.

Wrote up some flashcards and practice problems to help myself retain what Reiner taught. Hope it's helpful to you too!

Sponsors

* Jane Street needs constant access to incredibly low-latency compute. I recently asked one of their engineers, Clark, to talk me through how they meet these demands. Our conversation—which touched on everything from FPGAs to liquid cooling—was extremely helpful as I prepped to interview Reiner. You can watch the full discussion and explore Jane Street’s open roles at janestreet.com/dwarkesh

* Google’s Gemma 4 is the first open model that’s let me shut off the internet and create a fully disconnected “focus machine”. This is because Gemma is small enough to run on my laptop, but powerful enough to actually be useful. So, to prep for this interview, I downloaded Reiner’s scaling book, disconnected from wifi, and used Gemma to help me break down the material. Check it out at goo.gle/Gemma4

* Cursor helped me turn some notes I took on how gradients flow during large-scale pretraining into a great animation. At first, I wasn’t sure the best way to visualize the concept, but Cursor’s Composer 2 Fast model let me iterate on different ideas almost instantaneously. You can check out the animation in my recent blog po