
ElevenLabs’ Mati Staniszewski: Why Voice Will Be the Fundamental Interface for Tech
🤖 AI Summary
Overview
Mati Staniszewski, co-founder of ElevenLabs, discusses how the company has revolutionized text-to-speech technology by focusing on contextual understanding, emotional delivery, and voice replication. He shares insights into the technical challenges of building voice AI, the viral moments that propelled ElevenLabs into the spotlight, and the transformative potential of voice as a primary interface for technology. The conversation also explores the future of real-time translation, voice agents, and the cultural implications of breaking language barriers.
Notable Quotes
- Voice will fundamentally be the interface for interacting with technology—it carries emotions, intonation, and imperfections that text simply cannot.
– Mati Staniszewski
- We hope to cross the Turing test for voice interactions this year, making it indistinguishable from speaking to another human.
– Mati Staniszewski
- The biggest barrier to global communication is not understanding the other person. Real-time voice translation will change the world.
– Mati Staniszewski
🎙️ The Origins of ElevenLabs
- The idea for ElevenLabs stemmed from a frustrating experience watching a Polish-dubbed movie, where all characters were voiced by a single monotonous narrator.
- Co-founders Mati and Piotr, friends since high school, bonded over mathematics and later collaborated on hack projects, including early experiments in audio AI.
- Inspiration came from advancements like the Attention Is All You Need
paper and open-source models like Tortoise TTS, which hinted at the potential for high-quality voice replication.
🛠️ Building a Defensible Position in Voice AI
- ElevenLabs focused narrowly on audio, avoiding competition with multimodal foundation models.
- Early innovations included applying transformer and diffusion models to text-to-speech, enabling nuanced emotional delivery and contextual understanding.
- Challenges included sourcing high-quality audio data with accurate transcriptions and capturing non-verbal elements like tone and emotion.
- The company developed unique voice encoding techniques, allowing models to replicate voices without hardcoding features like gender or age.
🌍 Breaking Language Barriers and Real-Time Translation
- ElevenLabs aims to eliminate language barriers by enabling real-time voice translation while preserving the speaker's unique tone and emotion.
- Viral use cases include dubbing Lex Fridman’s interview with Prime Minister Modi into Hindi and English, and creating multilingual narration for European languages.
- Staniszewski envisions a future where people can travel and communicate seamlessly in any language, likening it to the Babel fish
from The Hitchhiker's Guide to the Galaxy.
🤖 The Rise of Voice Agents
- Voice agents are becoming a popular interface for tasks like customer support, healthcare, and education.
- Examples include automating nurse-patient calls, creating chess tutorials narrated by iconic players, and enabling interactive journalism with Time magazine.
- ElevenLabs powers voice interactions for gaming (e.g., Darth Vader in Fortnite) and enterprise applications, emphasizing quality, low latency, and scalability.
- The company is working on duplex models to improve contextual responsiveness and reduce latency further.
🇪🇺 Building a Global Company from Europe
- Advantages of being based in Europe include access to passionate, high-caliber talent and a growing enthusiasm for AI innovation.
- Challenges include fewer experienced mentors compared to the U.S. and regulatory hurdles like the EU AI Act.
- ElevenLabs’ global mindset from inception has helped it cater to diverse languages and cultures, aligning with its mission to make audio universally accessible.
AI-generated content may not be accurate or complete and should not be relied upon as a sole source of truth.
📋 Episode Description
Mati Staniszewski, co-founder and CEO of ElevenLabs, explains how staying laser-focused on audio innovation has allowed his company to thrive despite the push into multimodality from foundation models. From a high school friendship in Poland to building one of the fastest-growing AI companies, Mati shares how ElevenLabs transformed text-to-speech with contextual understanding and emotional delivery. He discusses the company's viral moments (from Harry Potter by Balenciaga to powering Darth Vader in Fortnite), and explains how ElevenLabs is creating the infrastructure for voice agents and real-time translation that could eliminate language barriers worldwide.
Hosted by: Pat Grady, Sequoia Capital
Mentioned in this episode:
-
Attention Is All You Need: The original Transformers paper
-
Tortoise-tts: Open source text to speech model that was a starting point for ElevenLabs (which now maintains a v2)
-
Harry Potter by Balenciaga: ElevenLabs’ first big viral moment from 2023
-
The first AI that can laugh: 2022 blog post backing up ElevenLab’s claim of laughter (it got better in v3)
-
Darth Vader's voice in Fortnite: ElevenLabs used actual voice clips provided by James Earl Jones before he died
-
Lex Fridman interviews Prime Minister Modi: ElevenLabs enabled Fridman to speak in Hindi and Modi to speak in English.
-
Time Person of the Year 2024: ElevenLabs-powered experiment with “conversational journalism”
-
Iconic Voices: Richard Feynman, Deepak Chopra, Maya Angelou and more available in ElevenLabs reader app
-
SIP trunking: a method of delivering voice, video, and other unified communications over the internet using the Session Initiation P