delayed-streams-modeling
Kyutai's Speech-To-Text and Text-To-Speech models based on the Delayed Streams Modeling framework, with implementations in PyTorch, Rust, and MLX.
About delayed-streams-modeling
This repository from Kyutai Labs introduces Speech-To-Text (STT) and Text-To-Speech (TTS) models based on their Delayed Streams Modeling (DSM) framework.
For the Non-Technical Reader
Imagine you're watching a live TV show with subtitles. This technology is like the engine that generates those subtitles in real-time. Kyutai's models can transcribe speech into text as it's being spoken, with minimal delay. Think of it as a super-fast, highly accurate voice-to-text converter that can understand different languages and even detect when someone is speaking. This could power more responsive voice assistants, real-time translation services, and more accessible communication tools.
For the Technical Reader
The repository provides implementations of Kyutai STT models, optimized for real-time usage and batch processing. Key features include:
- Streaming inference: Processes audio in chunks for real-time transcription.
- Batching: A H100 GPU can process 400 streams in real-time.
- Word-level timestamps: Returns precise timing information for each word.
- Semantic VAD: The 1B parameter model includes Voice Activity Detection.
Implementations are available in PyTorch (for research), Rust (for production, with a websocket server example), and MLX (for on-device inference on Apple silicon). The models include a ~1B parameter English/French model with a 0.5-second delay and a ~2.6B parameter English-only model with a 2.5-second delay. The Colab notebook provides an example of streaming audio directly into the PyTorch model.
Why It Matters
This technology pushes the boundaries of real-time voice processing. The availability of different implementations (PyTorch, Rust, MLX) caters to diverse needs, from research to production and on-device deployment. The focus on efficient batching and low latency is crucial for building scalable and responsive voice applications. The inclusion of a semantic VAD in the smaller model enhances its usability for voice agents. The models have word level time stamps, which is very useful for downstream tasks.
The "Voice AI Space Lab" Idea
Imagine building a real-time collaborative storytelling app. Users could take turns adding to a story, and the app would instantly transcribe their contributions, creating a dynamic, evolving text narrative alongside the audio. The word-level timestamps could be used to create synchronized animations or visual effects.
The Collaborative CTA
Given the multiple implementations (PyTorch, Rust, MLX), what are the biggest challenges you foresee in deploying these models across different platforms and use cases? What specific optimizations or adaptations would be most beneficial for your particular application? GitHub Repository
#VoiceAI #SpeechRecognition