Raon-Speech
Provides open-source speech AI models for understanding, generation, and real-time full-duplex conversation using the Hugging Face ecosystem.
About Raon-Speech
KRAFTON has released Raon-Speech, a comprehensive suite of open-source speech AI models designed to bridge the gap between static speech processing and fluid, real-time human-machine conversation. By providing models for both offline understanding and full-duplex interaction, this release offers a robust foundation for the next generation of voice-enabled applications.
1. For the Non-Technical Reader
Think of most voice assistants today like a walkie-talkie: you speak, wait for it to process, and then it speaks back. Raon-Speech is more like a natural phone call. It enables "full-duplex" communication, meaning the AI can listen and talk at the same time, allowing for natural interruptions, back-and-forth flow, and a much more human-like rhythm. For users, this means digital assistants and in-game characters that feel less like rigid software and more like active, attentive listeners.
2. For the Technical Reader
The Raon-Speech ecosystem is built on the Hugging Face framework and centers around a 9B parameter backbone. It integrates a Language Model (LM) backbone with an audio encoder and a Mimi codec path. The repository highlights two primary tracks:
- Raon-Speech (Offline): Optimized for standard speech-to-text and text-to-speech tasks.
- Raon-SpeechChat (Full-Duplex): Designed for real-time duplex decoding, allowing simultaneous input and output streams.
The architecture supports speaker-conditioning for TTS and utilizes a JSONL data format for multi-turn dialogues. Developers can leverage FlashAttention for training efficiency and utilize the provided Gradio demos for rapid prototyping. The models are available on Hugging Face: Raon-Speech-9B and Raon-SpeechChat-9B.
3. Why It Matters
In an era where high-performance speech models are often locked behind proprietary APIs, KRAFTON’s decision to open-source a 9B parameter model is significant. It provides a high-quality, privacy-conscious alternative for developers who require low-latency, real-time interaction without the costs or data-sharing concerns of closed-source providers. This move democratizes access to "GPT-4o style" voice capabilities for the open-source community.
4. The "Voice AI Space Lab" Idea
The Dynamic Roleplay Narrator: Use Raon-Speech to build a tabletop RPG game master that doesn't just read a script. Because of its full-duplex capabilities, the AI Narrator could react instantly when a player gasps in surprise or interrupts to ask a question about the environment, adjusting its tone and pace in real-time to match the emotional energy of the room.
Explore the project here: Raon-Speech GitHub and try the Official Demo.