DuplexSLA
Unifies speech, language, and action in a native full-duplex spoken language model for real-time planning and tool calling.
About DuplexSLA
DuplexSLA is a native full-duplex foundation model that synchronizes speech, language, and action into a single conversational clock, moving beyond traditional turn-based interactions to a continuous, real-time dialogue flow.
For the Non-Technical Reader
Imagine talking to a human assistant who doesn't just wait for their turn to speak, but actively listens while they talk. DuplexSLA acts like a multitasker who can handle a conversation, plan their next move, and perform tasks (like checking a calendar or booking a flight) all at once. For the user, this means no more awkward pauses where the AI says "Let me look that up for you" and goes silent. The AI can continue speaking to you while simultaneously executing background tasks, and it can handle interruptions naturally because it is always "listening" even when its own mouth is moving.
For the Technical Reader
DuplexSLA utilizes a dual-stream, three-channel formulation built upon the Step-Audio-2-mini (~7B parameters) backbone. The architecture operates on a 160 ms chunk timeline with the following channels:
- User Audio Channel: Continuous audio features at an 80 ms stride.
- Assistant Audio Channel: Discrete speech tokens in a TA4 layout (1 text anchor + 4 audio tokens) at a 40 ms stride.
- Action Channel: A rate-limited textual stream (≤10 tokens per chunk) for planning, tool calls, and interaction labels.
A key innovation is the semantic-driven turn-taking. By emitting internal <LISTEN>, <SPEAK>, and <THINK> decisions, the model removes the latency overhead of external Voice Activity Detection (VAD). Performance is validated via the new DuplexSLA-Bench, which measures turn-taking and agentic tool-calling accuracy in real-time scenarios.
Why It Matters
This repository represents a significant step toward Native Voice Agents. While proprietary models like GPT-4o have demonstrated duplex capabilities, DuplexSLA provides an open-source technical framework for synchronizing tool-calling with continuous speech. By folding turn-taking into the model's internal state, it reduces the "robotic" latency that plagues current cascaded systems, making high-stakes, real-time voice applications more viable and cost-effective.
The "Voice AI Space Lab" Idea
The Live Podcast Producer: Build an AI co-host that manages the technical side of a live stream via the Action Channel. While it is talking to you about a topic, it can simultaneously monitor a live chat feed, trigger sound effects, or pull up relevant images on screen based on your mid-sentence requests, all without breaking the vocal flow of the conversation.
Explore the repository here: https://github.com/hyzhang24/DuplexSLA