ComfyUI-OmniVoice-TTS
Provides OmniVoice nodes for ComfyUI, enabling zero-shot multilingual text-to-speech, voice cloning, voice design, and multi-speaker dialogue.
About ComfyUI-OmniVoice-TTS
The ComfyUI-OmniVoice-TTS repository introduces a powerful suite of nodes for ComfyUI, integrating the OmniVoice model to provide zero-shot multilingual text-to-speech with advanced voice cloning and synthetic voice design capabilities.
For the Non-Technical Reader
Imagine having a professional voice actor who can speak over 600 languages and learn a new voice just by listening to a 10-second clip. That is what this tool brings to your desktop. Instead of just reading text, it can perform: it handles multi-speaker dialogues and can even include human-like expressions such as laughter or sighs. For creators, this means you can "design" a character's voice simply by describing it—specifying age, accent, and pitch—without needing any original recording.
For the Technical Reader
OmniVoice is a state-of-the-art TTS model optimized for speed and flexibility. Key technical highlights include:
- Performance: Achieves a Real-Time Factor (RTF) as low as 0.025, making it roughly 40x faster than real-time.
- Architecture: Utilizes SageAttention via monkey-patching Qwen3Attention for optimized CUDA performance on SM80+ hardware.
- Efficiency: Features automatic CPU offloading, VBAR/aimdo integration, and Whisper ASR caching to prevent redundant processing of reference audio.
- Precision: Supports bf16 and fp16 precision with diffusion-based synthesis (4-64 steps) and classifier-free guidance.
Why It Matters
This project represents a significant shift toward high-fidelity, local-first voice synthesis. By supporting over 600 languages, it bridges a massive gap left by proprietary models that often prioritize English. The ability to run this locally within a ComfyUI workflow reduces reliance on expensive, privacy-invasive cloud APIs while offering professional-grade control over non-verbal cues and multi-character interactions.
The "Voice AI Space Lab" Idea
You could build an Automated Interactive Audiobook Studio. By feeding a script into a ComfyUI workflow, you could use the "Voice Design" feature to automatically generate distinct voices for every character based on their descriptions in the book (e.g., "an elderly man with a raspy voice"). The system could then render the entire book as a multi-speaker dialogue, complete with contextual laughs and sighs, all without hiring a single voice actor or recording a single line of audio.
Explore the project here: GitHub Repository | OmniVoice Model