ComfyUI-OmniVoice-TTS

The ComfyUI-OmniVoice-TTS repository introduces a powerful suite of nodes for ComfyUI, integrating the OmniVoice model to provide zero-shot multilingual text-to-speech with advanced voice cloning and synthetic voice design capabilities.

For the Non-Technical Reader

Imagine having a professional voice actor who can speak over 600 languages and learn a new voice just by listening to a 10-second clip. That is what this tool brings to your desktop. Instead of just reading text, it can perform: it handles multi-speaker dialogues and can even include human-like expressions such as laughter or sighs. For creators, this means you can "design" a character's voice simply by describing it—specifying age, accent, and pitch—without needing any original recording.

For the Technical Reader

OmniVoice is a state-of-the-art TTS model optimized for speed and flexibility. Key technical highlights include:

Performance: Achieves a Real-Time Factor (RTF) as low as 0.025, making it roughly 40x faster than real-time.
Architecture: Utilizes SageAttention via monkey-patching Qwen3Attention for optimized CUDA performance on SM80+ hardware.
Efficiency: Features automatic CPU offloading, VBAR/aimdo integration, and Whisper ASR caching to prevent redundant processing of reference audio.
Precision: Supports bf16 and fp16 precision with diffusion-based synthesis (4-64 steps) and classifier-free guidance.

Why It Matters

This project represents a significant shift toward high-fidelity, local-first voice synthesis. By supporting over 600 languages, it bridges a massive gap left by proprietary models that often prioritize English. The ability to run this locally within a ComfyUI workflow reduces reliance on expensive, privacy-invasive cloud APIs while offering professional-grade control over non-verbal cues and multi-character interactions.

The "Voice AI Space Lab" Idea

You could build an Automated Interactive Audiobook Studio. By feeding a script into a ComfyUI workflow, you could use the "Voice Design" feature to automatically generate distinct voices for every character based on their descriptions in the book (e.g., "an elderly man with a raspy voice"). The system could then render the entire book as a multi-speaker dialogue, complete with contextual laughs and sighs, all without hiring a single voice actor or recording a single line of audio.

Explore the project here: GitHub Repository | OmniVoice Model

About ComfyUI-OmniVoice-TTS

For the Non-Technical Reader

For the Technical Reader

Why It Matters

The "Voice AI Space Lab" Idea