DramaBox
Implements expressive prompt-driven text-to-speech and voice cloning using an IC-LoRA fine-tune of the LTX-2.3 audio model.
About DramaBox
DramaBox, developed by Resemble AI, is a highly expressive Text-to-Speech (TTS) model that bridges the gap between static voice synthesis and dynamic performance. Built on the LTX-2.3 framework, it allows for granular control over emotion, delivery, and non-verbal cues through simple text prompting.
For the Non-Technical Reader
Imagine you are a film director working with a voice actor. Instead of just giving them a script, you can provide stage directions like "she said with a heavy sigh" or "he laughed mid-sentence." DramaBox acts as that director. It doesn't just read words; it understands context and emotion. By providing a short 10-second clip of a voice, you can clone that specific sound and then use text prompts to make that voice whisper, laugh, or pause naturally. It transforms TTS from a robotic tool into a creative partner for storytelling, gaming, and content creation.
For the Technical Reader
DramaBox is an IC-LoRA fine-tune of the LTX-2.3 3.3B audio-only model. The architecture utilizes a DiT (Diffusion Transformer) approach where the LoRA is merged into the base for streamlined inference. Key technical components include:
Text Encoder: Uses Gemma-3-12b-it-bnb-4bit for high-level semantic understanding of prompts.
Hardware Requirements: Peak VRAM usage is approximately 24 GB, with a generation speed of ~2.5 seconds on an H100.
Control Mechanism: Prompt-driven conditioning where stage directions (outside quotes) and literal sounds (inside quotes like "[laugh]") guide the DiT's output.
Safety: Integrated with Resemble Perth, an imperceptible neural watermark that survives compression and editing.
Why It Matters
This release signifies a major step in the Open Source vs. Proprietary debate. By building on the Lightricks LTX-2.3 base, Resemble AI is providing the community with high-tier expressive capabilities that were previously locked behind expensive APIs. The inclusion of robust watermarking also addresses the growing industry concern regarding AI safety and voice authenticity, offering a template for responsible open-weights deployment.
The Voice AI Space Lab Idea
Why not build an "Interactive NPC Narrator" for tabletop RPGs? Using DramaBox, a Dungeon Master could type out a character's dialogue and include emotional cues like (nervous stutter) or (booming authoritative tone). The model could instantly generate the audio, allowing for a fully voiced, reactive world where the characters' emotions shift based on the players' decisions in real-time.
Explore the repository: GitHub - DramaBox
Try the demo: HuggingFace Space