LLM-Dys
LLM-Dys uses large language models to generate realistic dysfluent speech synthesis with word and phoneme level dysfluencies.
About LLM-Dys
This repository introduces LLM-Dys, a project focused on generating realistic dysfluent speech using large language models.
For the Non-Technical Reader
Imagine you're building a voice assistant that needs to sound more human. Humans don't speak perfectly; they stumble, repeat words, or pause. LLM-Dys helps create these imperfections artificially, making synthesized speech sound more natural and relatable. Think of it like adding realistic 'umms' and 'ahhs' to a robot's speech so it sounds less robotic and more like a person thinking out loud. This could be used in therapy tools to simulate different speech patterns or in creating more engaging characters for games and virtual assistants.
For the Technical Reader
LLM-Dys leverages large language models to introduce dysfluencies at both word and phoneme levels. It supports various dysfluency types, including repetition (REP), insertion (INS), deletion (DEL), pause (PAU), substitution (SUB), and prolongation (PRO). The dataset comprises approximately 12,790 hours of speech data and supports multi-speaker generation via the VCTK dataset. The repository provides instructions for generating the complete dataset and using pre-trained models. Key features include natural dysfluency patterns using LLMs and the capability to synthesize dysfluencies at both word and phoneme levels. The data generation process involves word-level and phoneme-level synthesis, along with a dysfluency transcriber. The full dataset is substantial (~5TB), requiring local generation following the provided setup instructions, which include configuring VITS.
Why It Matters
By open-sourcing a method for generating dysfluent speech, LLM-Dys lowers the barrier to entry for researchers and developers working on more realistic and human-sounding speech synthesis. This is particularly important for applications where naturalness and relatability are crucial, such as assistive technologies and interactive voice agents. The availability of a large-scale dataset also promotes further research in this area. The project's open nature fosters community contribution and accelerates innovation in speech synthesis.
The "Voice AI Space Lab" Idea
Imagine creating a "Speech Personality Generator" where users can dial in different levels and types of dysfluency to create unique vocal profiles for virtual characters. You could even let users upload their own speech samples and have the system analyze and replicate their specific dysfluency patterns!
The Collaborative CTA
How can we best evaluate the perceived naturalness and acceptability of artificially generated dysfluencies in different cultural contexts, and what metrics beyond simple word error rate are most relevant?
GitHub Repository | Demo | Sample Data | HuggingFace Dataset | Paper
#VoiceAI #SpeechSynthesis