New: the Voice AI Investors list release! Check it out

    LLM-Dys

    Git Repo
    Berkeley-Speech-Group

    LLM-Dys uses large language models to generate realistic dysfluent speech synthesis with word and phoneme level dysfluencies.

    About LLM-Dys

    This repository introduces LLM-Dys, a project focused on generating realistic dysfluent speech using large language models.

    For the Non-Technical Reader

    Imagine you're building a voice assistant that needs to sound more human. Humans don't speak perfectly; they stumble, repeat words, or pause. LLM-Dys helps create these imperfections artificially, making synthesized speech sound more natural and relatable. Think of it like adding realistic 'umms' and 'ahhs' to a robot's speech so it sounds less robotic and more like a person thinking out loud. This could be used in therapy tools to simulate different speech patterns or in creating more engaging characters for games and virtual assistants.

    For the Technical Reader

    LLM-Dys leverages large language models to introduce dysfluencies at both word and phoneme levels. It supports various dysfluency types, including repetition (REP), insertion (INS), deletion (DEL), pause (PAU), substitution (SUB), and prolongation (PRO). The dataset comprises approximately 12,790 hours of speech data and supports multi-speaker generation via the VCTK dataset. The repository provides instructions for generating the complete dataset and using pre-trained models. Key features include natural dysfluency patterns using LLMs and the capability to synthesize dysfluencies at both word and phoneme levels. The data generation process involves word-level and phoneme-level synthesis, along with a dysfluency transcriber. The full dataset is substantial (~5TB), requiring local generation following the provided setup instructions, which include configuring VITS.

    Why It Matters

    By open-sourcing a method for generating dysfluent speech, LLM-Dys lowers the barrier to entry for researchers and developers working on more realistic and human-sounding speech synthesis. This is particularly important for applications where naturalness and relatability are crucial, such as assistive technologies and interactive voice agents. The availability of a large-scale dataset also promotes further research in this area. The project's open nature fosters community contribution and accelerates innovation in speech synthesis.

    The "Voice AI Space Lab" Idea

    Imagine creating a "Speech Personality Generator" where users can dial in different levels and types of dysfluency to create unique vocal profiles for virtual characters. You could even let users upload their own speech samples and have the system analyze and replicate their specific dysfluency patterns!

    The Collaborative CTA

    How can we best evaluate the perceived naturalness and acceptability of artificially generated dysfluencies in different cultural contexts, and what metrics beyond simple word error rate are most relevant?

    GitHub Repository | Demo | Sample Data | HuggingFace Dataset | Paper

    #VoiceAI #SpeechSynthesis