GLM-TTS

GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS

GLM-TTS is a text-to-speech (TTS) system that allows users to clone voices and generate speech with emotional expression, using multi-reward reinforcement learning. It operates in two stages: first, a large language model (LLM) generates speech token sequences, then a Flow model converts these tokens into high-quality audio waveforms.

For the Non-Technical Reader

Imagine you want a computer to speak in your voice, or with a specific emotion like happiness or sadness. GLM-TTS lets you do just that. It's like having a voice actor in your computer that can mimic any voice from just a few seconds of audio. This could be used to create personalized audiobooks, more engaging virtual assistants, or even help people who have lost their voice to communicate again.

For the Technical Reader

GLM-TTS employs a two-stage architecture: an LLM based on Llama for text-to-token conversion and a Flow Matching model for token-to-mel-spectrogram conversion, followed by a vocoder for waveform generation. The system supports zero-shot voice cloning by extracting speaker features from prompt audio. It uses a Multi-Reward Reinforcement Learning framework to enhance emotional expression and prosody control. The system supports streaming inference and primarily supports Chinese, with some English mixed text support. It also supports phoneme-level text-to-speech conversion. The project includes inference scripts and model weights available on HuggingFace and ModelScope.

Why It Matters

GLM-TTS pushes the boundaries of expressive TTS. By open-sourcing this technology, the project democratizes access to advanced voice cloning and emotional speech synthesis. This can foster innovation in areas like accessibility, entertainment, and personalized communication. The use of reinforcement learning for emotion control is a significant step forward, potentially leading to more natural and engaging human-computer interactions.

The "Voice AI Space Lab" Idea

Imagine building a "Mood Radio" app that reads news articles in a voice that matches the tone of the story. If it's a happy story, the voice is cheerful; if it's a sad story, the voice is somber. This could create a more immersive and emotionally resonant news experience.

The Collaborative CTA

How do you see reinforcement learning impacting the future of personalized voice experiences, and what ethical considerations should we be mindful of as these technologies become more sophisticated?

#TTS #VoiceAI