New: the Voice AI Investors list release! Check it out

    hibiki

    Git Repo
    kyutai-labs

    Hibiki is a model for streaming speech-to-speech translation, generating natural speech and text translations in real-time, chunk by chunk.

    About hibiki

    Hibiki is a model designed for real-time speech translation, allowing for simultaneous interpretation as someone speaks.

    For the Non-Technical Reader:

    Imagine you're watching a foreign film without subtitles, and suddenly, a voice actor instantly translates the dialogue in real-time, matching the original speaker's tone. That's Hibiki. It's like having a personal interpreter who doesn't need to wait for the speaker to finish before translating, making conversations flow naturally. This technology could revolutionize international business meetings, allowing participants to understand each other without delays or the need for post-edited translations. It also opens doors for more accessible global content, instantly translating lectures, presentations, and even casual conversations.

    For the Technical Reader:

    Hibiki employs a decoder-only architecture leveraging the multistream architecture of Moshi for joint modeling of source and target speech. It produces both text and audio tokens at a consistent 12.5Hz framerate, enabling continuous audio output and timestamped text translation. The model is trained using supervised learning on aligned source and target speech/text data, relying on synthetic data generation due to the scarcity of real-world aligned data. Inference uses temperature sampling, which supports batch processing. The fidelity of voice transfer is adjustable via Classifier-Free Guidance. Currently, it supports French-to-English translation. Smaller variants like Hibiki-M can operate on smartphone hardware. Models are trained on sequences up to 120 seconds with a 40-second context size. Inference code is available for PyTorch, Rust, MLX (macOS), and MLX-swift (iOS). Note that the core implementation is closely tied to the kyutai-labs/moshi repository.

    Why It Matters:

    Hibiki represents a shift towards more fluid and accessible multilingual communication. By enabling real-time speech-to-speech translation, it can reduce communication barriers in various sectors. The availability of implementations across multiple platforms (PyTorch, Rust, MLX) and the potential for running smaller models on mobile devices democratizes access to this technology. Further, the use of synthetic data for training highlights a cost-effective approach to overcoming data scarcity in speech translation.

    The "Voice AI Space Lab" Idea:

    Imagine building a "Global Podcasting Studio" where creators can record in their native language, and Hibiki automatically translates their speech into multiple languages in real-time, complete with voice cloning. This could instantly expand a podcast's reach to a global audience, making content accessible to anyone, regardless of their language.

    The Collaborative CTA:

    What are the potential ethical considerations of real-time voice cloning in translation, and how can we ensure responsible deployment of this technology to prevent misuse or misrepresentation? Let's discuss!

    #VoiceAI #SpeechTranslation