New: the Voice AI Investors list release! Check it out

    granite-speech-models

    Git Repo
    ibm-granite

    Provides multilingual speech-language models for automatic speech recognition and translation using a two-pass design for enterprise applications.

    About granite-speech-models

    IBM has released Granite Speech Models, a suite of compact and efficient speech-language models designed to bridge the gap between hearing and understanding. Built on the foundation of the Granite-3.3 language models, these tools are optimized for enterprise-grade Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST).

    1. For the Non-Technical Reader

    Imagine having a highly skilled executive assistant who doesn't just transcribe what you say, but understands the context of the conversation. Most speech tools simply turn sound into text; Granite Speech uses a "two-pass" system. First, it listens and writes down the words (ASR). Then, it uses its internal intelligence to process that text—whether that means translating it into another language or summarizing the key points. For a human user, this means higher accuracy in noisy environments and the ability to move seamlessly from a spoken conversation in Spanish to a written summary in English.

    2. For the Technical Reader

    The Granite Speech architecture (specifically revision 3.3.2) utilizes a modality-aligned approach, connecting granite-3.3-2b/8b-instruct to speech via a 16-block Conformer encoder. Key technical specifications include:

    • Two-Pass Design: Explicit initiation for transcription followed by LLM processing, allowing for modular workflows.
    • Training: Modality alignment on public corpora with character-level targets using Connectionist Temporal Classification (CTC).
    • Multilingual Support: English, French, German, Spanish, and Portuguese, with translation capabilities including Japanese and Mandarin.
    • Deployment: Native support in transformers and vLLM for high-throughput inference.
    • License: Apache 2.0, allowing for broad commercial application.

    3. Why It Matters

    In a landscape dominated by massive, proprietary black-box models, Granite Speech offers a high-performance, open-source alternative in the sub-8B parameter range. By providing an Apache 2.0 licensed model that can be self-hosted, IBM is enabling enterprises to maintain strict data privacy and reduce the latency and costs associated with cloud-based speech APIs. It is a significant step toward democratizing high-fidelity multilingual voice AI.

    4. The "Voice AI Space Lab" Idea

    The "Real-Time Diplomat": Use Granite Speech to build a low-latency wearable or desktop app that listens to a multilingual roundtable. Using the 8B model, the app could transcribe the French and German speakers in real-time, while the underlying Granite LLM provides a running "sentiment and summary" feed in English, flagging potential misunderstandings or key consensus points as they happen.

    Explore the project here: GitHub Repository | HuggingFace Collection | Tech Report