New: the Voice AI Investors list release! Check it out

    FireRedVAD

    Git Repo
    FireRedTeam

    Detects voice activity and audio events in over one hundred languages using streaming and non-streaming DFSMN-based models.

    About FireRedVAD

    FireRedVAD is an industrial-grade solution designed to solve one of the most fundamental challenges in audio processing: accurately identifying when someone is speaking, singing, or when background music is playing across over 100 languages.

    For the Non-Technical Reader

    Imagine a smart assistant that doesn't just listen, but knows exactly when you start talking and when you've finished, even in a noisy room or while music is playing. FireRedVAD acts like a highly trained "gatekeeper" for audio systems. Instead of a computer trying to process hours of silence or background noise, it instantly flags the meaningful parts. For the user, this means faster response times for voice apps, fewer errors in transcription, and better privacy, as the system only "wakes up" when there is actual human activity.

    For the Technical Reader

    Built on a DFSMN-based (Deep Feedforward Sequential Memory Network) architecture, FireRedVAD supports both streaming and non-streaming Voice Activity Detection (VAD) and non-streaming Audio Event Detection (AED). Key technical highlights include:

    • Performance: Achieves a 97.57% F1 score on the FLEURS-VAD-102 benchmark, outperforming Silero-VAD, TEN-VAD, and FunASR-VAD.

    • Accuracy: Maintains a low False Alarm Rate of 2.69% and a Miss Rate of 3.62%.

    • Versatility: Supports speech, singing, and music detection across 100+ languages.

    • Deployment: Supports NCNN for multi-platform runtime and requires 16kHz 16-bit mono PCM format.

    Explore the technical details in the Research Paper or try the Live Demo.

    Why It Matters

    The release of a high-performance, multilingual VAD as an open-source tool shifts the power dynamic away from expensive proprietary APIs. By significantly reducing "compute waste"—the cost of processing silence or non-speech audio—it lowers operational overhead for AI startups. Its ability to distinguish between speech and singing makes it a versatile tool for the next generation of content moderation and automated media tagging platforms.

    Check out the repository here: FireRedVAD on GitHub and the models on HuggingFace.