WildASR-public

WildASR: Stress-Testing Voice AI for the Real World

WildASR is a multilingual diagnostic benchmark designed to evaluate the robustness of Automatic Speech Recognition (ASR) systems under real-world, out-of-distribution (OOD) conditions. Developed by Boson AI, it moves beyond standard lab-perfect datasets to reveal how AI performs when faced with the messy reality of human speech.

For the Non-Technical Reader

Think of most AI speech tools as high-performance sports cars: they work beautifully on a smooth, dry racetrack but might stall the moment they hit a gravel road or heavy rain. WildASR acts as the ultimate all-terrain testing ground. It checks if an AI can still understand a child’s voice, a senior’s speech, or a conversation happening in a noisy cafe. For users, this means fewer "I'm sorry, I didn't catch that" moments and more reliable voice assistants that work for everyone, regardless of their accent or environment.

For the Technical Reader

WildASR decomposes ASR robustness into three critical axes: Environmental Degradation (reverberation, clipping, codecs), Demographic Shift (age groups, accents), and Linguistic Diversity (code-switching, incomplete audio). Key technical highlights include:

Real-World Data: Unlike many benchmarks, WildASR uses 100% real human speech rather than synthetic TTS-generated audio.
Hallucination Detection: Introduces the Hallucination Error Rate (HER), utilizing an LLM judge (GPT-4o-mini) to identify unspoken content generated by models under degraded input.
Broad Model Support: The framework supports local models (Whisper, Canary) and API-based systems (GPT-4o, Gemini, Deepgram, ElevenLabs, DashScope).
Multilingual Scope: Covers four languages with systematic isolation of failure modes.

Explore the code on GitHub and access the dataset on Hugging Face.

Why It Matters

The industry often claims "human-parity" based on clean, in-distribution datasets like FLEURS. WildASR proves this parity is fragile. By exposing the uneven robustness gaps and safety risks posed by hallucinations, this benchmark pushes the industry toward more inclusive and reliable Voice AI. It shifts the economic incentive from simply scaling models to improving their fundamental reliability in edge cases.

The Voice AI Space Lab Idea

Imagine building the "Universal Family Bridge." Using WildASR to fine-tune your models, you could create a smart home hub that perfectly transcribes a multilingual household where the kids speak one language, the grandparents speak another with heavy accents, and the TV is always blaring in the background—all without the AI making up words to fill the gaps.

About WildASR-public

WildASR: Stress-Testing Voice AI for the Real World

For the Non-Technical Reader

For the Technical Reader

Why It Matters

The Voice AI Space Lab Idea