speech-to-text-benchmark

The speech-to-text-benchmark repository by Picovoice provides a minimalist and extensible framework for objectively evaluating and comparing the performance of various speech-to-text (STT) engines. By standardizing metrics across both cloud-based and on-device solutions, it offers a transparent look at the current state of voice recognition technology.

For the Non-Technical Reader

Imagine you are organizing an international track and field event where every athlete claims to be the fastest. To find the true winner, you need a standard track, a precise stopwatch, and a fair set of rules. This repository is that standardized track for AI. Instead of taking a company's word for how good their voice recognition is, businesses can use this tool to see exactly how many mistakes an AI makes (accuracy) and how much energy or time it takes to process speech (efficiency). It helps companies choose the right "voice engine" for their apps, ensuring they don't buy a race car when they only need a reliable commuter, or vice versa.

For the Technical Reader

This framework enables a rigorous comparison of engines including OpenAI Whisper, Amazon Transcribe, Google Speech-to-Text, and Picovoice's own Cheetah and Leopard. It moves beyond simple accuracy by measuring:

Word Error Rate (WER) & Punctuation Error Rate (PER): Standardized accuracy metrics across datasets like LibriSpeech, Common Voice, and VoxPopuli.
Core-Hour: A critical efficiency metric representing the CPU hours required to process one hour of audio, allowing for a direct comparison of computational overhead.
Word Emission Latency: Measuring the delay between the end of a spoken word and its transcription—essential for real-time streaming applications.
Model Size: Evaluating the footprint of on-device models in MB.

The tool is built to be extensible, allowing developers to plug in new engines or datasets to validate performance in specific niches.

Why It Matters

In an industry often dominated by "black box" cloud APIs, this benchmark brings transparency and economic clarity. It highlights the trade-offs between high-accuracy, high-latency models (like Whisper) and high-efficiency, low-latency models designed for the edge. For enterprises, this data is the difference between a project that is financially viable at scale and one that collapses under the weight of cloud compute costs or poor user experience due to lag.

About speech-to-text-benchmark

For the Non-Technical Reader

For the Technical Reader

Why It Matters