stt-benchmark
Benchmarks Speech-to-Text services measuring Time To Final Segment latency and semantic Word Error Rate for real-time applications.
About stt-benchmark
Measuring the Pulse of Real-Time Voice AI
The stt-benchmark repository by Pipecat AI is a specialized framework designed to measure the two most critical factors in real-time voice applications: how fast a system hears (latency) and how accurately it understands (semantic accuracy). By providing a standardized way to test major providers, it removes the guesswork from building conversational AI.
1. For the Non-Technical Reader
Imagine talking to someone who takes a two-second pause after every sentence you say—the conversation quickly becomes awkward and robotic. This tool acts like a high-precision stopwatch for AI listeners. It helps companies choose the best 'digital ear' for their AI agents, ensuring that when a human speaks, the AI responds almost instantly without misunderstanding the core message. It is the difference between a clunky automated menu and a fluid, lifelike conversation.
2. For the Technical Reader
The framework evaluates STT services—including Deepgram, Soniox, AssemblyAI, and Azure—using two primary metrics: TTFS (Time To Final Segment) and Semantic WER (Word Error Rate). Unlike standard WER, Semantic WER ignores trivialities like punctuation or filler words, focusing on errors that would actually confuse an LLM. Developers can analyze performance across the Pareto frontier, looking specifically at P95 and P99 latency to ensure production stability. The tool requires Python 3.12+ and uv, and it integrates directly with Pipecat service configurations.
3. Why It Matters
The Voice AI market is no longer just about who is the most accurate; it is about the latency-vs-accuracy trade-off. This benchmark highlights that a service with a great median speed might have a poor P95, leading to 'jittery' user experiences. As the industry moves toward more agentic workflows, the ability to objectively compare proprietary providers against open-source alternatives based on real-world 'Time to Final Segment' is vital for managing both cost and user retention.
4. The Voice AI Space Lab Idea
Why not build a 'Smart Provider Switcher'? Using these benchmarks, you could create a middleware layer that monitors real-time performance. If your primary STT provider starts lagging during peak hours, your system could automatically hot-swap to the next fastest provider on the Pareto frontier, ensuring your voice bot never misses a beat. Check out the repository here: https://github.com/pipecat-ai/stt-benchmark