eot-bench
Provides an open benchmark and multilingual dataset for evaluating end-of-turn detection performance and latency in voice AI systems.
About eot-bench
LiveKit has released eot-bench, an open-source benchmark and dataset designed to solve one of the most frustrating aspects of voice AI: knowing exactly when a person has finished speaking. This repository provides a standardized way to measure the "End-of-Turn" (EoT) problem, which has historically been one of the hardest open challenges in voice interaction.
For the Non-Technical Reader
Think about the last time you used a walkie-talkie. You have to wait for the "over" to know it's your turn. Early voice agents felt the same way—either they interrupted you mid-sentence or left an awkward silence before responding. eot-bench provides the testing ground to fix this. It helps developers build AI that "listens" for the difference between a thoughtful pause and the end of a thought, making conversations feel fluid and natural rather than mechanical.
For the Technical Reader
The eot-bench repository introduces a reproducible framework for evaluating End-of-Turn (EoT) detection. Key technical features include:
- Multilingual Dataset: The first open dataset of its kind, featuring real human-to-agent turns across 14 languages, including Arabic, Chinese, Hindi, and Spanish, annotated with every silence pause of at least 100 ms.
- Dynamic Evaluation: Instead of scoring on isolated clips, it evaluates models at real pauses under specific latency and interruption budgets, mimicking a live voice agent's environment.
- The Pareto Frontier: Allows developers to re-rank models based on a fixed false-cutoff budget (e.g., 5%), measuring exactly how much "dead air" (endpointing delay) remains before an agent responds.
- Licensing: The dataset and benchmark are available under Apache-2.0, promoting open collaboration across the field.
Why It Matters
Until now, EoT detection was measured on fragmented, private datasets, making it impossible to verify performance claims across the industry. By providing a "common ground," this project moves the needle from proprietary black boxes toward transparent, verifiable performance. It is a critical step for any developer aiming to reduce latency without sacrificing user experience, effectively standardizing how we define a "snappy" conversation.
The "Voice AI Space Lab" Idea
Imagine building a "Multilingual Dinner Party Moderator." Using this benchmark, you could create an agent that manages a fast-paced, multi-speaker debate in three different languages simultaneously. It would know exactly when to interject to keep the conversation flowing and when to hold back during a dramatic pause, all without the awkward lag or constant interruptions typical of traditional voice assistants.
Explore the project here:
- GitHub: https://github.com/livekit/eot-bench
- Dataset: Hugging Face Dataset
- Leaderboard: Interactive Leaderboard