New: the Voice AI Investors list release! Check it out

    aiewf-eval

    Git Repo
    kwindla

    Framework for evaluating multi-turn LLM conversations, including text, realtime audio, and speech-to-speech models with benchmarks.

    About aiewf-eval

    This repository provides a framework for evaluating multi-turn LLM conversations, encompassing text, real-time audio, and speech-to-speech models. It includes benchmarks for both long and medium-context scenarios.

    For the Non-Technical Reader

    Imagine you're testing a new voice assistant. This tool is like a rigorous exam for that assistant, checking how well it can hold a conversation over multiple turns, understand instructions, use tools, and recall information. Instead of just a quick question and answer, it tests the assistant's ability to maintain context and provide accurate responses throughout a longer dialogue. This ensures the assistant isn't just smart for a single interaction, but can truly engage in meaningful, extended conversations, leading to better user experiences in applications like customer service bots or complex task automation.

    For the Technical Reader

    The repository features benchmarks such as aiwfmediumcontext, providing detailed performance metrics for various models. Key metrics include Pass Rate (percentage of successful turns), Median Rate (consistency of performance), and TTFB (Time to First Byte), a critical measure of latency. For instance, models like gpt-5.1, gemini-3-flash-preview, and claude-sonnet-4-5 achieve 100% pass rates on the medium-context benchmark. Notably, models like nemotron-3-nano-30b-a3b exhibit very low TTFB (171ms), indicating potential for real-time applications, especially when running in-cluster on NVIDIA Blackwell hardware. The benchmark evaluates models across Tool Use, Instruction Following, and Knowledge Grounding, each assessed over 300 turns.

    Why It Matters

    The focus on multi-turn conversation evaluation is crucial for advancing Voice AI. Lower latency models such as gemini-2.5-flash and gpt-4.1, which offer a balance of intelligence and low TTFB, are favored for production voice agents. Open benchmarks enable the community to objectively compare models and drive innovation towards more responsive and context-aware conversational AI. This can lead to cost savings and improved user satisfaction by enabling more efficient and natural interactions.