eva

ServiceNow Research has released EVA (End-to-end framework for Evaluating Voice Agents), an open-source tool designed to solve one of the most difficult challenges in conversational AI: objectively measuring how well a voice bot actually performs in a real-world conversation.

For the Non-Technical Reader

Imagine hiring a customer service representative. You wouldn't just care if they successfully rebooked a flight; you'd also care if they were rude, if they kept interrupting, or if they spoke in a robotic, confusing way. EVA acts like an automated "mystery shopper" for AI. It uses one AI to act as a customer and another to act as a judge, testing voice bots on two main things: Accuracy (did it solve the problem?) and Experience (did it feel natural?). A key discovery from this tool is that bots which are great at following rules often struggle to sound human, and vice versa.

For the Technical Reader

EVA utilizes a bot-to-bot architecture to conduct fully automated, multi-turn spoken evaluations from end-to-end. The framework is built to benchmark both cascade systems (STT-LLM-TTS) and audio-native models (Speech-to-Speech). It introduces two distinct metric suites:

EVA-A (Accuracy): Evaluates task completion and faithfulness using 50 validated airline scenarios.
EVA-X (Experience): Measures conciseness, turn-taking, and naturalness.

The evaluation pipeline is powered by a multi-model judge ensemble, including GPT-4o for text metrics, Gemini 1.5 (Vertex AI) for audio judging, and Claude 3 (Bedrock) for faithfulness. It is optimized for Python 3.11+ and uses uv for dependency management.

Why It Matters

The "Accuracy–Experience tradeoff" identified by EVA is a critical insight for the industry. As we move toward Audio-Native LLMs, having an open-source, standardized framework prevents developers from flying blind. It shifts the focus from simple "success rates" to a holistic view of User Experience (UX), which is the primary driver of AI adoption in enterprise environments.

The Voice AI Space Lab Idea

You could use EVA to build an "AI Persona Optimizer." By running your voice agent through EVA’s 50 scenarios and analyzing the EVA-X scores, you can programmatically tweak your system's latency and prompt instructions to find the "Goldilocks zone"—where the bot is helpful enough to solve the problem but human enough to not frustrate the caller.

Explore the project here:

GitHub: https://github.com/ServiceNow/eva
Blog & Research: https://huggingface.co/blog/ServiceNow-AI/eva
Interactive Demo: https://servicenow.github.io/eva/#demo
Dataset: https://huggingface.co/datasets/ServiceNow-AI/eva

About eva

For the Non-Technical Reader

For the Technical Reader

Why It Matters

The Voice AI Space Lab Idea