Real-time speech AI is moving past plain transcription. For voice agents, meetings, contact centers, captions, and live support, the useful output isn't a block of text anymore. The system also needs to know who's talking, what language they're in, when they've actually finished a thought, and whether the account number someone just rattled off came through clean, all while the conversation is still happening.
That's the idea behind Soniox v5 Real-Time, now available through the Soniox API. It replaces stt-rt-v4 and is built for the messy end of live audio: phone calls, noisy rooms, far-field mics, accented speech, multilingual conversations, interruptions, and people talking over each other. And accents aren't an edge case here. Most English speakers in the world aren't native speakers, so a model that only handles clean American English isn't really a global one.
The hard part
The hard part of live speech isn't producing text. It's producing useful output fast enough for another system to act on it, in the middle of the conversation.
A batch model has it easy by comparison: it can wait for the whole recording, revise earlier text, and use the full conversation as context. A real-time model gets none of that. It has to decide what's happening while people are still speaking, which is why these systems increasingly look less like transcription APIs and more like conversation infrastructure: detecting turns, separating speakers, handling language switches, preserving names and codes and numbers, and doing all of it with low enough latency to feel live.
Does it hold up?
Fair question, and we're not going to answer it from a launch post alone. Soniox's previous real-time model gives some signal: on its public benchmark page, stt-rt-v4 is listed at 1.25% semantic WER and 249ms median time to final segment, measured on 1,000 real-world samples from Pipecat's smart-turn dataset.
That doesn't prove v5's numbers, and we won't pretend it does. What it tells you is that v5 is replacing a model already built around low-latency, high-accuracy recognition, and Soniox says the new generation improves the parts that matter most in production: speaker separation, language ID, endpointing, translation, context handling, and structured output.
What's new in v5
The cleanest way to read the release is as one streaming layer instead of a stack of separate steps. Transcription, diarization, language detection, translation, endpointing, and formatting all live inside the same API. Here's what Soniox says changed:
Transcription: higher accuracy across 60+ languages, so it performs better outside clean English audio.
Speaker separation: live speaker labeling, rebuilt from scratch, so you know who said what in meetings, calls, healthcare, and support.
Language ID: better detection for multilingual and accented speech, which is core for global products and language switching.
Translation: inline real-time translation, one-way or two-way, so two people speaking different languages can follow each other with no separate step.
Endpointing: faster semantic endpointing with tunable sensitivity, for voice agents that don't interrupt or stall.
Context: names, terms, and translation preferences passed at connection time, for better handling of domain-specific language.
Alphanumerics: improved handling of numbers, emails, IDs, dates, addresses, SKUs, and codes, because one wrong digit can break a whole workflow.
Where it shows up
What that actually buys you depends on what you're building:
Voice agents: a bot that waits for you to finish instead of cutting you off, and replies fast enough that it doesn't feel like dead air. Endpointing and latency are the whole game here.
Contact centers: a bilingual help line that holds up when a caller switches from English to Spanish halfway through, and that gets the confirmation code right the first time.
Meetings and healthcare: transcripts that know who said what, so the action item lands on the right person and the doctor's words don't get blended with the patient's.
Logistics and field work: capturing tracking numbers, addresses, and SKUs by voice without one bad character breaking the automation downstream.
Global products and accessibility: live captions and translated talks for conferences, classrooms, and audiences who don't all share a language.
What you'll still need to test
The honest caveat: a launch post is not your audio. The real test is how v5 does on noisy calls, regional accents, mixed-language conversations, the rare names and vocabulary specific to your product, and latency under real load. That's where real-time speech systems are actually won or lost. Soniox has public benchmark context for the old model, and v5 is now the one to put in front of your own traffic and judge.
Migration
For existing users the switch is trivial: stt-rt-v5 replaces stt-rt-v4, so you change the model name in your request. Soniox is retiring stt-rt-v4 on June 30, 2026, and after that, v4 requests route automatically to v5 with no service interruption and no API changes. Even if you do nothing, nothing breaks.
Why we're covering it
This is part of a bigger shift. The next layer of voice AI isn't really about transcription accuracy on its own. It's about turning a live conversation into structured, speaker-aware, language-aware output an application can use the instant it's spoken, which is what voice agents, meeting tools, contact centers, translation products, and accessibility systems increasingly run on. Soniox v5 Real-Time is one of the recent examples of that move.
Soniox's announcement is at https://soniox.com/blog/soniox-v5-real-time , and the public benchmark is at https://soniox.com/benchmarks.
Product launch · Published in partnership with Soniox



