Open AI Real Time

OpenAI Realtime API: Fast, Multimodal Speech-to-Speech for Developers

OpenAI’s Realtime API is a public beta API that enables developers to build fast, low-latency speech-to-speech experiences-similar to ChatGPT’s Advanced Voice Mode-directly into their applications. Powered by the new GPT-4o model, it supports natural, expressive conversations with six preset voices and can handle both audio and text inputs and outputs.

Unlike previous approaches that required chaining separate ASR, LLM, and TTS models (often with lag and loss of expressiveness), the Realtime API streams audio in and out, allowing for natural, real-time conversations. The API can also handle interruptions smoothly, making interactions feel more human-like.

Audio input and output are also being added to the Chat Completions API (as gpt-4o-audio-preview), supporting use cases that don’t require the ultra-low latency of the Realtime API.

Key Features

Low-latency Speech-to-Speech:
Real-time streaming audio input and output for natural conversations.
Expressive, Multimodal Voices:
Six preset voices with improved range and emotion.
Bidirectional WebSocket API:
Persistent connection for two-way, fast audio exchange.
Function Calling:
Trigger actions or pull in external context during conversations.
Audio & Text Inputs/Outputs:
Flexible multimodal support for diverse use cases.
Interrupt Handling:
Users can interrupt the AI, just like in human conversation.
Scalable Sessions:
No hard limit on simultaneous sessions (see docs for rate limits).
Safety & Privacy:
Multiple layers of automated and human safety review; no training on your data without explicit permission.

Use Cases

Voice assistants and customer support agents
Language learning and educational role-play
Real-time AI coaching, accessibility tools, and translation
Interactive entertainment and outbound marketing calls

Model Selection

gpt-4o-realtime-preview:
For low-latency, real-time speech-to-speech.
gpt-4o-audio-preview:
For audio input/output in the Chat Completions API.

Pricing

Text Input: $5 per 1M tokens
Text Output: $20 per 1M tokens
Audio Input: $100 per 1M tokens (~$0.06/min)
Audio Output: $200 per 1M tokens (~$0.24/min)
Cached Pricing: $2.50 per 1M cached text tokens, $20 per 1M cached audio tokens

Getting Started

Official Overview: Introducing the Realtime API
API Documentation: Realtime API Docs
Playground: Try the API in the OpenAI Playground
Reference Client: Reference Client (see announcement page for link)
Voices: Preset Voices
Partner Integrations:
Function Calling: Function Calling Guide
Usage Policies: OpenAI Usage Policies
Enterprise Privacy: Enterprise Privacy
Pricing Details: OpenAI Pricing

OpenAI’s Realtime API empowers developers to create next-generation, natural voice experiences-removing latency barriers and simplifying the stack for conversational AI across education, customer service, accessibility, and more.