News

OpenAI launches gpt-realtime: Realtime API updates and pricing (-20%)

Article Highlights:
  • gpt-realtime is GA with the Realtime API for voice agents
  • Remote MCP server support enables plug-and-play tools
  • In-session image input for visual grounding
  • Built-in SIP telephony for phone and PBX connectivity
  • New Cedar and Marin voices with more natural speech
  • 82.8% on Big Bench Audio, 30.5% on MultiChallenge
  • 66.5% on ComplexFuncBench and more precise tool calls
  • Async calls keep conversations flowing while awaiting results
  • 20% price cut vs gpt-4o-realtime-preview
  • $32/1M input tokens, $64/1M output, $0.40 cached input
  • EU Data Residency and enterprise privacy commitments
  • Reusable prompts and fine-grained context control
OpenAI launches gpt-realtime: Realtime API updates and pricing (-20%)

Introduction

gpt-realtime is OpenAI's new speech-to-speech model, and the Realtime API is GA with MCP, image input, and SIP for production-ready voice agents.

OpenAI has made the Realtime API generally available alongside gpt-realtime, an advanced voice model for reliable, low-latency, natural-sounding agents. The API now supports remote MCP servers, image inputs in sessions, and phone calling via SIP. The model improves at following complex instructions, precise function calling, and expressive speech, with two new voices (Cedar and Marin). By processing audio end-to-end within a single model, it reduces latency and preserves vocal nuance compared to chained STT/TTS pipelines.

Context

The Realtime API processes and generates audio with a single model. This removes handoffs between separate modules, enhances conversational naturalness, and reduces context errors. Since the public beta, feedback from thousands of developers has driven optimizations for reliability, quality, and responsiveness across customer support, personal assistance, and education use cases.

gpt-realtime: what's new

Audio quality and voices

The model delivers more natural speech, controllable in tone, pace, and style (e.g., fast and professional or empathetic). New voices Cedar and Marin debut with the largest gains; existing eight voices are updated too.

Intelligence and comprehension

gpt-realtime captures non-verbal cues, code-switches mid-sentence, and adapts tone. In internal evals, it scores 82.8% on Big Bench Audio (reasoning), beating the December 2024 model (65.6%). It also improves at detecting alphanumerics across languages.

Instruction following

Higher adherence to fine-grained developer instructions. On the MultiChallenge benchmark, it reaches 30.5% vs 20.6% for the December 2024 model.

Function calling and async

Better at choosing the right tools, timing, and arguments. On ComplexFuncBench (audio), it scores 66.5% vs 49.7%. Async calls keep conversations flowing while results are pending.

Realtime API: new capabilities

The Realtime API adds extensibility (MCP), visual grounding (images), and telephony (SIP) for enterprise-ready integrations.

  • Remote MCP servers: expose tools over MCP and make them available in-session without manual wiring
  • Image input: add photos/screenshots to the dialog for contextual Q&A and text reading
  • SIP: connect to the public phone network, PBX, desk phones, and other SIP endpoints
  • Reusable prompts: save and reuse developer messages, tools, variables, and examples across sessions
  • Fine-grained context control: intelligent token limits and multi-turn truncation to cut long-session costs

Safety and privacy

The Realtime API layers safeguards and active classifiers that can halt harmful conversations. Developers can add guardrails with the Agents SDK. Preset voices help prevent impersonation. EU Data Residency and enterprise privacy commitments are supported.

Pricing and availability

Prices are 20% lower than gpt-4o-realtime-preview: $32 per 1M audio input tokens ($0.40 for cached input) and $64 per 1M audio output tokens. gpt-realtime and the GA Realtime API are available to all developers today.

Context controls and multi-turn truncation significantly reduce costs during long sessions.

Conclusion

gpt-realtime and the Realtime API simplify deploying production-grade voice agents: natural speech, better comprehension, more reliable tools, and ready-to-use integrations (MCP, images, SIP). To get started, explore the documentation, Playground, and Realtime prompting guide.

FAQ

What is gpt-realtime used for?

It is an advanced speech-to-speech model for production voice agents, with natural audio, better comprehension, and reliable tool calling.

How is gpt-realtime different from STT/TTS pipelines?

It processes and generates audio in one model/API, reducing latency and preserving speech nuance versus chained components.

Does the Realtime API support MCP, images, and SIP?

Yes: MCP for remote tools, in-session image input, and SIP for phone network and PBX connectivity.

What is the pricing for gpt-realtime?

$32 per 1M audio input tokens ($0.40 cached) and $64 per 1M audio output tokens, a 20% reduction from the prior model.

What measured gains does gpt-realtime show?

82.8% on Big Bench Audio, 30.5% on MultiChallenge, and 66.5% on ComplexFuncBench, beating the December 2024 model.

How are safety and privacy handled?

Active classifiers and usage policies apply; EU Data Residency is supported, and developers can add guardrails via the Agents SDK.

Introduction gpt-realtime is OpenAI's new speech-to-speech model, and the Realtime API is GA with MCP, image input, and SIP for production-ready voice agents [...] Evol Magazine
Tag:
OpenAI