How does gpt-realtime differ from STT/TTS pipelines?

It uses a single model/API for audio in and out, cutting latency and preserving nuance versus chained components.

Does the Realtime API support MCP, image input, and SIP?

Yes—MCP for remote tools, image input during sessions, and SIP for public phone network and PBX connectivity.

How much does gpt-realtime cost to use?

$32 per 1M audio input tokens ($0.40 when cached) and $64 per 1M audio output tokens, with a 20% price reduction.

What safety and privacy measures are available?

Active classifiers and usage policies; EU Data Residency support, plus extra guardrails via the Agents SDK.

OpenAI gpt-realtime and Realtime API: key features and pricing

Q: What is gpt-realtime used for?

An advanced speech-to-speech model for production voice agents, with natural audio, better comprehension, and reliable tool calling.

Introduction

gpt-realtime is OpenAI's new speech-to-speech model, and the Realtime API is GA with MCP, image input, and SIP for production-ready voice agents.

OpenAI has made the Realtime API generally available alongside gpt-realtime, an advanced voice model for reliable, low-latency, natural-sounding agents. The API now supports remote MCP servers, image inputs in sessions, and phone calling via SIP. The model improves at following complex instructions, precise function calling, and expressive speech, with two new voices (Cedar and Marin). By processing audio end-to-end within a single model, it reduces latency and preserves vocal nuance compared to chained STT/TTS pipelines.

Context

The Realtime API processes and generates audio with a single model. This removes handoffs between separate modules, enhances conversational naturalness, and reduces context errors. Since the public beta, feedback from thousands of developers has driven optimizations for reliability, quality, and responsiveness across customer support, personal assistance, and education use cases.

gpt-realtime: what's new

Audio quality and voices

The model delivers more natural speech, controllable in tone, pace, and style (e.g., fast and professional or empathetic). New voices Cedar and Marin debut with the largest gains; existing eight voices are updated too.

Intelligence and comprehension

gpt-realtime captures non-verbal cues, code-switches mid-sentence, and adapts tone. In internal evals, it scores 82.8% on Big Bench Audio (reasoning), beating the December 2024 model (65.6%). It also improves at detecting alphanumerics across languages.

Instruction following

Higher adherence to fine-grained developer instructions. On the MultiChallenge benchmark, it reaches 30.5% vs 20.6% for the December 2024 model.

Function calling and async

Better at choosing the right tools, timing, and arguments. On ComplexFuncBench (audio), it scores 66.5% vs 49.7%. Async calls keep conversations flowing while results are pending.

Realtime API: new capabilities

The Realtime API adds extensibility (MCP), visual grounding (images), and telephony (SIP) for enterprise-ready integrations.

Remote MCP servers: expose tools over MCP and make them available in-session without manual wiring
Image input: add photos/screenshots to the dialog for contextual Q&A and text reading
SIP: connect to the public phone network, PBX, desk phones, and other SIP endpoints
Reusable prompts: save and reuse developer messages, tools, variables, and examples across sessions
Fine-grained context control: intelligent token limits and multi-turn truncation to cut long-session costs

Safety and privacy

The Realtime API layers safeguards and active classifiers that can halt harmful conversations. Developers can add guardrails with the Agents SDK. Preset voices help prevent impersonation. EU Data Residency and enterprise privacy commitments are supported.

Pricing and availability

Prices are 20% lower than gpt-4o-realtime-preview: $32 per 1M audio input tokens ($0.40 for cached input) and $64 per 1M audio output tokens. gpt-realtime and the GA Realtime API are available to all developers today.

Context controls and multi-turn truncation significantly reduce costs during long sessions.

Conclusion

gpt-realtime and the Realtime API simplify deploying production-grade voice agents: natural speech, better comprehension, more reliable tools, and ready-to-use integrations (MCP, images, SIP). To get started, explore the documentation, Playground, and Realtime prompting guide.

FAQ

What is gpt-realtime used for?

It is an advanced speech-to-speech model for production voice agents, with natural audio, better comprehension, and reliable tool calling.

How is gpt-realtime different from STT/TTS pipelines?

It processes and generates audio in one model/API, reducing latency and preserving speech nuance versus chained components.

Does the Realtime API support MCP, images, and SIP?

Yes: MCP for remote tools, in-session image input, and SIP for phone network and PBX connectivity.

What is the pricing for gpt-realtime?

$32 per 1M audio input tokens ($0.40 cached) and $64 per 1M audio output tokens, a 20% reduction from the prior model.

What measured gains does gpt-realtime show?

82.8% on Big Bench Audio, 30.5% on MultiChallenge, and 66.5% on ComplexFuncBench, beating the December 2024 model.

How are safety and privacy handled?

Active classifiers and usage policies apply; EU Data Residency is supported, and developers can add guardrails via the Agents SDK.

OpenAI launches gpt-realtime: Realtime API updates and pricing (-20%)

Introduction

Context

gpt-realtime: what's new

Audio quality and voices

Intelligence and comprehension

Instruction following

Function calling and async

Realtime API: new capabilities

Safety and privacy

Pricing and availability

Conclusion

FAQ

What is gpt-realtime used for?

How is gpt-realtime different from STT/TTS pipelines?

Does the Realtime API support MCP, images, and SIP?

What is the pricing for gpt-realtime?

What measured gains does gpt-realtime show?

How are safety and privacy handled?

Tag:

Related links:

Introduction

Context

gpt-realtime: what's new

Audio quality and voices

Intelligence and comprehension

Instruction following

Function calling and async

Realtime API: new capabilities

Safety and privacy

Pricing and availability

Conclusion

FAQ

What is gpt-realtime used for?

How is gpt-realtime different from STT/TTS pipelines?

Does the Realtime API support MCP, images, and SIP?

What is the pricing for gpt-realtime?

What measured gains does gpt-realtime show?

How are safety and privacy handled?

Tag:

Related links:

Related Articles

GPT-5.2-Codex: OpenAI Redefines Agentic Coding and Defensive Cybersecurity

The AI Land Grab: Why Google, OpenAI, and Perplexity Are giving It All Away in India

OpenAI Pivots to Platform: Third-Party Apps and MCP Support in ChatGPT