OpenAI has unveiled its new gpt-realtime model, marking the beginning of a new era in voice AI. Announced during a live broadcast on Thursday, the model is described as the company’s most advanced speech-to-speech system to date.
What sets gpt-realtime apart is its ability to deliver speech that closely resembles the human voice—capturing tone, emotion, and natural pacing. The result is a far more realistic and human-like interaction.
Key capabilities of the model include improved handling of complex instructions, precise execution of utility tasks such as ride-hailing, verbatim reading in call center scenarios, and seamless language switching during conversation.
Alongside the model, OpenAI also introduced two new voices—Cedar and Marin—which will be exclusively available through the Realtime API. Initially launched in beta in October 2024, the API’s latest version promises lower latency, greater reliability, and higher-quality performance.
Traditionally, voice AI has relied on chaining separate speech-to-text and text-to-speech systems. Realtime API consolidates this process into a single model and a single API, dramatically reducing delays and preserving the natural nuances of speech.
The new platform also supports remote MCP server connections, image input, and SIP (Session Initiation Protocol) integration for telephony—broadening its real-world applications.
Leave a Reply