OpenAI has released GPT-Realtime, its most advanced speech-to-speech AI model to date, now available through the fully launched Realtime API. The company says the new model is faster, more natural, and more affordable than its earlier voice AI systems.
The Realtime API, first introduced in beta in October 2024, powers ChatGPT’s advanced voice mode and allows developers to build responsive, conversational voice assistants. Until now, most AI voice systems required multiple steps—transcription, language processing, and text-to-speech synthesis—resulting in noticeable lag. GPT-Realtime eliminates that bottleneck by processing audio directly, reducing latency to near real-time.
The upgraded API introduces several new features:
-
More natural, expressive voices, with two new options: Cedar and Marin.
-
Seamless multilingual speech, including the ability to switch languages mid-sentence.
-
Support for nonverbal cues, like recognizing laughter, adjusting tone, and describing images.
-
MCP (Model Context Protocol) integration, a standardized way to connect AI models to external data sources—like a universal “USB port” for business applications.
Pricing has also been lowered: audio input tokens now cost $32 per million (down from $40), while output tokens are $64 per million (down from $80).
Early users are already reporting big gains. Josh Weisberg, head of AI at Zillow, said the upgraded API shows stronger reasoning and more natural speech, making it possible to guide customers through complex requests—such as filtering listings by lifestyle needs or walking them through affordability tools like Zillow’s BuyAbility score. According to Weisberg, the improvements make the experience “feel as natural as a conversation with a friend.”
With GPT-Realtime, OpenAI aims to bring human-like voice interactions to customer support, education, travel, e-commerce, and beyond—at both higher quality and lower cost.
Microsoft is expanding its AI footprint with the release of two new models that its teams trained completely in-house. MAI-Voice-1 is the tech major's first natural speech generation model, while MAI-1-preview is text-based and is the company's first foundation model trained end-to-end. MAI-Voice-1 is currently being used in the Copilot Daily and Podcast features. Microsoft has made MAI-1-preview available for public tests on LMArena, and will begin previewing it in select Copilot situations in the coming weeks.
In an interview with Semafor, Microsoft AI division leader Mustafa Suleyman said the pair of models was developed with a focus on efficiency and cost-effectiveness. MAI-Voice-1 runs on a single GPU, and MAI-1-preview was trained on about 15,000 Nvidia H-100 GPUs. For context, other models, such as xAI's Grok, took more than 100,000 of those chips for training. "Increasingly, the art and craft of training models is selecting the perfect data and not wasting any of your flops on unnecessary tokens that didn’t actually teach your model very much," Suleyman said.
Although it is being used to test the in-house models, Microsoft Copilot is primarily built on OpenAI's GPT tech. The decision to build its own models, despite having sunk billion-dollar investments in the newer AI company, indicates that Microsoft wants to be an independent competitor in this space. While that could take time to reach parity with the companies that have emerged as forerunners in AI development, Suleyman told Semafor that Microsoft has "an enormous five-year roadmap that we're investing in quarter after quarter." With some concerns arising that AI could be facing a bubble-pop, Microsoft's timeline will need to be aggressive to ensure that taking the independent path is worthwhile.