OpenAI debuts new model with enhanced real-time voice abilities

OpenAI just took the wraps off a big new update to ChatGPT.

Cofounder and CEO Sam Altman had teased "new stuff" coming to ChatGPT and GPT-4, the AI model that powers its chatbot, and told his followers to tune in Monday at 1 p.m. ET for its "Spring Update" to learn more.

Also ahead of time, Altman ruled that the event would reveal GPT-5 or a new OpenAI search engine, which is reportedly in the works. OpenAI is reportedly planning to eventually take on internet search giant Google with its own AI-powered search product.

But the big news on Monday was OpenAI's new flagship AI model, GPT-4o, which will be free to all users and "can reason across audio, vision, and text in real-time." It was CTO Mira Murati who delivered the updates with no appearance on the livestream from Altman.

There were a ton of demos intended to demonstrate the real-time smarts of GPT-4o.

OpenAI researchers showed how the new ChatGPT can quickly translate speech and help with basic linear algebra using its visual capabilities. The use of the tech on school assignments has been a polarizing topic in education since it first launched.

OpenAI posted another example to X of how one can interact with the new ChatGPT bot. It resembled a video call, and it got pretty meta.

In the video, ChatGPT takes in the room around it, discerns it's a recording setup, figures it might have something to do with OpenAI since the user is wearing a hoodie, and then gets told that the announcement has to do with the AI — it is the AI. It reacts with a voice that sounds more emotive.

OpenAI also announced the desktop version of ChatGPT and a new and improved user interface.

In addition to GPT-4o and ChatGPT, OpenAI's other products include its AI-powered image generator DALL-E, its unreleased text-to-video generator Sora, and its GPT app store.

OpenAI’s big ChatGPT event is over, and I can safely say the company severely downplayed it when it said on Twitter that it would “demo some ChatGPT and GPT-4 updates.” Sam Altman’s teaser that it would be new stuff “we think people will love,” and the detail that it “feels like magic to me” best describe what OpenAI managed to pull off with the GPT-4o update for ChatGPT.

As rumored, GPT-4o is a faster multimodal update that will handle voice, images, and live video. It’ll also let you interrupt it while you’re talking, and it can detect the tone of the user’s voice.

The key detail in OpenAI’s tweet was correct, however. This was going to be a live demo of ChatGPT’s new powers. And that’s really the big detail here. GPT-4o appears to be able to do what Google had to fake with Gemini in early December when it tried to show off similar Gemini features.

Google staged the early Gemini demos to make it seem that Gemini could listen to human voices in real-time while also analyzing the contents of pictures or live video. That was the mind-blowing tech that Google was proposing. However, in the days that followed, we learned that Gemini could not do any of that. The demos were sped up for the sake of presenting the results, and prompts were typed rather than spoken.

Yes, Gemini was successful at delivering the expected results. There’s no question about that. But the demo that Google ultimately showed us was fake. That was a problem in my book, considering one of the main issues with generative AI products is the risk of obtaining incorrect answers or hallucinations.

Fast-forward to mid-May and OpenAI has the technology ready to offer the kind of interaction with AI that Google faked. We just saw it demonstrated live on stage. ChatGPT, powered by the new GPT-4o model, was able to interact with various speakers simultaneously and adapt to their voice prompts in real-time.

GPT-4o was able to look at images and live video to offer answers to questions based on what it had just seen. It helped with math problems and coding. It then translated a conversation between two people speaking different languages in real-time.

Yes, these features were probably rehearsed and optimized over and over before the event. But OpenAI also took prompts from X for GPT-4o to try during the event.

Plus, I do expect issues with GPT-4o once it rolls out to users. Nothing is perfect. It might have problems handling voice, picture, and video requests. It might not be as fast as in the live demos from OpenAI’s event. But things will get better. The point is that OpenAI feels confident in the technology to demo it live.

I have no doubt that Gemini 1.5 (or later versions) will manage to match GPT-4o. And I think Google’s I/O event on Tuesday might even feature demos similar to OpenAI’s. Also, I don’t think GPT-4 was ready back in December to offer the features that OpenAI just demoed today.

However, it shows a big difference between the companies here. OpenAI went forward with this live demo when it had the technology ready. Google, meanwhile, had to fake a presentation to make Gemini seem more powerful than it was.

If you missed the ChatGPT Spring Update event, you can rewatch it below. More GPT-4o demos are available at this link.

Earlier today, OpenAI announced its newest product: GPT-4o, a faster, cheaper, more powerful version of its most advanced large language model, and one that the company has deliberately positioned as the next step in “natural human-computer interaction.” Running on an iPhone in what was purportedly a live demo, the program appeared able to tell a bedtime story with dramatic intonation, understand what it was “seeing” through the device’s camera, and interpret a conversation between Italian and English speakers. The model—which was powering an updated version of the ChatGPT app—even exhibited something like emotion: Shown the sentence i ♥️ chatgpt handwritten on a page, it responded, “That’s so sweet of you!”

Although such features are not exactly new to generative AI, seeing them bundled into a single app on an iPhone was striking. Watching the presentation, I felt that I was witnessing the murder of Siri, along with that entire generation of smartphone voice assistants, at the hands of a company most people had not heard of just two years ago.

Apple markets its maligned iPhone voice assistant as a way to “do it all even when your hands are full.” But Siri functions, at its best, like a directory for the rest of your phone: It doesn’t respond to questions so much as offers to search the web for answers; it doesn’t translate so much as offers to open the Translate app. And much of the time, Siri can’t even pick up what you’re saying properly, let alone watch someone solve a math problem through the phone camera and provide real-time assistance, as ChatGPT did earlier today.

Just as chatbots have promised to condense the internet into a single program, generative AI now promises to condense all of a smartphone’s functions into a single app and to add a whole host of new ones: Text friends, draft emails, learn what the name of that beautiful flower is, call an Uber and talk to the driver in their native language, without touching a screen. Whether that future comes to pass is far from certain. Demos happen in controlled environments and are not immediately verifiable. OpenAI’s was certainly not without its stumbles, including choppy audio and small miscues. We don’t know yet to what extent familiar generative AI problems, such as the confident presentation of false information and difficulty in understanding accented speech, may emerge once the app is rolled out to the public over the coming weeks. But at the very least, to call Siri or Google Assistant “assistants” is, by comparison, insulting.

The major smartphone makers seem to recognize this. Apple, notoriously late to the AI rush, is reportedly deep in talks with OpenAI to incorporate ChatGPT features into an upcoming iPhone software update. The company has also reportedly held talks with Google to consider licensing Gemini, the search giant’s flagship AI product, to the iPhone. Samsung has already brought Gemini to its newest devices, and Google tailored its latest smartphone, the Pixel 8 Pro, specifically to run Gemini. Chinese smartphone makers, meanwhile, are racing their American counterparts to put generative AI on their devices.

Today’s demo was a likely death blow not only to Siri but also to a wave of AI start-ups promising a less phone-centric vision of the future. A company named Humane produces an AI pin that is worn on a user’s clothing and responds to spoken questions; it has been pummeled by reviewers for offering an inconsistent and glitchy experience. Rabbit’s R1 is a small handheld box that my colleague Caroline Mimbs Nyce likened to a broken toy.

These gadgets, and others that may be on the horizon, face inevitable hurdles: compressing a decent camera, a good microphone, and a powerful microprocessor into a tiny box, making sure that box is light and stylish, and persuading people to carry yet another device on their body. Apple and Android devices, by comparison, are efficient and beautiful pieces of hardware already ubiquitous in contemporary life. I can’t think of anybody who, forced to choose between their iPhone and a new AI pin, wouldn’t jettison the pin—especially when smartphones are already perfectly positioned to run generative AI programs.

Each year, Apple, Samsung, Google, and others roll out a handful of new phones offering better cameras and more powerful computer chips in thinner bodies. This cycle isn’t ending anytime soon—even if it’s gotten boring—but now the most exciting upgrades clearly aren’t happening in physical space. What really matters is software.

The iPhone was revolutionary not just because it combined a screen, a microphone, and a camera. Allowing people to take photos, listen to music, browse the web, text family members, play games—and now edit videos, write essays, make digital art, translate signs in foreign languages, and more—was the result of a software package that puts its screen, microphone, and camera to the best use. And the American tech industry is amid a centi-billion-dollar bet that generative AI will soon be the only software worth having.

OpenAI debuts new model with enhanced real-time voice abilities

Post a Comment

Contact Form