Sunday, May 19, 2024

3 overlapping themes from OpenAI and Google that prove they’re at war

Must read

At Google I/O earlier this week, generative AI was unsurprisingly a major focal point.

In fact, Google CEO Sundar Pichai pointed out that “AI” was said 122 times, plus two more times by Pichai as he closed out the event.

The tech giant has injected AI features into seemingly all of its products and services, including Search, Workspace, and creative tools for videos, photos, and music. But arguably the biggest news of the day was how Google’s announcements compared to those from OpenAI. Just a day before Google I/O, OpenAI unveiled GPT-4o, a “natively multimodal” model that can process visuals and audio in real-time, which ostensibly ramped up the burgeoning rivalry.

Google I/O’s vibe was very different from OpenAI’s event. Google seemed unfocused, throwing endless AI spaghetti at the wall during an event that lasted almost two hours, compared to OpenAI’s focused and breezy 26 minute show.

But the AI capabilities that the two companies shared were noticeably similar, even using the same rhetoric (the AI is “interruptible”) and examples (AI can help with homework). Below, we’ve rounded up the three big, eerie similarities in the two companies’ messaging.

1. Simulating more than one human-style sensory input at once

Both Google and OpenAI talked about their AI models being “natively multimodal.” In this context, this piece of jargon means the models have visual, audio, and text understanding all rolled into one. In the AI world, these types of expression are described as “modalities.”

Google proudly claimed that Gemini has been “natively multimodal” from the beginning. OpenAI’s GPT-4o was its first model that combined voice and image processing with its existing text capabilities. So now, Google and OpenAI are on equal multimodal footing. Both companies showcased what they can do with technologies that can “see” and “hear.”

But both companies demoed features that explicitly showed off their models’ abilities to “see” and “hear” in real-time.

Google VP Sissie Hsiao presented a Live feature for the standalone Gemini app that echoes what DeepMind is working on with Project Astra, and might be the technology that powers this feature coming to Gemini Advanced subscribers in the coming months. Gemini Live “can better understand you and answer naturally, you can even interrupt while Gemini is responding and it will adapt to your speech pattern,” said Hsiao.

If an AI bot that you can interrupt sounds familiar, that’s because OpenAI said it first. “You can now interrupt the model,” said researcher Mark Chen during OpenAI’s live demo the day before Google I/O. “You don’t have to wait for it to finish your turn before you start speaking and you can just butt in whenever you want.”

Later on in OpenAI’s live demo, researcher Barrett Zoph used GPT-4o to help him solve a linear math equation. Zoph pointed a smartphone camera at a piece of paper with a hand-written equation, and ChatGPT walked him through how to solve for “x.”

Mashable Light Speed

Sameer Samat, president of Google’s Android ecosystem, demoed a similar ability to help with physics homework using Google’s existing Circle to Search tool. By circling a physics word problem displayed on a Pixel, Samat showed how Gemini can process the visual and provide step-by-step instructions on how to solve it.

Both companies shared other ways of how multimodality can help users. Zoph showed ChatGPT’s new vision capabilities on the desktop app by generating a graph from the code that was used to demonstrate GPT-4o’s contextual awareness. ChatGPT accurately identified that the graph was about temperature data over time and successfully provided some analysis of what the graph meant.

The next day at Google I/O, Labs VP Josh Woodward demonstrated how Notebook LM, Google’s digital scratchpad, could take in information from an open-source physics textbook and turn it into a podcast-style conversation between two bots about Newton’s Laws of Motion. Then, Woodward showed how he could jump into the conversation as if he were calling in to the podcast, and ask it to customize examples for his son.

2. AI that’s your friend thanks to context awareness

The message from both Google and OpenAI was about how multimodal AI can improve people’s lives. “We want everyone to benefit from what Gemini can do,” said Pichai talking about Google’s flagship AI model, Gemini 1.5 Pro. This set the stage for announcements throughout the event about Gemini seamlessly fitting into your life by understanding context.

Nowhere was this more clear than in the Project Astra demo video from Google DeepMind. The technology, described as an “advanced seeing and talking responsive agent,” is shown accurately responding to naturally phrased questions referring to visuals that aren’t explicitly mentioned.

With the tester pointing a smartphone camera at various things, it describes what the code is on a desktop screen, identifies the concept of Schrodinger’s Cat that shows a simple whiteboard drawing of a live cat’s face next to a dead cat’s face and a cardboard box held up by the tester, and comes up with a band name for a tiger stuffed animal and a (real) Golden Retriever. The band name is “Golden Stripes,” by the way.

On Android, Google VP of engineering David Burke showed off what context awareness looks like in users’ hands. Burke demonstrated how you can ask specific questions about the contents of a YouTube video, like say, the rules of Pickleball.

OpenAI also demoed contextual understanding. In demos posted to OpenAI’s site, the audio version of GPT-4o “watched” its human conversation partners, flirtatiously noting a demo-er’s OpenAI sweatshirt in one instance, and cracking dad jokes, understanding sarcasm, and refereeing an on-camera rock paper scissors game in others. In another demo, some code was casually shared with ChatGPT, and the application showed off GPT-4o’s audio capabilities by actually analyzing the code, apparently without being fed any explicit description of what it was meant to do.

Google DeepMind’s Project Astra is still very much in development, but its contextual understanding on Android will roll out to users in the coming months. OpenAI’s GPT-4o voice mode isn’t available yet, with no details on when it ships, according to CEO Sam Altman.

3. AI helpers that know your schedule and work needs

The overarching message of Google I/O and OpenAI’s event was AI can take care of tasks in your life that range from visionary to mundane, which normally involve, you know, googling something, or using your own human brain. Google took this a step further with explicit callouts of AI agents, assistants, and teammates (there were a lot of different terms for AI helpers sprinkled throughout, which frankly we’re still a little confused about.)

Examples of what Google agents could do included using Gemini to return a pair of shoes by taking a picture of them with your phone, and prompting the agent to search your Gmail inbox for the receipt, locate the order number, fill out a return form, and schedule a pickup. As Pichai noted, Google isn’t quite there yet, but more concretely a Gemini side panel in the Gmail mobile app can summarize relevant emails or draft replies based on context clues mined from your inbox.

This is where Google has the upper hand because AI becomes a lot more useful when it works across different apps like Gmail, Google Calendar, and Search. OpenAI was the one that started this conversation by talking about its goal of achieving AGI (artificial general intelligence) and making references to sci-fi AI assistants like Scarlett Johansson’s character in the film Her. During OpenAI’s event, CEO Sam Altman tweeted “her” in an apparent reference to the film. But despite OpenAI’s explicit or implicit yearnings for this type of use case, there wasn’t much talk about AI agents.

Besides, OpenAI would have an uphill battle to fight if it wanted users to start uploading their work documents and calendars into their ChatGPT accounts. But you know what does have email and calendar apps? Apple. And OpenAI has reportedly finalized a partnership with the iPhone maker to bring ChatGPT to iOS 18. And Apple’s developer conference WWDC is less than a month away.

The tech beef rages on with more battles soon to come.

Latest article