Google is taking a shot at OpenAI’s Sora with Veo, an AI model that can generate roughly one-minute 1080p video clips at the prompt of text.
Veo, unveiled Tuesday at Google’s I/O 2024 developer conference, can capture different visual and cinematic styles, including landscape shots and timelapses, and make edits and adjustments to footage that’s already been created.
“We’re exploring features like storyboarding and creating bigger scenes to see what Veo can do,” Demis Hassabis, head of Google’s DeepMind AI R&D lab, told reporters during a virtual roundtable. “We’ve made incredible progress in video.”
Veo builds on Google’s early commercial video creation work, previewed in April, which used the company’s Imagen 2 family of image creation models to create video clips.
However, unlike the Imagen 2-based tool, which could only produce low-resolution videos lasting a few seconds, Veo looks to be competitive with today’s top video production models — not just Sora, but models from startups like Pika, Runway and Irreverent Labs.
In a briefing, Daniel Eck, who leads research efforts at DeepMind in genetic media, showed me a few select examples of what Veo can do. One in particular — an aerial view of a busy beach — demonstrated Veo’s strengths against competing video models, he said.
“Detailing all the swimmers on the beach has proved difficult for both the image-generating models and the video – having so many emotive characters,” he said. “If you look closely, the surf looks really good. And the meaning of the immediate word “very”, I would say, is captured with all the people – the lively beach filled with sunbathers.
Veo trained in many shots. This is generally how it works with generative AI models: Fed examples of some form of data, the models pick out patterns in the data that allow them to generate new data — video, in Veo’s case.
Where did the plans for Veo training come from? Eck wouldn’t say for sure, but admitted that some might come from Google’s YouTube.
“Google models may be trained on some YouTube content, but always in accordance with our agreement with YouTube creators,” he said.
The “deal” part can technically be true. But it’s also true that, given YouTube’s network effects, creators have little choice but to play by Google’s rules if they hope to reach the widest possible audience.
This was revealed by the New York Times report in April Google has expanded its terms of service last year, in part to allow the company to tap into more data to train its AI models. Under the old Terms of Service, it was unclear whether Google could use YouTube data to create products beyond the video platform. Not so with the new terms, which significantly loosen the reins.
Google is far from the only tech giant leveraging massive amounts of user data to train internal models. (I see: After.) But what will surely frustrate some creators is Eck’s insistence that Google sets the “gold standard,” here, in terms of ethics.
“The solution to this [training data] The challenge will be in bringing all the stakeholders together to figure out what the next steps are,” he said. “Until we take these steps with the stakeholders — we’re talking about the film industry, the music industry, the artists themselves — we’re not going to move quickly.”
However, Google has already made Veo available to select creators, including Donald Glover (AKA Childish Gambino) and his creative company Gilga. (Like OpenAI with Sora(Google positions Veo as a tool for creatives.)
Eck noted that Google provides tools that allow webmasters to prevent the company’s bots from scraping training data from their sites. But the settings don’t apply to YouTube. And Google, unlike some of its rivals, doesn’t offer a mechanism to allow creators to remove their work from training datasets after scraping.
I also asked Eck about regression, which in the context of genetic artificial intelligence refers to when a model creates a mirror copy of a training example. Tools like Midjourney have been found to spit precision stills from movies like “Dune,” “Avengers” and “Star Wars” gave a time stamp — creating a potential legal minefield for users. OpenAI has reportedly gone so far as to block trademarks and creator names on issues from Sora to try to deflect copyright challenges.
So what steps did Google take to mitigate the risk of regression with Veo? Eck had no answer, short of saying that the research team applied filters for violent and obscene content (so no porn) and uses DeepMind’s SynthID technology to flag videos from Veo as AI-generated.
“We will aim — for something as big as the Veo model — to gradually release it to a small set of stakeholders that we can work very closely with to understand the implications of the model and only then roll it out to a larger group,” he said. .
Eck had more to share about the technical details of the model.
Eck described Veo as “fairly controllable” in the sense that the model understands camera movements and VFX fairly well from prompts (think descriptors like “pan”, “zoom” and “explosion”). And, like Sora, Veo has some understanding of physics—things like fluid dynamics and gravity—that contribute to the realism of the videos it creates.
Veo also supports mask editing for changes to specific areas of a video and can create video from a still image, production models like Stability AI’s Stable Video. Perhaps most interestingly, given a series of prompts that together tell a story, Veo can create longer videos — videos longer than a minute.
That’s not to say the Veo is perfect. Reflecting the limitations of today’s genetic AI, objects in Veo’s videos disappear and reappear without much explanation or consequence. And Veo often gets its physics wrong — for example, cars will inexplicably, improbably reverse on a dime.
That’s why Veo will remain behind a waiting list at Google Labs, the company’s gateway to experimental technology, for the foreseeable future inside a new AI video creation and editing frontend called VideoFX. As it improves, Google aims to bring some of the model’s capabilities to YouTube shorts and other products.
“This is very much a work in progress, very experimental … there’s a lot more left to do than what’s been done here,” Eck said. “But I think that’s kind of the raw material for doing something really great in the filmmaking space.”