DeepMind, Google’s AI research lab, says it’s developing AI technology to create soundtracks for videos.
In a Position on its official blog, DeepMind says it sees the technology, V2A (short for “video-to-audio”), as an essential piece of the multimedia puzzle created by artificial intelligence. While many bodies, including DeepMind, have developed artificial intelligence models that generate videos, these models cannot generate sound effects to synchronize with the videos they generate.
“Video production models are advancing at an incredible pace, but many current systems can only produce silent output,” writes DeepMind. “V2A technology [could] become a promising approach to the life of the films being made.’
DeepMind’s V2A technology follows the description of a soundtrack (eg “jellyfish pulsating underwater, sea life, ocean”) combined with a video to generate music, sound effects and even dialogue that match the characters and the tone of the video, watermarked by DeepMind’s deep fake -fighting SynthID technology. The AI model powering V2A, a diffusion model, was trained on a combination of audio and dialogue transcripts as well as video clips, DeepMind says.
“By training on video, audio, and additional annotations, our technology learns to associate specific audio events with various visual scenes while responding to information provided in annotations or transcripts,” according to DeepMind.
Mom is the word on whether any of the training data is copyrighted — and whether the creators of the data were notified of DeepMind’s work. We’ve reached out to DeepMind for clarification and will update this post if we hear back.
AI-powered audio production tools aren’t groundbreaking. Startup Stability AI released one just last week and ElevenLabs released one in May. Neither are models for creating video audio effects. A Microsoft work can create speech and song videos from still image and platforms like Spades and GenreX have trained models to shoot video and make a better guess as to what music or effects are appropriate in a given scene.
But DeepMind claims its V2A technology is unique in that it can understand the raw pixels from a video and automatically sync sounds produced with the video, optionally without description.
V2A isn’t perfect, and DeepMind recognizes that. Because the underlying model has not been trained on many videos with artifacts or distortions, it does not generate particularly high-quality audio for them. And in general, the sound created is not super convincing; My colleague Natasha Lomas described it as “a hodgepodge of stereotypical sounds,” and I can’t say I disagree.
For these reasons, and to prevent misuse, DeepMind says it won’t be releasing the technology to the public anytime soon, if ever.
“To make sure our V2A technology can have a positive impact on the creative community, we gather diverse perspectives and ideas from leading creators and filmmakers and use this valuable feedback to inform our ongoing research and development,” writes DeepMind. “Before we consider opening access to it to the wider public, our V2A technology will undergo rigorous security assessments and testing.”
DeepMind pitches its V2A technology as a particularly useful tool for archivists and people working with historical footage. But genetic AI along these lines also threatens to upend the film and television industry. It’s going to take some seriously strong labor protections to ensure that the tools of media production don’t eliminate jobs — or, as the case may be, entire professions.