OpenAI’s video creation tool, Sora, wowed the AI community in February with fluid, realistic video that looks miles ahead of the competition. But the carefully curated debut left out a lot of details — details that have been filled in by a director he had early access to make a short using Sora.
Shy Kids is a Toronto-based digital production team selected by OpenAI as one of the few for the production of short films essentially for OpenAI promotional purposes, although they were given considerable creative freedom in creating “head of air”. In one interview with visual effects news agency fxguidepost-production artist Patrick Cederberg described “essentially using Sora” as part of his work.
Perhaps the most important addendum for most is simply this: While OpenAI’s post highlighting the shorts leaves the reader to assume that they came more or less fully formed from Sora, the reality is that these were professional productions, with a solid script , editing, color correction. and post-work such as rotoscoping and VFX. Just like Apple says “shoot on iPhone” but doesn’t show the studio setup, professional lighting and color afterwards, Sora’s post only talks about what it allows people to do, not how they did it really.
Cederberg’s interview is interesting and fairly non-technical, so if you’re at all interested, head over to fxguide and read it. But here are some interesting nuggets about Sora’s use that tell us that, as impressive as it is, the model is perhaps less of a giant leap forward than we thought.
Control is still the most desired and also the most elusive at this point. … The closest we could get was just being hyper-descriptive in our prompts. Explaining the wardrobe for the characters, as well as the balloon type, was our way around consistency, because from shot to shot / generation to generation, there’s not yet a feature set up to fully control consistency.
In other words, matters that are simple in traditional filmmaking, such as choosing the color of a character’s clothes, require complex solutions and controls in a production system because each shot is created independently of the others. That could obviously change, but it’s definitely a lot more laborious right now.
Sora’s results also had to be monitored for unwanted elements: Cederberg described how the model would typically create a face on the balloon that the main character has for a head, or a string dangling in front. These had to be removed afterwards, another time-consuming process, if they couldn’t get the prompt to opt them out.
Precise timing and character or camera movements aren’t really possible: “There’s a little bit of temporal control over where these different actions happen in the actual generation, but it’s not precise… it’s kind of a shot in the dark,” he said. Cederberg.
For example, timing a gesture like a wave is a very approximate, sentence-based process, unlike manual animation. And a pan shot up the character’s body may or may not reflect what the director wants — so the team in this case rendered a portrait shot and did a crop pan in post. The clips generated were also often in slow motion for no particular reason.
In fact, the use of everyday filmmaking language like “panning right” or “tracking shot” was generally inconsistent, Cederberg said, which the team found quite surprising.
“Researchers, before they approached artists to play with the tool, weren’t really thinking like filmmakers,” he said.
As a result, the team made hundreds of generations, every 10 to 20 seconds, and ended up using only a handful. Cederberg calculated the ratio at 300:1 — but of course we’d all be surprised by the ratio in an ordinary shot.
Actually the team made a little behind the scenes video explaining some of the issues they faced if you’re curious. Like many adjacent AI content, the comments are quite critical of the whole effort — though not as powerful as the AI-assisted advertising we’ve seen demonstrated recently.
The last interesting aspect concerns copyright: If you ask Sora to give you a “Star Wars” clip, she will refuse. And if you try to top it off with “dressed up man with a lightsaber in a retro futuristic spaceship”, he’ll refuse too, as he somehow recognizes what you’re trying to do. He also refused to do an “Aronofsky-style shot” or a “Hitchcock zoom”.
On the one hand, it makes perfect sense. But it begs the question: If Sora knows what these are, does that mean the model was trained in that content, the better to recognize that it’s infringing? OpenAI, which keeps its training data cards close to the vest — to the point of absurdity, like CTO Mira Murati’s interview with Joanna Stern — will almost certainly never tell us.
As for Sora and its use in filmmaking, it’s clearly a powerful and useful tool in its place, but its place is not “making movies out of whole cloth.” Yet. As another villain once said, “that comes later.”