Published
Reading time
3 min read
DAVID DING
Loading the Elevenlabs Text to Speech AudioNative Player...

Last year, we saw an explosion of models that generate either video or audio outputs in high quality. In the coming year, I look forward to models that produce video clips complete with audio soundtracks including speech, music, and sound effects. I hope these models will bring a new era of cinematic creativity.

The technologies required for such cinematic video generators are in place. Several companies provide very competitive video models, and Udio and others create music models. All that’s left is to model video and audio simultaneously, including dialog and voiceovers. (In fact, we’ve already seen something like this: Meta’s Movie Gen. Users describe a scene and Movie Gen will produce a video clip complete with a music score and sound effects.)

Of course, training such models will require extensive datasets. But I suspect that the videos used to train existing video generators had soundtracks that include these elements, so data may not be a barrier to developing these models.

Initially, these models won’t produce output that competes with the best work of professional video editors. But they will advance quickly. Before long, they’ll generate videos and soundtracks that approach Hollywood productions in raw quality, just as current image models can produce images that are indistinguishable from high-end photographs.

At the same time, the amount of control users have over the video and audio outputs will continue to increase. For instance, when we first released Udio, users couldn’t control the harmony it generated. A few months later, we launched an update that enables users to specify the key, or tonal center. So users can take an existing song and remix it in a different key. We are continuing to do research into giving users additional levers of control, such as voice, melody, and beats, and I’m sure video modeling teams are doing similar research on controllability.

Some people may find the prospect of models that generate fully produced cinematic videos unsettling. I understand this feeling. I enjoy photography and playing music, but I’ve found that image and audio generators are helpful starting points for my creative work. If I choose, AI can give me a base image that I can work on in Photoshop, or a musical composition to sample from or build on. Or consider AI coding assistants that generate the files for an entire website. You no longer need to rely on web developers, but if you talk to them, you’ll learn that they don’t always enjoy writing the boilerplate code for a website. Having a tool that builds a site’s scaffold lets them spend their time on development tasks they find more stimulating and fun.

In a similar way, you’ll be able to write a screenplay and quickly produce a rough draft of what the movie might look like. You might generate 1,000 takes, decide which one you like, and draw inspiration from that to guide a videographer and actors.

Art is all about the creative choices that go into it. Both you and I can use Midjourney to make a picture of a landscape, but if you’re an artist and you have a clear idea of the landscape you want to see, your Midjourney output will be more compelling than mine. Similarly, anyone can use Udio to make high-production quality music, but if you have good musical taste, your music will be better than mine. Video will remain an art form, because individuals will choose what their movie is about, how it looks, and how it feels — and they’ll be able to make those choices more fluidly, quickly, and interactively.

David Ding is a lifelong musician and co-founder of Udio, maker of a music-creation web app that empowers users to make original music. Previously, he was a Senior Research Engineer at Google DeepMind.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox