Roll Over, Beyoncé How OpenAI's Jukebox generates synthetic music

Published

May 06, 2020

Reading time

2 min read

A new generative model croons like Elvis and raps like Eminem. It might even make you think you’re listening to a lost demo by the Beatles.

What’s new: OpenAI released Jukebox, a deep learning system that has generated thousands of songs in styles from country to metal and soul. It even mimics the voices of greats like Frank Sinatra.

How it works: Jukebox generates music by drawing from a database of 1.2 million songs. Where some other AI-powered systems use symbolic generators to create tunes, Jukebox uses audio recordings, which capture more of music’s subtleties.

In working with raw audio, the biggest bottleneck is its sheer size and complexity, the authors write. They used vector quantized variational autoencoders, or VQ-VAEs, to compress the training set to a lower-dimensional space. Then they trained the model to generate audio in this compressed space. Transformers create successively higher-resolution versions of a new song. Finally, a decoder turns that output into audio.
The researchers paired each song with metadata including its artist, lyrics, and genre. That helps guide the model as it generates made-to-order music in any designated style.
The model made cohesive music, but it struggled to produce coherent lyrics. To overcome this, researchers added existing lyrics into the conditioning information. It also had a hard time associating chunks of words with musically appropriate passages, so the researchers used open source tools to manually align words with the music windows in which they appear.
The model requires upward of nine hours of processing to render one minute of audio.

Results: OpenAI released over 7,000 songs composed by Jukebox. Many have poor audio quality and garbled lyrics, but there are more than a few gems. Have a listen — our favorites include the Sinatra-esque “Hot Tub Christmas,” with lyrics co-written by OpenAI engineers and a natural language model, and a country-fied ode to deep learning.
Behind the news: AI engineers have been synthesizing music for some time, but lately the results have been sounding a lot more like human compositions and performances.

In 2016, Sony’s Flow Machine, trained on 13,000 pieces of sheet music, composed a pop song reminiscent of Revolver-era Beatles.
The production company AIVA sells AI-generated background music for video games, patriotic infomercials, and tech company keynotes.
Last April, OpenAI released MuseNet, a music generator that predicts a sequence of notes in response to a cue.

Why it matters: Jukebox’s ability to match lyrics and voices to the music it generates can be uncanny. It could herald a new way for human musicians to produce new work. As a percentage of all music consumed, computer generated music is poised to grow.

We’re thinking: Human artists already produce a huge volume of music — more than any one person can listen to. But we’re particularly excited about the opportunity for customization. What if you could have robo-Beyonce sing a customized tune for your home movie, or robo-Elton John sing you a song celebrating your birthday?

Subscribe to The Batch