Text-to-music generation has arrived.
What's new: Stability.ai, maker of the Stable Diffusion image generator and StableLM text generator, launched Stable Audio, a system that generates music and sound effects from text. You can play with it and listen to examples here. The service is free for 20 generations per month up to 45 seconds long. The professional tier allows 500 generations per month, up to 90 seconds long, for $11.99 per month. An enterprise tier is negotiable. The company said it would open-source the model eventually.
How it works: Stable Audio is a latent diffusion model. It generates audio by a process that’s similar to the way Stable Diffusion generates images, but it uses a variational autoencoder to map audio to an embedding for processing and back to audio for your listening pleasure. The authors trained the system on 800,000 audio files containing music, sound effects, and performances on individual instruments and corresponding descriptions.
- During training, a variational autoencoder learns small embedding representations of audio examples.
- A CLAP transformer pretrained on their dataset produces an embedding for text that describes musical characteristics like style, instrumentation, tempo, mood, or any sort of description. Separate embedding layers represent the duration of the audio to be generated and how many seconds into a given audio file the current training example starts. The latter helps the model to learn how musical compositions are expressed over time.
- Stable Audio adds noise to the audio vector. A U-Net convolutional neural network learns to estimate the added noise and remove it according to the text and timing embeddings.
- At inference, the system starts with a pure-noise embedding and a user-prompted descriptive text and output file length. It removes noise iteratively to produce an embedding of the generated audio. From that embedding, the decoder from the variational autoencoder produces the audio at CD-quality (16-bit, 44.1kHz, stereo) resolution.
Behind the News: Stable Audio joins earlier services including Boomy, Mubert, plugger.ai, Soundful, and VEED.IO. It follows tantalizing advances in audio generation.
- Google MusicLM learned to generate music from text descriptions by setting the problem up as a sequence-to-sequence modeling task.
- Riffusion turned spectrograms generated by Stable Diffusion into audio.
- OpenAI Jukebox learned to compress their training set and generated audio from this compressed space. The researchers guided generation using metadata including artist, lyrics, and style.
Yes, but: Stable Audio excels when generating instrumental and ambient music, but its output tends to suffer from some of the same flaws as previous text-to-music generators: Longer outputs often lack a coherent structure, and the clarity and detail of individual instruments and sound effects varies wildly. It also doesn’t effectively generate the sound of a vocalist pronouncing words.
Why it matters: AI has demonstrated its prowess at generating convincing text and images. Generated audio has implications for producers not only of music but also of videos, video games, and podcasts. Stable Audio sounds like an early step, but it stands out for its speed, high-resolution output, and the inclusion of a mechanism for learning musical structure.
We're thinking: Stable Audio is impressive, but this doesn’t quite feel like music’s GPT moment. Text and image generation took off as soon as highly capable generative models appeared. Music generation may yet await models that can produce not only high-res output but also sonorities and structures coherent and varied enough to be widely useful.