Neural audio synthesizers like WaveRNN or GANSynth produce impressive sounds, but they require large, data-hungry neural networks. A new code library beefs up the neural music studio with efficient sound modules based on traditional synthesizer designs.
What’s new: Jesse Engel and colleagues at Google Brain introduced Differentiable Digital Signal Processing (DDSP), a set of digital signal processing tools that integrate with neural networks to boost their performance.
Key insight: Traditional synthesizers incorporate powerful sound-generation and -processing tools, but their controls are often limited to sliders and switches that don’t take full advantage of their abilities. A neural network can learn to manipulate such tools more dynamically, potentially producing more realistic renditions of existing instruments as well as novel sounds.
How it works: DDSP offers tools such as oscillators (which generate sound), filters (which modify tone color), envelopes (which shape the sound over time), and reverberators (which mimic sound waves that reflect off walls). Most are implemented as layers that can be inserted into neural networks without affecting backprop training, so a network can learn to control them.
- The researchers use DDSP to emulate the Spectral Modeling Synthesizer (SMS), a 1990s-vintage digital synth. Once it has been trained, their SMS emulator can mimic input sounds. Also, parts of an SMS network trained on, for instance, violins can be swapped with those of one trained on, say, guitars to reinterpret a violin recording using a guitar sound.
- They re-created the SMS architecture as an autoencoder with additional components. The autoencoder’s encoder maps input sounds to low-dimensional vectors. The decoder’s output drives DDSP’s oscillator and filter, which in turn feed a reverberator to produce the final output.
Results: The SMS emulator showed that DDSP can make for a high-quality neural sound generator. Compared to WaveRNN, it scored better for L1 loudness loss, a measure of the difference between audio input and synthesized output (.07 compared to .10). It also had a better L1 loss of fundamental frequency, which measures the accuracy of the synthesized waveform relative to the input (.02 versus 1.0). And it has one tenth as many parameters!
Why it matters: Audio synthesis is one of several applications migrating from digital signal processing tech to deep learning. Machine learning engineers need not leave the older technology behind — they can build DSP functions into their neural networks.
We’re thinking: The SMS demo is preliminary, but it points toward next-generation audio models that combine deep learning with more intuitive structures and controls.