If you want to both synthesize data and find the probability of any given example — say, generate images of manufacturing defects to train a defect detector and identify the highest-probability defects — you may use the architecture known as a normalizing flow. A new type of layer enables users to boost a normalizing flow’s performance by tuning it to their training data.
What’s new: Gianluigi Silvestri at OnePlanet Research Center and colleagues at Google Research and Radboud University introduced the embedded-model flow (EMF). This architecture uses a probabilistic program — a user-defined probability distribution — to influence the training of a normalizing flow.
Normalizing Flow basics: A normalizing flow (NF) is a generative architecture. Like a generative adversarial network (GAN), it learns to synthesize examples similar to its training data. Unlike a GAN, it also learns to calculate the likelihood of existing examples. During training, an NF transforms examples into noise. At inference, it runs in reverse to transform noise into synthetic examples. Thus it requires layers that can execute both forward and backward; that is, layers that are invertible as well as differentiable.
Key insight: Like a normalizing flow layer, the cumulative distribution function (CDF), which is a function of a probability distribution, can be both differentiable and invertible. (In cases where this is not true, it’s possible to approximate the CDF’s derivative or inverse.) The CDF of a probability distribution can be used to compute that distribution, so it can be used to create a probabilistic program. Such a program, being differentiable and invertible, can be used in an NF, where it can transform a random vector to follow a probability distribution and vice versa.
How it works: EMF is a normalizing flow composed of three normalizing flow layers and a user-defined probabilistic program layer. The authors used a dataset of handwritten digits to train the model to generate digits 0 through 9.
- The authors built a probabilistic program using a Gaussian hierarchical distribution, which models a user-defined number of populations (in this case, 10 digits).
- They modeled the distribution using the CDF and implemented the resulting function as a probabilistic program layer.
- The probabilistic program layer learned to transform the distribution’s 10 populations into random noise. This helped the normalizing flow layers learn to allocate various digits to different parts of the distribution.
- At inference, the authors reversed the network, putting the probabilistic program layer first. It transformed a random vector into the distribution of 10 populations, and the other layers produced a new image.
Results: The authors compared EMF with a baseline made up of a comparable number of normalizing flow layers. Generating examples in the test set, it achieved a negative log likelihood of 1260.8, while the baseline scored 1307.9 (lower is better). EMF outperformed similar baselines trained for other tasks. For instance, generating solutions to the differential equations for Brownian motion, it achieved a negative log likelihood of -26.4 compared to the baseline’s -26.1.
Yes, but: A baseline with an additional normalizing flow layer achieved a better negative log likelihood (1181.3) for generating test-set digits. The authors explain that EMF may have underperformed because it had fewer parameters, although they don’t quantify the difference.
Why it matters: Normalizing flows have their uses, but the requirement that its layers be invertible imposes severe limitations. By proposing a new layer type that improves their performance, this work makes them less forbidding and more useful. In fact, probabilistic programs aren’t difficult to make: They’re easy to diagram, and the authors offer an algorithm that turns such diagrams into normalizing flow layers.
We’re thinking: The authors achieved intriguing results with a small model (three layers, compared to other work and dataset (10,000 examples compared to, say, ImageNet’s 1.28 million). We look forward to learning what EMF-style models can accomplish with more and wider layers, and with larger datasets like ImageNet.