End-to-end backpropagation and labeled data are the peanut butter and chocolate of deep learning. However, recent work suggests that neither is necessary to train effective neural networks to represent complex data.
What’s new: Sindy Löwe, Peter O’Connor, and Bastiaan Veeling propose Greedy InfoMax (GIM), an unsupervised method for learning to extract features that trains only one layer at a time.
Key insight: The information bottleneck theory (IB) suggests that neural networks work by concentrating information like a data-compression algorithm. In data compression, the amount of information retained is measured in mutual information (MI) between original and compressed versions. IB says that neural nets maximize MI between each layer’s input and output. Thus GIM reframes learning as a self-supervised compression problem. Unlike earlier MI-based approaches, it optimizes each layer separately.
How it works: GIM works on modular networks, in which each layer learns to extract features from its input and passes its output to the next available layer, and so on down to the final layer. GIM doesn’t require labels, but if they’re available, a linear classification model can learn from GIM’s compressed output in a supervised manner.
- GIM uses the previous layer’s output as the next layer’s input to train each layer independently. This differs from the usual backpropagation in which all layers learn at once.
- The researchers devised a task that teaches layers to extract features that maximize MI. Given a subsequence of input data that has been compressed according to the current weights, the layer predicts the next element in the compressed sequence, choosing from a random selection drawn from the input including the correct choice. High success demonstrates that the layer is able to compress the input.
- The process effectively removes redundancy between nearby regions of the input. For example, a recording of a song’s chorus may repeat several times, so it’s possible to represent the recording without capturing the repetitions.
Results: The researchers pitted Greedy InfoMax against contrastive predictive coding. In image classification, GIM beat CPC by 1.4 percent, achieving 81.9 percent accuracy. In a voice identification task, GIM underperformed CPC by 0.2 percent, scoring 99.4 percent accuracy. GIM’s scores are state-of-the-art for models based on mutual information.
Why it matters: Backprop requires storing forward prediction, backward gradients, and weights for an entire network simultaneously. InfoMax handles each layer individually, making it possible to accommodate much larger models in limited memory.
Behind the news: Layerwise training or pre-training has been around for at least a decade. For example, stacked autoencoders use reconstruction error as an alternative unsupervised mechanism to control intelligent data compression. Many past approaches are more focused on pre-training and assume that, once each layer has been trained individually, they will be trained together with a supervised task.
We’re thinking: Many machine learning applications use a large pretrained network as an initial feature extractor and then apply transfer learning. By maximizing MI between layers, this approach could use more data to train and build still larger networks.