Why use a complex model when a simple one will do? New work shows that the simplest multilayer neural network, with a small twist, can perform some tasks as well as today’s most sophisticated architectures.
What’s new: Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, and a team at Google Brain revisited multilayer perceptrons (MLPs, also known as vanilla neural networks). They built MLP-Mixer, a no-frills model that approaches state-of-the-art performance in ImageNet classification.
Key insight: Convolutional neural networks excel at processing images because they’re designed to discern spatial relationships, and pixels that are nearby one another in an image tend to be more related than pixels that are far apart. MLPs have no such bias, so they tend to learn interpixel relationships that exist in the training set and don’t hold in real life. By modifying MLPs to process and compare images across patches rather than individual pixels, MLP-Mixer enables this basic architecture to learn useful image features.
How it works: The authors pretrained MLP-Mixer for image classification using ImageNet-21k, which contains 21,000 classes, and fine-tuned it on the 1,000-class ImageNet.
- Given an image divided into patches, MLP-Mixer uses an initial linear layer to generate 1,024 representations of each patch. MLP-Mixer stacks the representations in a matrix, so each row contains all representations of one patch, and each column contains one representation of every patch.
- MLP-Mixer is made of a series of mixer layers, each of which contains two MLPs, each made up of two fully connected layers. Given a matrix, a mixer layer uses one MLP to mix representations within columns (which the authors call token mixing) and another to mix representations within rows (which the authors call channel mixing). This process renders a new matrix to be passed along to the next mixer layer.
- A softmax layer renders a classification.
Results: An MLP-Mixer with 16 mixer layers classified ImageNet with 84.15 percent accuracy. That’s comparable to the state-of-the-art 85.8 percent accuracy achieved by a 50-layer HaloNet, a ResNet-like architecture with self-attention.
Yes, but: MLP-Mixer matched state-of-the-art performance only when pretrained on a sufficiently large dataset. Pretrained on 10 percent of JFT300M and fine-tuned on ImageNet, it achieved 54 percent accuracy on ImageNet, while a ResNet-based BiT trained the same way achieved 67 percent accuracy.
Why it matters: MLPs are the simplest building blocks of deep learning, yet this work shows they can match the best-performing architectures for image classification.
We’re thinking: If simple neural nets work as well as more complex ones for computer vision, maybe it’s time to rethink architectural approaches in other areas, too.