Deep learning models can be unwieldy and often impractical to run on smaller devices without major modification. Researchers at Facebook AI Research found a way to compress neural networks with minimal sacrifice in accuracy.
What’s new: Building on earlier work, the researchers coaxed networks to learn smaller layer representations. Rather than storing weights directly, the technique uses approximate values that can stand in for groups of weights.
Key insight: The researchers modified an existing data-compression method, product quantization, to learn viable weight approximations.
How it works: By representing groups of similar weights with a single value, the network can store only that value and pointers to it. This reduces the amount of storage needed for weights in a given layer. The network learns an optimal set of values for groups of weights, or subvectors, in a layer by minimizing the difference between layer outputs of the original and compressed networks.
- For fully connected layers, the authors group the weights into subvectors. (They propose a similar but more involved process for convolutional layers.)
- They pick a random subset of subvectors as starting values, then iteratively improve the values, layer by layer, to minimize the difference between the compressed and original neural network.
- Then they optimize the compressed network representation against multiple layers at a time, starting with the first two and ultimately encompassing the entire network.
Results: The researchers achieve best top-1 accuracy on ImageNet for model sizes of 5MB and 10MB. (They achieve competitive accuracy for 1MB models.) They also show that their quantization method is superior to previous methods for ResNet-18.
Why it matters: Typically, researchers establish the best model for a given task, and follow-up studies find new architectures that deliver similar performance using less memory. This work offers a way to compress an existing architecture, potentially taking any model from groundbreaking results in the lab to widespread distribution in the field with minimal degradation in performance.
Yes, but: The authors demonstrate their method on architectures with fully connected layers and CNNs only. Further research will be required to find its limits, and also to optimize the compressed results for compute speed.
We’re thinking: The ability to compress top-performing models could put state-of-the-art AI in the palm of your hand and eventually in your pacemaker.