The technique known as attention has yielded spectacular results in speech and natural language tasks. Now researchers have shown that it can improve image recognizers as well.
What’s new: A residual neural network incorporating self-attention layers beat the state of the art accuracy on ImageNet classification by 1.3% over a ResNet50 baseline, according to a new paper. It also beat the best Common Objects in Context object detection score over a RetinaNet baseline. This is not the first use of attention in vision tasks, but it outperforms earlier efforts.
How it works: Quoc Le and his colleagues augmented convolution layers with multi-headed self-attention layers. The resulting network uses attention in parallel with convolution and concatenates the outputs produced by each. Attention extends a network’s awareness beyond the data it’s processing at a given moment. Adding it to a convolutional neural network enables the model to consider relations between different areas of an image.
Why it matters: The attention networks generally require fewer parameters than convolution layers to achieve an accuracy target. That makes it possible to improve performance in a same-sized model.
Bottom line: Image recognizers based on CNNs already achieve high accuracy and sensitivity, and adding attention makes them even sharper.