The transformer architecture that has shaken up natural language processing may replace recurrent layers in object detection networks.
What’s new: A Facebook team led by Nicolas Carion and Francisco Massa simplified object detection pipelines by using transformers, yielding Detection Transformer (DETR).
Key insight: Images can show multiple objects. Some object detection networks use recurrent layers to predict one object at a time until all objects are accounted for. Language models use transformers to evaluate a sequence of words in one pass. Similarly, DETR uses them to predict all objects in an image in a single process.
How it works: DETR predicts a fixed number of object bounding boxes and classes per image. First, it extracts image features using convolutional layers. Then transformers predict features associated with regions likely to contain objects. Feed-forward layers process the object features into classes and bounding boxes. (“No object” is a possible class.)
- The transformers generate object bounding boxes and labels as a sequence, but their order is arbitrary.
- The loss function uses the Hungarian algorithm to match each object class (except “no object”) with a unique label. This makes predicting anchors (box center points) and complicated matching algorithms unnecessary.
- During training, each transformer layer makes its own prediction. Evaluating this output ensures that all transformers learn to contribute equally — a technique borrowed from language models that’s not available with recurrent layers. The additional loss function especially helps the system predict the correct number of objects.
Results: The researchers pitted DETR against Faster R-CNN on the canonical object detection dataset Coco. At model sizes of roughly 40 million parameters, DETR bettered Faster R-CNN’s average precision, a measure of true positives, from 0.402 to 0.420. And DETR did it faster, spotting objects at 28 images per second compared to Faster R-CNN’s 26 images per second.
Why it matters: Transformers are changing the way machine learning models handle sequential data in NLP and beyond.
We’re thinking: What happened to the Muppet names for transformer-based models? Fozzie Bear is available.