Object detectors typically detect only items that were labeled in their training data. A new method liberates them to locate and recognize a much wider variety of objects.
What’s new: Xiuye Gu and colleagues at Google Research developed Vision and Language Knowledge Distillation (ViLD) to build a zero-shot object detector — that is, one that can handle classes on which it didn’t train. ViLD takes advantage of representations generated by the pretrained zero-shot classifier CLIP.
Key Insight: In knowledge distillation, one model learns to mimic another model’s output. Similarly, one model can learn to mimic another’s representations. An object detector’s representations (which encode several regions and classifications per image) can conform to a classifier’s (which encode one classification per image) by cropping the images that contain multiple objects into separate regions for the classifier. Then the object detector can learn to reproduce the classifier’s representation of each region.
How it works: To understand ViLD, it helps to know a bit about CLIP. CLIP matches images and text using a vision transformer and a text transformer pretrained on 400 million image-text pairs. At inference, users give it a text list of the classes they want to recognize. Fed an image, it returns the most likely class in the list. To that system, the authors added a Mask R-CNN object detector trained on the most common classes in Large Vocabulary Instance Segmentation (LVIS), a dataset that contains images of objects that have been segmented and labeled. They reserved the other LVIS classes for the test set.
- Given a list of LVIS classes, CLIP’s text transformer generated a list of class representations.
- Given an image, Mask R-CNN generated object representations. In parallel, CLIP’s vision transformer generated corresponding cropped-region representations.
- For each Mask R-CNN object representation, the authors found the closest LVIS class representation. They measured similarity using cosine similarity, a measure of the angle between two vectors, and applied a softmax to predict the object’s class.
- They trained the Mask R-CNN using two loss terms. The first minimized the difference between CLIP’s and Mask R-CNN’s representations. The second encouraged the Mask R-CNN’s predicted class of a region to match the known label.
- At inference, they fed the remaining LVIS classes to CLIP and added the text transformer’s representations to the earlier list. Presented with a new object class, the Mask R-CNN generated a representation, and the authors found the closest LVIS class representation in the list.
Results: The authors pitted their system against a Mask R-CNN trained on all LVIS classes in a supervised manner. They compared average precision, a measure of how many objects were correctly identified in their correct location (higher is better). The author’s system achieved 16.1 average precision on novel categories, while the supervised model’s achieved 12.3 average precision.
Why it matters: Large, diverse training datasets for object detection are difficult and expensive to obtain. ViLD offers a way to overcome this bottleneck.
We’re thinking: Physicists who want to classify a Bose-Einstein condensate need absolute-zero-shot object detection.