Object Detection for Small Devices Grounding DINO 1.5, an edge device model built for faster, smarter object detection

Published

Nov 27, 2024

Reading time

3 min read

An open source model is designed to perform sophisticated object detection on edge devices like phones, cars, medical equipment, and smart doorbells.

What’s new: Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, and colleagues at the International Digital Economy Academy introduced Grounding DINO 1.5, a system that enables devices with limited processing power to detect arbitrary objects in images based on a text list of objects (also known as open-vocabulary object detection). You can download the code and weights here.

Key insight: The original Grounding DINO follows many of its predecessors by using image embeddings of different levels (from lower-level embeddings produced by an image encoder’s earlier layers, which are larger and represent simple patterns such as edges, to higher-level embeddings produced by later layers, which are smaller and represent complex patterns such as objects). This enables it to better detect objects at different scales. However, it takes a lot of computation. To enable the system to run on devices that have less processing power, Grounding DINO 1.5 uses only the smallest (highest-level) image embeddings for a crucial part of the process.

How it works: Grounding DINO 1.5 is made up of components that produce text and image embeddings, fuse them, and classify them. It follows the system architecture and training of Grounding DINO with the following exceptions: (i) It uses a different image encoder, (ii) a different model combines text and image embeddings, and (iii) it was trained on a newer dataset of 20 million publicly available text-image examples.

Given an image, a pretrained EfficientViT-L1 image encoder produced three levels of image embeddings.
Given the corresponding text, BERT produced a text embedding composed of tokens.
Given the highest-level image embedding and the text embedding, a cross-attention model updated each one to incorporate information from the other (fusing text and image modalities, in effect). After the update, a CNN-based model combined the updated highest-level image embedding with the lower-level image embeddings to create a single image embedding.
Grounding DINO 1.5 calculated which 900 tokens in the image embedding were most similar to the tokens in the text embedding.
A cross-attention model detected objects using both the image and text embeddings. For each token in the updated image embedding, it determined: (i) which text token(s), if any, matched the image token, thereby giving each image token a classification including “not an object” and (ii) a bounding box that enclosed the corresponding object (except for tokens that were labeled “not an object”).
The system learned to (i) maximize the similarity between matching tokens from the text and image embeddings and minimize the similarity between tokens that didn’t match and (ii) minimize the difference between its own bounding boxes and those in the training dataset.

Results: Grounding DINO 1.5 performed significantly faster than the original Grounding DINO: 10.7 frames per second versus 1.1 frames per second running on an Nvidia Jetson Orin NX computer. Tested on a dataset of images of common objects annotated with labels and bounding boxes, Grounding DINO 1.5 achieved better average precision (a measure of how many objects it identified correctly in their correct location, higher is better) than both Grounding DINO and YOLO-Worldv2-L (a CNN-based object detector). Grounding DINO 1.5 scored 33.5 percent, Grounding DINO 27.4 percent, and YOLO-Worldv2-L 33 percent.

Why it matters: The authors achieved 10 times the speed with just a couple of small changes (a more efficient image encoder and a smaller image embedding when performing cross-attention between embeddings of images and texts). Small changes can yield big results.

We’re thinking: Lately model builders have been building better, smaller, faster large language models for edge devices. We’re glad to see object detection get similar treatment.

Subscribe to The Batch