Video search engines are often evaluated based on how they rank a single video when presented with a brief description that accompanies that video in the test set. But this criterion may not reflect a system's utility in the real world, where numerous videos may be highly relevant to the search terms. New work aims to solve this problem.
What’s new: Researchers at the University of Bristol led by Michael Wray propose a new benchmark, Semantic Similarity Video Retrieval (SVR), that evaluates video retrieval systems by their ability to rank many similar videos. They also built a system that performed well on it.
Key insight: To evaluate a video retrieval system based on how similar the top-ranked videos are to an input description, the evaluation process needs a ground-truth measure of similarity between descriptions and videos. There isn’t an automatic way to compare a description to a video, but there are several ways to compare a description to other descriptions. The authors assessed the similarity between existing descriptions to approximate ground-truth similarity between descriptions and videos. This enabled them to train their system to rank the similarity of input text to a variety of videos, and to evaluate the quality of its search results.
How it works: The authors generated separate representations for captions and videos and honed the similarity of matching descriptions and videos. Given a description, the system learned to rank clips whose video representation best matched that of the input (and vice-versa). They trained and tested it on videos with descriptions from movies, news, how-tos, and other sources.
- The authors calculated similarity between each description and every other description using METEOR. If the similarity between two descriptions exceeded a threshold, they matched the description with the video bearing the other caption.
- They used these matches to train a system that included a GPT-based language model, which generated representations of descriptions, and a combination of convolutional neural networks, which generated representations of videos. A triplet loss encouraged the system to produce similar representations of matched descriptions and videos and dissimilar representations of unmatched ones.
- Given input text (for the purpose of evaluation, an existing description), they ranked the top-matching videos according to the cosine similarity between the representations of the text and the representation of the videos.
Results: The authors measured how well their system ranked each video with respect to every description (and vice-versa) using nDCG. This method rewards high rankings of similar representations (as measured by METEOR) and penalizes high rankings of dissimilar representations. The authors’ system scored 0.840 out of a perfect 1.0. A baseline system that used two vanilla neural networks to create video and description embeddings scored .833.
Why it matters: Rather than designing a system to ace a common test, the authors devised a new test that better reflects what users expect from such systems. That approach should lead to more useful systems all around.
We’re thinking: The more machine learning improves, the more we need benchmarks that are capable of measuring the latest improvements.