Better Text Embeddings Jina AI launches jina-embeddings-v3, a text embedding model with task-specific adapters

Published

Oct 16, 2024

Reading time

3 min read

Text embedding models are often used to retrieve text, cluster text, determine similarity between texts, and generate initial embeddings for text classifiers. A new embedding model comes with adapters that specialize it to each of these use cases.

What’s new: Saba Sturua and colleagues at Jina AI released jina-embeddings-v3, a text-embedding system with open weights that can process 8,192 input tokens and output embeddings of 1,024 values. It’s free for noncommercial use and competes with closed weight models from Cohere and OpenAI.

How it works: Jina-embeddings-v3 comprises a transformer (559 million parameters) and five LoRA adapters that plug into the model and adjust its weights for retrieval, clustering, determining similarity, and classification. Two adapters adjust the model for retrieval: one for documents and one for queries.

The authors started with a pretrained XLM-RoBERTa. They further pretrained it to predict masked words in data from text in 89 languages.
They add a mean pooling layer to average output vectors into one embedding. They fine-tuned the model, using an unspecified dataset of 1 billion text pairs in various languages, to produce similar embeddings for matching text pairs and dissimilar embeddings for non-matching text pairs.
They fine-tuned the five adapters on the four tasks. For retrieval, they trained the two adapters to produce similar embeddings of matching queries and documents and dissimilar embeddings for queries and documents that didn’t match. For clustering, the authors fine-tuned the adapter to produce more-similar embeddings of examples from the same class and less-similar embeddings of examples from different classes. Text similarity worked in a related manner: they fine-tuned the adapter to produce more-similar embeddings of similar examples than dissimilar examples. For classification, they fine-tuned the adapter to produce similar embeddings of examples of the same class and different embedding of different classes.
They modified the loss function during training using matryoshka representation learning. This method encourages the loss function to solve the problem at hand using the first 32, 64, 128, 256, 512, and 768 values of the embedding as effectively as it would if it used all 1,024 values.

Results: The authors compared jina-embeddings-v3 to Cohere’s multilingual embed v3, OpenAI’s text-embedding-3-large, and Microsoft’s open-weights Multilingual-E5-large-instruct. They tested their system on the Massive Text Embedding Benchmark (MTEB) for embedding tasks.

On English-language tasks, Jina-embeddings-v3 achieved an average score of 65.52 percent, while OpenAI achieved 64.6 percent, Microsoft 64.41 percent, and Cohere 64.01 percent. For example, when they trained logistic classifiers on embeddings produced by the various models, jina-embeddings-v3 performed best as classification, achieving an average accuracy of 82.58 percent, while OpenAI achieved 75.45 percent, Microsoft 77.56 percent, and Cohere 76.01 percent.*
The team also tested how well smaller versions of the embedding performed on retrieval. Medium sizes reduced performance only slightly. For instance, using all 1,024 values for retrieval, the model achieved 63.35 percent normalized discounted cumulative gain (nDCG), a measure of how well the model ranks the retrieved documents (higher is better). When it used the first 32 values, the model achieved 52.54 percent nDCG; and when it used 128 values, it achieved 61.64 percent nDCG.

Why it matters: Training a set of LoRA adapters is becoming the go-to method for adapting a pretrained model for a variety of tasks. Jina extends the list to computing embeddings for different language tasks and gives developers a further option for generating high-quality embeddings.

We’re thinking: The authors’ results show that using embeddings that are one-eighth the typical size degrades performance by only 2 percent. That tradeoff may be worthwhile if your computational budget is constrained or your task is especially data-intensive.

Subscribe to The Batch