Llama Herd Expands Meta updates Llama models with vision-language, edge sizes, and agentic APIs

Published

Oct 02, 2024

Reading time

3 min read

Meta extended its Llama family of models into two new categories: vision-language and sizes that are small enough to fit in edge devices.

What’s new: Meta introduced Llama 3.2, including two larger vision-language models and two smaller text-only models as well as developer tools for building agentic applications based on the new models. Weights and code are free to developers who have less than 700 million monthly active users. Multiple providers offer cloud access.

How it works: Llama 3.2 90B and 11B accept images as well as text and generate text output (image processing is not available in the European Union). Llama 3.2 1B and 3B accept and generate text. All four models can process 131,072 tokens of input context and generate 2,048 tokens of output.

Llama 3.2 90B and 11B are based on Llama 3.1. The team froze a Llama 3.1 model and added an image encoder and cross-attention layers. They trained these new elements, given matching images and text, to produce image embeddings that matched the resulting text embeddings. To enhance the model’s ability to interpret images, the team fine-tuned the new elements via supervised learning and DPO. Given an image, they learned to generate questions and answers that ranked highly according to a reward model. Thus Llama 3.2 responds to text input identically to Llama 3.1, making it a viable drop-in replacement.
Likewise, Llama 3.2 3B and 1B are based on Llama 3.1 8B. The team members pruned each model using an unspecified method. Then they used Llama 3.1 8B and 70B as teacher models, training the Llama 3.2 students to mimic their output. Finally, they fine-tuned the models to follow instructions, summarize text, use tools, and perform other tasks using synthetic data generated by Llama 3.1 405B.
On popular benchmarks, Llama 3.2 90B and 11B perform roughly comparably to Claude 3 Haiku and GPT-4o-mini, the smaller vision-language models from Anthropic and OpenAI respectively. For example, Llama 3.2 90B beats both closed models on MMMU and MMMU-Pro, answering visual questions about graphs, charts, diagrams, and other images. They also beat Claude 3 Haiku and GPT-4o-mini on GPQA, which tests graduate-level reasoning in various academic subjects. However, on these benchmarks, larger Llama 3.2 models are well behind larger, proprietary models like o1 and Sonnet 3.5 as well as the similarly sized, open Qwen-2VL.
Llama 3.2’s vision-language capabilities now drive the company’s Meta AI chatbot. For example, users can upload a photo of a flower and ask the chatbot to identify it or post a picture of food and request a recipe. Meta AI also uses Llama 3.2’s image understanding to edit images given text instructions.

New tools for developers: Meta announced Llama Stack, a series of APIs for customizing Llama models and building Llama-based agentic applications. Among other services, Llama Stack has APIs for tool use, memory, post-training, and evaluation. Llama Guard, a model designed to evaluate content for sexual themes, violence, criminal planning, and other issues, now flags problematic images as well as text. Llama Guard 3 11B Vision comes with Llama.com’s distributions of Llama 3.2 90B and 11B, while Llama Guard 3 1B comes with Llama 3.2 3B and 1B.

Why it matters: Meta’s open models are widely used by everyone from hobbyists to major industry players. Llama 3.2 extends the line in valuable ways. The growing competition between Llama and Qwen shows that smaller, open models can offer multimodal capabilities that are beginning to rival their larger, proprietary counterparts.

We’re thinking: By offering tools to build agentic workflows, Llama Stack takes Llama 3.2 well beyond the models themselves. Our new short course “Introducing Multimodal Llama 3.2” shows you how to put these models to use.

Subscribe to The Batch