Anthropic announced a suite of large multimodal models that set new states of the art in key benchmarks.
What’s new: Claude 3 comprises three language-and-vision models: Opus (the largest and most capable), Sonnet (billed as the most cost-effective for large-scale deployments), and Haiku (the smallest, fastest, and least expensive to use). The models will be available via Anthropic, on Amazon Bedrock, and in a private preview on Google Cloud’s Vertex AI Model Garden. Opus also is available with the Claude Pro chatbot, which costs $20 monthly. Sonnet powers Claude’s free chatbot.
How it works: The models, whose parameter counts are undisclosed, were trained on public, proprietary, and synthetic data ending in August 2023. They can process 200,000 tokens of context. Opus can accommodate up to 1 million tokens of context, comparable to Google’s Gemini 1.5 Pro, upon request.
- Opus costs $15 per 1 million tokens of input and $75 per 1 million tokens of output; Sonnet costs $3/$15 per 1 million tokens of input/output. Haiku, which is not yet available, will cost $0.25/$1.25 per 1 million tokens of input/output.
- Opus achieved state-of-the-art performance on several benchmarks that cover language, mathematics, reasoning, common knowledge, and code generation, outperforming OpenAI's GPT-4 and Google’s Gemini 1.0 Ultra. It ranks above Gemini 1.0 Pro on the LMSYS Chatbot Arena Leaderboard, which reflects crowdsourced human preferences.
- Sonnet set a new state of the art in AI2D (interpreting science diagrams). It outperforms GPT-4 and Gemini 1.0 Pro on several benchmarks.
- Haiku achieved top marks in Chart Q&A (answering questions about charts) via zero-shot, chain-of-thought prompting. Generally, it outperforms Gemini 1.0 Pro and GPT-3.5.
Test recognition: Opus aced “needle-in-a-haystack” tests to evaluate its ability to track long inputs. It also exhibited interesting behavior: In one such test, amid random documents that covered topics including startups, coding, and work culture, engineers inserted a sentence about pizza toppings and questioned the model on that topic. Not only did the model answer the question accurately, it also deduced that it was being tested, as Anthropic prompt engineer Alex Albert reported in a post on X. “I suspect this pizza topping ‘fact’ may have been inserted as a joke or to test if I was paying attention,” Opus said, “since it does not fit with the other topics at all.”
Inside the system prompt: In a separate post on X, Anthropic alignment specialist Amanda Askell provided a rare peek at the thinking behind Claude 3’s system prompt, text prepended to user prompts to condition the model’s responses. To ground them in time, the models receive the current date and its training cut-off. To avoid rambling output, they’re directed to be concise. In an effort to correct for political and social biases that the team has observed, the models are asked to assist users even if it “personally disagrees with the views being expressed,” refrain from negative stereotyping of majority groups, and focus on objective information when addressing controversial topics. Finally, it’s directed to avoid discussing the system prompt unless it’s directly relevant to a query. “You might think this part is to keep the system prompt secret from you,” Askell wrote. “The real goal of this part is to stop Claude from excitedly telling you about its system prompt at every opportunity.”
Why it matters: Anthropic began with a focus on fine-tuning for safety, and its flagship model now tops several benchmark leaderboards as well. The Claude 3 family gives developers access to state-of-the-art performance at competitive prices.
We’re thinking: Three highly capable “GPT-4-class” large language models (LLMs) are now widely available: GPT-4, Gemini Pro, and Claude 3. The pressure is on for teams to develop an even more advanced model that leaps ahead and differentiates. What a great time to be building applications on top of LLMs!